A Data Resource for Prediction of Gas-Phase Thermodynamic Properties of Small Molecules

: The thermodynamic properties of a substance are key to predicting its behavior in physical and chemical systems. Speciﬁcally, the enthalpy of formation and entropy of a substance can be used to predict whether reactions involving that substance will proceed spontaneously under conditions of constant temperature and pressure, and if they do, what the heat and work yield of those reactions would be. Prediction of enthalpy and entropy of substances is therefore of value for substances for which those parameters have not been experimentally measured. We developed a database of 2869 experimental values of enthalpy of formation and 1403 values for entropy for substances composed of stable small molecules, derived from the literature. We developed a model for predicting enthalpy of formation and entropy from semiempirical quantum mechanical calculations of energy and atom counts, and applied the model to a comprehensive database of 16,417 small molecules. The database of small-molecule thermodynamic properties will be useful for predicting the outcome of any process that might involve the generation or destruction of volatile products, such as atmospheric chemistry, volcanism, or waste pyrolysis. Additionally, the collected experimental thermodynamic values will be of value to others developing models to predict enthalpy and entropy. Dataset: 10.5281/zenodo.4661783. Dataset License: CC BY (SA).


Summary
We present a dataset for the prediction of thermodynamic parameters for compounds and its application to a set of 16,411 small molecules [1]. The dataset addresses the gasphase enthalpy of formation, the entropy of molecules, and the change of both parameters with temperature. The dataset contains a compilation of measured enthalpy of formation for 2869 compounds, measured entropy for 1403 compounds, and temperature dependence of parameters for 172 compounds. These can be used as a reference source in their own right, or used to build a model for predicting these values for new compounds. We describe building such a model and applying it to 16,411 small molecules in the 'All Small Molecules' collection [1].
For context, we first provide material on the importance of enthalpy and entropy. The enthalpy and entropy of formation of a compound are key parameters for predicting whether reactions involving a compound will proceed spontaneously under isobaric and isothermal (constant pressure and temperature) conditions. Specifically, if the Gibbs free energy change (∆G) of a system in which a chemical reaction takes place is negative, and the system is at constant pressure and temperature, then that system, at equilibrium, will contain more of the products of that reaction than of the reactants, and the reaction is said to proceed spontaneously in the forward direction. This condition is usually met in chemistry happening in an unenclosed space at the surface of a planet, where the pressure is constant but the reactants can change volume as they form products. (If volume is constant but pressure changes, as might be true in confined gas bubbles in a rock, for example, then the change in Helmholtz free energy (∆F) is the appropriate measure of reaction spontaneity, but this can also be calculated from change in enthalpy and in pressure.) A negative ∆G does not imply that the reaction will proceed. How fast a reaction proceeds is the domain of kinetics, not of thermodynamics. However, if the reaction does proceed, then at equilibrium thermodynamics will predict whether the products of the reaction will dominate.
Because ∆G is only measured by changes of state of a system, the free energy of a chemical is practically defined as the free energy change when forming a molecule from its constituent elements at standard state (298 K, 1 bar), an energy change that is called the standard free energy of formation of a compound (∆G • ). Knowledge of the ∆G • values of the reactants and products of a reaction allows the Gibbs free energy change of that reaction to be calculated: ∆G where ∆G • p are the standard free energies of formation of the products and ∆G • r are the standard free energies of formation of the reactants. ∆G • reactants can be calculated from the enthalpy of formation (∆H, heat released when forming a compound from its elements) and the entropy change of the reaction (∆S) via ∆G = ∆H − T∆S (2) where T is the absolute temperature and ∆S is defined as where S • p is the entropy of the products and S • r is the entropy of the reactants. Thus, knowledge of ∆H and S of a substance is important in predicting the likely outcome of a chemical reaction involving that substance.
∆H and S have been experimentally measured for thousands of compounds, but this is a small fraction of the millions known, and of the almost boundless number of possible molecules [2]. Computational methods for predicting ∆H and S are therefore valuable. A range of approaches have been used, including quantum mechanical (QM) ab initio and semiempirical methods, molecular mechanics (MM) methods, and groupadditive methods, as well as combined methods (e.g., [3,4]). The QM methods seek to predict molecular properties from first principles based on the arrangement of electron orbitals around the nuclei in a molecule (for example, see [5][6][7][8][9]). MM methods treat atoms as indivisible and model their interactions through empirically derived force fields [10]. Group-additive methods seek an empirical approach of providing a table of ∆H and ∆S values contributed by different chemical groups in a molecule; the ∆H and ∆S of the molecule is then the sum of the values of those chemical groups (for example, [11][12][13]). MM and group-additive approaches can be very accurate when parameterized for narrowly defined sets of molecules (e.g., alkanes [5,6,10]), but are inaccurate if applied outside their specific domains.
In this paper, we present a dataset for a combined QM and group-additive approach. We provide a set of reference data on the measured gas-phase enthalpy and entropy of formation of compounds, computed QM and group parameters from which ∆H and ∆S can be calculated, the results of that modelling, and the application of those models to the 'All Small Molecules' dataset of 16,417 small, potentially volatile molecules generated for atmospheric chemistry studies [1]. Because this dataset was developed to be deployed in atmospheric chemistry studies, the relevant thermodynamic parameters for the gas phase have been collected and modelled. However, the presented data resource and model also provide a basis that can be built on to provide energies of vaporization and condensed phase data for compounds such as urea and glycine, which are unlikely to be present in the gas phase.

Summary of Data
The dataset presented in this paper are a set of data for modelling the thermodynamic properties-enthalpy of formation and entropy-of arbitrary molecules in the gas phase containing the elements H, B, C, N, O, F, Si, P, S, Cl, Ge, As, Se, Br, and I. The data are used in two ways, as summarized in Figure 1. In building the models, molecular structures are used to generate quantum mechanics-calculated estimates of enthalpy and entropy, which are then adjusted to fit known values (red lines above) using an algorithm based on the count of the number of atoms in a molecule built in the StarDrop software (http://www.optibrium.com/stardrop/). The same model can then be used with the same input but without literature value input (i.e., without the red lines in Figure 1) to predict the thermodynamic properties of molecules for which the thermodynamic parameters are unknown. The semiempirical QM methods also directly predict the change in ∆H • and S • , and hence in ∆G • , with temperature.
Data 2022, 7, 33 3 of 20 tropy of formation of compounds, computed QM and group parameters from which ΔH and ΔS can be calculated, the results of that modelling, and the application of those models to the 'All Small Molecules' dataset of 16,417 small, potentially volatile molecules generated for atmospheric chemistry studies [1]. Because this dataset was developed to be deployed in atmospheric chemistry studies, the relevant thermodynamic parameters for the gas phase have been collected and modelled. However, the presented data resource and model also provide a basis that can be built on to provide energies of vaporization and condensed phase data for compounds such as urea and glycine, which are unlikely to be present in the gas phase.

Summary of Data
The dataset presented in this paper are a set of data for modelling the thermodynamic properties-enthalpy of formation and entropy-of arbitrary molecules in the gas phase containing the elements H, B, C, N, O, F, Si, P, S, Cl, Ge, As, Se, Br, and I. The data are used in two ways, as summarized in Figure 1. In building the models, molecular structures are used to generate quantum mechanics-calculated estimates of enthalpy and entropy, which are then adjusted to fit known values (red lines above) using an algorithm based on the count of the number of atoms in a molecule built in the StarDrop software (http://www.optibrium.com/stardrop/). The same model can then be used with the same input but without literature value input (i.e., without the red lines in Figure 1) to predict the thermodynamic properties of molecules for which the thermodynamic parameters are unknown. The semiempirical QM methods also directly predict the change  In this paper, we present the training data for building the models, with links to the original literature, the model files for StarDrop, and the result of predicting the thermodynamic properties of 16,417 molecules in a comprehensive dataset of small molecules [1]. In this paper, we present the training data for building the models, with links to the original literature, the model files for StarDrop, and the result of predicting the thermodynamic properties of 16,417 molecules in a comprehensive dataset of small molecules [1].

Measured Values for Enthalpy
Enthalpy data were collected from 11 compilations of enthalpy, supplemented with 11 smaller sets of data to fill in the data for elements not commonly used in organic chemistry, notably B, Ge, Si, and Se. A number of collections contained enthalpy data on radicals and isolated ions; these were not included as the purpose of the dataset was to predict the enthalpy of stable molecules. Some of the collections had multiple values, and so are represented by several columns in the data. The statistics on the sources of data, number of columns, and number of compounds represented are shown in Table 1. Overall, enthalpy data on 2869 compounds were collected. The data schema for this dataset is summarized in Table 2. The schema includes data used in modelling enthalpy (discussed below in Section 3.1). Entropy data were collected from eight data collections, as listed in Table 3. The data schema for the dataset is summarized in Table 4. Note that what is listed in this dataset is absolute entropy S, not entropy of formation ∆S. Entropy of formation can readily be derived from absolute entropy from Equation (3). As was the case for the enthalpy data, only molecules with conventional bonding, and not radicals or isolated ions, were included, and sources with multiple entries are represented by multiple columns. Only a subset of data was taken from Yaws. The statistics on the sources of data, number of columns, and number of compounds represented are shown in Table 3. Overall, entropy data on 1403 compounds were collected.  Enthalpy of formation and entropy change with temperature. Data on the change in enthalpy of formation and entropy with temperature of a set of 174 molecules were extracted from [19]. The data for trichloromethylsilane (CH 3 SiCl 3 ) and trifluoromethylsilane (CH 3 SiF 3 ) were internally inconsistent, in that the tabulated free energy of formation was not the same as the free energy of formation that can be calculated from tabulated entropy, enthalpy of formation, and the respective elemental entropies. No other silicon or fluorine compound shows this inconsistency, so this is not a systemic problem with this dataset. As there is no obvious explanation for this inconsistency, or of which of the tabulated ∆H, S, or ∆G are in error, these entries were removed from the dataset. The resulting 172 molecules are provided in a dataset. The data are provided as described in Table 5 in the form of the difference between the respective values at temperatures between 300 and 1500 K and the value at 298 K (standard state). We previously described a list of 16,417 molecules (ASM) containing no more than 6 non-hydrogen atoms [1] as a repository of potentially volatile compounds. The ASM database of small molecules was built as a comprehensive list of potential biosignature gases, that is, gases that indicate the presence of life in a world (see [41][42][43][44] for a review of biosignatures). To extend the value of the ASM dataset, we calculated entropy and enthalpy of formation for these molecules, as described below. The extended dataset contains the original dataset, the calculated values required for calculating enthalpy and entropy as described in Section 3 below, and the output results. The schema for the data is described in Table 6. Calculated free energy of formation for 13 temperatures between 300 K and 1500 K, derived from modelled enthalpy, entropy, and PM7 outputs, in kJ/mol The columns 'Model + Measured_Enthalpy' and 'Model + Measured_Entropy' list measured values for enthalpy and entropy, respectively, where those are known, and modelled values where no measured values are known. These values are therefore the most accurate values of enthalpy and entropy available. The Gibbs free energy (relevant to reaction at constant pressure and temperature, as noted) is listed. Calculating Gibbs free energy requires the elemental entropy and H−Ho values to be known; these are provided in the file 'elemental_thermodynamics.xlsx', with data for As, Se, and Ge derived from [25,46,47], and all other values from [19].
The ∆G values from the thermodynamic data have been integrated with the prior 'All Small Molecules' (ASM) database. The data schema for the flat file version of the dataset is shown in Table 7. The ASM dataset is available for download at www.allmols.org.

Methods
To calculate the free energy of formation of a substance, its enthalpy of formation and entropy need to be known. These values were calculated separately using quantum mechanical calculations, and then corrected for systematic biases using heuristics developed from reference datasets described above.

Measured Thermodynamic Values
Literature compilations of thermodynamic values were identified initially by search of Google Scholar (scholar.google.com) with keywords for thermodynamics (thermodynamics, entropy, enthalpy, free energy, heat of formation) and data collections (database, collection, table). Compounds of specific elements were further identified using thermodynamic terms and terms relevant to the element (e.g., arsenic, arsenous, organoarsenic). These initial papers were followed up by searching for (a) references in the papers identified as relevant and (b) papers citing the papers found.

Inconsistencies and Errors in Published ∆H
Of 2869 substances, 1602 were present only in one data source. For substances for which ∆H • values were present in more than one source, in some cases there was substantial difference in the values provided by those sources. Thus, for example, the ∆H • of sulfur hexachloride (SCl 6 ) is reported as 91.58 kJ/mol by [9] but −82.80 kJ/mol by [24]. Tetraiodomethane (CI 4 ) is variously reported to have a ∆H • of 267.94, 326.9, or 452.49 kJ/mol. While half of the 1236 substances represented by more than one data source had ranges of 1 kJ/mol or less, a substantial fraction of the range of ∆H • values was much larger (Figure 2). This was after correcting for typographical errors and correcting some of the most egregious differences by recalculating from the original literature. Despite these data correction procedures, 35 compounds listed a range of listed ∆H values in excess of 50 kJ/mol. Compounds with ranges of >50 kJ/mol were excluded from further analysis. Some spot checks suggested that applying a lower exclusion limit did not improve the match between QM-predicted ∆H and experimental values. The excluded values are retained in the database for future reference and flagged in column 27. The filtered set contained 2834 molecules, of which 1232 had more than one source for the DH • value.
The presented data correction and curation procedures do not remove all errors. For example, [22] lists the condensed phase ∆H of 2-fluoro-2,2-dinitroethanol as −480.3 kJ/mol but the gas-phase ∆H as −181.8 kJ/mol, implying a heat of vaporization of~300 kJ/mol, which is similar to that of diamond. This example was excluded from the dataset, but others less obvious in error and present as only a single-source entry may have been retained.
Three entries in the NIST-JANAF online tables are inconsistent between the PDF version [18], including the PDFs on the online database, and the online version [19]. Specifically, entries for phosphoryl tribromide (Br 3 OP), thiophosphoryl tribromide (Br 3 PS) and phosphine (PH 3 ) were significantly different between the two versions. In addition, the data in the PDF version of the entry for phosphine were internally inconsistent. The ∆G values tabulated were different from those that could be calculated from the tabulated ∆H and S • values. This was not a systematic error in phosphorus compounds, as other phosphorus compounds did not show these inconsistencies. The online database values of the ∆G • values for phosphine were systematically higher (i.e., more positive) than those from the PDF versions. We note that [50] used the values from the PDF version in [18] in all calculations. Bains et al.'s [50] conclusions would not be changed by using the updated online values; indeed they would be strengthened, suggesting that phosphine is less likely to be formed in Venus's atmosphere than they calculated in their paper. ΔG values tabulated were different from those that could be calculated from the tabulated ΔH and S o values. This was not a systematic error in phosphorus compounds, as other phosphorus compounds did not show these inconsistencies. The online database values of the ΔG o values for phosphine were systematically higher (i.e., more positive) than those from the PDF versions. We note that [50] used the values from the PDF version in [18] in all calculations. Bains et al.'s [50] conclusions would not be changed by using the updated online values; indeed they would be strengthened, suggesting that phosphine is less likely to be formed in Venus's atmosphere than they calculated in their paper.
In some cases, initial modelling pointed to errors in experimental ΔH values, which we could correct. For example, on an initial run of the model the highest difference between modelled and experimental ΔH value was for diphenyl disulfone, with a modelled value of −279.09 kJ/mol and a reported experimental value of −481.02 kJ/mol. The extreme value of this difference for a relatively unexceptional molecule led us to recalculate the experimental value from the original data given in [51]. A small correction for the heat of formation of liquid water [18], assuming that sulfuric acid dissolved in water in the bomb calorimeter at the end of the experiment, would be in the form of sulfate ions and not undissociated sulfuric acid (ΔH values taken from [52]), and updating the heat of vaporization of water, we recalculated the heat of formation as −240.04 kJ/mol. This is not a unique example, and Stewart comments that one use of such modelling is to point out potentially questionable reported experimental data [7].  In some cases, initial modelling pointed to errors in experimental ∆H values, which we could correct. For example, on an initial run of the model the highest difference between modelled and experimental ∆H value was for diphenyl disulfone, with a modelled value of −279.09 kJ/mol and a reported experimental value of −481.02 kJ/mol. The extreme Data 2022, 7, 33 9 of 19 value of this difference for a relatively unexceptional molecule led us to recalculate the experimental value from the original data given in [51]. A small correction for the heat of formation of liquid water [18], assuming that sulfuric acid dissolved in water in the bomb calorimeter at the end of the experiment, would be in the form of sulfate ions and not undissociated sulfuric acid (∆H values taken from [52]), and updating the heat of vaporization of water, we recalculated the heat of formation as −240.04 kJ/mol. This is not a unique example, and Stewart comments that one use of such modelling is to point out potentially questionable reported experimental data [7].
With these corrections made where this was possible, an average ∆H • was used in this work. Future work could recalculate ∆H • from original literature data for all the compounds (if the data are published, and not just the derived thermodynamic parameters), but using modern values for reference enthalpies of elements and end products of combustion.

Measured Entropy (S • ) Values
Measured values of entropy (S • ) of compounds were collected from literature sources [19,21,23,25,[38][39][40]. In contrast to ∆H data, the entropy data were much more internally consistent. Among the 418 entries for which more than one value was available, 381 had ranges of <8 J/mol/K (Figure 3). The most extreme range was for acetic acid (CH 3 COOH), with values between 282.84 [21] and 404.04 [39]. A difference in S • of 167 J/mol/K at 298 K is equivalent to a difference of 36 kJ/mol in ∆G (Equation (2)). Although it is large, the S • difference for acetic acid implies a ∆G difference of less than the 50 kJ/mol cutoff used to eliminate extremely divergent values from the ∆H dataset, so for consistency with the enthalpy dataset, no values were excluded from the entropy dataset. The distribution of ranges in the 418 entries for which more than one value was found is shown in Figure 3. ternally consistent. Among the 418 entries for which more than one value was available, 381 had ranges of <8 J/mol/K ( Figure 3). The most extreme range was for acetic acid (CH₃COOH), with values between 282.84 [21] and 404.04 [39]. A difference in S o of 167 J/mol/K at 298 K is equivalent to a difference of 36 kJ/mol in ΔG (Equation (2)). Although it is large, the S o difference for acetic acid implies a ΔG difference of less than the 50 kJ/mol cutoff used to eliminate extremely divergent values from the ΔH dataset, so for consistency with the enthalpy dataset, no values were excluded from the entropy dataset. The distribution of ranges in the 418 entries for which more than one value was found is shown in Figure 3.

Modelling Method
In principle, enthalpy of formation can be calculated for any molecules using ab initio quantum mechanics (QM) methods. In practice, this is impractical for the molecules considered here for two reasons. First, ab initio computational methods are computationally intensive, especially if high accuracy is required. The enthalpy of formation of a molecule can be calculated from the difference between the total energy of the molecule and the total energy of its component elements. Total energy (the energy released by assembling the molecule from nuclei and electrons at infinite separation) is an output of ab initio methods. However, the total energy is a very large number; for example, the total

Modelling Method
In principle, enthalpy of formation can be calculated for any molecules using ab initio quantum mechanics (QM) methods. In practice, this is impractical for the molecules considered here for two reasons. First, ab initio computational methods are computationally intensive, especially if high accuracy is required. The enthalpy of formation of a molecule can be calculated from the difference between the total energy of the molecule and the total energy of its component elements. Total energy (the energy released by assembling the molecule from nuclei and electrons at infinite separation) is an output of ab initio methods. However, the total energy is a very large number; for example, the total energies of H 2 , O 2 , and H 2 O calculated to B3LYP/6-311G level of accuracy are −3071.5, −394,346.8, and −200,532.4 kJ/mol, respectively. These values have to be calculated to at least five significant figures to calculate the enthalpy of formation to within 20 kJ/mol, which, due to computing time required, is impractical for a large number (16,417) of molecules collected in the ASM database. Second, the most accurate QM methods are not parameterized for atoms heavier than neon, and so most of the molecules of interest would be inaccessible to them.
We therefore chose the semiempirical QM methods [8] as the basis for calculating enthalpy of formation. Specifically, we used the MOPAC2016 [53] implementation of PM3 [24,54], PM6 [55], and PM7 [56] semiempirical calculations of thermodynamic parameters. The three methods represent a successive improvement of the semiempirical approach, so we used all three to test their accuracy on our specific dataset. We comment further on the comparison between ab initio and semiempirical methods below.
The accuracy of the three methods in predicting enthalpy of formation and entropy is listed in Table 8. Unexpectedly, PM6 proved more accurate in this dataset for predicting entropy than PM7. It is unclear why this might be, but PM6 was used for entropy calculations and PM7 for enthalpy calculations for all subsequent modelling.
A ∆H that is only accurate to within 37 kJ/mol is not sufficiently accurate to predict the outcome of a reaction. As an example, the reaction of nitrogen with hydrogen to form ammonia 1 2 N 2 + 1 1 2 H 2 ↔ NH 3 has a free energy of reaction of −16.327 kJ/mol at 25 • C [18], predicting that the reaction will form NH 3 at 25 • C if the reaction happens at all. An error of 38 kJ/mol on this value would suggest a range of −54.3 to +21.7 kJ/mol; the former value of ∆H • suggests that an equilibrium mixture of N 2 , H 2 , and NH 3 at 25 • C would contain essentially 100% NH 3 ; and the latter value of ∆H • suggests that an equilibrium mixture would contain 6·10 −5 NH 3 . We therefore sought to improve the accuracy of the energy of formation calculation with a group additive approach. We tried atom counts, bond counts, and larger functional group counts as the basis for the possible improvement of the accuracy of the energy of formation calculation, but found that atom counts gave as good a match as bond or group counts, and required fewest free variables. Modelling was performed in Optibrium's StarDrop software (www.optibrium.com/ startdrop), which is optimized for matching molecular properties to molecular structure [57]. The reader is directed to StarDrop user documentation for details of this technology. In summary, data are input as a set of structures (coded as SMILES strings), enthalpy endpoints, and atom counts. The AutoModeller function of StarDrop then follows the following procedure: 1. Splits the data into three sets: 50% of the structures into a training set, 25% into a validation set, 25% into a test set. Splitting is performed on the basis of Tanimoto coefficient clustering of molecules. 3. Applies all models to the independent validation set and selects the best model based on validation set fit. 4. Applies this model to the test set to provide an independent measure of model accuracy.

Modelling Enthalpy of Formation
The method above was used to model the enthalpy of formation based on the measured values in the dataset described in Section 2.2.1. Using atom counts and PM7 semi-empirical QM output as inputs, a radial basis function (RBF) model was found to give the best prediction, with r 2 = 0.997 and RMS error of 24.33 kJ/mol on the test set. Including bond counts or ab initio QM calculations to atom counts did not significantly change the accuracy of the model. Model performance on the validation and test data subsets of data is shown in Figure 4 (because a fitted radial basis function is required to pass through all the training data points, the training data are always exactly matched).

Modelling Enthalpy of Formation
The method above was used to model the enthalpy of formation based on the measured values in the dataset described in Section 2.2.1. Using atom counts and PM7 semi-empirical QM output as inputs, a radial basis function (RBF) model was found to give the best prediction, with r 2 = 0.997 and RMS error of 24.33 kJ/mol on the test set. Including bond counts or ab initio QM calculations to atom counts did not significantly change the accuracy of the model. Model performance on the validation and test data subsets of data is shown in Figure 4 (because a fitted radial basis function is required to pass through all the training data points, the training data are always exactly matched).

Comparison with Other Methods
A wide range of methods have been used to calculate enthalpy, so it is useful to benchmark this method to them. Published data on model performance are rarely comparable, as they are tested on different sets of molecules. Those that are benchmarked against similar molecule sets usually select chemically limited molecules (e.g., alkanes), which do not represent the chemical diversity we are capturing with this work. We therefore used the same method as described above to develop models optimized for our dataset, but based on different input parameters, specifically, ab initio quantum mechanics, semiempirical quantum mechanics, and group contribution. Ab initio QM methods were implemented in GAMESS [58]. Because of the diversity of the compounds being considered in this work, the only group contribution method that can be applied is to consider the smallest possible 'group'-two atoms joined by a bond. This is the same

Comparison with Other Methods
A wide range of methods have been used to calculate enthalpy, so it is useful to benchmark this method to them. Published data on model performance are rarely comparable, as they are tested on different sets of molecules. Those that are benchmarked against similar molecule sets usually select chemically limited molecules (e.g., alkanes), which do not represent the chemical diversity we are capturing with this work. We therefore used the same method as described above to develop models optimized for our dataset, but based on different input parameters, specifically, ab initio quantum mechanics, semiempirical quantum mechanics, and group contribution. Ab initio QM methods were implemented in GAMESS [58]. Because of the diversity of the compounds being considered in this work, the only group contribution method that can be applied is to consider the smallest possible 'group'-two atoms joined by a bond. This is the same as calculating the enthalpy of a molecule as being the sum of the enthalpy of formation of its component bonds. We deployed this sum_of_bonds method here. The results are summarized in Table 9; more details on the methods used and the performance of specific methods are given in Appendix A. We emphasize that much better performance can be obtained with all the methods listed in Table 9 for more limited chemical spaces, and group contribution methods can be used for them. However, as our goal was to predict the thermodynamic properties of any covalent molecule containing any of 15 elements, our approach of semiempirical QM corrected by atom counts in an RBF model is the most accurate solution.

Entropy Modelling
A prediction accuracy of 29 J/mol/K in predicting entropy is also insufficient for our purposes, and so we also sought to improve the accuracy of entropy prediction. Entropy modelling was performed using the same procedure as enthalpy modelling. StarDrop modelling was then performed as described above to correct the GAMESS output based on element counts. The best model fit was found to be GP2DSearch, with r 2 = 0.9248 and RMS error of 12.85 on the test set. A model performance on the three data subsets of data is shown in Figure 5.

Entropy Modelling
A prediction accuracy of 29 J/mol/K in predicting entropy is also insufficient for our purposes, and so we also sought to improve the accuracy of entropy prediction. Entropy modelling was performed using the same procedure as enthalpy modelling. StarDrop modelling was then performed as described above to correct the GAMESS output based on element counts. The best model fit was found to be GP2DSearch, with r 2 = 0.9248 and RMS error of 12.85 on the test set. A model performance on the three data subsets of data is shown in Figure 5.  We note that the semiempirical methods make a number of simplifications that could contribute to the inaccuracy of prediction of enthalpy and entropy. For example, entropy calculations do not include conformational terms, which could contribute significantly to some molecules. These will not be adequately corrected by any modelling that includes just atom or bond counts, such as the modelling described above. Thus there is room for further work to improve the predictions of thermodynamic parameters reported here.

Change in Enthalpy and Entropy with Temperature
In contrast with enthalpy of formation and entropy, the change in enthalpy and entropy with temperature was well predicted by the PM7 method, as shown in Figure 6. We note that the semiempirical methods make a number of simplifications that could contribute to the inaccuracy of prediction of enthalpy and entropy. For example, entropy calculations do not include conformational terms, which could contribute significantly to some molecules. These will not be adequately corrected by any modelling that includes just atom or bond counts, such as the modelling described above. Thus there is room for further work to improve the predictions of thermodynamic parameters reported here.

Change in Enthalpy and Entropy with Temperature
In contrast with enthalpy of formation and entropy, the change in enthalpy and entropy with temperature was well predicted by the PM7 method, as shown in Figure 6. Figure 6. Errors in semiempirical prediction of the difference between enthalpy at 298 K and enthalpy at other temperatures (H−Ho) and the entropy at 298 K and at other temperatures (S-So). X-axis: temperature. Y-axis: root mean square difference between predicted S-So and actual S-So (left axis) and predicted H−Ho and actual H−Ho (right axis). Actual values are taken from [19]. This shows that the semiempirical methods correctly predict the change in enthalpy of formation and change in entropy of the reference molecule set with temperature with an error of less than the average range of measured values.
We therefore used the PM7 predicted values for the change in enthalpy and entropy with temperature without further adjustment (except to convert from calories to joules). We note, however, that the MOPAC semiempirical methods do not predict change in enthalpy of formation. The PM7 output provides for whereas a prediction of ΔH should calculate ΔH = ΔH o + (ΔH-ΔH o ) Figure 6. Errors in semiempirical prediction of the difference between enthalpy at 298 K and enthalpy at other temperatures (H−Ho) and the entropy at 298 K and at other temperatures (S-So). X-axis: temperature. Y-axis: root mean square difference between predicted S-So and actual S-So (left axis) and predicted H−Ho and actual H−Ho (right axis). Actual values are taken from [19]. This shows that the semiempirical methods correctly predict the change in enthalpy of formation and change in entropy of the reference molecule set with temperature with an error of less than the average range of measured values.
We therefore used the PM7 predicted values for the change in enthalpy and entropy with temperature without further adjustment (except to convert from calories to joules). We note, however, that the MOPAC semiempirical methods do not predict change in enthalpy of formation. The PM7 output provides for whereas a prediction of ∆H should calculate where ∆H • is the enthalpy of formation of the compound at 298 K (modelled in Section 3.2.2 above), [H−Ho] is the increase in absolute enthalpy between 298 K and the target temperature, n e is the number of atoms of element e in the molecule, and [H−Ho] e is the increase in absolute entropy of element e between 298 K and the target temperature.

Application of All Small Molecules (ASM) Database
The models described above were run on the All Small Molecules (ASM) dataset [1] to provide predicted Gibbs free energy of formation data for those molecules, which is applicable to calculating equilibria in the gas phase at constant temperature and pressure. Modelling was performed exactly as above, and ∆G calculated according to Equations (2)-(4). Both the inputs to the models and the outputs from the models are provided in the data file provided in this set so that others can develop improved models.
We note that the ASM molecule list is a list of small molecules with a wide range of volatility (with boiling point as a proxy of volatility; see [1] for details on the ASM molecule selection process and the creation of ASM database itself). Some molecules in the list, such as urea or glycine, are very unlikely to be stably present in the gas phase except at extremely low pressures. The calculations presented in this work are for the gas phase only. However, we included the results for less volatile molecules here as well for two reasons. First, this work could be extended with estimates of heats and entropy of vaporization to predict enthalpy and entropy of the solid state. The gas-phase data, therefore, act as a base on which further work could be built. Second, it is possible that such chemical species could be fleeting intermediates in gas-phase chemistry (as phosphorous acid has been proposed to be in the phosphorus chemistry of Venus's lower atmosphere, despite its thermal instability below the clouds [50]). The thermodynamics of such less volatile molecules could therefore be of interest for modelling such processes. Future work will seek to build comparable models for solid-phase thermodynamics, and hence for heats of vaporization, so that such mixed-phase calculations can be performed.
The same calculation of ∆G, starting from PM6 and PM7 output and atom counts, was performed for the 172 molecules from the [19] dataset used above in Section 3.2.5. For these molecules, measured values of ∆H and ∆S are known, and so a 'measured' value of ∆G can be derived. The root mean square difference between ∆G calculated from semiempirical QM methods and atom counts as described and that tabulated in [19] is shown in Figure 7.
We note that the [19] set of compounds does not include any As, Se, or Ge compounds, and so this is only an estimate of the error in the wider dataset. The error we expect in the setoff compounds is e(∆G) = e(∆H) + e(∆S) × T where e(∆G) = RMS error in ∆G, e(∆H) = RMS error in ∆H, e(∆S) = RMS error in ∆S, and T = temperature. Surprisingly, the error in ∆G is substantially smaller than this estimate. This suggests that errors in estimating the various input parameters are not independent and partially cancel each other when values of ∆H and S estimated from semiempirical QM methods are used to calculate ∆G. Figure 7. Accuracy of ΔG estimates for the [19] set of compounds. Y-axis: RMS error (kJ/mol). Xaxis: temperature (K). Blue line: RMS error in ΔG. Red line: sum of RMS errors in ΔH and T. ΔS, calculated from the same set of molecules according to Equation (5). This shows that the overall error in the prediction of ΔG for the reference dataset is of the same order as the range of experimental values for the enthalpy of formation.
We note that the [19] set of compounds does not include any As, Se, or Ge compounds, and so this is only an estimate of the error in the wider dataset. The error we expect in the setoff compounds is e(ΔG) = e(ΔH) + e(ΔS).T (5) where e(ΔG) = RMS error in ΔG, e(ΔH) = RMS error in ΔH, e(ΔS) = RMS error in ΔS, and T = temperature. Surprisingly, the error in ΔG is substantially smaller than this estimate. This suggests that errors in estimating the various input parameters are not independent and partially cancel each other when values of ΔH and S estimated from semiempirical QM methods are used to calculate ΔG.   AbAb initio QM calculations on the enthalpy Test set were done using GAMESS [58]. Calculations were done using DFT using B3LYP at 3-21 level of accuracy. Higher levels of accuracy frequently failed to converge for molecules containing atoms heavier than neon, and many are not parameterized for atoms heavier than argon. Data was extracted from the output file for the absolute energy (AE), and five other intermediate energy contributions; total potential energy, total kinetic energy, 1-electron energy, 2-electron energy, nuclear repulsion energy and nuclear-electron interaction energy. Models were built with just the total Ab Initio energy ("Ab Initio") or using all six energy measures as input ("Ab Initio plus").
In principle the enthalpy of formation of a compound can be determined from the absolute energy by subtracting the absolute energy of the elements from which the compound is composed. We attempted two approaches to this. The first was to optimize the values of E e in Equation (A1) where A is the absolute energy of the molecule as calculated by GAMESS, n e is the number of atoms of element e in the molecule and E e is the notional energy of that element in the standard state. Values of E e were adjusted using a simulated annealing approach [59] to minimize the RMS error in predicting ∆H. This approach resulted in some E e values that were positive, which is unphysical, and in any case produced a poor match (Method 1 in Table A1). We therefore used the StarDrop model building software to build a model from either Ab Initio (Method 2) or Ab Initio Plus (Method 3) data and the count of the number of atoms, using the same protocol as described in the main paper. This results in output that can be a non-linear function of the inputs, and improved performance considerably, at the expense of not being readily physically interpretable.
An approximation to the total enthalpy of a molecule is the enthalpy of each of the bonds in the molecule. This is a substantial simplification of the energetics of a molecule; for example, it neglects aromatization energies, delocalization of electrons across several atoms, and partial bond structures (such as the overlap of bonds in the amide bond which prevents rotation around the C-N bond in peptides). Despite these limitations, 'typical bond energies' are often cited in chemistry textbooks as meaningful, so we used StarDrop to model ∆H based solely on the counts of bonds between atoms, each combination of two atoms and one bond type (Single, double or triple) being counted separately. The result (Model 4) was a poor match, and this approach was not explored further.
Semi-empirical methods PM3, PM6 and PM7 (Methods 4, 5, 6) are included here for comparison. All are better than any ab initio calculation in this modelling, but none are good enough for useful chemical prediction. We therefore sought to reduce the errors in the semi-empirical method by including data on the atom (model 8) or bond (model 9) counts in the StarDrop input. Including atom counts reduced errors by~40%, and so this approach was adopted for the main paper. Unexpectedly, including bond data resulted in slightly poorer performance. This was unexpected as the atom count data is implicit in the bond count data. The poorer performance may simply be due to the noise in a sparsely populated, wide input dataset overwhelming the modelling-there are 15 elements but 151 bond types as input, many of which are only present in 1 or 2 molecules, and the noise in this data may overwhelm any realistic modelling.