Calculation of VS,max and Its Use as a Descriptor for the Theoretical Calculation of pKa Values for Carboxylic Acids.

The theoretical calculation of pKa values for Brønsted acids is a challenging task that involves sophisticated and time-consuming methods. Therefore, heuristic approaches are efficient and appealing methodologies to approximate these values. Herein, we used the maximum surface electrostatic potential (VS,max) on the acidic hydrogen atoms of carboxylic acids to describe the H-bond interaction with water (the same descriptor that is used to characterize σ-bonded complexes) and correlate the results with experimental pKa values to obtain a predictive model for other carboxylic acids. We benchmarked six different methods, all including an implicit solvation model (water): Five density functionals and the Møller⁻Plesset second order perturbation theory in combination with six different basis sets for a total of thirty-six levels of theory. The ωB97X-D/cc-pVDZ level of theory stood out as the best one for consistently reproducing the reported pKa values, with a predictive power of 98% correlation in a test set of ten other carboxylic acids.


Introduction
The hydrogen bond is a strong, directional, non-covalent interaction responsible for a large number of chemical phenomena spanning from chemistry to biochemistry [1][2][3], which has become a paradigm amongst the toolbox of chemical concepts [4,5]. Hydrogen bonding is also a major driving force of chemical reactivity. For instance, the deprotonation of Brønsted acids in aqueous media, where the reaction constant, K a , and its associated logarithmic quantity, pKa = −logKa, are an intrinsic characteristic of each acid. This process occurs through the abstraction of the acidic hydrogen atom using a water molecule (Equation (1)). It is commonly regarded that the deprotonation reaction is mainly promoted by electrostatic interactions, given the partial positive charge on the acid hydrogen. Characterization of a protic acid through the pKa values is of practical importance and usefulness in various steps of the chemical design rationale, and therefore, it is an important quantity.
The accurate prediction of pKa values for carboxylic acids by means of computational methods covers a wide range of potential applications from chemical design and biochemistry research to drug development [6][7][8]. However, calculating the equilibrium constant for the deprotonation reaction of a Brønsted acid implies the calculation of the Gibbs Free Energy change, (−∆G), which in turn Since the interaction of water molecules with the acidic hydrogen atom is not isotropic, some parallels between hydrogen and σ-hole bonded systems arise. The formation of these directional interactions implies the presence of an electrostatic potential maximum located on the opposite side of the O-H σ-bond, which can be quantified by the maximum surface electrostatic potential, V S,max . By assuming that the deprotonation of a carboxylic acid begins with the formation of a hydrogen bond with a water molecule, RCOOH···H 2 O, we propose the use of V S,max as a suitable descriptor for the strength of this interaction, which in turn correlates with the corresponding pKa values, in a similar fashion to how a σ-hole-based interaction is quantified in halogen or tetrel bonds. Previously, the nucleophilicities and electrophilicities of Lewis acids and bases, respectively, have been derived from the interpolation of the mutual dissociation energies [11].
Previous efforts for deriving suitable pKa descriptors from ab initio or DFT descriptors have been successfully published, in some cases mixing implicit and explicit solvation models [12]. Electrostatic properties, such as the total molecular electrostatic potential (MEP) on the acidic hydrogen, combined with the sum of the valence Natural Atomic Orbital (NAO) energies on the acidic atom and the leaving proton for amino acids and nucleotides exhibits a correlation coefficient R 2 = 0.91 [13] with the experimental pKa values. Monard and Jensen used various kinds of atomic charges of the conjugated phenolates, alkoxides, or thiolates, with the best correlations being observed for the atomic electrostatic charges from a Natural Population Analysis (NPA) calculated at the B3LYP/3-21G (R 2 = 0.995) and M06-2X/6-311G (R 2 = 0.986) levels of theory for alcohols and thiols, in implicit solvent, respectively. Other efforts include correlations on the excited states of photoacids [14] using Time Dependent DFT at the ωB97X-D/6-31G(d) level of theory for a family of hydroxyl-substituted aromatic compounds. QSPR models have yielded, for instance, a three parameters model which uses the MEP maxima, the number of carboxylic acid and amine groups for phenols, at the HF/6-31G(d,p) and B3LYP/6-31G(d,p) levels of theory (R 2 = 0.96) [15]. It involves a four parameters linear equation comprising the highest normal mode vibrational frequency, the partial positive and negative charges divided by the total surface area and a reactivity index, defined in terms of a population analysis on the frontier orbital HOMO (R 2 ca. 0.95 sic.) [16] for N-Base ligands at the semi empirical AM1 level of theory, as well as a Principal Components Analysis (PCA) for organic and inorganic acids (RMSE = 0.0195) [17]. Moreover, genetic algorithms (GA) and neural networks (NN) have employed frontier orbital energies for a chemical space of sixty commercial drugs [18] (GA, R 2 = 0.703; NN, R 2 = 0.929). Thus far, the only major commercial program capable of including the effects of molecular conformations on the estimation of pKa values is 'Jaguar pKa [19,20]. For more thorough reviews on the development of pKa descriptors, please refer to References [21][22][23].
Non-covalent interactions like tetrel [24], pnicogen [25], chalcogen [26,27], carbon [28], and halogen [29][30][31] bonds offer some resemblances to H-bonded systems, both in structural and reactivity terms. All these forms of bonding correspond to directional, intermolecular non-covalent interactions of an electrostatic nature involving elements in groups 14 through 17, respectively. These atoms behave as electrophiles through the interaction with either n or π electrons from Lewis bases [32,33]. The formation of these non-covalent interactions stems from a similar origin, via the presence of σ-holes [34][35][36], a localized region of positive electrostatic potential on the surface of the bridge atom (prominently present in atoms of group 17), and opposite to the internuclear axis of one of the covalent σ bonds, hence the σ-holes. A stretch of this label has been applied to hydrogen bonding, despite the absence of p electrons on hydrogen atoms and the high polarizability of the hydrogen bonds [37][38][39].
Energetically, the strength of these interactions increases as the bridging atom increases its atomic number, the electronegativity of the atom bonded opposite to the non-covalent bond, and the number of electron-withdrawing groups bonded to the bridging atom. Tetrel bonds, for instance, are stabilizing interactions in nature [40,41] that form cooperative networks [26,[42][43][44][45][46], a feature that is used as a powerful tool for the design of crystal structures [47][48][49]. The stabilization arising from these interactions ranges from 1 kcal/mol to 50 kcal/mol [50]. Therefore, the formation and strength of these interactions closely depends on the polarization of the electron density surrounding the bridging atom. In the particular case of tetrel bonds, these factors have been extensively investigated by Scheiner, who has further assessed the electronic [50,51] and steric [52] contributions.
Several computational studies on the nature of tetrel bonds have been published so far, from their strict quantum treatment [53] to their charge transfer dynamics in the attoseconds regime [54], and the tunneling bond-breaking processes promoted by σ-holes [55].
Thus, the importance of the study of non-covalent interactions has large implications for crystal engineering [56], biochemistry, and the understanding of chemical reactivity [57][58][59]. In our research group, we have reported the chemical reduction of a trichloromethyl group into a methyl group via the attack of σ-holes on chlorine atoms by thiophenolate anions, a reaction mechanism which is extensible to other trichloromethyl compounds [60].
Herein, we presented a benchmark of linear models which correlate the V S,max calculations with various DFT methods, and used MP2 as a reference (see methodology section), to the pKa values of carboxylic acids. Physically, the obtained value of V S,max on the acidic hydrogen atom reflects the attractive interaction between it and a water molecule, and thus in turn can be used to describe the deprotonation process in electrostatic terms.

Results
Thirty (30) different carboxylic acids with reported pKa values were selected from Lange s Handbook of Chemistry [61], and they were optimized and the surface electrostatic potential calculated (see methods section for full details). The structures of the acids are shown in Figure 1. The levels of theory used were obtained from the combination of the following functionals: ωB97X-D (A) [62], B3LYP (B) [63], LC-ωPBE (C) [64], M06-2X (D) [65] and PBE0 (E) [66], as well as the Møller-Plesset second-order perturbation theory, MP2 (F), and the following basis sets, 6-31+G(d,p) (1), 6-311++G(d,p) (2), cc-pVDZ (3), cc-pVTZ (4), aug-cc-pVTZ (5), and Def2-TZVP (6). In total, thirty-six levels of theory were used to calculate the electronic structure of the thirty carboxylic acids, which comprised the chemical space under study for a total of 1080 different wave functions, upon which the maximum surface potential, VS,max, was calculated and plotted against the experimental pKaexp value. Our model was based on simple linear regressions to obtain the best fittings. The VS,max on each acidic hydrogen atom was used for the correlations, as an example, Figure  2 depicts the location of VS,max on the acid hydrogen atom for compound 14. This value was calculated on the isodensity surface ρ = 0.001 a.u., and it was used as a descriptor for the magnitude of the attractive interaction RCOOH···H2O. In total, thirty-six levels of theory were used to calculate the electronic structure of the thirty carboxylic acids, which comprised the chemical space under study for a total of 1080 different wave functions, upon which the maximum surface potential, V S,max , was calculated and plotted against the experimental pKa exp value. Our model was based on simple linear regressions to obtain the best fittings. The V S,max on each acidic hydrogen atom was used for the correlations, as an example, Figure 2 depicts the location of V S,max on the acid hydrogen atom for compound 14. This value was calculated on the isodensity surface = 0.001 a.u., and it was used as a descriptor for the magnitude of the attractive interaction RCOOH···H 2 O. All the correlation coefficients, slopes, and intercepts for all thirty-six levels of theory are collected in Table 1.  Figure 3 for method (A) only, the plots with the rest of the methods (B)-(F) are presented in the Supporting Information section ( Figures S3, S5, S7, S9 and S11). All the correlation coefficients, slopes, and intercepts for all thirty-six levels of theory are collected in Table 1. The obtained linear model is shown in Figure 3 for method (A) only, the plots with the rest of the methods (B)-(F) are presented in the Supporting Information section ( Figures S3, S5, S7, S9 and S11).
A physical interpretation of the trends observed in Figure 3 can be rationalized in terms of the polarization of the O-H bond in the carboxylic acid motif. When the electron density of this bond was more polarized towards the oxygen atom, then the hydrogen atom possessed a more positive electrostatic potential, at the same time it was more labile and readily available for water to abstract it, thus having a lower pKa.
To further analyze the obtained models, a comparison between the experimental and calculated pKa values was made by calculating the ∆pKa = pKa exp − pKa cal . Figure 4 shows these plots for the results obtained with the functional (A), where the corresponding ∆pKa plots for the other levels of theory are collected in the Supporting Information section (Figures S4, S6, S8, S10 and S12).  A physical interpretation of the trends observed in Figure 3 can be rationalized in terms of the polarization of the O-H bond in the carboxylic acid motif. When the electron density of this bond was more polarized towards the oxygen atom, then the hydrogen atom possessed a more positive electrostatic potential, at the same time it was more labile and readily available for water to abstract it, thus having a lower pKa.
To further analyze the obtained models, a comparison between the experimental and calculated pKa values was made by calculating the ΔpKa = pKaexp − pKacal. Figure 4 shows these plots for the results obtained with the functional (A), where the corresponding ΔpKa plots for the other levels of theory are collected in the Supporting Information section (Figures S4, S6, S8, S10 and S12).  The set of models obtained for functionals (C) and (A) had the highest correlation values across the basis sets employed (see the discussion in Section 3 for further results analysis). Particularly, the A3 level of theory (ωB97X-D/cc-pVDZ) exhibited simultaneously, a high correlation (R 2 = 0.9680) and the lowest ΔpKa values. Table S8 shows the pKa intervals for all levels of theory and it can be observed that all (C) models have a ΔpKa interval above 1.  The set of models obtained for functionals (C) and (A) had the highest correlation values across the basis sets employed (see the discussion in Section 3 for further results analysis). Particularly, the A3 level of theory (ωB97X-D/cc-pVDZ) exhibited simultaneously, a high correlation (R 2 = 0.9680) and the lowest ∆pKa values. Table S8 shows the pKa intervals for all levels of theory and it can be observed that all (C) models have a ∆pKa interval above 1. To assess the predictive capabilities of our model, given by Equation (2), we built a test set with ten carboxylic acids ( Figure 5), with pKa values that lay within the range covered by the original training set. The experimental pKa values were reported in Reference [61] and were reproduced in Table 2, together with the calculated values for the test set and the differences, which lay in the range of ∆pKa = ±0.3 units. Figure 6 shows the remarkable correlation between the experimental and calculated values with a correlation coefficient R 2 = 0.9801. To assess the predictive capabilities of our model, given by Equation 2, we built a test set with ten carboxylic acids ( Figure 5), with pKa values that lay within the range covered by the original training set. The experimental pKa values were reported in Reference [61] and were reproduced in Table 2, together with the calculated values for the test set and the differences, which lay in the range of ΔpKa = ±0.3 units. Figure 6 shows the remarkable correlation between the experimental and calculated values with a correlation coefficient R 2 = 0.9801.

Computational Method: DFT or Ab Initio?
As a comparison standard, the Møller-Plesset second-order perturbation theory, MP2 (F), was included in the study, not only to assess its accuracy, but to compare the DFT and at least one wave function method as well. From all the tested levels of theory, the highest R 2 correlation coefficients (Table 1) between VS,max and pKaexp values were obtained consistently with the ab initio MP2 method. Nevertheless, the DFT functional ωB97X-D functional (A), yielded comparably similar results at a fraction of the computational cost. The lowest correlation coefficients were obtained with the B3LYP functional (B), which, despite being one of the most popular ones to model organic molecules, could be describing the surface electrostatic potential inadequately. A similar performance to that of B3LYP was observed for the PBE0 functional (E), which in turn, was slightly improved when long range corrections were included in the case of LC-ωPBE (C). The latter functional was thought to yield much better results due to this long-range correlation term; however, that was not the case.
The M06-2X functional (D) also showed to be properly describing the surface electrostatic potentials, as shown in the high correlation coefficients. This was plausibly because of the dispersion terms included in its formulation. The ωB97X-D functional included an empirical dispersion term which was added a posteriori to correct the energy, but not the electron density [67].
Although the M06-2X functional is widely used and regarded as probably the best functional to model organic reactions [65], in this case, it yielded a larger discrepancy in the ΔpKa plots than the plots obtained with ωB97X-D (Figures 4 and S8).
The fact that the MP2/cc-pVDZ and the ωB97X-D/cc-pVDZ yielded comparable results showed that for the case of modeling surface electrostatic potentials, a computationally expensive method may not always be preferred, as very similar or even better results can be obtained with a less demanding approach in just a fraction of the time.

Computational Method: DFT or Ab Initio?
As a comparison standard, the Møller-Plesset second-order perturbation theory, MP2 (F), was included in the study, not only to assess its accuracy, but to compare the DFT and at least one wave function method as well. From all the tested levels of theory, the highest R 2 correlation coefficients (Table 1) between V S,max and pKa exp values were obtained consistently with the ab initio MP2 method. Nevertheless, the DFT functional ωB97X-D functional (A), yielded comparably similar results at a fraction of the computational cost. The lowest correlation coefficients were obtained with the B3LYP functional (B), which, despite being one of the most popular ones to model organic molecules, could be describing the surface electrostatic potential inadequately. A similar performance to that of B3LYP was observed for the PBE0 functional (E), which in turn, was slightly improved when long range corrections were included in the case of LC-ωPBE (C). The latter functional was thought to yield much better results due to this long-range correlation term; however, that was not the case.
The M06-2X functional (D) also showed to be properly describing the surface electrostatic potentials, as shown in the high correlation coefficients. This was plausibly because of the dispersion terms included in its formulation. The ωB97X-D functional included an empirical dispersion term which was added a posteriori to correct the energy, but not the electron density [67].
Although the M06-2X functional is widely used and regarded as probably the best functional to model organic reactions [65], in this case, it yielded a larger discrepancy in the ∆pKa plots than the plots obtained with ωB97X-D (Figure 4 and Figure S8).
The fact that the MP2/cc-pVDZ and the ωB97X-D/cc-pVDZ yielded comparable results showed that for the case of modeling surface electrostatic potentials, a computationally expensive method may not always be preferred, as very similar or even better results can be obtained with a less demanding approach in just a fraction of the time.

Basis Set: Is Larger Better?
Most of the reported benchmarks to model organic molecules deal with the selection of the proper DFT functional methods [68][69][70]. However, little attention is paid to the basis set, or more precisely, to the proper functional/basis set combination (i.e., the level of theory). For further details, refer to Figures S13-S18, where the obtained models are organized by basis set.
Four out of the six methods yielded the strongest V S,max -pKa correlations when using the relatively medium size cc-pVDZ basis set. Surprisingly, the M06-2X functional presented the largest ∆pKa deviations when combined with the largest basis set aug-cc-pVTZ ( Figures S7 and S8).
In the case of the MP2 calculations ( Figure S12), increasing the so-called quality of the basis set may not be beneficial in all cases. When comparing the split-valence Pople's basis sets, practically the same correlation was found with the double-ζ set and the corresponding triple-ζ quality one, 0.9613 versus 0.9625, respectively. On the other hand, the Dunning-Huzinaga basis showed a decrease in correlation when increasing the set size from cc-pVDZ to cc-pVTZ, 0.9702 and 0.9661, respectively. However, the ∆pKa deviations were practically consistent among the MP2 levels of theory.
In terms of the difference between the experimental and correlated pKa values, the A3 level of theory yielded the smallest ∆pKa deviations, with most of the differences kept under 0.5 pKa units, showing that, for this case, a larger basis set size may not always be better.

A Final Remark
In the thirty-six levels of theory tested in this study, the calculation of the V S,max of three compounds (6, 7, and 12)  Most of the reported benchmarks to model organic molecules deal with the selection of the proper DFT functional methods [68][69][70]. However, little attention is paid to the basis set, or more precisely, to the proper functional/basis set combination (i.e., the level of theory). For further details, refer to Figures S13-S18, where the obtained models are organized by basis set.
Four out of the six methods yielded the strongest VS,max-pKa correlations when using the relatively medium size cc-pVDZ basis set. Surprisingly, the M06-2X functional presented the largest ΔpKa deviations when combined with the largest basis set aug-cc-pVTZ ( Figures S7 and S8).
In the case of the MP2 calculations ( Figure S12), increasing the so-called quality of the basis set may not be beneficial in all cases. When comparing the split-valence Pople's basis sets, practically the same correlation was found with the double-ζ set and the corresponding triple-ζ quality one, 0.9613 versus 0.9625, respectively. On the other hand, the Dunning-Huzinaga basis showed a decrease in correlation when increasing the set size from cc-pVDZ to cc-pVTZ, 0.9702 and 0.9661, respectively. However, the ΔpKa deviations were practically consistent among the MP2 levels of theory.
In terms of the difference between the experimental and correlated pKa values, the A3 level of theory yielded the smallest ΔpKa deviations, with most of the differences kept under 0.5 pKa units, showing that, for this case, a larger basis set size may not always be better.

A Final Remark
In the thirty-six levels of theory tested in this study, the calculation of the VS,max of three compounds (6, 7, and 12) required an average of conformers, where the angle (D1 = O=C-O-H) was either 0.0° or 180.0°. The conformation D1 = 180.0° was stable due to strong delocalization effects from nearby п bonds to the σ*O-H orbital in the acidic hydrogen atom or intramolecular hydrogen bonding with Lewis basic motifs (Figure 7). For such kind of compounds, further improvements are required in the methodology for our linear models. So far, the applicability domain of these regressions is limited by the pKa data used to construct the models (0.5 < pKa < 5.0). Caution must be taken when using the linear models presented herein for molecules outside this range.

Materials and Methods
Geometry optimizations and wave function printouts for the 30 carboxylic acids were performed using the Gaussian 09 rev. E01 suite of programs as in Reference [71], at each of the different levels of theory (see text). All calculations included the Conductor-like Polarizable Continuum Model (CPCM) implicit solvation model (water) as described in References [72,73]. The radii for cavity construction was the UFF default which takes the radii from the UFF (Universal Force Field) scaled by 1.1 with explicit spheres for hydrogen atoms. Frequency analyses were performed at the end of each geometry optimization at the same level of theory to verify that the found geometries corresponded to the energy minima. The ultrafine integration grid was used in all the calculations.
The maximum surface potential (VS,max) calculations were performed on the wave function files with the 'MultiWFN' program, version 3.3.8 as in Reference [74], using an isodensity value of 0.001 a.u. All the computed values were collected in the Supporting Information (Tables S1-S6). So far, the applicability domain of these regressions is limited by the pKa data used to construct the models (0.5 < pKa < 5.0). Caution must be taken when using the linear models presented herein for molecules outside this range.

Materials and Methods
Geometry optimizations and wave function printouts for the 30 carboxylic acids were performed using the Gaussian 09 rev. E01 suite of programs as in Reference [71], at each of the different levels of theory (see text). All calculations included the Conductor-like Polarizable Continuum Model (CPCM) implicit solvation model (water) as described in References [72,73]. The radii for cavity construction was the UFF default which takes the radii from the UFF (Universal Force Field) scaled by 1.1 with explicit spheres for hydrogen atoms. Frequency analyses were performed at the end of each geometry optimization at the same level of theory to verify that the found geometries corresponded to the energy minima. The ultrafine integration grid was used in all the calculations.
The maximum surface potential (V S,max ) calculations were performed on the wave function files with the 'MultiWFN' program, version 3.3.8 as in Reference [74], using an isodensity value of 0.001 a.u. All the computed values were collected in the Supporting Information (Tables S1-S6).

Conclusions
V S,max is a scalar quantity that characterizes a σ-hole, and according to our calculations, it has also proven to be a suitable descriptor to be correlated with the pKa value of carboxylic acids, yielding differences in pKa of high accuracy. ∆pKa = ±0.30 when the ωB97X-D/cc-pVDZ level of theory was used to calculate the associated electron density upon which the V S,max value was obtained. By means of straightforward DFT calculations with a simple implicit solvation model (CPCM), the value of the V S,max could be calculated and Equation (2) obtained herein, could be used to estimate the pKa values without the need for a full thermodynamic cycle calculation; thus, avoiding long computations of solvation free energies and other costly quantities which require high accuracy methods.
The ωB97X-D/cc-pVDZ level of theory (A3) yielded the lowest ∆pKa values, standing as the best choice for estimating the pKa of any given acid through the calculation of the V S,max . Hence, we highly recommend this level of theory for geometry optimization and wave function file print. Care must be taken as the pKa value sought after should be between 0.5 and 5.0 pH units, for this is the applicability domain of our resulting equations, given the chemical space covered herein.
Further testing is needed for these regression models to become universal. However, it is important to stress that V S,max has turned out to be a powerful descriptor for predicting the pKa values of carboxylic acids as it is reflected by low, yet distinguishable differences across all methods studied herein. The presence of intramolecular non-covalent interactions, for example, hydrogen bonding, as well as highly electron-delocalizing groups within the chemical space, are key features to consider in the inclusion of an average of the V S,max for the most stable conformers. Our proposed descriptor is also dependent of the isodensity value for the definition of the surface upon which it is calculated, and it is highly recommended to keep the value suggested by Bader et al. [75] of = 0.001 a.u. However, by taking these considerations into account as part of the parametrical requirements of Equation (2), then extremely accurate pKa results are obtained in a straightforward fashion.
Supplementary Materials: The following are available online, Tables S1-S6: Calculated V S,max values for carboxylic H atoms to the different levels of theory studied, Table S7: Reported pKa values for carboxylic acids studied. Figures S1, S3, S5, S7, S9 and S11: Correlation of pKa exp vs V S,max . Figures S2, S4, S6, S8, S10 and S12: Difference between the experimental and calculated pKa values (∆pKa exp−cal ).