A Density Functional Theory and Information-Theoretic Approach Study of Interaction Energy and Polarizability for Base Pairs and Peptides

Using density functional theory (DFT) and the information-theoretic approach (ITA) quantities to appreciate the energetics and properties of biopolymers is still an unaccomplished and ongoing task. To this end, we studied the building blocks of nucleic acid base pairs and small peptides. For base pairs, we have dissected the relative importance of energetic components by using two energy partition schemes in DFT. Our results convincingly show that the exchange-correlation effect predominantly governs the molecular stability of base pairs while the electrostatic potential plays a minor but indispensable role, and the steric effect is trivial. Furthermore, we have revealed that simple density-based ITA functions are in good relationships with molecular polarizabilities for a series of 30 hydrogen-bonded base pairs and all 20 natural α-amino acids, 400 dipeptides, and 8000 tripeptides. Based on these lines, one can easily predict the molecular polarizabilities of larger peptides, even proteins as long as the total molecular wavefunction is available, rather than solving the computationally demanding coupled-perturbed Hartree–Fock (CPHF) equation or its DFT counterpart coupled-perturbed Kohn–Sham (CPKS) equation.


Introduction
Nucleic acids and proteins are important biological polymers in living organisms. Proteins are responsible for the catalysis of biological processes, while nucleic acids or base pairs play a role as carriers of the genetic information. However, ab initio calculations of entire nucleic acids or proteins are still a tough nut to crack or even computationally intractable. Rather, one can easily delve into their building blocks, base pairs of nucleic acids and amino acids of proteins or peptides, respectively.
For base pairs, there exists a type of important noncovalent interaction: hydrogenbonding interactions, which could stabilize the two strands of nucleic acids (DNA and RNA). To figure out the origin and nature of hydrogen-bonding interactions in base pairs is still of heated discussion and an ongoing task in the literature [1][2][3][4][5]. Normally, the role of electrostatic potential is greatly emphasized [2], and in most cases, this is the dominantly important energetic component of the interaction energy. However, this is not the whole picture of H-bonded interactions! Parthasarathi et al. [4] have shown that the presence of critical points between H-bonded DNA base pairs is indicative of the closedshell kind of interactions. With Bader's quantum theory of atoms in molecules (QTAIM), Toczyłowski et al. [5] have revealed that the electrostatic and exchange energies are found to be the most important components of the overall interaction energy, although the dispersion and the induction energies also play important roles. Of note, the exchange term used is not the same quantity as that in DFT [6]. In this work, we also resort to the supermolecule

Validation
Hait et al. [22] have verified that hybrid density functionals are satisfactory in calculating molecular polarizabilities. A few popular density functionals, such as M06-2X [23], B3LYP [24,25], CAM-B3LYP [26], PBE0 [27] and ωB97XD [28] were chosen together with a series of atomic-centered Gaussian-type basis sets 6-311G(d,p) [29], Def2-SVP [30], Def2-TZVP [30] and aug-cc-pVTZ [31]. We calculated the molecular polarizabilities for a total of 20 natural α-amino acids (in its neutral form in the gas phase) at different theoretical levels as shown in Table 1. The MP2 results from the literature are also collected for comparison. Two statistical parameters, mean unsigned error (MUE) and mean signed error (MSE), are employed to gauge the quality of the method. The experimental values [32] of α iso are taken as reference to obtain MUE and MSE values. It is clearly shown that the combination of 6-311G(d,p) with 5 functionals from this work or MP2 (MSE =~16 Bohr 3 ) from the literature [33] leads to large deviations compared with the experiment. This is also the case for a small double-zeta basis set Def2-SVP, when combined with M06-2X, B3LYP, PBE0, and ωB97XD, with an exception of CAM-B3LYP. The reason behind is unknown. When it comes to a relatively larger triple-zeta basis set Def2-TZVP, the MSE dramatically decreases, for example from 18.4 (Def2-SVP) to 8.9 Bohr 3 for M06-2X. If an even better basis set augcc-pVTZ is used, the MSE is only 3.8 Bohr 3 . Of note, the relative deviation of 5 density functionals should be comparable when combined with, for example, Def2-TZVP. In addition, though B3LYP performs well in predicting molecular polarizabilities here, its failure in delineating dispersion will undoubtedly hinder the wide applicability of ITA quantities for various kinds of inorganic, organic and biological systems. The results will be presented in our forthcoming publication. Collectively, with both accuracy and efficiency taken into consideration, we selected M06-2X/Def2-TZVP for predicting molecular polarizabilities.

Total Energy Decomposition of Base Pairs
To determine the nature of hydrogen bonding interactions in base pairs is of heated discussion in the literature. In this work, we aim to use two energy decomposition schemes as mentioned above to dictate which energetic component is responsible for the hydrogen bonding interactions. Using the supramolecular approach, the interaction energy (∆E) and its components, including kinetic (∆T s ), exchange-correlation (∆E xc ), electrostatic potential (∆E e ), steric hindrance (∆E s ), and quantum effect (∆E q ), are obtained at the M06-2X/Def2-TZVP level, as collected in Table 2. Here, we only consider the electronic energy, with thermal contributions and BSSE (basis set superposition error) [34,35] corrections neglected. The total energy difference is always negative, indicating that hydrogen-bonding interactions are energetically favorable. In addition, one can see that exchange-correlation (∆E xc ), electrostatic potential (∆E e ), and steric hindrance (∆E s ) all make positive contributions to the total energy difference. However, the large negative values of steric hindrance are compensated by the positive Fermionic quantum effect in the new scheme. In the conventional scheme, the positive kinetic energy also cancels out part of the exchange-correlation effect and electrostatic potential. Next, we will figure out the relative importance of these three components. Shown in Figure 1 are the linear correlations between the total energy difference (∆E) and ∆E s , ∆E e , and ∆E xc . These lines indicate that single-variable linear regression cannot distinguish the relative importance of these three energetic components. To resolve this issue, two-variable fits are used to make that happen. Shown in Figure 2 are the strong linear correlations between the fitted and predicted total energy difference. The fitted regression equations for Figure 2a,

Molecular Polarizabilities of Base Pairs
In Table 3, we have collected the molecular polarizabilities, molecular volumes, and ITA quantities for 30 base pairs, which are obtained at the M06-2X/Def2-TZVP level. Also included in Table 3 are the correlation coefficients (R 2 ) between the molecular polarizabilities and molecular volumes and ITA quantities. One can observe that Shannon entropy  It is lucidly shown that it is the exchange-correlation effect that dominates the hydrogenbonding interactions, while the electrostatic potential also plays an important role and the steric hindrance is trivial.

Molecular Polarizabilities of Base Pairs
In Table 3, we have collected the molecular polarizabilities, molecular volumes, and ITA quantities for 30 base pairs, which are obtained at the M06-2X/Def2-TZVP level. Also included in Table 3 are the correlation coefficients (R 2 ) between the molecular polarizabilities and molecular volumes and ITA quantities. One can observe that Shannon entropy (S S ), Fisher information (I F ), 2nd and 3rd relative Reényi entropy ( r R 2 and r R 3 ), and G 3 are in strong linear relationships with molecular polarizabilities, with R 2 > 0.8. It is intriguing to note that the absolute values of G 3 are very close to molecular polarizabilities. However, the reason is unknown at present. In Table 4, we compare the polarizability results from conventional calculations, G 3 data, and those predicted on top of the molecular wavefunctions obtained at the M06-2X/Def2-TZVP level. One can easily observe that the G 3 data are very accurate in comparison to conventional polarizability data with MSE(%) to 1.5. However, the two sets of predicted data employing the original Tkatchenko-Scheffler (TS) formula [36] on top of Becke [37] or Hirshfeld [38] partitions are either strongly underestimated or overestimated, with MUE(%) up to -23.4 and 11.6, respectively. It is found that a mean value can greatly reduce the MSE(%) to 5.9. Though a rational theoretical explanation is lacking at the moment, we can take a careful look of the TS formula, , where the sum runs over all atoms in a molecule.
, measuring the volume ratio for atom A in a molecule to the free atom A in vacuum, can be obtained by the Becke or Hirshfeld partitioning of the electron density. Of note, α free A denotes the atomic polarizabilities and they have been obtained accurately [39]. Thus, the main problem should come from the weights. Very recently,

4/3 has been proposed for
small molecules with improved performance [40]. Yet, the performance of both original TS method and its variant for macromolecules requires more extensive work.

Molecular Polarizabilities of Amino Acids, Dipeptides and Tripeptides
We first cross-validate the accuracy of the molecular polarizabilities of 8000 tripeptides. The two theoretical levels are M06-2X/Def2-TZVP//M06-2X/Def2-SVP and B3LYP/6-31+G(d,p)//B3LYP/6-31G level, respectively. Shown in Figure 3 is the strong correlation coefficient (R 2 = 0.992) between the two sets of calculated molecular polarizabilities. The gap between the two methods is only 11.95 Bohr 3 , indicating that one can use a relatively cheaper theoretical method to obtain accurate results. In addition, we have found some strong correlations for a total of 20 amino acids, 400 dipeptides, and 8000 tripeptides between molecular polarizabilities and molecular volumes, GBP entropy (S GBP ), 2nd relative Rényi entropy ( r R 2 ), and G 3 as shown in Figure 4. The correlation coefficients are in the range of 0.87 to 0.95. We have to point out that molecular polarizabilities can be linearly correlated with molecular volumes at both atomic and molecular levels [41][42][43][44][45][46][47][48]. Here, we have verified that ITA quantities can serve as good indicators of molecular properties, in our case, molecular polarizability. This means that one can directly predict molecular polarizabilities of larger peptides, or even proteins. It is well-documented that it is time-consuming to solve the CPHF/CPKS equations [15][16][17]. When the molecular system becomes larger, it may be even intractable. For ITA quantities, as long as the total molecular wavefunction is obtained in a single-point calculation, numerical integration of ITA quantities normally requires much less time than iteratively solving the CPHF/CPKS equations. It is anticipated when it comes to real systems, like proteins, one can combine the linear-scaling fragment-based methods, such as GEBF (generalized-energy based fragmentation) [49][50][51][52], and ITA quantities to accurately and efficiently predict the molecular polarizabilities.
We first cross-validate the accuracy of the molecular polarizabilities of 8000 tripeptides. The two theoretical levels are M06-2X/Def2-TZVP//M06-2X/Def2-SVP and B3LYP/6-31+G(d,p)//B3LYP/6-31G level, respectively. Shown in Figure 3 is the strong correlation coefficient (R 2 = 0.992) between the two sets of calculated molecular polarizabilities. The gap between the two methods is only 11.95 Bohr 3 , indicating that one can use a relatively cheaper theoretical method to obtain accurate results. In addition, we have found some strong correlations for a total of 20 amino acids, 400 dipeptides, and 8000 tripeptides between molecular polarizabilities and molecular volumes, GBP entropy (SGBP), 2nd relative Rényi entropy ( r R2), and G3 as shown in Figure 4. The correlation coefficients are in the range of 0.87 to 0.95. We have to point out that molecular polarizabilities can be linearly correlated with molecular volumes at both atomic and molecular levels [41][42][43][44][45][46][47][48]. Here, we have verified that ITA quantities can serve as good indicators of molecular properties, in our case, molecular polarizability. This means that one can directly predict molecular polarizabilities of larger peptides, or even proteins. It is well-documented that it is timeconsuming to solve the CPHF/CPKS equations [15][16][17]. When the molecular system becomes larger, it may be even intractable. For ITA quantities, as long as the total molecular wavefunction is obtained in a single-point calculation, numerical integration of ITA quantities normally requires much less time than iteratively solving the CPHF/CPKS equations. It is anticipated when it comes to real systems, like proteins, one can combine the linearscaling fragment-based methods, such as GEBF (generalized-energy based fragmentation) [49][50][51][52], and ITA quantities to accurately and efficiently predict the molecular polarizabilities.

Discussion
We have systematically investigated the energetics of 30 base pairs and molecular polarizabilities of 30 base pairs and 20 amino acids, 400 dipeptides, and 8000 tripeptides. It is the first time that ITA quantities have been applied to correlate with molecular polarizabilities. Though at present, in theory we have not verified the existence of such a linear

Discussion
We have systematically investigated the energetics of 30 base pairs and molecular polarizabilities of 30 base pairs and 20 amino acids, 400 dipeptides, and 8000 tripeptides. It is the first time that ITA quantities have been applied to correlate with molecular polarizabilities. Though at present, in theory we have not verified the existence of such a linear correlation between ITA quantities and molecular polarizabilities, they are both related to electron density and its derivatives. Additionally, our results presented here demonstrate that one can determine the molecular polarizability by numerically integrating the simple density-based functions when the molecular wavefunction of a given system is obtained from single-point calculations, rather than resorting to iteratively obtaining the molecular orbital derivatives. The implication of our work is straightforward, that one can use the linear relationships as shown in the text to predict molecular polarizabilities of relatively larger peptides, or even larger proteins. Can the ITA quantities be widely applied to various inorganic, organic and biological systems? More work along this line is required.
We have to point out that we are not the first to predict the molecular polarizabilities just as employing the electron density as the input. Jayatilaka et al. [53] has cast the molecular polarizability in terms of moments of the ground-state electron density matrix and the results are reasonably good against CPHF results. The corresponding performance for hyperpolarizability is far from satisfactory. Since hyperpolarizability is not considered in this work, we will figure out if the abovementioned density-based ITA functions can predict the molecular hyperpolarizability well.

Energy Decomposition Schemes in DFT
In Kohn-Shan DFT, the total energy difference (∆E) can be decomposed into its components via [3,4] ∆E and where T s , E e , and E xc are the noninteracting kinetic, electrostatic, and exchange-correlation terms, respectively. The electrostatic potential E e has three components: the nuclear−electron attraction, (V ne ), the classical Coulombic repulsion, (J), and the nuclear−nuclear repulsion (V nn ). The last term, E xc , consists of exchange (E x ) and correlation (E c ) components. In Equation (2), E s stands for the steric hindrance, and E q signifies the contribution from Fermionic quantum effect. The steric effect E s has been proved to be simply the Weizsäcker kinetic energy, Combining Equations (1)-(3), one can simply define E q , which reads This new formulation has its own distinct physical meaning with a corresponding physical state. It has been applied to a number of molecular systems and phenomena, whose results are consistent with our chemical intuition and conventional wisdom [54].

Information-Theoretic Approach Quantities
Shannon entropy S S [8] is a measure of the spatial delocalization of the electron density, and Fisher information I F [9] measures the sharpness or localization of the same. They are defined as Equations (5) and (6), respectively.
Additionally, for atoms and molecules, I F has an equivalent expression [55] in terms of the Laplacian of the electron density, ∇ 2 ρ(r).
Ghosh−Berkowitz−Parr (GBP) entropy S GBP [10] where t(r; ρ) is the kinetic energy density, which is related to the total kinetic energy T S via t(r; ρ)dr = T S (9) t TF (r; ρ) is the Thomas−Fermi (TF) kinetic energy density given by with K as the Boltzmann constant (set to be unity for convenience in this work), c = (5/3) + ln(4πc K /3), and c K = (3/10)(3π 2 ) 2/3 , the specific form of the local kinetic energy More recently, additional ITA quantities have been introduced as new reactivity descriptors in conceptual density functional theory (CDFT) [11]. One example is Onicescu information energy [11] of order n E n = 1 n − 1 ρ n (r)dr (12) relative Rényi entropy [8] of order n and information gain [12] (also called Kullback−Leibler divergence) I G is given in Equation (14) where ρ 0 (r) is the reference-state density satisfying the same normalization condition as ρ(r). Very recently [13,14], we have proposed another three functions G 1 , G 2 , and G 3 , whose analytical forms as shown below: The G 2 function involves Laplacian contribution and can be regarded as the relative Laplacian contribution to the steric potential. The quantifications and applications of Equations (15)- (17) can be found in [14]. Suffice to note, during the past decade, we have attempted to seamlessly glue the density functional theory and information theory together, as electron density can be used as a linker of these two theories. The progress and applications can be found in our recent review [54]. For example, very recently we have applied the information-theoretic approach to appreciate homochirality [56,57], which is another very fundamental problem in biology and we will look into it in the near future.

Amino Acids, Dipeptides, and Tripeptides
Shown in Scheme 2 are the 20 natural amino acids, directly taken from the template of the GaussView [58] program. We used the tleap module in the AmberTools package to generate a series of capped (ACE and NME) dipeptides (400) and tripeptides (8000). For each of the peptides, a simulation box was built with water molecules and Na + and Cl − ions. We performed a total of 1 ns MD simulations with the ff19SB [59] force field for the

Amino Acids, Dipeptides, and Tripeptides
Shown in Scheme 2 are the 20 natural amino acids, directly taken from the template of the GaussView [58] program. We used the tleap module in the AmberTools package to generate a series of capped (ACE and NME) dipeptides (400) and tripeptides (8000). For each of the peptides, a simulation box was built with water molecules and Na + and Cl − ions. We performed a total of 1 ns MD simulations with the ff19SB [59] force field for the peptide and TIP3P [60] for the water molecules. An NPT (1 atm, 300 K) ensemble was used with a time step of 2 fs. The temperature was controlled with underdamped Langevin simulations of the "virtual" solvent with the damping coefficient γ = 5 ps −1 [61]. The pressure was held constant by applying the Langevin piston method [62,63]. Periodic long-range electrostatic interactions were computed using the particle mesh Ewald summation [64]. Covalent bonds associated with hydrogen atoms were constrained by the SHAKE algorithm [65]. All classical MD simulations were carried out with the Amber20 [66,67] CUDA version. A total of 10 MD snapshot structures of each peptide were evenly extracted for further quantum chemical calculations. Since we were not going to locate the global minimum conformers, we first performed single-point calculations of the 10 structures at the semi-empirical PM7 [68] level, and selected the structure with the lowest energy for subsequent analysis. All density functional theory (DFT) calculations were conducted with the Gaussian 16 [69] package with ultrafine integration grids and tight self-consistent field convergence.
A full geometric optimization at the M06-2X/Def2-SVP level without any symmetry constraint was conducted for 30 base pairs and 8420 peptides. The optimized Cartesian coordinates are supplied in the Supplementary Materials. Since it is computationally demanding to perform harmonic vibrational frequency analysis for all systems in this study, we chose 30 base pairs, 20 amino acids, 400 dipeptides, and 10 randomly selected tripeptides and no imaginary frequencies were observed. Molecular polarizabilities, volumes, and wavefunctions were obtained at the M06-2X/Def2-TZVP level. Note that only the isotropic molecular polarizability α iso = (α xx + α yy + α zz )/3 is reported in this work. Total energy components were obtained via the keyword iop(5/33 = 1). The Multiwfn 3.8 [70] program was utilized to calculate all ITA quantities introduced above by using the checkpoint or wavefunction file from the Gaussian calculations as the input. Molecular volumes were obtained at 0.001 e/Bohr 3 contour surface of electronic density. The stockholder Hirshfeld partition scheme of atoms in molecules was employed when atomic contributions were concerned. The reference density was the neutral atom calculated at the same level of theory as molecules.

Conclusions
In this work, we have explored the interaction energy of base pairs and molecular properties of both base pairs and amino acids, dipeptides, and tripeptides. Using the total energy decomposition schemes, one can find that the exchange-correlation effect makes the predominant contribution to the molecular stability, followed by the electrostatic potential, and steric hindrance plays a trivial role. We further revealed that molecular polarizabilities can be linearly correlated with ITA quantities. These strong linear correlations can be used to predict the molecular polarizabilities of larger peptides and even proteins. We mention in passing that two directions merit further studies. (i) Verification of the wide applicability of ITA quantities to various systems needs intensive work. (ii) When a fragment-based method, such as the linear-scaling generalized energy-based fragmentation (GEBF) approach, is taken into consideration, one can directly predict the subsystem (fragment of a large system, usually with a few atoms within a distance threshold) polarizabilities rather than solving the derivatives of molecular orbitals and obtain the molecular polarizability of macromolecular systems via a linear combination of subsystem polarizabilities. More work along this line is ongoing, and results will be presented elsewhere in the very near future.