Differential Scanning Calorimetry of Proteins and the Two-State Model: Comparison of Two Formulas

: Differential Scanning Calorimetry (DSC) is a regular and powerful tool to measure the specific heat profile of various materials. In order to connect the measured profile to the properties of a particular protein, a model is required to fit. We discuss here the application of an exact two-state formula with its approximation and process the DSC experimental data on protein folding in water. The approximate formula relies on the smallness of the transition interval, which is different for each protein. With an example of the set of 33 different proteins, we show the practical validity of the approximation and the equivalence of exact and approximate two-state formulas for processing DSC data.


Introduction
Differential Scanning Calorimetry (DSC) is a powerful technique used to measure the temperature dependence of specific heat for various solid [1] and soft materials, including biopolymer solutions [2].Structural transformations taking place in the material of study show up as peaks in the heat capacity.Strictly speaking, only the transition temperature (T m ) can be determined from the position of the peak maximum if no model assumptions are invoked.For anything beyond this point, one needs additional assumptions to be made in order to process the DSC data.
If the baselines before and after the transition are parallel, their difference is temperatureindependent (constant), and the baseline subtraction can be performed.The area under the specific heat peak can be thus numerically estimated, giving us the enthalpic cost (∆H) of transformation.
Quite often, it happens that a typical protein undergoing unfolding in solution does not have constant baselines.Sometimes, baselines are linear [3][4][5], and the (unjustified) subtraction gives a closed contour with the area enclosed equal to ∆H.Not only is it unjustified but such a procedure is also not unique: each particular choice of the baseline points fits to a different line, which closes a different contour with a different ∆H.Sometimes, the baselines are not linear at all [6][7][8][9][10], and no analysis is possible beyond the transition temperature determination.
Let us assume it is possible to perform the baseline subtraction.Then, the time comes to decide about the model which is going to be fit to the specific heat profile.The most common approach to analyze the DSC unfolding data is the two-state model [11].It assumes that a protein exists in either a fully folded or fully unfolded state, and that the unfolding transition is reversible and cooperative [12].The approach essentially relies on the temperature-independent (constant) difference of specific heats ∆C P = C U P − C N P between the unfolded (U) and native (N) states and allows to derive a formula for the Gibbs free energy cost of protein (un)folding (∆G).The obtained formula allows to process the experimental data in terms of fitting.
While powerful and general, the two-state model is limited in its ability to accurately describe the complex thermodynamics of protein folding, especially for proteins with more than one domain or subunit [12].
Yet, another well-known approach, suggested by Hawley [13,14], is applicable to describe protein folding.It reproduces the elliptic phase diagram of protein (un)folding and allows for cold denaturation to happen.Hawley assumes the free energy difference between the native and denatured states in a protein is a quadratic function of pressure and temperature [13].The resulting bell-shaped three-dimensional curve ∆G(P, T) is sliced at the zero surface to obtain an elliptical phase diagram.The approach was quite successful not only for protein folding but was later applied to describe the re-entrant melting transition in DNA [15] as well.
If we disregard the pressure-dependent terms in Hawley's approach [13], it becomes very similar to the expressions suggested by Privalov for the two-state model [11].As shown by Smeller [14], under the assumption of a small transition interval, the logarithmic term in the free energy expression can be resolved into a Taylor series and truncated at the second order, resulting in a quadratic expression.The assumption of the smallness of the transition interval depends on the particular protein: some have slightly wider, and some slightly narrower, temperature intervals.To which extent the approximate formula will be equivalent to the exact one can only become clear after the same two-state approach is applied to a set of data and the fitted parameters are compared.
In this publication, we compare the use of logarithmic and square formulas for the free energy of protein folding within the two-state model.We first discuss the theoretical foundations, review the derivations and later apply both formulas to process the set of 33 experimental DSC curves.Despite the obvious differences in transition intervals for each of proteins, the obtained values of fitted quantities for both methods are very similar.We thus conclude that the assumption of a small transition interval is often valid, at least for the random set of 33 proteins of our choice.It means that the proposed two formulas can be identically used to fit DSC data.

Materials and Methods
The extensively adopted two-state model for analyzing heat capacity profiles [2,11,12] is fundamentally based on the premise that folding constitutes a phase transition between the native and denatured states.While protein folding could be theoretically conceptualized as a coil-globule transition, the presence of limited system sizes and the heterogeneous composition of polypeptides results in a finite temperature range that deviates significantly from the idealized notion of a phase transition, which the two-state model is specifically designed to accommodate.
The process of determining the free energy difference between the native (N) and unfolded (U) states begins with formulating expressions for the enthalpies and entropies of each of the phases.The specific heat at constant pressure can be equivalently expressed for both the α = U, N phases as follows: and the estimations for enthalpy and entropy can be expressed as where the subscript 0 denotes the values of quantities at the transition point (chosen for convenience, although any reference temperature would work).This facilitates the formulation of expressions for the differences in enthalpy and entropy between the native and denatured phases as Up to this point, no assumptions have been made apart from the existence of two phases.
To proceed, following Privalov [11], we adopt the assumption that the difference in specific heat between the two phases remains constant and can be substituted by its value at transition point, e.g., ∆ U N C P = ∆C 0 P = const.In other words, within the transition region, the difference in specific heats between the two phases does not vary with temperature.This is often interpreted as parallel baselines as proposed by Privalov [11].The implications and consequences of such an assumption are not entirely clear, but once it is accepted, it leads to Therefore, Here, we utilize the fact that at the transition point T 0 , the free energy difference equals zero, implying ∆S 0 = ∆H 0 T 0 .Conversely, considering a phase transition in proteins, which are finite-length heteropolymers, the transition interval should not be zero but cannot be excessively large either.This justifies the approximation T T 0 ≈ 1 (for any temperature T within the transition interval), enabling the expansion of the last term of Equation ( 5) into a Taylor series.The resulting expression is quadratic in temperature: Hawley was the pioneer in employing such a formula to describe the cold denaturation of chymotrypsinogen [13,14].However, the Formulas ( 5) or ( 6) representing the Gibbs free energy expenditure for unfolding are not suitable for interpreting experimental DSC data on protein folding.Recall that we assumed a constant jump in heat capacity to derive Equation ( 5).Now, the bad news is that unfolding DSC experiments yield a temperaturedependent dome-shaped curve rather than the fixed jump as modeled with ∆ U N C P = const.To effectively fit this curve, an adjustment [12] to the aforementioned two-state model is necessary.Specifically, the enthalpy difference Equation (3a) is multiplied by the fraction of unfolded units in the protein where θ U is the degree of denaturation and K NU represents the equilibrium constant between the native (N) and unfolded (U) states.The modified expression for the enthalpy reads: resulting in the specific heat The ansatz in Equation ( 8) disrupts the equivalence between the two specific heat formulas in Equation ( 1).Anyway, Equation ( 9) is used to fit the experimental data within the two-state model [12].It involves three fitting parameters: T 0 , ∆H 0 , and ∆C 0 P .

Results and Discussion
With the detailed derivation we performed above, we can now proceed to the main goal of this paper: the comparison of fitting procedures based on Equation ( 5) and its approximation Equation (6).
To analyze experimental data, we converted heat capacity plots from various publications into the digital format (see Table 1 for references).For digitization, we took the graphs from the original papers and uploaded them to an online digitization tool system [16] that is an opensource software leveraging computer vision technology to assist in extracting numerical data from images, including plots, maps, and various other visual materials.WebPlotDigitizer ensures an accurate data extraction of information by employing a combination of computer vision algorithms and manual techniques.
All data are expressed in molar units, with specific heat measured per residue.The data we digitized primarily originate from curves with baselines subtracted by the original authors who conducted the measurements, and only once did we use the linear extrapolation of the initial slope of the heat capacity function to perform the subtraction.Consequently, we refrain from deliberating on the efficacy or limitations of the baseline subtraction procedures employed by other authors.Here, we focus on comparing how two different approaches of the two-state model fit the heat capacity data, reported by other authors.Least-square fit was performed with the domestic code in Python, using the 'optimize.curve_fit'function of the 'scipy' library.
The results of fitting 33 experimental data to both the logarithmic (Equation ( 5)) and square (Equation ( 6)) formulas (see Appendix A) are shown in Table 1.In general, both formulas result in very similar fits.As shown in Figure 1, the qualitative similarity between the fitted curves can be seen visually.However, it is not clear what the statistical consequences would be of using the approximate formula while processing a dataset of experimental points.In order to study the statistical quantitative similarity, in Figure 2, we compare the histograms of fitted parameters and their mean values for 33 proteins in Table 1.The histograms were constructed with a domestic code in Python (https://www.python.org/),with the help of the 'histogram' function of the 'numpy' library.As we can see from Figure 2, although the historgams differ a little, the mean values of fitting parameters (see the legends) for both the logarithmic and square equations show a very high degree of proximity.Additionally, the two datasets of fitted values can be compared using the concept of the Root Mean Square Deviation (RMSD): and the Normalized Root Mean Square Deviation (NRMSD) where These measures allow to quantify how far the two datasets are from each other.Usually, such analysis is used in the context of the comparison of two conformations of a protein, but nothing stops us from using it for our purposes.The analysis has shown that the NRMSD(∆H 0 ) = 0.02, NRMSD(∆C 0 P ) = 0.11 and NRMSD(T 0 ) = 0.002, which means the parameter values obtained from the fit performed with two formulas are very close to each other.With the results of the fitting summarized in Table 1, we can answer the question posed: is the condition of smallness of the reduced transition interval ∆T/T 0 satisfied for the 33 randomly chosen globular proteins from our set?Under the assumption that all the proteins considered are two-state, we can re-write Equation (12) of Ref. [26] as where R = 8.314J/molK is the gas constant, and estimate the reduced interval using fitted values from Table 1.The results of such an estimate, presented in Figure 3, indicate a certain span of values around the mean value of 0.04 for the members of the set.Nevertheless, the parameter remains small, justifying the validity of series expansion and resulting in the equivalence of the logarithmic (Equation ( 5)) and square (Equation ( 6)) formulas.# of protein in  1, calculated under the assumption of two-state formula, using Equation ( 13) (red dots).The mean value is 0.04, shown as a blue dashed line.

Conclusions
We fit the specific heat experimental data (33 samples) using two formulas for the free energy difference between the native and the unfolded states within the two-state model.Despite the fact that the square formula is just an approximation of the logarithmic one, valid for small transition intervals, they both result in very similar outcomes, at least for the random pick of experimental data we have used.Fitted curves often coincide visually, with the coefficients of determination at least R 2 > 0.9 and mostly R 2 > 0.99, indicating very good fits to data points for both formulae.Although there are some differences in histograms of fitted parameters, depending on the type of the formula used in fit, the averages are very close to each other.The comparison of the two datasets of fitted parameters performed with the help of NRMSD analysis has also shown a high degree of similarity.Although the estimate of reduced transition intervals indicates some span of values around the mean value of 0.04, values remain small and justify the use of series expansion.To summarize, the analysis above allows to consider both the exact logarithmic and the approximate square formulae for the free energy difference of (un)folding to be practically equivalent and suggests the use of Equation ( 5) or Equation ( 6) at one's convenience.

Figure 2 .
Figure 2. Histograms of fitted parameters for the dataset of 33 proteins.The blue line corresponds to the logarithmic formula given by Equation (5) and the dashed red line to the square formula and Equation (6) for the two-state model.The mean and NRMSD values are calculated not from the histogram but from the parameters set recorded in Table 1.(a) The histogram of ∆H 0 with the bin width of 25 kJ/mol.NRMSD(∆H 0 ) = 0.02; (b) the histogram of ∆C 0 P with the bin width of 1 kJ/(mol K).NRMSD(∆C 0 P ) = 0.11.(c) The histogram of T 0 with the bin width of 5 K. NRMSD(T 0 ) = 0.002.

Table 1 .
The results of the fits using the logarithmic and square formulas.

Table 1 −
Figure 3. Reduced transition intervals of 33 samples from Table