Generalized Statistical Thermodynamics

We develop the mathematical theory of generalized statistical thermodynamics by constructing phase spaces (canonical and microcanonical) of probability distributions. Generalized thermodynamics is independent of physical hypotheses; it is applicable to probability distributions in general and provides the means to incorporate our knowledge, hypotheses and physical models about a stochastic process. The second law, Gibbs-Shannon entropy and Kullback-Leibler divergence, all have straightforward and noncontroversial interpretation in this theory. We obtain statistical mechanics as a special case and make contact with Information Theory and Bayesian inference.


I. INTRODUCTION
What is thermodynamics? The question, so central to physics, has been asked numerous times and has been given nearly as many different answers. To quote just a few: thermodynamics is the branch of science concerned with the relations between heat and other forms of energy involved in physical and chemical processes [1]; the study of the restrictions on the possible properties of matter that follow from the symmetry properties of the fundamental laws of physics [2, p3]; concerned with the relationships between certain macroscopic properties of a system in equilibrium [3, p1]; a phenomenological theory of matter [4, p3]. While strictly true, such statements emphasize aspects of thermodynamics that are far too narrow to converge to a definition of sufficient generality as to what to call thermodynamics or how to use it outside physics. And yet, since Gibbs [5], Shannon [6] and Jaynes [7] drew quantitative connections between entropy and probability distributions, thermodynamics has been spreading to new fields. The tools of statistical thermodynamics are now used in network theory [8], ecology [9], epidemics [10], neuroscience [11], financial markets [12], and in the study of complexity in general. What motivates the intuitive impulse to apply thermodynamics to such vastly diverse problems? Is thermodynamics even applicable outside classical or quantum mechanical systems? And if so, what is the range of its applicability?
Here we answer these fundamental questions: Thermodynamics is variational calculus applied to probability distributions, and by extension to stochastic processes in general; it is independent of physical hypotheses but provides the means to incorporate our knowledge and model assumptions about the particular problem. We define a space of probability distributions and formulate a functional that assigns probabilities to this space through a biased sampling process. The maximization of the probability functional expresses a distribution -any distribution-via a set of parameters (microcanonical partition function, canonical partition function and generalized temperature) that are connected via mathematical relationships that we recognize as thermodynamic equations. Entropy, the second law, the Kullback-Leibler divergence and the Gibbs inequality have simple interpretations in this theory. As with all calculus, generalized thermodynamics is independent of physical assumptions and gains physical context when applied to specific problems.
We obtain statistical mechanics as a special case and make contact with Information Theory and Bayesian inference.

A. Random Sampling
Consider the continuous probability distribution h 0 (x) ≥ 0 that satisfies the normalization i = 1, 2 · · · K + 1, such that the probability to sample a value of x in the ith interval is if ∆ is sufficiently small. We sample N values from h 0 and construct the frequency distribution n = (n 1 , n 2 , · · · ), where n i is the number of sampled values that lie in the ith interval. The probability to observe distribution n in a random sample of size N is given by the multinomial distribution, and its logarithm is where p = (p 1 , p 2 · · · ) (boldface variables refer to ordered lists on the grid {x i }). We define Substituting this definition and (2) into (4) we obtain and in the continuous limit, ∆ → 0, N → ∞, where δP (h|h 0 , N ) is the probability to sample region (h, h + δh) in the continuous space of distributions, while taking a random sample of size N from h 0 (hereafter all integrals are understood to be taken in the domain of h 0 ). Any probability distribution h(x) defined in the domain of h 0 may materialize in a random sample taken from h 0 . Clearly, the most probable distribution in this space is h 0 , and indeed h 0 maximizes Eq. (5). For all other distributions we must have with the equal sign only for h = h 0 . The probability to obtain h 0 relative to the probability to obtain any other distribution in the same domain in the limit N → ∞ is Accordingly, h 0 is overwhelmingly more probable than any other distribution in its domain.
The quantity D(h||h 0 ) defined in Eq. (5) is the relative entropy or Kullback-Leibler divergence of distribution h relative to h 0 [13]; Eq. (6) is the Gibbs inequality and a well known property of relative entropy. Both concepts are of fundamental importance in Information Theory, statistical inference and statical mechanics. Relative entropy is interpreted as a "distance" or "divergence" between statistical populations [13,14], as cost associated with a code whose implicit probabilities

B. Biased Sampling
We now apply a bias such that a random sample of size N from h 0 is accepted with probability where log w(x; h) is the variational derivative of log W [h] with respect to h. The probability to obtain a sample with distribution N h = (n 1 , n 2 · · · ) under such biased sampling is where r N is the normalization constant; its logarithm in the continuous limit is If we choose w(x; h) = f (x)/h 0 (x), where f is some other normalized distribution in the domain of h 0 , we obtain h * = f . A suitable bias therefore may be constructed for any distribution in the domain; conversely, any distribution h 0 may be used to generate a sample of any other distribution f over the same domain by biased sampling.
We define the probability functional so that δP (h|h 0 , N ) = ̺ N [h|h 0 , W ]. The ratio of the probability to sample h * , the distribution that maximizes ̺, relative to the probability to sample any other distribution in the continuous limit is As in random sampling, the most probable distribution is overwhelmingly more probable than any other feasible distribution. Since δP (h * |h 0 , N ) > 0, we must have with the equal sign only for h = h * . The distribution that maximizes (16) and satisfies Equation (13) is the generalization of the Gibbs inequality for biased sampling from arbitrary distribution h 0 ; Eq. (14) is the distribution that satisfies it as an exact equality.

C. Canonical Sampling
Any distribution h 0 may be used to generate all other distributions in the same domain. We choose the generating function to be the normalized exponential distribution with parameter β: and write the probability functional ̺ in Eq. (11) as wherex is the mean of h(x), q = r/β and r is the normalization constant in Eq. (11). We call this space canonical. The probability of h in a large sample of size N collected from the exponential distribution with bias W is ̺ N [h|W, β]; then by the same argument that led to Eq. (13) we obtain with the equal sign for h = h * . The distribution that maximizes ̺[h|h 0 , W 0 ] and satisfies (13) as an equality is obtained from Eq. (14) with r = βq, and its mean satisfiesx

D. Microcanonical Sampling
Distributions of the canonical space may have any meanx ≥ 0. We now define the microcanonical space of as the subset of the canonical space of distributions with fixed meanx. The generating distribution is again the exponential function, which we now write as withx fixed. The probability to observe distribution h while sampling h 0 is still given by Eq. (10) but r is replaced with r ′ such that normalization is satisfied in the microcanonical space. We define the microcanonical probability functional with log ω = 1 + logx + log r ′ and write the probability of h as ̺ N [h|W ;x]. The argument that produced Eqs. (13) and (17) now gives with the equal sign for h = h * . We obtain h * by Lagrange maximization of the microcanonical functional under the constraints that h * obeys normalization and its mean is fixed. The result is with q and β related to the two Lagrange multipliers. This is the same distribution as in the canonical maximization; it is overwhelmingly more probable that all other distributions and its mean also satisfies Eq. (19). We insert Eq. (23) into (22) to obtain where S[h * ] is the Gibbs-Shannon entropy of the most probable distribution, Substituting Eq. (23) for h * in (24) we obtain a relationship between ω, β, q andx: In combination with Eq. (19), this result defines log ω(x) as the Legendre transformation of q(β) with respect to β. By the reciprocal property of the transformation we then have The maximization of the microcanonical functional implies that ̺[h; W,x] is a concave functional in h. It follows that log ω is a concave function ofx, therefore we must have The concave property of ρ[h|W,x] imposes certain restrictions on the possible form of log W . Since give h * in implicit form since it appears on both sides of the equation.
These results are summarized in Table I and their derivations are given in Appendix. The multivariate case = h(x 1 , x 2 · · · ) and the case of discrete distributions are straightforward extensions of the theory and will not be written down here. of the second law and expresses the elementary fact that any distribution in the phase space is less probable than the most probable distribution h * . The relationships between ω, β, q andx are those of classical thermodynamics. They do not involve distribution h * , but without the probabilistic context of the phase space they have no simple interpretation. The set of equations that involve h * are those obtained by Gibbs (statistical thermodynamics). Gibbs's statistical mechanics is obtained when these are applied to a system of newtonian particles. The entire set of equations constitutes generalized thermodynamics.

III. GENERALIZED THERMODYNAMICS
We have obtained h * as the most probable distribution by biased sampling of the exponential function, and in the process we encountered thermodynamics. Any distribution in the domain of the exponential may be obtained in this manner under an appropriate bias functional. Thermodynamics may be condensed into the microcanonical inequality, which applies to all normalized h(x), x ≥ 0 with fixed meanx. It is a generalized statement of the second law and expresses the relative probability of distributions under biased sampling from the exponential distribution. All relationships between ω (microcanonical partition function), q (canonical partition function), β (generalized inverse temperature) andx follow from the maximization of this inequality and have equivalents in familiar thermodynamics. In the special case W [h] = 1 for all h (unbiased sampling) the most probable distribution is exponential. Equation (29) in combination with (24) gives and states that the entropy of any distribution in the microcanonical space is less than the entropy of the most probable distribution. This is the mathematical statement of the second law of thermodynamics. In the general case h * may be any distribution in R + and is uniquely determined by the bias W . We call selection functional because it can be constructed to pick out any feasible distribution. The derivatives d log q/dβ and d log ω/dx in Eqs. (19) and (27) may be viewed as equations of change along a path in the space of distributions under fixed bias W . This path is described parametrically in terms ofx and represents a nonstationary stochastic process, a continuous transition between distributions that each maximize the microcanonical functional at the local value ofx under a common W . We call this process quasistatic to indicate that it consists of a sequence of distributions, each maximizing the thermodynamics functional at the local value ofx. Application of thermodynamics to stochastic processes amounts to converting the rules that govern the process, the transition probabilities for example, into an appropriate selection bias.

A. Contact with Statistical Mechanics
The obvious way to make contact with statistical mechanics is to take h * to be the probability of microstate at fixed temperature, volume and number of particles. Then, W = w = 1, x is the energy E i of microstate i, β is 1/k B T , q is the thermodynamic canonical partition function, ω is the thermodynamic microcanonical partition function, and Eqs. (23)-(28) map to standard thermodynamic relationships. From Eq. (24) we obtain ̺ = e S[h] /ω.i.e., the canonical probability h * maximizes entropy and thus we obtain complete correspondence with statistical thermodynamics.
The condition w = 1 is a statement of the postulate of equal a priori probabilities, a model assumption external to thermodynamics that assigns equal probability to all microstates with the same energy, volume, and number of particles. This is not the only way to establish contact with statistical mechanics. Generalized thermodynamics applies to any probability distribution and we may choose h * to be some other probability distribution in the problem, for example, the probability to find a macroscopic system of fixed (T, V, N ) at energy E. We write the energy distribution in the form of Eq. (23) with w, β and q to be determined. From Eqs. (19), (27) and (26) withx =Ē we havē Comparison with established thermodynamic relationships leads to the identifications β → 1/k B T , log q → −F/k B T (free energy), log ω → thermodynamic entropy. To identify w we require input from physics, and this comes via the observation that the probability density of macroscopic energy E is asymptotically a Dirac delta function at E =Ē. Then S[h * ] = 0 (this is the entropy of the energy distribution, not to be confused with thermodynamic entropy). From Eqs. (8) and (24) we find log W [h * ] = log w(x; h * ) = log ω, and conclude that log w is the thermodynamic entropy. (21) appears in the role of a Bayesian uniform prior [17] -an improper prior. In the context of generalized thermodynamics w is not a prior probability distribution, indeed not a probability distribution at all; it is a functional derivative that expresses our prior knowledge (in this case, a hypothesis) about the process and produces a properly normalized distribution h * when the probability functional is maximized. Generalized thermodynamics offers the means to encode this knowledge into the bias functional. One other example has been given in the literature, where it is shown how to transcribe into W the rules (in this case rate laws) that specify the transition probabilities in irreversible aggregation, including a case that produces a phase transition [18,19].

IV. CONCLUSIONS
Stripped to its core, what we call statistical thermodynamics is a mapping between a probability distribution h * and a set of functions, {w, β, q, ω} from which the distribution may be reconstructed.
What we call classical thermodynamics is the set of relationships between {β, q, ω,x}; these relationships are the same for all distributions. What we call second law is the variational condition that selects h * among all other distributions in the same domain with the same mean. What we call stochastic process is a path in the space of distributions under fixed W . Physics enters through W . This generic mathematical formalism applies to any distribution. To use an analogy, thermodynamics is a universal grammar that becomes a language when applied to specific problems.
In a fitting coincidence -or perhaps as its inevitable consequence-it was the human desire to maximize the amount of useful work in the steam engine that led to the first encounter with the variational foundation of thermodynamics. Gibbs's breakthrough was to connect thermodynamics to a probability distribution, and that of Shannon and Jaynes, to transplant it outside physics.
In the time since, the vocabulary of statistical thermodynamics has felt intuitively familiar across disciplines in a déjà vu sort of manner, even as its grammar remained undeciphered. This intuition can now be understood: The common thread that runs through every discipline that has adopted the thermodynamic language is an underlying stochastic process, and where there is probability, there is thermodynamics.
Homogeneity allows us to express log W as an integral over the variational derivatives log w(x; h), This is Eq. (8) in the text. We also have or equivalently, where t is any parameter other than x on which h may depend (for example,x, β, etc., or any function of these variables). In the special but important case that log W [h] is linear functional of where a(x) is a fixed function of x, Eq. (A1) is satisfied with log w(x; h) = a(x), and Eq. (A3) is satisfied trivially, since in this case δa(x)/δh = 0 (a(x) does not depend on h).
Equations (A1) and (A3) are the functional equivalents of the following two results for homogeneous functions f (x 1 , x 2 · · · ) of degree 1 with respect to all x i : Equation (A2) is used throughout the paper. Equation (A3) is used in the derivation of Eq. (19) later in this supplement.

Most Probable Distribution in Biased Sampling (Equation 14)
We maximize the generic probability functional (Eq. (10) in the text) with respect to h under the normalization constraint Using the Lagrange multiplier λ 0 , the equivalent unconstrained maximization problem is with q, λ 0 and r fixed. We set the variational derivative at h = h * equal to zero, and solve for h * to obtain with α = e 1+λ 0 . To evaluate r we apply the condition ̺[h * |W, h 0 ] = 1. Noting that and we obtain α = r. The most probable distribution is This is Eq. (14) in the text.

a. Canonical Probability Functional (Equation 16)
We obtain the canonical functional by setting h 0 (x) = βe −βx in Eq. (A7): wherex is the mean of h. We define q = r/β and write the canonical functional as This is Eq. (16) in the text.

b. Most Probable Distribution in Canonical Space (Equation 18)
The canonical functional in Eq. (A14) is a special case of the generic functional in Eq. (A7) with h 0 = βe −βx and q = r/β. The most probable distribution of the generic probability functional is given in Eq. (A12); accordingly, the most probable distribution in the canonical space is obtained from that equation with h 0 (x) = βe −βx and r = qβ: which is Eq. (18) in the text.

c. The q-β-x Relationship (Equation 19)
We write Eq. (18) as and take the derivative d(log q)/dβ: The last integral identically equal to zero by virtue of Eq. (A3). The final result is which is Eq. (19) in the text.

a. Microcanonical Probability Functional (Equation 21)
The microcanonical functional in the continuous limit is with r ′ such that normalization is satisfied. Setting h 0 = e −x/x /x we obtain which is Eq. (21) in the text.

b. Most Probable Distribution in Microcanonical Space (Equation 23)
We maximize the microcanonical functional with respect to h under the constraints The equivalent unconstrained maximization is where λ 0 and λ 1 are Lagrange multipliers andx and ω are fixed. We set the variational derivative with respect to h equal to zero: and solve for h * : Setting q = e 1+λ 0 , β = λ 1 we obtain which is Eq. (23) in the text.

c. Relationships for log ω (Equations 24 and 26)
We write the microcanonical probability functional in the equivalent form where Applying the condition ̺[h * |W,x] = 1 we obtain which is Eq. (24) in the text.
The entropy of the most probable distribution is We substitute this result into Eq. (A31) to obtain This is Eq. (26) in the text.

Curvature of log ω (Equation 28)
Consider the microcanonical spaces of distributions with meansx 1 andx 2 and let h * 1 and h * 2 be the most probable distributions in these spaces. We form the distribution with 0 ≤ α ≤ 1 whose mean isx = αx 1 + (1 − α)x 2 . Let h * be the most probable distribution in the space of distributions with meanx. We then have:  and states that log ω(x) is a concave function ofx. It follows that which is Eq. (28) in the text.
We collect a random sample from h and form the distribution n = (n i , n 2 · · · ), where n i is the number of samples in the ith interval. Then, We define the extensive distribution H as which satisfies i.e., H is a homogeneous copy of h with scale factor λ. We take the logarithm of the multinomial coefficient and apply Stirling's approximation: From Eq. (B4) we also havex = x a +ī∆, whereī is the mean of i in n. Equation (B8) becomes which restates the fact that the functional defined in Eq. B1 is homogeneous with degree 1.

Comments:
(a) The sampled distribution n depends on the discretization step ∆. If ∆ is decreased by some factor, all n i decrease by that factor, but the number of possible outcomes increases by the same factor and overall log n!/N increases. By subtracting logī we obtain a quantity that converges to an intensive limit that depends only on h.
(b) At fixed discretization, log n!/N and S[h] are equal to within an additive constant.
We may interpret entropy as the the intensive log of the multinomial coefficient with the understanding that the precise relationship involves the additive constant − log(x− x a ) + logī.