Article Entropy and Uncertainty

We give a survey of the basic statistical ideas underlying the definition of entropy in information theory and their connections with the entropy in the theory of dynamical systems and in statistical mechanics.


Introduction
The concept of entropy originated in the physical and engineering sciences but now plays a ubiquitous role in all areas of science and in many non-scientific disciplines.A quick search of the ANU library catalogue gives books on entropy in mathematics, physics, chemistry, biology, communication theory and engineering but also in economics, linguistics, music, architecture, urban planning, social and cultural theory and even in creationism.Many of the scientific applications will be described in the lectures over the next three weeks.In this brief introductory lecture we describe some of the theoretical ideas that underpin these applications.
Entropy is an encapsulation of the rather nebulous notions of disorder or chaos, uncertainty or randomness.It was introduced by Clausius in the 19th century in thermodynamics and was an integral part of Boltzmann's theory.In the thermodynamic context the emphasis was on entropy as a measure of disorder.Subsequently the probabilistic nature of the concept emerged more clearly with Gibbs work on statistical mechanics.Entropy then had a major renaissance in the middle of the 20th century with the development of Shannon's mathematical theory of communication.It was one of Shannon's great insights that entropy could be used as a measure of information content.There is some ambiguity in the notion of information and in Shannon's theory it is less a measure of what one communicates but rather what one could communicate, i.e. it is a measure of one's freedom of choice when one selects a message to be communicated.Shannon also stressed the importance of the relative entropy as a measure of redundancy.The relative entropy gives a comparison between two probabilistic systems and typically measures the actual entropy to the maximal possible entropy.It is the relative entropy that has played the key role in many of the later developments and applications.
Another major landmark in the mathematical theory of the entropy was the construction by Kolmogorov and Sinai of an isomorphy invariant for dynamical systems.The invariant corresponds to a mean value over time of the entropy of the system.It was remarkable as it differed in character from all previous spectral invariants and it provided a mechanism for providing a complete classification of some important systems.Other landmarks were the definition of mean entropy as an affine functional over the state space of operator algebras describing models of statistical mechanics and the application of this functional to the characterization of equilibrium states.Subsequently entropy became a useful concept in the classification of operator algebras independently of any physical background.
In the sequel we discuss entropy, relative entropy and conditional entropy in the general framework of probability theory.In this context the entropy is best interpreted as a measure of uncertainty.Subsequently we develop some applications, give some simple examples and indicate how the theory of entropy has been extended to non-commutative settings such as quantum mechanics and operator algebras.

Entropy and uncertainty
First consider a probabilistic process with n possible outcomes.If n = 2 this might be something as simple as the toss of a coin or something as complex as a federal election.Fortunately the details of the process are unimportant for the sequel.Initially we make the simplifying assumption that all n outcomes are equally probable, i.e. each outcome has probability p = 1/n.It is clear that the inherent uncertainty of the system, the uncertainty that a specified outcome actually occurs, is an increasing function of n.Let us denote the value of this function by f (n).It is equally clear that f (1) = 0 since there is no uncertainty if there is only one possible outcome.Moreover, if one considers two independent systems with n and m outcomes, respectively, then the combined system has nm possible outcomes and one would expect the uncertainty to be the sum of the individual uncertainties.In symbols one would expect The additivity reflects the independence of the two processes.But if this property is valid for all positive integers n, m then it is easy to deduce that f (n) = log n although the base of logarithms remains arbitrary.(It is natural to chose base 2 as this ensures that f (2) = 1, i.e. a system such as coin tossing is defined to have unit uncertainty, but other choices might be more convenient.)Thus the uncertainty per outcome is given by (1/n) log n or, expressed in terms of the probability, uncertainty per outcome = −p log p .
Secondly, consider a process with n possible outcomes with probabilities p 1 , p 2 , . . ., p n , respectively.
Then it is natural to ascribe uncertainty −p i log p i to the i-th outcome.This is consistent with the foregoing discussion and leads to the hypothesis This is the standard expression for the entropy of the probabilistic process and we will denote it by the symbol H(p), or H(p 1 , . . ., p n ).(This choice of notation dates back to Boltzmann.)Explicitly, the entropy is defined by Although the argument we have given to identify the entropy with the inherent uncertainty is of a rather ad hoc nature applications establish that it gives a surprisingly efficient description.This will be illustrated in the sequel.It is a case of 'the proof of the pudding is in the eating'.Before proceeding we note that the entropy H(p) satisfies two simple bounds.If x ∈ 0, 1] then −x log x ∈ [0, 1] and one can extend the function x → −x log x to the closed interval [0, 1] by setting −0 log 0 = 0. Then H(p) ≥ 0 with equality if and only if one outcome has probability one and the others have probability zero.It is also straightforward to deduce that H(p) is maximal if and only if the probabilities are all equal, i.e. if and only if p 1 = . . .= p n = 1/n.Therefore one has bounds There is a third less precise principle: the most probable outcomes give the major contribution to the total entropy.The entropy enters the Boltzmann-Gibbs description of equilibrium statistical mechanics through the prescription that the state of equilibrium is given by the microscopic particle configurations which maximize the entropy under the constraints imposed by the observation of macroscopic quantities such as the energy and density.Thus if the configuration with assigned probability p i has energy e i the idea is to maximize H(p) with E(p) = n i=1 p i e i held fixed.If one formulates this problem in terms of a Lagrange multipliers β ∈ R then one must maximize the function We will discuss this problem later in the lecture.

Entropy and multinomial coefficients
In applications to areas such as statistical mechanics one is usually dealing with systems with a large number of possible configurations.The significance of entropy is that it governs the asymptotic behaviour of the multinomial coefficients where n 1 + . . .+ n m = n.These coefficients express the number of ways one can divide n objects into m subsets of n 1 , n 2 , . . ., n m objects, respectively.If n is large then it is natural to examine the number of partitions into m subsets with fixed proportions p 1 = n 1 /n, . . ., p m = n m /n.Thus one examines with p 1 + . . .+ p n = 1 as n → ∞.Since the sum over all possible partitions, n 1 ,...,nm n C n 1 ...nm = m n , increases exponentially with n one expects a similar behaviour for P n (p).Therefore the asymptotic behaviour will be governed by the function n −1 log P n (p).But this is easily estimated by use of the Stirling-type bounds (2πn) 1/2 n n e −n e 1/(12n+1) ≤ n! ≤ (2πn) 1/2 n n e −n e 1/(12n) for the factorials.One finds as n → ∞ where H(p) is the entropy.Thus the predominant asymptotic feature of he P n is an exponential increase exp(nH) with H the entropy of the partition p 1 , . . ., p m .Next consider n independent repetitions of an experiment with m possible outcomes and corresponding probabilities q 1 , . . ., q m .The probability that these outcomes occur with frequencies p 1 , . . ., p m is given by Then estimating as before one finds as n → ∞ where H(p|q) is given by The latter expression is the relative entropy of the frequencies p i with respect to the probabilities q i .Now it is readily established that H(p|q) ≤ 0 with equality if, and only if, p i = q i of all i ∈ {1, . . ., m}.Thus if the p i and q i are not equal then P n (p|q) decreases exponentially as n → ∞.Therefore the only results which effectively occur are those for which the frequencies closely approximate the probabilities.One can relate the variational principle (3) defining the Boltzmann-Gibbs equilibrium states to the relative entropy.Set q i = e −βe i /Z where Z = i e −βe i .Then Therefore the maximum is obtained for p i = q i and the maximal value is log Z.

Conditional entropy and information
Next we consider two processes α and β and introduce the notation A 1 , . . ., A n for the possible outcomes of α and B 1 , . . ., B m for the possible outcomes of β.The corresponding probabilities are denoted by p(A 1 ), . . ., p(A n ) and p(B 1 ), . . ., p(B m ), respectively.The joint process, α followed by β, is denoted by αβ and the probability of the outcome A i B j is given by p(A i B j ).We assume that p(A i B j ) = p(B j A i ) although this condition has to be relaxed in the subsequent non-commutative settings.If the α and β are independent processes then p(A i B j ) = p(A i )p(B j ) and the identity is automatic.
The probability that B j occurs given the prior knowledge that A i has occurred is called the conditional probability for B j given A i .It is denoted by p(B j |A i ).A moment's reflection establishes that Since the possible outcomes B j A i of the process βα are the same as the outcomes A i B j of the process αβ one has the relation for the conditional probabilities.If the two processes are independent one obviously has p(B The conditional entropy of the process β given the outcome A i for α is then defined by in direct analogy with the (unconditional) entropy of β defined by In fact if α and β are independent then H(β|A i ) = H(β).Since A i occurs with probability p(A i ) it is natural to define the conditional entropy of β dependent on α by The entropy H(β) is interpreted as the uncertainty in the process β and H(β|α) is the residual uncertainty after the process α has occurred.
Based on the foregoing intuition Shannon defined the difference as the information about β gained by knowledge of the outcome of α.It corresponds to the reduction in uncertainty of the process β arising from knowledge of α.
The usefulness of these various concepts depends on a range of properties that are all traced back to simple features of the function x → −x log x.
First, one has the key relation This follows by calculating that Note that if α and β are independent then H(β|α) = H(β) and the relation (10) asserts that H(αβ) = H(α) + H(β).Note also that it follows from (10) that the information is given by This latter identity establishes the symmetry Next remark that H(β|α) ≥ 0 by definition.Hence H(αβ) ≥ H(α) and, by symmetry, H(αβ) ≥ H(β).Thus one has the lower bounds But it also follows by a convexity argument that H(β|α) ≤ H(β).The argument is as follows.Since the function x ≥ 0 → log x is convex one has for all λ i ≥ 0 with n i=1 λ i = 1 and all x i ≥ 0. Therefore because n i=1 (p(A i B j )/p(B j )) = 1.(Here I have been rather cavalier in assuming p(B j |A i ) and p(B j ) are strictly positive but it is not difficult to fill in the details.) It follows from i.e. the information is positive.But then using the identity (11) one deduces that This is a generalization of the property of subadditivity f (x + y) ≤ f (x) + f (y) for functions of a real variable.It is subsequently of fundamental importance.
Finally we note that the information is increasing in the sense that or, equivalently, The latter property is established by calculating that where we have used the bound −x log x ≤ 1 − x.This property ( 17) is usually referred to as strong subadditivity as it reduces to the subadditive condition (15) if β is the trivial process with only one outcome.
Example There are two cities, for example Melbourne and Canberra, and the citizens of one always tells the truth but the citizens of the other never tell the truth.An absent-minded mathematician forgets where he is and attempts to find out by asking a passerby, who could be from either city.What is the least number of questions he must ask if the only replies are 'yes' and 'no'?Alternatively, how many questions must he pose to find out where he is and where the passerby lives?
Since there are two towns there are two possible outcomes to the experiment α of questioning.If the mathematician really has no idea where he is then the entropy H(α) = log 2 represents the total information.Then if one uses base 2 logarithms H(α) = 1.So the problem is to ask a question β that gives unit information, i.e. such that I(α, β) = H(α) = 1 or, equivalently, H β (α) = 0. Thus the question must be unconditional.This could be achieved by asking 'Do you live here?'.
Alternatively to find out where he is and to also decide where the passerby lives he needs to resolve the outcome of a joint experiment α 1 α 2 where α 1 consists of finding his own location and α 2 consists of finding the residence of the passerby.But then the total information is H(α Hence one question will no longer suffice but two clearly do suffice: he can find out where he is with one question and then find out where the passerby lives with a second question.This is consistent with the fact that H(α This is all rather easy.The following problem is slightly more complicated but can be resolved by similar reasoning.
Problem Assume in addition there is a third city, Sydney say, in which the inhabitants alternately tell the truth or tell a lie.Argue that the mathematician can find out where he is with two questions but needs four questions to find out in addition where the passerby lives.

Dynamical systems
Let (X, B) denote a σ-finite measure space, i.e. a set X equipped with a σ-algebra B of subsets of X.Further let µ denote a probability measure on (X, B).Then (X, B, µ) is called a probability space.A finite partition α = (A 1 , . . ., A n ) of the space is a collection of a finite number of disjoint elements A i of B such that n i=1 A i = X.Given two partitions α = (A 1 , . . ., A n ) and β = (B 1 , . . ., B m ) the join α ∨ β is defined as the partition composed of the subsets A i ∩ B j with i ∈ {1, . . ., n} and j ∈ {1, . . ., m}.
If α = (A 1 , . . ., A n ) is a partition of the space then 0 ≤ µ(A i ) ≤ 1 and n i=1 µ(A i ) = 1 because µ is a probability measure.Thus the µ(A i ) correspond to the probabilities introduced earlier and α ∨ β now corresponds to the joint process αβ.Therefore we can now use the definitions of entropy, conditional entropy, etc. introduced previously but with the replacements p( Next let T be a measure-preserving invertible transformation of the probability space (X, B, µ).In particular T B = B and µ(T A) = µ(A) for all A ∈ B. Then (X, B, µ, T ) is called a dynamical system.In applications one also encounters dynamical systems in which T is replaced by a measure-preserving flow, i.e. a one-parameter family of measure preserving transformations {T t } t∈R such that T s T t = T s+t , T −s = (T s ) −1 and T 0 = I is the identity.The flow is usually interpreted as a description of the change with time t of the observables A. The single automorphism T can be thought of as the change with unit time and T n is the change after n-units of time.
The entropy of the partition α = (A 1 , . . ., A n ) of the probability space is given by and we now define the mean entropy of the partition of the dynamical system by where T α = (T A 1 , . . ., T A n ).Then the mean entropy of the automorphism T is defined by where the supremum is over all finite partitions.It is of course necessary to establish that the limit in (18) exists.But this is a consequence of subadditivity.Set for all n, m ∈ N where we have used (15) and the T -invariance of µ.It is, however, an easy consequence of subadditivity that the limit of n −1 f (n) exists as n → ∞.
The entropy H(µ, T ) was introduced by Kolmogorov and Sinai and is often referred to as the Kolmogorov-Sinai invariant.This terminology arises since H(µ, T ) is an isomorphy invariant.Two dynamical systems (X 1 , B 1 , µ 1 , T 1 ) and (X 2 , B 2 , µ 2 , T 2 ) are defined to be isomorphic if there is an invertible measure preserving map U of (X 1 , B 1 , µ 1 ) onto (X 2 , B 2 , µ 2 ) which intertwines T 1 and T 2 , i.e. which has the property U T 1 = T 2 U .If this is the case then H(µ 1 , T 1 ) = H(µ 2 , T 2 ).Thus to show that two dynamical systems are not isomorphic it suffices to prove that H(µ 1 , T 1 ) = H(µ 2 , T 2 ).Of course this is not necessarily easy since it requires calculating the entropies.This is, however, facilitated by a result of Kolmogorov and Sinai which establishes that the supremum in ( 19) is attained for a special class of partitions.The partition α is defined to be a for each finite generator α.
Another way of formulating the isomorphy property is in Hilbert space terms.Let H = L 2 (X ; µ).Then it follows from the T -invariance property of µ that there is a unitary operator on H such that are isomorphic if there is a unitary operator V from H 1 to H 2 which intertwines the two unitary representatives U 1 and U 2 of the maps T 1 and T 2 .Since one has U 1 = V U 2 V −1 the spectra of U 1 and U 2 are also isomorphy invariants.The Kolgmogorov-Sinai entropy was the first invariant which was not of a spectral nature.
The mean entropy has another interesting property as a function of the measure µ.The probability measures form a convex subset of the dual of the bounded continuous functions C b (X).Now if µ 1 and µ 2 are two probability measures and λ ∈ [0, 1] then The concavity inequality (21) is a direct consequence of the definition of H(α; µ) and the concavity of the function x → −x log x.Conversely, one has inequalities because x → − log x is decreasing.Therefore one obtains the 'convexity' bound Now replacing α by α ∨ T α . . .∨ T n−1 α in (21), dividing by n and taking the limit n → ∞ gives Hence one concludes that the map µ → H(α ; µ, T ) is affine, i.e.

H(α
for each partition α, each pair µ 1 and µ 2 of probability measures and each λ ∈ [0, 1]. Finally it follows from the identification (20) that the mean entropy is also affine, for each pair µ 1 and µ 2 of probability measures and each λ ∈ [0, 1].This is a somewhat surprising and is of great significance in the application of mean entropy in statistical mechanics.

Mean entropy and statistical mechanics
The simplest model of statistical mechanics is the one-dimensional ferromagnetic Ising model.This describes atoms at the points of a one-dimensional lattice Z with two degrees of freedom which we label as 0, 1 and which we think of as a spin orientation.Thus X = {0, 1} Z and a point x ∈ X is a doubly-infinite array of 0s and 1s.The labels indicate whether a particle at a given point of the lattice has negative or positive spin orientation.Two neighbouring atoms with identical orientation are ascribed a negative unit energy and neighbouring atoms with opposite orientation are ascribed a positive unit of energy.Therefore it is energetically favourable for the atoms to align and provide a spontaneous magnetism.Since the configurations x of particles are doubly infinite the total energy ascribed to each x is usually infinite but the mean energy, i.e. the energy per lattice site is always finite.
The model generalizes to d dimensions in an obvious way.Then X = {0, 1} Z d and a point x ∈ X corresponds to a d-dimensional array of 0s and 1s.If one now assigns a negative unit energy to each pair of nearest neighbours in the lattice Z d with similar orientations and positive unit energy to the pairs with opposite orientation then the energy of a configuration of particles on a cubic subset of Z d with side length L grows as L d , i.e. as the d-dimensional volume.Therefore the energy per lattice site is again finite.
The group Z d of shifts acts in an obvious manner on X.Let T 1 , . . ., T d denote the unit shift to in each of the d directions.Further let µ denote a Z d -invariant probability measure over X.Then the energy E(µ) per lattice site is well defined, it has a value in [−1, 1] and µ → E(µ) is an affine function.Now consider the entropy per lattice site.
Let α denote the partition of X into two subsets, the subset A 0 of configurations with a 0 at the origin of Z d and the subset A 1 of configurations with a 1 at the origin.Then and the partition α is a generator of X.Now the previous definition of the mean entropy generalizes and exists by an extension of the earlier subadditivity argument to the d-dimensional setting.The Boltzmann-Gibbs approach sketched earlier would designate the equilibrium state of the system at fixed mean energy as the measure which maximizes the functional This resembles the earlier algorithm but there is one vital difference.Now the supremum is taken over the infinite family of invariant probability measures µ over X.There is no reason that the supremum is uniquely attained.In fact this is not usually the case.
There is a competition between two effects.Assuming β > 0 the energy term −βE(µ) is larger if E(µ) is negative and this requires alignment of the spins, i.e. ordered configurations are preferred.But the entropy term H(µ) is largest if the the system is disordered, i.e. if all possible configurations are equally possible.If β is large the energy term tends to prevail but if β is small then the entropy term prevails.In fact β is interpretable as the inverse temperature and there is a tendency to ordering at low temperatures and to disorder at high temperatures.Since there are two possible directions of alignment of the spins this indicates that there are two distinct maximising measures at low temperature and only one at high temperatures.The advantage of this description is that it reflects reality.The Ising model, with d ≥ 2, indeed gives a simple description of a phase transition for which there is a spontaneous magnetization at low temperatures.
Although we have described the model with a nearest neighbour interaction which favours alignment of the model atoms the same general features pertain if the interaction favours anti-parallel alignment, i.e. if the alignment of neighbours has positive energy and the anti-alignment negative energy.Then it is still energetically favourable to have an ordered state but the type of ordering is different.The model then describes a phenomenon called anti-ferrogmagnetism.
The description of the invariant equilibrium states as the invariant measures which maximize the mean entropy at fixed mean energy has many other positive aspects.Since µ → H(µ) − β E(µ) is an affine function it tends to attain its maximum at extremal points of the convex weakly * compact set of invariant measures E. In fact if the maximum is unique then the maximizing measure is automatically extremal.If, however, the maximum is not uniquely attained then the maximizing measures form a face ∆ β of the convex set E and each µ ∈ ∆ β has a unique decomposition as a convex combination of extremal measures in ∆ β .This indicates that the extremal measures correspond to pure phases and in the case of a phase transition there is a unique prescription of the phase separation.This interpretation is corroborated by the observation that the extremal invariant states are characterized by the absence of long range correlations.
The foregoing description of the thermodynamic phases of macroscopic systems was successfully developed in the 1970s and 1980s and also extended to the description of quantum systems.But the latter extension requires the development of a non-commutative generalization of the entropy.

Quantum mechanics and non-commutativity
The Ising model has a simple quantum-mechanical extension.Again one envisages atoms at the points of a cubic lattice Z d but each atom now has more structure.The simplest assumption is that the observables corresponding to the atom at the point x ∈ Z d are described by an algebra A {x} of 2 × 2matrices.Then the observables corresponding to the atoms at the points of a finite subset Λ ⊂ Z d are described by an algebra A Λ of 2 |Λ| × 2 |Λ| -matrices where |Λ| indicates the number of points in Λ.Thus where the product is a tensor product of matrices.A quantum-mechanical state ω λ of the subsystem Λ is then determined by a positive matrix ρ Λ with Tr Λ (ρ Λ ) = 1 where Tr Λ denotes the trace over the matrices A Λ .The value of an observable A ∈ A Λ in the state ω λ is then given by ω Λ (A) = Tr Λ (ρ Λ A) Now if Λ ⊂ Λ one can identify A Λ as a subalgebra of A Λ and for consistency the matrices ρ Λ that determine the state must satisfy the condition The natural generalization of the classical entropy is now given by the family of entropies as Λ varies over the bounded subsets of Z d .The previous mean entropy should then be defined by if the limit exists.The existence of the limit is now a rather different problem than before.Nevertheless it can be established for translation invariant states by a extension of the earlier subadditivity argument which we now briefly describe.First if ρ and σ are two positive matrices both with unit trace then the entropy of ρ relative to σ is defined by H(ρ|σ) = − Tr(ρ log ρ − ρ log σ) in direct analogy with the earlier definition (6).The key point is that one still has the property H(ρ|σ) ≤ 0. This is established as follows.Let ρ i and σ i denote the eigenvalues of ρ and σ.Further let ψ i denote an orthonormal family of eigenfunctions of ρ corresponding to the eigenvalues ρ i .Then where we have used convexity of the logarithm and the inequality −x log x ≤ 1 − x.Now suppose that Λ 1 and Λ 2 are two disjoint subsets of But using (26) and the identity one immediately deduces that This corresponds to the earlier subadditivity and suffices to prove the existence of the mean entropy.
These simple observations on matrix algebras are the starting point of the development of a noncommutative entropy theory.

Bibliography
There is an enormous literature on entropy but the 1948 paper A mathematical theory of communication by Claude Shannon in the Bell System Technical Journal remains one of the most readable accounts of its properties and significance [1].This paper can be downloaded from http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdfNote that in later versions it became The mathematical theory of communication.
Another highly readable account of the entropy and its applications to communication theory is given in the book Probability and Information by A. M. Yaglom and I. M. Yaglom.This book was first published in Russian in 1956 as Veronatjost' i informacija and republished in 1959 and 1973.It has also been translated into French, German, English and several other languages.The French version was published by Dunod and an expanded English version was published in 1983 by the Hindustan Publishing Corporation [2].I was unable to find any downloadable version.
The example of the absentminded mathematician was adapted from this book which contains many more recreational examples and a variety of applications to language, music, genetics etc.
The discussion of the asymptotics of the multinomial coefficients is apocryphal.I have followed the discussion in the Notes and Remarks to Chapter VI of Operator Algebras and Quantum Statistical Mechanics 2 by Bratteli and Robinson, Springer-Verlag, 1981 [3].This book is now available as two searchable pdf files: http://folk.uio.no/bratteli/bratrob/VOL-1S1.pdf http://folk.uio.no/bratteli/bratrob/VOL-2.pdf.
There are now many books that describe the ergodic theory of dynamical systems including the theory of entropy.The earliest source in English which covered the Kolmogorov-Sinai theory was I believe the 1962-63 Aarhus lecture notes of Jacobs Ergodic theory I and II.These are now difficult to find but are worth reading if you can locate a copy.Another early source is the book by Arnold and Avez Problèmes ergodiques de la mecanique classique, Gauthier-Villars, Paris 1967 [4].
More recent books which I have found useful are An Introduction to Ergodic Theory by Ya.G. Sinai, Princeton University Press, Mathematical Notes, 1976 [5]: An Introduction to Ergodic Theory by P. Walters, Springer-Verlag, Graduate Text in Mathematics, 1981 [6]: Topics in Ergodic Theory by W. Parry, Cambridge University Press, 1981 [7].But there are many more.
Chapter VI of Operator Algebras and Quantum Statistical Mechanics 2 by Bratteli and Robinson [3] contains a description of the applications of entropy to spin systems but the theory has moved on since then.Another source which covers more recent developments in the quantum-mechanical applications