1. Introduction
The concept of entropy originated in the physical and engineering sciences but now plays a ubiquitous role in all areas of science and in many non-scientific disciplines. A quick search of the ANU library catalogue gives books on entropy in mathematics, physics, chemistry, biology, communication theory and engineering but also in economics, linguistics, music, architecture, urban planning, social and cultural theory and even in creationism. Many of the scientific applications will be described in the lectures over the next three weeks. In this brief introductory lecture we describe some of the theoretical ideas that underpin these applications.
Entropy is an encapsulation of the rather nebulous notions of disorder or chaos, uncertainty or randomness. It was introduced by Clausius in the 19th century in thermodynamics and was an integral part of Boltzmann’s theory. In the thermodynamic context the emphasis was on entropy as a measure of disorder. Subsequently the probabilistic nature of the concept emerged more clearly with Gibbs work on statistical mechanics. Entropy then had a major renaissance in the middle of the 20th century with the development of Shannon’s mathematical theory of communication. It was one of Shannon’s great insights that entropy could be used as a measure of information content. There is some ambiguity in the notion of information and in Shannon’s theory it is less a measure of what one communicates but rather what one could communicate, i.e. it is a measure of one’s freedom of choice when one selects a message to be communicated. Shannon also stressed the importance of the relative entropy as a measure of redundancy. The relative entropy gives a comparison between two probabilistic systems and typically measures the actual entropy to the maximal possible entropy. It is the relative entropy that has played the key role in many of the later developments and applications.
Another major landmark in the mathematical theory of the entropy was the construction by Kolmogorov and Sinai of an isomorphy invariant for dynamical systems. The invariant corresponds to a mean value over time of the entropy of the system. It was remarkable as it differed in character from all previous spectral invariants and it provided a mechanism for providing a complete classification of some important systems. Other landmarks were the definition of mean entropy as an affine functional over the state space of operator algebras describing models of statistical mechanics and the application of this functional to the characterization of equilibrium states. Subsequently entropy became a useful concept in the classification of operator algebras independently of any physical background.
In the sequel we discuss entropy, relative entropy and conditional entropy in the general framework of probability theory. In this context the entropy is best interpreted as a measure of uncertainty. Subsequently we develop some applications, give some simple examples and indicate how the theory of entropy has been extended to non-commutative settings such as quantum mechanics and operator algebras.
  2. Entropy and uncertainty
First consider a probabilistic process with 
n possible outcomes. If 
 this might be something as simple as the toss of a coin or something as complex as a federal election. Fortunately the details of the process are unimportant for the sequel. Initially we make the simplifying assumption that all 
n outcomes are equally probable, i.e. each outcome has probability 
. It is clear that the inherent uncertainty of the system, the uncertainty that a specified outcome actually occurs, is an increasing function of 
n. Let us denote the value of this function by 
. It is equally clear that 
 since there is no uncertainty if there is only one possible outcome. Moreover, if one considers two independent systems with 
n and 
m outcomes, respectively, then the combined system has 
 possible outcomes and one would expect the uncertainty to be the sum of the individual uncertainties. In symbols one would expect 
 The additivity reflects the independence of the two processes. But if this property is valid for all positive integers 
 then it is easy to deduce that 
 although the base of logarithms remains arbitrary. (It is natural to chose base 2 as this ensures that 
, i.e. a system such as coin tossing is defined to have unit uncertainty, but other choices might be more convenient.) Thus the uncertainty per outcome is given by 
 or, expressed in terms of the probability, 
Secondly, consider a process with 
n possible outcomes with probabilities 
, respectively. Then it is natural to ascribe uncertainty 
 to the 
i-th outcome. This is consistent with the foregoing discussion and leads to the hypothesis 
 This is the standard expression for the entropy of the probabilistic process and we will denote it by the symbol 
, or 
. (This choice of notation dates back to Boltzmann.) Explicitly, the 
entropy is defined by
	  
 Although the argument we have given to identify the entropy with the inherent uncertainty is of a rather 
ad hoc nature applications establish that it gives a surprisingly efficient description. This will be illustrated in the sequel. It is a case of `the proof of the pudding is in the eating’.
Before proceeding we note that the entropy 
 satisfies two simple bounds. If 
 then 
 and one can extend the function 
 to the closed interval 
 by setting 
. Then 
 with equality if and only if one outcome has probability one and the others have probability zero. It is also straightforward to deduce that 
 is maximal if and only if the probabilities are all equal, i.e. if and only if 
. Therefore one has bounds
	  
 There is a third less precise principle: the most probable outcomes give the major contribution to the total entropy.
The entropy enters the Boltzmann–Gibbs description of equilibrium statistical mechanics through the prescription that the state of equilibrium is given by the microscopic particle configurations which maximize the entropy under the constraints imposed by the observation of macroscopic quantities such as the energy and density. Thus if the configuration with assigned probability 
 has energy 
 the idea is to maximize 
 with 
 held fixed. If one formulates this problem in terms of a Lagrange multipliers 
 then one must maximize the function
	  
 We will discuss this problem later in the lecture.
  3. Entropy and multinomial coefficients
In applications to areas such as statistical mechanics one is usually dealing with systems with a large number of possible configurations. The significance of entropy is that it governs the asymptotic behaviour of the multinomial coefficients 
 where 
. These coefficients express the number of ways one can divide 
n objects into 
m subsets of 
 objects, respectively. If 
n is large then it is natural to examine the number of partitions into 
m subsets with fixed proportions 
. Thus one examines 
 with 
 as 
. Since the sum over all possible partitions, 
 increases exponentially with 
n one expects a similar behaviour for 
. Therefore the asymptotic behaviour will be governed by the function 
. But this is easily estimated by use of the Stirling-type bounds 
 for the factorials. One finds
	  
 as 
 where 
 is the entropy. Thus the predominant asymptotic feature of he 
 is an exponential increase 
 with 
H the entropy of the partition 
.
Next consider 
n independent repetitions of an experiment with 
m possible outcomes and corresponding probabilities 
. The probability that these outcomes occur with frequencies 
 is given by 
 Then estimating as before one finds
	  
 as 
 where 
 is given by 
 The latter expression is the 
relative entropy of the frequencies 
 with respect to the probabilities 
. Now it is readily established that 
 with equality if, and only if, 
 of all 
. Thus if the 
 and 
 are not equal then 
 decreases exponentially as 
. Therefore the only results which effectively occur are those for which the frequencies closely approximate the probabilities.
One can relate the variational principle (
3) defining the Boltzmann–Gibbs equilibrium states to the relative entropy. Set 
 where 
. Then 
 Therefore the maximum is obtained for 
 and the maximal value is 
.
  4. Conditional entropy and information
Next we consider two processes α and β and introduce the notation  for the possible outcomes of α and  for the possible outcomes of β. The corresponding probabilities are denoted by  and , respectively. The joint process, α followed by β, is denoted by  and the probability of the outcome  is given by . We assume that  although this condition has to be relaxed in the subsequent non-commutative settings. If the α and β are independent processes then  and the identity is automatic.
The probability that 
 occurs given the prior knowledge that 
 has occurred is called the 
conditional probability for 
 given 
. It is denoted by 
. A moment’s reflection establishes that 
 Since the possible outcomes 
 of the process 
 are the same as the outcomes 
 of the process 
 one has the relation 
 for the conditional probabilities. If the two processes are independent one obviously has 
.
The conditional entropy of the process 
β given the outcome 
 for 
α is then defined by
	  
 in direct analogy with the (unconditional) entropy of 
β defined by 
 In fact if 
α and 
β are independent then 
. Since 
 occurs with probability 
 it is natural to define the 
conditional entropy of 
β dependent on 
α by 
 The entropy 
 is interpreted as the uncertainty in the process 
β and 
 is the residual uncertainty after the process 
α has occurred.
Based on the foregoing intuition Shannon defined the difference 
 as the 
information about 
β gained by knowledge of the outcome of 
α. It corresponds to the reduction in uncertainty of the process 
β arising from knowledge of 
α.
The usefulness of these various concepts depends on a range of properties that are all traced back to simple features of the function .
First, one has the key relation 
 This follows by calculating that 
 Note that if 
α and 
β are independent then 
 and the relation (
10) asserts that 
. Note also that it follows from (
10) that the information is given by 
 This latter identity establishes the symmetry
	  
Next remark that 
 by definition. Hence 
 and, by symmetry, 
. Thus one has the lower bounds
	  
 But it also follows by a convexity argument that 
. The argument is as follows. Since the function 
 is convex one has 
 for all 
 with 
 and all 
. Therefore 
 because 
. (Here I have been rather cavalier in assuming 
 and 
 are strictly positive but it is not difficult to fill in the details.)
It follows from 
 that 
 i.e. the information is positive. But then using the identity (
11) one deduces that 
 This is a generalization of the property of subadditivity 
 for functions of a real variable. It is subsequently of fundamental importance.
Finally we note that the information is increasing in the sense that 
 or, equivalently, 
 The latter property is established by calculating that 
 where we have used the bound 
. This property (
17) is usually referred to as strong subadditivity as it reduces to the subadditive condition (
15) if 
β is the trivial process with only one outcome.
Example  There are two cities, for example Melbourne and Canberra, and the citizens of one always tells the truth but the citizens of the other never tell the truth. An absent-minded mathematician forgets where he is and attempts to find out by asking a passerby, who could be from either city. What is the least number of questions he must ask if the only replies are `yes’ and `no’? Alternatively, how many questions must he pose to find out where he is and where the passerby lives?
 Since there are two towns there are two possible outcomes to the experiment α of questioning. If the mathematician really has no idea where he is then the entropy  represents the total information. Then if one uses base 2 logarithms . So the problem is to ask a question β that gives unit information, i.e. such that  or, equivalently, . Thus the question must be unconditional. This could be achieved by asking `Do you live here?’.
Alternatively to find out where he is and to also decide where the passerby lives he needs to resolve the outcome of a joint experiment  where  consists of finding his own location and  consists of finding the residence of the passerby. But then the total information is . Hence one question will no longer suffice but two clearly do suffice: he can find out where he is with one question and then find out where the passerby lives with a second question. This is consistent with the fact that .
This is all rather easy. The following problem is slightly more complicated but can be resolved by similar reasoning.
Problem  Assume in addition there is a third city, Sydney say, in which the inhabitants alternately tell the truth or tell a lie. Argue that the mathematician can find out where he is with two questions but needs four questions to find out in addition where the passerby lives.
   5. Dynamical systems
Let  denote a σ-finite measure space, i.e. a set X equipped with a σ-algebra  of subsets of X. Further let μ denote a probability measure on . Then  is called a probability space. A finite partition  of the space is a collection of a finite number of disjoint elements  of  such that . Given two partitions  and  the join  is defined as the partition composed of the subsets  with  and .
If  is a partition of the space then  and  because μ is a probability measure. Thus the  correspond to the probabilities introduced earlier and  now corresponds to the joint process . Therefore we can now use the definitions of entropy, conditional entropy, etc. introduced previously but with the replacements ,  etc. Note that since  one automatically has .
Next let T be a measure-preserving invertible transformation of the probability space . In particular  and  for all . Then  is called a dynamical system. In applications one also encounters dynamical systems in which T is replaced by a measure-preserving flow, i.e. a one-parameter family of measure preserving transformations  such that ,  and  is the identity. The flow is usually interpreted as a description of the change with time t of the observables A. The single automorphism T can be thought of as the change with unit time and  is the change after n-units of time.
The entropy of the partition 
 of the probability space is given by 
 and we now define the 
mean entropy of the partition of the dynamical system by 
 where 
. Then the mean entropy of the automorphism 
T is defined by 
 where the supremum is over all finite partitions.
It is of course necessary to establish that the limit in (
18) exists. But this is a consequence of subadditivity. Set 
 then 
 for all 
 where we have used (
15) and the 
T-invariance of 
μ. It is, however, an easy consequence of subadditivity that the limit of 
 exists as 
.
The entropy 
 was introduced by Kolmogorov and Sinai and is often referred to as the Kolmogorov–Sinai invariant. This terminology arises since 
 is an isomorphy invariant. Two dynamical systems 
 and 
 are defined to be isomorphic if there is an invertible measure preserving map 
U of 
 onto 
 which intertwines 
 and 
, i.e. which has the property 
. If this is the case then 
. Thus to show that two dynamical systems are not isomorphic it suffices to prove that 
. Of course this is not necessarily easy since it requires calculating the entropies. This is, however, facilitated by a result of Kolmogorov and Sinai which establishes that the supremum in (
19) is attained for a special class of partitions. The partition 
α is defined to be a 
generator if 
 is the partition of 
X into points. Then 
 for each finite generator 
α.
Another way of formulating the isomorphy property is in Hilbert space terms. Let . Then it follows from the T-invariance property of μ that there is a unitary operator on  such that  for all . Now the two dynamical systems  and  are isomorphic if there is a unitary operator V from  to  which intertwines the two unitary representatives  and  of the maps  and . Since one has  the spectra of  and  are also isomorphy invariants. The Kolgmogorov–Sinai entropy was the first invariant which was not of a spectral nature.
The mean entropy has another interesting property as a function of the measure 
μ. The probability measures form a convex subset of the dual of the bounded continuous functions 
. Now if 
 and 
 are two probability measures and 
 then 
 The concavity inequality  (
21) is a direct consequence of the definition of 
 and the concavity of the function 
. Conversely, one has inequalities 
 and 
 because 
 is decreasing. Therefore one obtains the `convexity’ bound 
 Now replacing 
α by 
 in  (
21), dividing by 
n and taking the limit 
 gives 
 Similarly from  (
22), since 
 as 
, one deduces the converse inequality 
 Hence one concludes that the map 
 is affine, i.e. 
 for each partition 
α, each pair 
 and 
 of probability measures and each 
.
Finally it follows from the identification  (
20) that the mean entropy is also affine, 
 for each pair 
 and 
 of probability measures and each 
. This is a somewhat surprising and is of great significance in the application of mean entropy in statistical mechanics.
  6. Mean entropy and statistical mechanics
The simplest model of statistical mechanics is the one-dimensional ferromagnetic Ising model. This describes atoms at the points of a one-dimensional lattice  with two degrees of freedom which we label as  and which we think of as a spin orientation. Thus  and a point  is a doubly-infinite array of 0s and 1s. The labels indicate whether a particle at a given point of the lattice has negative or positive spin orientation. Two neighbouring atoms with identical orientation are ascribed a negative unit energy and neighbouring atoms with opposite orientation are ascribed a positive unit of energy. Therefore it is energetically favourable for the atoms to align and provide a spontaneous magnetism. Since the configurations x of particles are doubly infinite the total energy ascribed to each x is usually infinite but the mean energy, i.e. the energy per lattice site is always finite.
The model generalizes to d dimensions in an obvious way. Then  and a point  corresponds to a d-dimensional array of 0s and 1s. If one now assigns a negative unit energy to each pair of nearest neighbours in the lattice  with similar orientations and positive unit energy to the pairs with opposite orientation then the energy of a configuration of particles on a cubic subset of  with side length L grows as , i.e. as the d-dimensional volume. Therefore the energy per lattice site is again finite.
The group  of shifts acts in an obvious manner on X. Let  denote the unit shift to in each of the d directions. Further let μ denote a -invariant probability measure over X. Then the energy  per lattice site is well defined, it has a value in  and  is an affine function. Now consider the entropy per lattice site.
Let 
α denote the partition of 
X into two subsets, the subset 
 of configurations with a 0 at the origin of 
 and the subset 
 of configurations with a 1 at the origin. Then 
 and the partition 
α is a generator of 
X. Now the previous definition of the mean entropy generalizes and 
 exists by an extension of the earlier subadditivity argument to the 
d-dimensional setting.
The Boltzmann–Gibbs approach sketched earlier would designate the equilibrium state of the system at fixed mean energy as the measure which maximizes the functional 
 This resembles the earlier algorithm but there is one vital difference. Now the supremum is taken over the infinite family of invariant probability measures 
μ over 
X. There is no reason that the supremum is uniquely attained. In fact this is not usually the case.
There is a competition between two effects. Assuming  the energy term  is larger if  is negative and this requires alignment of the spins, i.e. ordered configurations are preferred. But the entropy term  is largest if the the system is disordered, i.e. if all possible configurations are equally possible. If β is large the energy term tends to prevail but if β is small then the entropy term prevails. In fact β is interpretable as the inverse temperature and there is a tendency to ordering at low temperatures and to disorder at high temperatures. Since there are two possible directions of alignment of the spins this indicates that there are two distinct maximising measures at low temperature and only one at high temperatures. The advantage of this description is that it reflects reality. The Ising model, with , indeed gives a simple description of a phase transition for which there is a spontaneous magnetization at low temperatures.
Although we have described the model with a nearest neighbour interaction which favours alignment of the model atoms the same general features pertain if the interaction favours anti-parallel alignment, i.e. if the alignment of neighbours has positive energy and the anti-alignment negative energy. Then it is still energetically favourable to have an ordered state but the type of ordering is different. The model then describes a phenomenon called anti-ferrogmagnetism.
The description of the invariant equilibrium states as the invariant measures which maximize the mean entropy at fixed mean energy has many other positive aspects. Since  is an affine function it tends to attain its maximum at extremal points of the convex weakly compact set of invariant measures E. In fact if the maximum is unique then the maximizing measure is automatically extremal. If, however, the maximum is not uniquely attained then the maximizing measures form a face  of the convex set E and each  has a unique decomposition as a convex combination of extremal measures in . This indicates that the extremal measures correspond to pure phases and in the case of a phase transition there is a unique prescription of the phase separation. This interpretation is corroborated by the observation that the extremal invariant states are characterized by the absence of long range correlations.
The foregoing description of the thermodynamic phases of macroscopic systems was successfully developed in the 1970s and 1980s and also extended to the description of quantum systems. But the latter extension requires the development of a non-commutative generalization of the entropy.
  7. Quantum mechanics and non-commutativity
The Ising model has a simple quantum-mechanical extension. Again one envisages atoms at the points of a cubic lattice 
 but each atom now has more structure. The simplest assumption is that the observables corresponding to the atom at the point 
 are described by an algebra 
 of 
-matrices. Then the observables corresponding to the atoms at the points of a finite subset 
 are described by an algebra 
 of 
-matrices where 
 indicates the number of points in Λ. Thus 
 where the product is a tensor product of matrices. A quantum-mechanical state 
 of the subsystem Λ is then determined by a positive matrix 
 with 
 where 
 denotes the trace over the matrices 
. The value of an observable 
 in the state 
 is then given by 
 Now if 
 one can identify 
 as a subalgebra of 
 and for consistency the matrices 
 that determine the state must satisfy the condition 
 The natural generalization of the classical entropy is now given by the family of entropies 
 as Λ varies over the bounded subsets of 
. The previous mean entropy should then be defined by 
 if the limit exists. The existence of the limit is now a rather different problem than before. Nevertheless it can be established for translation invariant states by a extension of the earlier subadditivity argument which we now briefly describe.
First if 
ρ and 
σ are two positive matrices both with unit trace then the entropy of 
ρ relative to 
σ is defined by 
 in direct analogy with the earlier definition  (
6). The key point is that one still has the property 
. This is established as follows. Let 
 and 
 denote the eigenvalues of 
ρ and 
σ. Further let 
 denote an orthonormal family of eigenfunctions of 
ρ corresponding to the eigenvalues 
. Then 
 where we have used convexity of the logarithm and the inequality 
.
Now suppose that 
 and 
 are two disjoint subsets of 
. Set 
 and 
. Then it follows from the foregoing that 
 But using  (
26) and the identity 
 one immediately deduces that 
 This corresponds to the earlier subadditivity and suffices to prove the existence of the mean entropy.
These simple observations on matrix algebras are the starting point of the development of a non-commutative entropy theory.
  Bibliography
There is an enormous literature on entropy but the 1948 paper 
A mathematical theory of communication by Claude Shannon in the Bell System Technical Journal remains one of the most readable accounts of its properties and significance [
1]. This paper can be downloaded from
Note that in later versions it became The mathematical theory of communication.
Another highly readable account of the entropy and its applications to communication theory is given in the book 
Probability and Information by A. M. Yaglom and I. M. Yaglom. This book was first published in Russian in 1956 as 
Veronatjost’ i informacija and republished in 1959 and 1973. It has also been translated into French, German, English and several other languages. The French version was published by Dunod and an expanded English version was published in 1983 by the Hindustan Publishing Corporation [
2]. I was unable to find any downloadable version.
The example of the absentminded mathematician was adapted from this book which contains many more recreational examples and a variety of applications to language, music, genetics etc.
The discussion of the asymptotics of the multinomial coefficients is apocryphal. I have followed the discussion in the Notes and Remarks to Chapter VI of 
Operator Algebras and Quantum Statistical Mechanics 2 by Bratteli and Robinson, Springer-Verlag, 1981 [
3]. This book is now available as two searchable pdf files:
There are now many books that describe the ergodic theory of dynamical systems including the theory of entropy. The earliest source in English which covered the Kolmogorov–Sinai theory was I believe the 1962–63 Aarhus lecture notes of Jacobs 
Ergodic theory I and II. These are now difficult to find but are worth reading if you can locate a copy. Another early source is the book by Arnold and Avez 
Problèmes ergodiques de la mecanique classique, Gauthier-Villars, Paris 1967 [
4].
More recent books which I have found useful are 
An Introduction to Ergodic Theory by Ya. G. Sinai, Princeton University Press, Mathematical Notes, 1976 [
5]: 
An Introduction to Ergodic Theory by P. Walters, Springer-Verlag, Graduate Text in Mathematics, 1981 [
6]: 
Topics in Ergodic Theory by W. Parry, Cambridge University Press, 1981 [
7]. But there are many more.
Chapter VI of 
Operator Algebras and Quantum Statistical Mechanics 2 by Bratteli and Robinson [
3] contains a description of the applications of entropy to spin systems but the theory has moved on since then. Another source which covers more recent developments in the quantum-mechanical applications is 
Quantum entropy and its use by M. Ohya and D. Petz, Springer-Verlag, 1993 [
8]. Finally a recent comprehensive treatment of the extension of the theory to the structural study of operator algebras is given in 
Dynamical Entropy in Operator Algebras by S. Nesheyev and E. Størmer, Springer-Verlag, 2006 [
9].