Basic concepts, identities and inequalities – the Toolkit of Information Theory

Basic concepts and results of that part of Information Theory which is often referred to as "Shannon Theory" are discussed with focus mainly on the discrete case. The paper is expository with some new proofs and extensions of results and concepts.


Codes
Though we are not interested in technical coding, the starting point of Information Theory may well be taken there.Consider Table 1.It shows a codebook pertaining to the first six letters of the alphabet.The code this defines maps the letters to binary code words.The efficiency is determined by the code word lengths, respectively 3,4,3,3,1 and 4. If the frequencies of individual letters are known, say respectively 20,8,15,10, 40 and 7 percent, efficiency can be related to the average code length, in the example equal to 2,35 measured in bits (binary digits).The shorter the average code length, the higher the efficiency.Thus average code length may be taken as the key quantity to worry about.It depends on the distribution (P ) of the letters and on the code (κ).Actually, it does not depend on the internal structure of the code words, only on the associated code word lengths.Therefore, we take κ to stand for the map providing these lengths (κ(a) = 3, • • • , κ(f ) = 4).Then average code length may be written, using bracket notation, as κ, P .a 1 0 0 b 1 1 1 0 c 1 0 1 d 1 1 0 e 0 f 1 1 1 1 Table 1: A codebook Clearly, not every map which maps letters to natural numbers is acceptable as one coming from a "sensible" code.We require that the code is prefix-free, i.e. that no codeword in the codebook can be the beginning of another codeword in the codebook.The good sense in this requirement may be realized if we imagine that the binary digits in a codeword corresponding to an initially unknown letter is revealed to us one by one e.g. by a "guru" as replies to a succession of questions: "is the first digit a 1?", "is the second digit a 1?" etc.The prefix-free property guarantees the "instantaneous" nature of the procedure.By this we mean that once we receive information which is consistent with one of the codewords in the codebook, we are certain which letter is the one we are looking for.
The code shown in Table 1 is compact in the sense that we cannot make it more efficient, i.e. decrease one or more of the code lengths, by simple operations on the code, say by deleting one or more binary digits.With the above background we can now prove the key result needed to get the theory going.
Theorem 1.1 (Kraft's inequality).Let A be a finite or countably infinite set, the alphabet, and κ any map of A into N 0 = {0, 1, 2, . . .}. Then the necessary and sufficient condition that there exists a prefix-free code of A with code lenghts as prescribed by κ is that Kraft's inequality i∈A 2 −κ(i) ≤ 1 (1.1) holds.Furthermore, Kraft's equality holds, if and only if there exists no prefix-free code of A with code word lenghts given by a function ρ such that ρ(i) ≤ κ(i) for all i ∈ A and ρ(i 0 ) < κ(i 0 ) for some i 0 ∈ A.
Proof.With every binary word we associate a binary interval contained in the unit interval [0; 1] in the "standard" way.Thus, to the empty codeword, which has length 0, we associate [0; 1], and if J ⊆ [0; 1] is the binary interval associated with ε 1 • • • ε k , then we associate the left half of J with ε 1 • • • ε k 0 and the right half with ε 1 • • • ε k 1.To any collection of possible code words associated with the elements ("letters") in A we can then associate a family of binary sub-intervals of [0; 1], indexed by the letters in A -and vice versa.We realize that in this way the prefix-free property corresponds to the property that the associated family of binary intervals consists of pairwise disjoint sets.A moments reflection shows that all parts of the theorem follow from this observation.
In the sequal, A denotes a finite or countably infinite set, the alphabet.
We are not interested in combinatorial or other details pertaining to actual coding with binary codewords.For the remainder of this paper we idealize by allowing arbitrary non-negative numbers as code lengths.Then we may as well consider e as a base for the exponentials occuring in Kraft's inequality.With this background we define a general code of A as a map κ : and a compact code as a map κ : The set of general codes is denoted ∼ K(A), the set of compact codes K(A).The number κ i above is now preferred to κ(i) and referred to as the code length associated with i.The compact codes are the most important ones and, for short, they are referred to simply as codes.

Entropy, redundancy and divergence
The set of probability distributions on A, just called distributions or sometimes sources, is denoted M 1 + (A) and the set of non-negative measures on A with total mass at most 1, called general distributions, is denoted the corresponding point probabilities are denoted by p i , q i , • • • .There is a natural bijective correspondence between ∼ M 1 + (A) and ∼ K(A), expressed notationally by writing P ↔ κ or κ ↔ P , and defined by the formulas Here, log is used for natural logarithms.Note that the values κ i = ∞ and p i = 0 correspond to eachother.When the above formulas hold, we call (κ, P ) a matching pair and we say that κ is adapted to P or that P is the general distribution which matches κ.
As in Section 1, κ, P denotes average code length.We may now define entropy as minimal average code length: and redundancy D(P κ) as actual average code length minus minimal average code length, i.e.
(2.6) Some comments are in order.In fact (2.6) may lead to the indeterminate form ∞ − ∞.Nevertheless, D(P κ) may be defined as a definite number in [0; ∞] in all cases.Technically, it is convenient first to define divergence D(P Q) between a probability distribution P and a, possibly incomplete, distribution Q by Entropy defined by (2.5) also makes sense and the minimum is attained for the code adapted to P , i.e.

H(P
(2.8) If H(P ) < ∞, the minimum is only attained for the code adapted to P .
Finally, for every P ∈ M 1 + (A) and κ ∈ ∼ K(A), the following identity holds with Q the distribution matching κ: κ, P = H(P ) + D(P Q). (2.9) Proof.By the inequality we realize that the sum of negative terms in (2.7) is bounded below by −1, hence D(P Q) is well defined.The same inequality then shows that D(P Q) ≥ 0. The discussion of equality is easy as there is strict inequality in (2.10) in case x = y.The validity of (2.9) with − p i log p i in place of H(P ) then becomes a triviality and (2.8) follows.
The simple identity (2.9) is important.It connects three basic quantities: entropy, divergence and average code length.We call it the linking identity.Among other things, it shows that in case H(P ) < ∞, then the definition (2.6) yields the result D(P κ) = D(P Q) with Q ↔ κ.We therefore now define redundancy D(P κ), where P ∈ M 1 + (A) and κ ∈ ∼ K(A) by Divergence we think of, primarily, as just a measure of discrimination between P and Q. Often it is more appropriate to think in terms of redundancy as indicated in (2.6).Therefore, we often write the linking identity in the form κ, P = H(P ) + D(P κ) . (2.12) 3 Some topological considerations On the set of general distributions ∼ M 1 + (A), the natural topology to consider is that of pointwise convergence.When restricted to the space M 1 + (A) of probability distributions this topology coincides with the topology of convergence in total variation ( 1 -convergence).To be more specific, denote by V (P, Q) the total variation (3.13) Then we have Lemma 3.1.Let (P n ) n≥1 and P be probability distributions over A and assume that (P n ) n≥1 converges pointwise to P , i.e. p n,i → p i as n → ∞ for every i ∈ A. Then P n converges to P in total variation, i.e.V (P n , P ) → 0 as n → ∞.
Proof.The result is known as Scheffé's lemma (in the discrete case).To prove it, consider P n − P as a function on A. The negative part (P n − P ) − converges pointwise to 0 and for all n, 0 ≤ (P n −P ) − ≤ P , hence, e.g. by Lebesgue's dominated convergence theorem, i∈A (P n −P ) − (i) → 0.
As, for the positive part, i∈A (P n −P ) + (i) = i∈A (P n −P ) − (i), we find that also i∈A (P n −P ) + (i) converges to 0. As V (P n , P ) = i∈A |P n − P |(i) and, generally, |x| = x + + x − , we now conclude that V (P n , P ) → 0.
We denote convergence in M 1 + (A) by P n V −→ P .As the lemma shows, it is immaterial if we here have the topology of pointwise convergence or the topology of convergence in total variation in mind.
Another topological notion of convergence in M 1 + (A) is expected to come to play a significant role but has only recently been discovered.This is the notion defined as follows: For (P n ) n≥1 ⊆ M 1 + (A) and P ∈ M 1 + (A), we say that (P n ) n≥1 converges in divergence to P , and write P n D −→ P , if D(P n P ) → 0 as n → ∞.The new and somewhat unexpected observation is that this is indeed a topological notion.In fact, there exists a strongest topology on M 1 + (A), the information topology, such that P n D −→ P implies that (P n ) n≥1 converges in the topology to P and for this topology, convergence in divergence and in the topology are equivalent concepts.We stress that this only holds for ordinary sequences and does not extend to generalized sequences or nets.A subset P ⊆ M 1 + (A) is open in the information topology if and only if, for any sequence (P n ) n≥1 with P n → P and P ∈ P, one has P n ∈ P, eventually.Equivalently, P is closed if and only if (P n ) n≥1 ⊆ P, P n D −→ P implies P ∈ P. The quoted facts can either be proved directly or they follow from more general results, cf.[7] or [1].We shall not enter into this here but refer the reader to [6].
Convergence in divergence is, typically, a much stronger notion than convergence in total variation.This follows from Pinsker's inequality D(P Q) ≥ 1 2 V (P, Q) 2 which we shall prove in Section 4. In case A is finite, it is easy to see that the convergence P n D −→ P amounts to usual convergence P n V −→ P and to the equality supp (P n ) = supp (P ) for n sufficiently large.Here, "supp " denotes support, i.e. the set of elements in A with positive probability.
We turn to some more standard considerations regarding lower semi-continuity.It is an important fact that entropy and divergence are lower semi-continuous, even with respect to the usual topology.More precisely: Theorem 3.2.With respect to the usual topology, the following continuity results hold: Proof.We need a general abstract result: Let X be a topological space and let (ϕ n ) n≥1 be a sequence of lower semi-continuous functions ϕ n : X →] − ∞; ∞] and assume that ϕ = ∞ 1 ϕ n is a well defined function ϕ : X →] − ∞; ∞].Assume also that there exist continuous minorants To prove this auxiliary result, let (x ν ) be an ordinary or generalized sequence and x an element of X such that x ν → x.We have to prove that lim inf ϕ(x ν ) ≥ ϕ(x).Fix N ∈ N and use the fact that a finite sum of lower semi-continuous functions is lower semi-continuous to conclude that As this holds for all N ∈ N, we conclude that In particular, a sum of non-negative real valued lower semi-continuous functions is lower semicontinuous.The statement (i) follows from this fact as x −x log x is non-negative and continuous on [0; 1].
We now turn to the proof of (ii).First we remark that we may restrict attention to D defined on the space To see this, take any (ordinary or generalized) sequence which converges in that space to (P, Q).By taking a proper subsequence if necessary, we may assume that the sequence (D(P ν Q ν )) ν is convergent and also that (Q ν (A)) ν converges.Then we may add a point -the "point at infinity" -to the space A and extend all measures considered to the new space in a natural way such that all measures become probability distributions.Denoting the extended measures with a star, we then find that (P * ν , Q * ν ) ν → (P * , Q * ) and we have lim inf ν D(P * ν Q * ν ) ≥ D(P * Q * ), provided lower semi-continuity has been estableshed for true probability distributions.Then lim ν D(P ν Q ν ) ≥ D(P Q) follows and we conclude that the desired semi-continuity property also holds if the Q's are allowed to be improper distributions.
To prove (ii) we may thus restrict attention to the space M 1 + (A) × M 1 + (A).Further, we may assume that A = N.For each n, denote by ϕ n the map p n log pn qn .Then ϕ n is lower semi-continuous.Denote by ψ n the map (P, Q) p n − q n .Then ψ n is a continuous minorant to ϕ n and ∞ 1 ψ n = 0.As D = ∞ 1 ϕ n , the auxiliary result applies and the desired conclusion follows.
In Section 5 we shall investigate further the topological properties of H and D. For now we point out one simple continuity property of divergence which has as point of departure, not so much convergence in the space M 1 + (A), but more so convergence in A itself.The result we have in mind is only of interest if A is infinite as it considers approximations of A with finite subsets.Denote by P 0 (A) the set of finite subsets of A, ordered by inclusion.Then (P 0 (A), ⊆) is an upward directed set and we can consider convergence along this set, typically denoted by lim A∈P 0 (A) .
P |A denoting as usual the conditional distribution of P given A.
Proof.First note that the result makes good sense as P (A) > 0 if A is large enough.The result follows by writing D(P |A Q) in the form since lim A∈P 0 (A) P (A) = 1 and since lim

Datareduction
Let, again, A be the alphabet and consider a decomposition θ of A. We shall think of θ as defining a datareduction.We often denote the classes of θ by A i with i ranging over a certain index set which, in pure mathematical terms, is nothing but the quotient space of A w.r.t.θ.We denote this quotient space by ∂A -or, if need be, by ∂ θ A -and call ∂A the derived alphabet (the alphabet derived from the datareduction θ).Thus ∂A is nothing but the set of classes for the decomposition θ.Now assume that we have also given a source P ∈ M 1 + (A).By ∂P (or ∂ θ P ) we denote the derived source, defined as the distribution ∂P ∈ M 1 + (∂A) of the quotient map A → ∂A or, if you prefer, as the image measure of P under the quotient map.Thus, more directly, ∂P is the probability distribution over ∂A given by (∂P )(A) = P (A) ; A ∈ ∂A.
If we choose to index the classes in ∂A we may write ∂P Remark.Let A 0 be a basic alphabet, e.g.A 0 = {0, 1} and consider natural numbers s and t with s < t.If we take A to be the set of words x 1 • • • x t of length t from the alphabet A 0 , i.e.A = A t 0 , and θ to be the decomposition induced by the projection of A onto A s 0 , then the quotient space ∂A t 0 can be identifyed with the set A s 0 .The class corresponding to In this example, we may conveniently think of x 1 • • • x s as representing the past (or the known history) and Often, we think of a datareduction as modelling either conditioning or given information.Imagine, for example, that we want to observe a random element x ∈ A which is govorned by a distribution P , and that direct observation is impossible (for practical reasons or because the planned observation involves what will happen at some time in the future, cf.Example ??).Instead, partial information about x is revealed to us via θ, i.e. we are told which class A i ∈ ∂A the element x belongs to.Thus "x ∈ A i " is a piece of information (or a condition) which partially determines x.
Considerations as above lie behind two important definitions: By the conditional entropy of P given θ we understand the quantity As usual, P |A i denotes the conditional distribution of P given A i (when well defined).Note that when P |A i is undefined, the corresponding term in (4.14) is, nevertheless, well defined (and = 0).
Note that the conditional entropy is really the average uncertainty (entropy) that remains after the information about θ has been revealed.
Similarly, the conditional divergence between P and Q given θ is defined by the equation There is one technical comment we have to add to this definition: It is possible that for some i, P (A i ) > 0 whereas Q(A i ) = 0.In such cases P |A i is welldefined whereas Q|A i is not.We agree that in such cases, D θ (P Q) = ∞.This corresponds to an extension of the basic definition of divergence by agreering that the divergence between a (well defined) distribution and some undefined distribution is infinite.
In analogy with the interpretation regarding entropy, note that, really, D θ (P Q) is the average divergence after information about θ has been revealed.
We also note that D θ (P Q) does not depend on the full distribution Q but only on the family (Q|A i ) of conditional distributions (with i ranging over indices with P (A i ) > 0).Thinking about it, this is also quite natural: If Q is conceived as a predictor then, if we know that information about θ will be revealed to us, the only thing we need to predict is the conditional distributions given the various A i 's.
Whenever convenient we will write H(P |θ) in place of H θ (P ) whereas a similar notation for divergence appears awquard and will not be used.
From the defining relations (4.14) and (4.15) it is easy to identify circumstances under which H θ (P ) or D θ (P Q) vanish.For this we need two new notions: We say that P is deterministic modulo θ, and write P = 1 (mod θ), provided the conditional distribution P |A i is deterministic for every i with P (A i ) > 0. And we say that Q equals P modulo θ, and write Q = P (mod θ), provided Q|A i = P |A i for every i with P (A i ) > 0. This condition is to be understood in the sense that if P (A i ) > 0, the conditional distribution Q|A i must be well defined (i.e.Q(A i ) > 0) and coincide with P |A i .The new notions may be expressed in a slightely different way as follows: It should be noted that the relation "equality mod θ" is not symmetric: The two statements Q = P (mod θ) and P = Q (mod θ) are only equivalent if, for every i, P (A i ) = 0 if and only if We leave the simple proof of the following result to the reader: Theorem 4.1.(i) H θ (P ) ≥ 0, and a necessary and sufficient condition that H θ (P ) = 0 is that P be deterministic modulo θ.
(ii) D θ (P Q) ≥ 0, and a necessary and sufficient condition that D θ (P Q) = 0, is that Q be equal to P modulo θ.
Intuitively, it is to be expected that entropy and divergence decrease under datareduction: H(P ) ≥ H(∂P ) and D(P Q) ≥ D(∂P ∂Q).Indeed, this is so and we can even identify the amount of the decrease in information theoretical terms: Theorem 4.2 (datareduction identities).Let P and Q be distributions over A and let θ denote a datareduction.Then the following two identities hold: ) (4.17) The identity (4.16) is called Shannon's identity (most often given in a notation involving random variables, cf.Section 7).
Proof.Below, sums are over i with P (A i ) > 0. For the right hand side of (4.16) we find the expression which can be rewritten as easily recognizable as the entropy H(P ).
For the right hand side of (4.17) we find the expression i which can be rewritten as easily recognizable as the divergence D(P Q).
Of course, these basic identities can, more systematically, be written as H(P ) = H(∂ θ P ) + H θ (P ) and similarly for (4.17).An important corollary to Theorems 4. 1 We end this section with an important inequality mentioned in Section 3: Corollary 4.5 (Pinsker's inequality).For any two probability distributions, where ∂P and ∂Q refer to the datareduction defined by the decomposition A = A + ∪ A − .Put p = P (A + ) and q = P (A − ).Keep p fixed and assume that 0 ≤ q ≤ p. Then and elementary considerations via differentiation w.r.t.q (two times!) show that this expression is non-negative for 0 ≤ q ≤ p.The result follows.

Approximation with finite partition
In order to reduce certain investigations to cases which only involve a finite alphabet and in order to extend the definition of divergence to general Borel spaces, we need a technical result on approximation with respect to finer and finer partitions.We leave the usual discrete setting and take an arbitrary Borel space (A, A) as our basis.Thus A is a set, possibly uncountable, and A a Borel structure (the same as a σ-algebra) on A. By Π σ (A, A) we denote the set of countable decompositions of A in measurable sets (sets in A), ordered by subdivision.We use "≺" to denote this ordering, i.e. for π, ρ ∈ Π σ (A, A), π ≺ ρ means that every class in π is a union of classes in ρ.By π ∨ ρ we denote the coarsest decomposition which is finer than both π and ρ, i.e. π ∨ ρ consists of all non-empty sets of the form A ∩ B with A ∈ π, B ∈ ρ.
Clearly, Π σ (A, A) is an upward directed set, hence we may consider limits based on this set for which we use natural notation such as lim π , lim inf π etc.
By Π 0 (A, A) we denote the set of finite decompositions in Π σ (A, A) with the ordering inherited from Π σ (A, A).Clearly, Π 0 (A, A) is also an upward directed set.
By M 1 + (A, A) we denote the set of probability measures on (A, A).For P ∈ M 1 + (A, A) and π ∈ Π σ (A, A), ∂ π P denotes the derived distributions defined in consistency with the definitions of the previous section.If A π denotes the σ-algebra generated by π, ∂ π P may be conceived as a measure in M 1 + (A, A π ) given by the measures of the atoms of A π : (∂ π P )(A) = P (A) for A ∈ π.Thus In the result below, we use Π 0 to denote the set Π 0 (A, A).
(i) For any P ∈ M 1 + (A, A), (5.20) Proof.We realize that we may assume that (A, A) is the discrete Borel structure N and that τ is the decomposition of N consisting of all singletons {n}; n ∈ N.
For any non-empty subset A of A = N, denote by x A the first element of A, and, for P ∈ M 1 + (A) and π ∈ Π 0 , we put with δ x denoting a unit mass at x. Then P π V −→ P along the directed set Π 0 and H(∂ π P ) = H(P π ); π ∈ Π 0 .
Combining lower semi-continuity and the datareduction inequality (i) of Corollary 4.3, we find that and (i) follows.The proof of (ii) is similar and may be summarized as follows: As the sequences in (5.19) and in (5.20) are weakly increasing, we may express the results more economically by using the sign "↑" in a standard way: The type of convergence established also points to martingale-type of considerations, cf.[8] 1 Motivated by the above results, we now extend the definition of divergence to cover probability distributions on arbitrary Borel spaces.For P, Q ∈ M 1 + (A, A) we simply define D(P Q) by (5.21)By Theorem 5.1 we have The definition given is found to be the most informative when one recalls the separate definition given earlier for the discrete case, cf.(2.6) and (2.7).However, it is also important to note the following result which most authors use as definition.It gives a direct analytical expression for divergence which can be used in the discrete as well as in the general case.
Theorem 5.2 (divergence in analytic form).Let (A, A) be a Borel space and let P, Q ∈ M 1 + (A, A).Then where dP dQ denotes a version of the Radon-Nikodym derivative of P w.r.t.Q.If this derivative does not exist, i.e. if P is not absolutely continuous w.r.t.Q, then (5.23) is to be interpretated as giving the result D(P Q) = ∞.
Proof.First assume that P is not absolutely continuous w.r.t.Q.Then P (A) > 0 and Q(A) = 0 for some A ∈ A and we see that Then assume that P is absolutely continuous w.r.t.Q and put f = dP dQ .Furthermore, put I = log f dP .Then I can also be written as ϕ(f )dQ with ϕ(x) = x log x.As ϕ is convex, In order to prove the reverse inequality, let t < I be given and choose s > t such that I −log s > t.As P ({f = 0}) = 0, we find that Then, from the right-hand inequality of the double inequality we find that and, using also the left-hand inequality of (5.24), it follows that As π ∈ Π σ , and as t < I was arbitrary, it follows from (5.22) that D(P Q) ≤ I.
For the above discussion and results concerning divergence D(P Q) between measures on arbitrary Borel spaces we only had the case of probability distributions in mind.However, it is easy to extend the discussion to cover also the case when Q is allowed to be an imcomplete distribution.Detals are left to the reader.
It does not make sense to extend the basic notion of entropy to distributions on general measure spaces as the natural quantity to consider, sup π H(∂ π P ) with π ranging over Π 0 or Π σ can only yield a finite quantity if P is essentially discrete.

Mixing, convexity properties
Convexity properties are of great significance in Information Theory.Here we develop the most important of these properties by showing that the entropy function is concave whereas divergence is convex.
The setting is, again, a discrete alphabet A. On A we study various probability distributions.If (P ν ) ν≥1 is a sequence of such distributions, then a mixture of these distributions is any distribution P 0 of the form with α = (α ν ) ν≥1 any probability vector (α ν ≥ 0 for ν ≥ 1, ∞ 1 α ν = 1).In case α ν = 0, eventually, (6.25) defines a normal convex combination of the P ν 's.The general case covered by (6.25) may be called an ω-convex combination2 .
As usual, a non-negative function f : for every convex combination α ν P ν .If we, instead, extend this by allowing ω-convex combinations, we obtain the notions we shall call ω-convexity, respectively ω-concavity.And f is said to be strictly ω-convex if f is ω-convex and if, provided It is an important feature of the convexity (concavity) properties which we shall establish, that the inequalities involved can be deduced from identities which must then be considered to be the more basic properties.Theorem 6.1 (identities for mixtures).Let P 0 = ∞ 1 α ν P ν be a mixture of distributions Proof.By the linking identity, the right hand side of (6.26) equals where κ 0 is the code adapted to P 0 , and this may be rewritten as i.e. as κ 0 , P 0 , which is nothing but the entropy of P 0 .This proves (6.26).Now add the term α ν D(P ν Q) to each side of (6.26) and you get the following identity: Conclude from this, once more using the linking identity, that this time with κ adapted to Q.As α ν κ, P ν = κ, P 0 = H(P 0 ) + D(P 0 Q), we then see upon subtracting the term H(P 0 ), that (6.27) holds provided H(P 0 ) < ∞.The general validity of (6.27) is then deduced by a routine approximation argument, appealing to Theorem ??.Theorem 6.2 (basic convexity/concavity properties).The function P H(P This follows from Theorem 6.1. The second part of the result studies D(P Q) as a function of the first argument P .It is natural also to look into divergence as a function of its second argument.To that end we introduce the geometric mixture of the probability distributions Q ν , ν ≥ 1 w.r.t. the weights (α ν ) ν≥1 (as usual, α ν ≥ 0 for ν ≥ 1 and α ν = 1).By definition, this is the incomplete probability distribution Q g 0 , notationally denoted g α ν Q ν , which is defined by In other words, the point probabilities Q g 0 (x) are the geometric avarages of the corresponding point probabilities Q ν (x); ν ≥ 1 w.r.t. the weights α ν ; ν ≥ 1.
That Q g 0 is indeed an incomplete distribution follows from the standard inequality connecting geometric and arithmetic mean.According to that inequality, Q g 0 ≤ Q a 0 , where Q a 0 denotes the usual arithmetic mixture: To distinguish this distribution from Q g 0 , we may write it as a α ν Q ν .If we change the point of view by considering instead the adapted codes: κ ν ↔ Q ν , ν ≥ 1 and κ 0 ↔ Q g 0 , then, corresponding to (6.28), we find that which is the usual arithemetic average of the codes κ ν .We can now prove: Theorem 6.3 (2nd convexity identity for divergence).Let P and O ν ; ν ≥ 1be probability distributions over A and let (α ν ) ν≥1 be a sequence of weights.Then the identity Proof.Assume first that H(P ) < ∞.Then, from the linking identity, we get (using notation as above): so that (6.29) holds in this case.
In order to establish the general validity of (6.29) we first approximate P with the conditional distributionsP |A; A ∈ P 0 (A) (which all have finite entropy).Recalling Theorem3.3,and using the result established in the first part of this proof, we get: This shows that the inequality "≤" in (6.29) holds, quite generally.But we can see more from the considerations above since the only inequality appearing (obtained by an application of Fatou's lemma, if you wish) can be replaced by equality in case α ν = 0, eventually (so that the α's really determine a finite probability vector).This shows that (6.29) holds in case α ν = 0, eventually.
For the final step of the proof, we introduce, for each n, the approximating finite probability vector (α n1 , α n2 , • • • , α nn , 0, 0, • • • ) with It is easy to see that Q g n0 → Q g 0 as n → ∞.By the results obtained so far and by lower semi-continuity of D we then have: hereby establishing the missing inequality "≥" in (6.29).
This is nothing but a convenient reformulation of Theorem6.3.By the usual inequality connecting geometric and arithemetic mean -and by the result concerning situations with equality in this inequality -we find as a further corollary that the following convexity result holds:

.30)
In case the left-hand side in (6.30) is finite, equality holds in (6.30) if and only if either there exists P and Q such that P n = P and Q n = Q for all n with α n > 0, or else, P n = Q n for all n with α n > 0.
Proof.We have Here we used the well-known "log-sum inequality": As equality holds in this inequality if and only if (x ν ) and (y ν ) are proportional we see that, under the finiteness condition stated, equality holds in (6.30) if and only if, for each i ∈ A there exists a constant c i such that q n,i = c i p n,i for all n.From this observation, the stated result can be deduced.
7 The language of the probabilist Previously, we expressed all definitions and results via probability distributions.Though these are certainly important in probability theory and statistics, it is often more suggestive to work with random variables or, more generally -for objects that do not assume real values -with random elements.Recall, that a random element is nothing but a measurable map defined on a probability space, say X : Ω → S where (Ω, F, P ) is a probability space and (S, S) a Borel space.As we shall work in the discrete setting, S will be a discrete set and we will then take S = P(S) as the basic σ-algebra on S. Thus a discrete random element is a map X : Ω → S where Ω = (Ω, F, P ) is a probability space and S a discrete set.As we are accustomed to, there is often no need to mention explicitly the "underlying" probability measure.If misunderstanding is unlikely, "P " is used as the generic letter for "probability of".By P X we denote the distribution of X: If several random elements are considered at the same time, it is understood that the underlying probability space (Ω, F, P ) is the same for all random elements considered, whereas the discrete sets where the random elements take their values may, in general, vary.
The entropy of a random element X is defined to be the entropy of its distribution: H(X) = H(P X ).The conditional entropy H(X|B) given an event B with P (B) > 0 then readily makes sense as the entropy of the conditional distribution of X given B. If Y is another random element, the joint entropy H(X, Y ) also makes sense, simply as the entropy of the random element (X, Y ) : ω (X(ω), Y (ω)).Another central and natural definition is the conditional entropy H(X|Y ) of X given the random element Y which is defined as Here it is understood that summation extends over all possible values of Y .We see that H(X|Y ) can be interpretated as the average entropy of X that remains after observation of Y .Note that If X and X are random elements which take values in the same set, D(X X ) is another notation for the divergence D(P X P X ) between the associated distributions.
Certain extreme situations may occur, e.g. if X and Y are independent random elements or, a possible scenario in the "opposite" extreme, if X is a consequence of Y , by which we mean that, for every y with P (Y = y) positive, the conditional distribution of X given Y = y is deterministic.
Often we try to economize with the notation without running a risk of misunderstanding, cf.Table 2  For instance, (7.31) may be written Let us collect some results formulated in the language of random elements which follow from results of the previous sections.in Y " is the same as "the information about Y contained in X".Because of this symmetry, we choose to use a more "symmetric" terminology rather than the directional "information about X contained in Y ".Finally, we declare a preference for the "saving in coding effort-definition" simply because it is quite general as opposed to the "decrease in uncertainty-definition" which leads to (8.36) that could result in the indeterminate form ∞ − ∞.
With the above discussion in mind we are now prepared to define I(X ∧ Y ), the mutual information of X and Y by I(X ∧ Y ) = y P (y)D(X|y X). (8.38) As we saw above, mutual information is symmetric in case H(X|Y ) and H(Y |X) are finite.However, symmetry holds in general as we shall now see.Let us collect these and other basic results in one theorem: Theorem 8.1.Let X and Y be discrete random elements with distributions P X , respectively P Y and let P X,Y denote the joint distribution of (X, Y ).Then the following holds: It is instructive to derive the somewhat surprising identity (8.40) directly from the more natural datareduction identity (4.17).To this end, let X : Ω → A and Y : Ω → B be the random variables concerned, denote their distributions by P 1 , respectively P 2 , and let P 12 denote their joint distribution.Further, let π : A × B → A be the natural projection.By (4.17By symmetry, e.g. by considering the natural projection A × B → B instead, we then see that I(X ∧ Y ) = I(Y ∧ X).

Information Transmission
An important aspect of many branches of mathematics is that to a smaller or larger extent one is free to choose/design/optimize the system under study.For key problems of information theory this freedom lies in the choice of a distribution or, equivalently, a code.In this and in the next section we look at two models which are typical for optimization problems of information theory.A detailed study has to wait until later chapters.For now we only introduce some basic concepts and develop their most fundamental properties.

Corollary 6 . 5 (
Convexity of D in the second argument).For each fixedP ∈ M 1 + (A), the function Q D(P Q) defined on M 1 + (A) is strictly ω-convex.It lies nearby to investigate joint convexity of D(• •) with both first and second argument varying.Theorem 6.6 (joint convexity divergence).D(• •) is jointly ω-convex, i.e. for any sequence ( below short notation full notation or definition P (x), P (y), P (x, y) P (X = x), P (Y = y), P ((X, Y ) = (x, y)) P (x|y), P (y|x)P (X = x|Y = y), P (Y = y|X = x) X|y conditional distribution of X given Y = y Y |x conditional distribution of Y given X = xTable2
and 4.2 is the following Corollary 4.3 (datareduction inequalities).With notation as above the following results hold: (i).H(P ) ≥ H(∂P ) and, in case H(P ) < ∞, equality holds if and only if P is deterministic modulo θ. (ii).D(P Q) ≥ D(∂P ∂Q) and, in case D(P Q) < ∞, equality holds if and only if Q equals P modulo θ. equality holds if and only if the support of P is contained in one of the classes A i defined by θ. ∂Q, i.e. if and only if, for all classes A i defined by θ, P ( (Shannon's inequality for conditional entropy).H(P ) ≥ H θ (P ) and, in case H(P ) < ∞, (ii).D(P Q) ≥ D θ (P Q) and, in case D(P Q) < ∞, equality holds if and only if ∂P =