Maximum Entropy on Compact Groups

On a compact group the Haar probability measure plays the role of uniform distribution. The entropy and rate distortion theory for this uniform distribution is studied. New results and simplified proofs on convergence of convolutions on compact groups are presented and they can be formulated as entropy increases to its maximum. Information theoretic techniques and Markov chains play a crucial role. The convergence results are also formulated via rate distortion functions. The rate of convergence is shown to be exponential.


Introduction
It is a well-known and celebrated result that the uniform distribution on a finite set can be characterized as having maximal entropy.Jaynes used this idea as a foundation of statistical mechanics [1], and the Maximum Entropy Principle has become a popular principle for statistical inference [2][3][4][5][6][7][8].Often it is used as a method to get prior distributions.On a finite set, for any distributions P we have H(P ) = H(U ) − D(P U ) where H is the Shannon entropy, D is information divergence, and U is the uniform distribution.Thus, maximizing H(P ) is equivalent to minimizing D(P U ). Minimization of information divergence can be justified by the conditional limit theorem by Csiszár [9,Theorem 4].So if we have a good reason to use the uniform distribution as prior distribution we automatically get a justification of the Maximum Entropy Principle.The conditional limit theorem cannot justify the use of the uniform distribution itself, so we need something else.Here we shall focus on symmetry.
Example 1.A die has six sides that can be permuted via rotations of the die.We note that not all permutations can be realized as rotations and not all rotations will give permutations.Let G be the group of permutations that can be realized as rotations.We shall consider G as the symmetry group of the die and observe that the uniform distribution on the six sides is the only distribution that is invariant under the action of the symmetry group G.
Example 2. G = R/2πZ is a commutative group that can be identified with the group SO (2) of rotations in 2 dimensions.This is the simplest example of a group that is compact but not finite.
For an object with symmetries the symmetry group defines a group action on the object, and any group action on an object defines a symmetry group of the object.A special case of a group action of the group G is left translation of the elements in G. Instead of studying distributions on objects with symmetries, in this paper we shall focus on distributions on the symmetry groups themselves.It is no serious restriction because a distribution on the symmetry group of an object will induce a distribution on the object itself.
Convergence of convolutions of probability measures were studied by Stromberg [10] who proved weak convergence of convolutions of probability measures.An information theoretic approach was introduced by Csiszár [11].Classical methods involving characteristic functions have been used to give conditions for uniform convergence of the densities of convolutions [12].See [13] for a review of the subject and further references.
Finally it is shown that convergence in information divergence corresponds to uniform convergence of the rate distortion function and that weak convergence corresponds to pointwise convergence of the rate distortion function.In this paper we shall mainly consider convolutions as Markov chains.This will give us a tool, which allows us to prove convergence of iid.convolutions, and the rate of convergence is proved to be exponential.
The rest of the paper is organized as follows.In Section 2. we establish a number of simple results on distortion functions on compact set.These results will be used in Section 4..In Section 3. we define the uniform distribution on a compact group as the uniquely determined Haar probability measures.In Section 4. it is shown that the uniform distribution is the maximum entropy distribution on a compact group in the sense that it maximizes the rate distortion function at any positive distortion level.Convergence of convolutions of a distribution to the uniform distribution is established in Section 5. using Markov chain techniques, and the rate of convergence is discussed in Section 6..The group SO (2) is used as our running example.We finish with a short discussion.

Distortion on compact groups
Let G be a compact group where * denotes the composition.The neutral element will be denoted e and the inverse of the element g will be denoted g −1 .
Figure 1.Squared Euclidean distance between the rotation angles x and y.
We shall start with some general comments on distortion functions on compact sets.Assume that the group both plays the role as source alphabet and reproduction alphabet.A distortion function d : G × G → R is given and we will assume that d (x, y) ≥ 0 with equality if and only if x = y.We will also assume that the distortion function is continuous.This illustrated in Figure 1.
The distortion function might be a metric but even if the distortion function is not a metric, the relation between the distortion function and the topology is the same as if it was a metric.One way of constructing a distortion function on a group is to use the squared Hilbert-Smidt norm in a unitary representation of the group.The following theorem may be considered as a kind of uniform continuity of the distortion function or as a substitute for the triangular inequality when d is not a metric.Lemma 5.If C is a compact set and d : C × C → R is a non-negative continuous distortion function such that d (x, y) = 0 if and only if x = y, then there exists a continuous function Proof.Assume that the theorem does not hold.Then there exists > 0 and a net (x λ , y λ , z λ ) λ∈Λ such that d (x λ , y λ ) − d (z λ , y λ ) > and d (z λ , y λ ) → 0. A net in a compact set has a convergent subnet so without loss of generality we may assume that the net (x λ , y λ , z λ ) λ∈Λ converges to some triple (x ∞ , y ∞ , z ∞ ) .By continuity of the distortion function we get and d (z ∞ , y ∞ ) = 0, which implies z ∞ = y ∞ and we have a contradiction.
We note that if a distortion function satisfies (1) then it defines a topology in which the distortion balls are open.
In order to define the weak topology on probability distributions we extend the distortion function from where X and Y are random variables with values in C and the infimum is taken all joint distributions on (X, Y ) such that the marginal distribution of X is P and the marginal distribution of Y is Q.The distortion function is continuous so (x, y) → d (x, y) has a maximum that we denote d max .
Proof.According to Lemma 5 there exists a function f 1 satisfying (1).We use that This hold for all δ > 0 and in particular for δ = (d (S, P )) 1/2 , which proves the theorem.
The theorem can be used to construct the weak topology on M 1 + (C) with > 0 as open balls that generate the topology.We note without proof that this definition is equivalent with the quite different definition of weak topology that one will find in most textbooks.
For a group G we assume that the distortion function is right invariant in the sense that for all x, y, z ∈ G a distortion function A right invariant distortion function satisfies d (x, y) = d (x * y −1 , e), so right invariant continuous distortion functions of a group can be constructed from non-negative functions with a minimum in e.

The Haar measure
We use * to denote convolution of probability measures on G.For g ∈ G we shall use g * P to denote the g-translation of the measure P or, equivalently, the convolution with a measure concentrated in g.The n-fold convolution of a distribution P with itself will be denoted P * n .For random variables with values in G one can formulate an analog of the central limit theorem.We recall some facts about probability measures on compact groups and their Haar measures.Definition 7. Let G be a group.A measure P is said to be a left Haar measure if g * P = P for any g ∈ G. Similarly, P is said to be a right Haar measure if P * g = P for any g ∈ G.A measure is said to be a Haar measure if it is both a left Haar measure and a right Haar measure.
is a density on a probability distribution P on SO (2) if the Fourier coefficients a n are sufficiently small so that f is non-negative.A sufficient condition for f to be non-negative is that ∞ n=1 |a n | ≤ 1.Translation by y gives a distribution with density The distribution P is invariant if and only if f is 1 or, equivalently, all Fourier coefficients (a n ) n∈N are 0.
A measure P on G is said to have full support if the support of P is G, i.e.P (A) > 0 for any non-empty open set A ⊆ G.The following theorem is well-known [14][15][16].
Theorem 9. Let U be a probability measure on the compact group G. Then the following four conditions are equivalent.
• U is a left Haar measure.
• U is a right Haar measure.
• U has full support and is idempotent in the sense that U * U = U.
• There exists a probability measure P on G with full support such that P * U = U.
• There exists a probability measure P on G with full support such that U * P = U.
In particular a Haar probability measure is unique.
In [14][15][16] one can find the proof that any locally compact group has a Haar measure.The unique Haar probability measure on a compact group will be called the uniform distribution and denoted U.For probability measures P and Q the information divergence from P to Q is defined by We shall often calculate the divergence from a distribution to the uniform distribution U, and introduce the notation For a random variable X with values in G we will sometimes write D (X U ) instead of D (P U ) when X has distribution P.
Example 10.The distribution P with density f given by (2) has Let G be a compact group with uniform distribution U and let F be a closed subgroup of G. Then the subgroup has a Haar probability measure U F and where

The rate distortion theory
We will develop aspects of the rate distortion theory of a compact group G. Let P be a probability measure on G.We observe that compactness of G implies that a covering of G by distortion balls of radius δ > 0 contains a finite covering.If k is the number of balls in a finite covering then R P (δ) ≤ log (k) where R P is the rate distortion function of the probability measure P. In particular the rate distortion function is upper bounded.The entropy of a probability distribution P is given by H (P ) = R P (0).If the group is finite then the uniform distribution maximizes the Shannon entropy R P (0) but if the group is not finite then in principle there is no entropy maximizer.As we shall see the uniform distribution still plays the role of entropy maximizer in the sense that the uniform distribution maximize the value R P (δ) of the rate distortion function for any positive distortion level δ > 0. The rate distortion function R P can be studied using its convex conjugate R * P given by The rate distortion function is then recovered by the formula The techniques are pretty standard [17].
Theorem 11.The rate distortion function of the uniform distribution is given by where Z is the partition function defined by The rate distortion function of an arbitrary distribution P satisfies Proof.First we prove a Shannon type lower bound on the rate distortion function of an arbitrary distribution P on the group.Let X be a random variable with values in G and distribution P , and let X be a random variable coupled with X such that the mean distortion E d X, X equals δ.Then Now, E d X, X = E d X * X−1 , e and where P β is the distribution that maximizes divergence under the constraint E [d (Y, e)] = δ when Y has distribution P β .The distribution P β is given by the density where β is determined by the condition δ = Z (β) /Z (β) .
If P is uniform then a joint distribution is obtained by choosing X uniformly distributed, and choosing Y distributed according to P β and independent of X.Then X = Y * X is distributed according to P β * U = U , and we have equality in (7).Hence the rate determined the lower bound ( 7) is achievable for the uniform distribution, which prove the first part of the theorem, and the left inequality in (4).
The joint distribution on X, X that achieved the rate distortion function when X has a uniform distribution, defines a Markov kernel Ψ : X → X that is invariant under translations in the group.For any distribution P the joint distribution on X, X determined by P and Ψ gives an achievable pair of distortion, and rate that is on the rate distortion curve of the uniform distribution.This proves the right inequality in Equation ( 4).
Example 12.For the group SO (2) the rate distortion function can be parametrized using the modified Bessel functions I j , j ∈ N 0 .The partition function is given by Hence R * U (β) = log (Z (β)) = 2β + log (I 0 (−2β)).The distortion δ corresponding to β is given by and the corresponding rate is These joint values of distortion and rate can be plotted with β as parameter as illustrated in Figure 2.
The minimal rate of the uniform distribution is achieved when X and X are independent.In this case the distortion is E d X, X = G d (x, e) dP x.This distortion level will be called the critical distortion and will be denoted d crit .On the interval ]0; d crit ] the rate distortion function is decreasing and the distortion rate function is the inverse R −1 P of the rate distortion function R P on this interval.The distortion rate function satisfies: Theorem 13.The distortion rate function of an arbitrary distribution P satisfies for some increasing continuous function f 2 satisfying f 2 (0) = 0. Proof.The right hand side follows because R U is decreasing in the interval [0; d crit ] Let X be a random variable with distribution P and let Y be a random variable coupled with X.Let Z be a random variable coupled with X such that E [d (X, Z)] = d (P, U ) .The couplings between X and Y , and between X and Z can be extended to a joint distribution on X, Y and Z such that Y and Z are independent given X.For this joint distribution we have We have to prove that and this follows because

Convergence of convolutions
We shall prove that under certain conditions the n-fold convolutions P * n converge to the uniform distribution.
Example 14.The function is a density on a probability distribution P on G if the Fourier coefficients a n are sufficiently small.If (a n ) and (b n ) are Fourier coefficients of P and Q then the convolution has density Therefore the n-fold convolution has density Therefore each of the Fourier coefficients is exponentially decreasing.
Clearly, if P is uniform on a proper subgroup then convergence does not hold.In several papers on this topic [13, 18, and references therein] it is claimed and "proved" that if convergence does not hold then the support of P is contained in the coset of a proper normal subgroup.The proofs therefore contain errors that seem to have been copied from paper to paper.To avoid this problem and make this paper more self-contained we shall reformulate and reprove some already known theorems.In the theory of finite Markov chains is well-known that there exists an invariant probability measure.Certain Markov chains exhibits periodic behavior where a certain distribution is repeated after a number of transitions.All distributions in such a cycle will lie at a fixed distance from any (fixed) measure, where the distance is given by information divergence or total variation (or any other Csiszár f -divergence).It is also well-known that finite Markov chains without periodic behavior are convergent.In general a Markov chain will converge to a "cyclic" behavior as stated in the following theorem [19].
Theorem 15.Let Φ be a transition operator on a state space A with an invariant probability measure Q in .If D (S Q) < ∞ then there exists a probability measure We shall also use the following proposition that has a purely computational proof [20].
Proposition 16.Let P x , x ∈ X be distributions and let Q be a probability distribution on X.Then We denote the set of probability measures on G by M 1 + (G).
Theorem 17.Let P be a distribution on a compact group G and assume that the support of P is not contained in any nontrivial coset of a subgroup of G.Then, if D (S U ) is finite then denote the Markov kernel Ψ (g) = P * g.Then P * n * S = Ψ n (P * S) .Thus there exists a probability measure First we note that Therefore g * Q = P * Q for P almost every g ∈ G. Thus there exists at least one g 0 ∈ G such that Ψi−1 P .
According to [19] this ergodic mean will converge to a distribution T such that Ψ (T ) = T so that T * P = T. Hence we also have that T * T = T, i.e.T is idempotent and therefore supported by a subgroup of G.We know that P is not contained in any nontrivial subgroup of G so the support of T must be G.We also get Q = T * Q, which together with Theorem 9 implies that Q = U.
by choosing S = P we get the following corollary.
Corollary 18.Let P be a probability measure on the compact group G with Haar probability measure U .Assume that the support of P is not contained in any coset of a proper subgroup of G and D (P U ) is finite.Then D (P * n U ) → 0 for n → ∞.
Corollary 18 together with Theorem 11 implies the following result.
Corollary 19.Let P be a probability measure on the compact group G with Haar probability measure U .Assume that the support of P is not contained in any coset of a proper subgroup of G and D (P U ) is finite.Then the rate distortion function of P * n converges uniformly to the rate distortion function of the uniform distribution.
We also get weak versions of these results.
Corollary 20.Let P be a probability measure on the compact group G with Haar probability measure U. Assume that the support of P is not contained in any coset of a proper subgroup of G. Then P * n converges to U in the weak topology, i.e. d (P * n , U ) → 0 for n → ∞.
Proof.If we take S = P β then D (P β ) is finite and D (P * n * P β U ) → 0 for n → ∞.We have

Rate of convergence
Normally the rate of convergence will be exponential.If the density is lower bounded this is wellknown.We bring a simplified proof of this.For any distribution Q on G we have Here we have used convexity of divergence.
If a distribution P has support in a proper subgroup F then Therefore D (P ) < 1 bit implies that P cannot be supported by a proper subgroup, but it implies more.
Proposition 23.If P is a distribution on the compact group G and D (P ) < 1 bit then d(P * P ) dU is lower bounded by a positive constant.
Proof.The condition D (P ) < 1 bit implies that U dP dU > 0 > 1/2.Hence there exists ε > 0 such that U dP dU > ε > 1/2.We have Using the inclusion-exclusion inequalities we get for all y ∈ G.
Combining Theorem 17, Lemma 22, and Proposition 23 we get the following result.
Theorem 24.Let P be a probability measure on a compact group G with Haar probability measure U.
If the support of P is not contained in any coset of a proper subgroup of G and D (P U ) is finite then the rate of convergence of D (P * n U ) to zero is exponential.
As a corollary we get the following result that was first proved by Kloss [21] for total variation.
Corollary 25.Let P be a probability measure on the compact group G with Haar probability measure U.If the support of P is not contained in any coset of a proper subgroup of G and D (P U ) is finite then P * n converges to U in variation and the rate of convergence is exponential.
Proof.This follows directly from Pinsker's inequality [22,23] Corollary 26.Let P be a probability measure on the compact group G with Haar probability measure U.If the support of P is not contained in any coset of a proper subgroup of G and D (P U ) is finite, then the density dP * n dU converges to 1 point wise almost surely for n tending to infinity.
Proof.The variation norm can be written as The result follows by the exponential rate of convergence of P * n to U in total variation combined with the Borel-Cantelli Lemma.

Discussion
In this paper we have assumed the existence of the Haar measure by referring to the literature.With the Haar measure we have then proved convergence of convolutions using Markov chain techniques.The Markov chain approach can also be used to prove the existence of the Haar measure by simply referring to the fact that a homogenous Markov chain on a compact set has an invariant distribution.The problem about this approach is that the proof that a Markov chain on a compact set has an invariant distribution is not easier than the proof of the existence of the Haar measure and is less known.
We have shown that the Haar probability measure maximizes the rate distortion function at any distortion level.The normal proofs of the existence of the Haar measure use a kind of covering argument that is very close to the techniques found in rate distortion technique.There is a chance that one can get an information theoretic proof of the existence of the Haar measure.It seems obvious to use concavity arguments as one would do for Shannon entropy but, as proved by Ahlswede [24], the rate distortion function at a given distortion level is not a concave function of the underlying distribution, so some more refined technique is needed.
As noted in the introduction for any algebraic structure A the group Aut (A) can be considered as symmetry group, it it has a compact subgroup for which the results of this paper applies.It would be interesting to extend the information theoretic approach to the algebraic object A itself, but in general there is no known equivalent to the Haar measure for other algebraic structures.Algebraic structures are used extensively in channel coding theory and cryptography so although the theory may become more involved extensions of the result presented in this paper are definitely worthwhile.

Example 3 .
As distortion function on SO (2) we use the squared Euclidean distance between the corresponding points on the unit circle, i.e. d (x, y) = 4 sin 2 x − y 2 = 2 − 2 cos (x − y) .

Theorem 4 .
If C is a compact set and d : C × C → R is a non-negative continuous distortion function such that d (x, y) = 0 if and only if x = y, then the topology on C is generated by the distortion balls {x ∈ C | d (x, y) < r} where y ∈ C and r > 0.

Proof.
We have to prove that a subset B ⊆ C is open if and only if for any y ∈ B there exists a ball that is a subset of B and contains y.Assume that B ⊂ C is open and that y ∈ B. Then B compact.Hence, the function x → d (x, y) has a minimum r on B and r must be positive because r = d (x, y) = 0 would imply that x = y ∈ B. Therefore {x ∈ C | d (x, y) < r} ⊆ B. Continuity of d implies that the balls {x ∈ C | d (x, y) < r} are open.If any point in B is contained in an open ball, then B is a union of open set and open.

Example 8 .
The uniform distribution on SO (2) or R/2πZ has density 1/2π with respect to the Lebesgue measure on [0; 2π[ .The function

Figure 2 .
Figure 2. The rate distortion region of the uniform distribution on SO (2) is shaded.The rate distortion function is the lower bounding curve.In the figure the rate is measured in nats.The critical distortion d crit equals 2, and the dashed line indicates d max = 4.

Lemma 22 .
Let P be a probability distribution on the compact group G with Haar probability measure U.If dP/dU ≥ c > 0 and D (P ) is finite, then D P n ≤ (1 − c) n−1 D (P ) .Proof.First we write P = (1 − c) • S + c • U where S denotes the probability measure S = P − cU 1 − c .
1/2implying that d (P * n * P β , U ) → 0 for n → ∞.Now|d (P * n , U ) − d (P * n * P β , U )| ≤ f 2 (d (P * n * P β , P * n )) ≤ f 2 (d (P β , e)) .limn→∞ sup d (P * n , U ) ≤ f 2 (d (P β , e)) for all β, which implies that Corollary 21.Let P be a probability measure on the compact group G with Haar probability measure U. Assume that the support of P is not contained in any coset of a proper subgroup of G and D (P U ) is finite.Then R P * n converges to R U pointwise on the interval ]0; d max [ for n → ∞.Proof.Corollary 20 together with Theorem 13 implies uniform convergence of the distortion rate function for distortion less than d crit .This implies pointwise convergence of the rate distortion function on ]0; d crit [ because rate distortion functions are convex functions.The same argument works in the interval ]d crit ; d max [ .Pointwise convergence in d crit must also hold because of continuity.