Symmetry Studies and Decompositions of Entropy

This paper describes a group-theoretic method for decomposing the entropy of a finite ensemble when symmetry considerations are of interest. The cases in which the elements in the ensemble are indexed by {1,2,...,n} and by the permutations of a finite set are considered in detail and interpreted as particular cases of ensembles with elements indexed by a set subject to the actions of a finite group. Decompositions for the entropy in binary ensembles and in ensembles indexed by short DNA words are discussed. Graphical descriptions of the decompositions of the entropy in geological samples are illustrated. The decompositions derived in the present cases follow from a systematic data analytic tool to study entropy data in the presence of symmetry considerations.


Introduction
The entropy H = − j p j log p j of a finite set of n mutually exclusive events with corresponding probabilities p 1 , . . ., p n measures the amount of uncertainty associated with those events [1, p.3].Its value is zero when any of the events is certain, it is positive otherwise, and attains its maximum value (log n) when the events are equally like, that is, p 1 = . . .= p n = 1/n.Alternatively [2, p.7], H is the mean value of the quantities − log p j and can be interpreted as the mean information in an observation obtained to ascertain the mutually exclusive and exhaustive (hypotheses defined by those) events.
In general, the probabilities of an ensemble with n elements are indexed by the set V = {1, . . ., n}, and commonly indicated by p 1 , p 2 , . . ., p n .However, there are applications in which the elements in the ensemble exhibit intrinsic symmetries.Consider, for example, an election in which voters select their ordered preferences among the candidates in the list {a, b, c} by choosing one of the 6 permutations in the set V = {abc, cab, bca, bac, cba, acb}.
The resulting distribution p = (p abc , p cab , . . ., p acb ) of relative frequencies of those choices is then an example of empirical probabilities indexed by permutations.Accordingly, the corresponding entropy is written as H = − s∈V p s log p s .
In another example, the experimental results might be the empirical probabilities p = (p AA , p AG , . . ., p TT ) with which the 16 binary sequences V = {AA, AG, . . ., TT} in the DNA alphabet {A, G, C, T} appear in a given reference sequence of interest.Similarly, then, these probabilities are indexed by a set of labels structurally more interesting than {1, 2, . . ., n}.
In fact, the 16 binary words can be classified, for example, by permutations shuffling the letters in the DNA alphabet, or distinctly, by permutations shuffling the positions of those letters.The objective of this paper is demonstrating that there is a systematic relationship among the labels, the (frequency) data indexed by those labels and the symmetries that are consistent with them, and showing that the methodology describing such relationship leads to the statistical analysis of the observed entropy of those frequency distributions.
The basic elements of symmetry studies are introduced in the next section and applied to the derivation of the standard (Section 3) and regular (Section 4) decompositions of entropy.In Section 5 a symmetry study of the entropy in Sloan Charts used in the quantification of visual acuity is presented.Additional comments are included in Section 6.The algebraic aspects present in this paper follow from the elements of the theory of linear representations of finite groups and can be found, for example, in [3, pp.1-24].

Symmetry Studies
The voting and the molecular biology data mentioned in the Introduction are examples of structured data in symmetry studies [4,5].These studies are centered on the notion of data (x) that are indexed by a finite set V of indices or labels (s) upon which certain symmetry transformations can be defined.Briefly, symmetry studies explore the symmetry transformations identified by the set of labels to facilitate the classification, interpretation and statistical analysis of the data {x(s), s ∈ V} indexed by these labels.A finite group (G) with g elements acts on V and determines a linear representation (ρ) of G that operates in the data vector space (V).The resulting factorization of V into a direct sum of invariant subspaces follows from the construction of algebraically orthogonal projections of the form P = n τ ∈G χ(τ −1 )ρ(τ )/g, one for each irreducible representation (in dimension of n and character χ) of G.The canonical projections are the key elements leading to the explicit calculation and interpretation of the canonical invariants P 1 x, P 2 x, . . . in the data.If G has h irreducible representations, then there are h projections, and the identity operator I in V reduces according to the sum I = P 1 + P 2 + . . .+ P h of algebraically orthogonal (P i P j = P j P i = 0 for i = j) projection (P 2 i = P i , i = 1, . . ., h) matrices.A formal connection with the data analytical component of any symmetry study follows from the observation that basic decompositions of the form (x|y) = (x|P 1 y) + (x|P 2 y) + . . .+ (x|P h y), for any inner product (•|•) defined in V, can then always be studied within the context of statistical inference for quadratic forms [6,7].As a consequence, symmetry-related hypotheses defined by the canonical invariants can be identified and assessed [8][9][10].
The present paper concentrates on two basic types of canonical decompositions.In the standard case the components of the probability distributions p = (p 1 , . . ., p n ) are indexed by the set V = {1, . . ., n}, the symmetry transformations are defined by the group S n of all permutations of those indices, which then acts on V according to τ : (1, 2, . . ., n) → (τ 1, τ2, . . ., τn), τ ∈ S n .
In the regular case the components (p σ ) of the probability distributions are indexed by the elements σ of a finite group (G, * ) acting on itself according to In closing this section it is opportune to remark that applications of group-theoretic principles in statistics and probability have a long history and tradition of their own, from Legendre-Gauss's least squares principle through R.A Fisher's method of variance decomposition.According to E.J. Hannan [11], the work [12] of A. James appears to be among the first describing the grouptheoretic nature of Fisher's argument, giving meaning to the notion of relationship algebra of an experimental design.In that same early period, U. Grenander [13] showed the effectiveness of harmonic analysis techniques to extend classical limit theorems to algebraic structures such as locally compact groups, Banach spaces and topological algebras.Two decades later, the relevance of group invariance and group representation arguments in statistical inference would become evident [14][15][16][17][18][19].The integral of Haar, object of L. Nachbin's 1965 monograph [20], became a familiar tool among statisticians.The work of S. Andersson on invariant normal models [21] is now recognized as a landmark concept extending and setting definitive boundaries to multivariate statistical analysis [22] in the tradition of T.W. Anderson.The collection of contemporary work in [23] clearly documents the present-day interest on those methods.

The Standard Decomposition of Entropy
Consider first a probability distribution p indexed by the set V = {1, 2}, that is p = (p 1 , p 2 ).The group of permutations is S 2 = {1, t}, where 1 indicates the identity and t indicates the transposition (12).S 2 acts on V by permutation of the indices of their elements and the resulting linear representation is given by Note that ρ(τ ) is the matrix defined by the change of basis {e 1 , e 2 } → {e τ 1 , e τ 2 } in the data vector space R 2 .The canonical projections associated with ρ are defined by , where the coefficients {1, 1} and {1, −1} determining A and Q are the two irreducible characters of S 2 .Note that Let = (log(p 1 ), log(p 2 )), so that H = −p is the entropy in the probability distribution p.It then follows from I = A + Q that H decomposes as the components of which can be expressed as the log geometric mean of the components of p, and as Because H 2 can be expressed as it then follows that −H 2 is precisely Kullback's [2, p.110] divergence between p and the uniform distribution e /2 = (1, 1)/2, thus justifying the interpretation of entropy as a measure of nonuniformity.
The standard decomposition of the entropy obtained for two-component distributions extends to n−component distributions in the expected way.Writing where ee is the n × n matrix of ones, it then follows that the canonical decomposition is exactly The irreducibility of V q , proved using character theory [3, p.17; 5, p.70], shows that the reduction V = V a + V q is exactly the canonical reduction determined by ρ.This decomposition is referred to as the standard decomposition or reduction.Applying the standard reduction I = A+Q to the entropy H = −p of a distribution p = (p 1 , . . ., p n ), where = (log(p1), . . ., log(p n )), it follows, in summary, Similarly to the case n = 2 described earlier, (−1×) the entropy of a n−component distribution decomposes as the sum of the log geometric mean and Kullback's divergence between p and the uniform distribution e /n.This can be easily verified by direct evaluation of the RHS of the above equality.

Graphical displays of {H
It is observed that both H 1 and H 2 remain invariant under S n , in the sense that for P = A, Q, The last equality in (3.1) is a consequence of the fact that P and ρ(τ ) commute for all τ ∈ S n .Therefore, H 1 and H 2 define a set of one-dimensional invariants that can be jointly displayed and interpreted, along with any additional covariates.Graphical displays such as these are generically called invariant plots.Table (3.3)shows the observed frequencies with which the words in the permutation orbit of the DNA word ACT appear in 9 subsequent regions of the BRU isolate K02013 of the Human Immunodeficiency Virus Type I. To locate the sequence in the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)data base, use the accession number K02013.Each region is in length of 900 base pairs.Table (3.3)also shows the entropy (H) of each distribution and its standard decomposition: the log geometric mean (−H 1 ) and divergence (−H 2 ) relative to the uniform distribution e /6.In the present example S 6 acts on V by permutation of the six DNA words.In the next section the same frequency data will be studied under the permutation action of S 3 on the letters of the DNA words.
The invariant plot of Figure 3.1 shows the (color coded) entropy of the ACT orbit and its invariant components (H 1 , H 2 ) in each one of nine regions.Note, for example, that the entropy in regions 3 and 6 are equal and yet their location in the invariant plot differ, showing a slightly increased divergence from the uniform distribution e /6 in region 3.Although the entropy in regions 6 and 8 is essentially the same, there is a three-fold ratio between their divergence components-this is noticeable by inspecting the (lack of) uniformity in the corresponding frequency distributions.
Differentiations of that nature are possible because the dimension of the invariant subspace associated with H 2 is tr Q = n − 1, where n is the number of components in the distribution under consideration.The regular decomposition described in the next section will further illustrate these differentiations and their relation to the symmetries imposed in the set of labels for the components of the distributions.

The standard decomposition of the entropy of the Sloan fonts
Table (3.4)shows the ten Sloan fonts [24, Table 5] used for the assessment of visual acuity in the Early Treatment Diabetic Retinopathy Study, along with their estimated difficulty (probability of incorrectly identifying the letter), the corresponding entropy and their standard invariant components H 1 and H 2 .Figure 3.2 shows the standard decomposition for each letter.The divergence from the uniform distribution e /2 ranges from 0.58 when the distribution is (0.844, 0.156) to 0.001 for (0.516, 0.484).The study of the entropy of Sloan fonts will continue later on in Section 5.

Regular Decompositions of Entropy
In this section, the decomposition of the entropy for probability ensembles with components indexed by S 3 , the group of permutations of three objects, will be considered.Decompositions for ensembles indexed by other finite groups can be obtained similarly.To illustrate, consider the DNA word ACT and its permutation orbit V = {ACT, CTA, TAC, CAT, TCA, ATC} introduced early on in Section 3.1.Note that V can be generated from the word ACT by applying, respectively, the permutations defining the group S 3 , so that V is isomorphic to S 3 .Indicate by p = (r 1 , r 2 , r 3 , t 1 , t 2 , t 3 ) the respective relative frequencies with which the words in V, or equivalently, the elements in S 3 , appear in a given DNA reference region.These relative frequencies are, consequently, data indexed by S 3 .Let also indicate the vector of the log components of p, so that H = −p is the entropy in the probability distribution p indexed by S 3 .

The regular decomposition
The permutations in S 3 act on the DNA words by shuffling the letters.This gives a linear representation ρ of S 3 .Each matrix ρ(τ ) is defined in the vector space for the data indexed by S 3 .Associated with ρ there are three canonical projections corresponding to the irreducible characters 12), (13), (23)}, r = {(123), (132)}, given by , and Observe that P i P j = P j P i = 0 for i = j, P 2 i = P i , i = 1, 2, 3 and I = P 1 + P 2 + P 3 .The underlying invariant image subspaces P i x, x ∈ R 6 , are in dimension of 1, 4, 1 respectively.It then follows that H = −p = −p P 1 − p P 2 − p P 3 , with the regular components of the entropy H given by and The interpretation of these components can be expressed in terms of three axial reflections (transpositions) and the three-fold rotations (cyclic permutations) of the regular triangle: −H 1 is the log geometric mean of the components of p; direct evaluation shows that where r • is the marginal probability r 1 + r 2 + r 3 of a rotation, t • is the marginal probability t 1 + t 2 + t 3 of a reflection, D(r : u) is Kullback's divergence between the rotation subcomposition [25, p.33] r = (r 1 , r 2 , r 3 )/(r 1 + r 2 + r 3 ) It is then possible to summarize the regular decomposition of the entropy of a distribution p indexed by S 3 as a three-component sum: the log geometric mean of p, the total uniform divergence within rotations and within reflections and a component measuring the between rotations and reflections separation.The within-region and between-words in the ratios of the entropy regular components for the three DNA words considered in this example is striking.In particular, the ratio H 2 /H 3 clearly differentiates the ACG orbit from ACT and AGT.The between-region range of the ratios H 2 /H 3 is 0.295 − 1.45 in the ACG orbits, in contrast to 0.222 − 108 in the AGT orbits and 0.114 − 72.3 in the ACT orbits.

A Symmetry Study of Sloan Charts
This study will combine the notions developed in the previous sections by introducing the Sloan Charts and considering each line in the chart as the sampling unit (total of 42 lines); identifying the group G of planar symmetries of the Sloan fonts; determining the invariants associated with the regular representation of G and studying the line entropy in that space.Cyclic permutation orbits have been used [26] to describe the evolutionary strategy of the HIV-1 virus.

The Sloan Charts and the Sloan Fonts symmetries
Tables (5.2) show a Sloan Chart [24, Table 5] developed for use in the Early Treatment Diabetic Retinopathy Study and the 10 individual Sloan letters appearing in Example 3.2, their symmetry transformations, the single-letter difficulty as the estimated [24] probability of incorrectly identifying the letter, and the corresponding entropy.The group of interest here is the point planar group G = {1, o, v, h}, in which 1 is the identity, o the point symmetry or inversion, v the vertical axis reflection and h the horizontal axis reflection.The font is considered centered at the center of inversion with their natural vertical and horizontal directions along the corresponding axes of reflection.Observe that the symmetries indicated in the RHS Table in (5.2) are the subgroups of G leaving the letter fixed (or the letter stabilizer).G is an Abelian group isomorphic to C 2 × C 2 and its multiplication table is given by * (5.1)The objective here is studying the relationship between line entropy and letter symmetry based on 42 ETDRS lines, obtained from three charts similar to the one shown in the LHS table in (5.2).

The regular decomposition
The canonical projections for the regular representation (G acting on itself) are given by Indicating, for simplicity of notation, by x = (1, o, v, h) a generic vector of data indexed by G, direct evaluation shows that where x are the Fourier transforms evaluated at the irreducible one-dimensional representations of G.
Note, as shown in Appendix B, that that is, the entry x(τ ) of the regular β-invariant P β x is precisely x β (τ ) = n β tr [β(τ −1 ) x(β)]/g.Consequently, the data assignment x β identifies the properties of the regular canonical projection P β with the properties of the Fourier transform x(β) evaluated at β.In that sense, the indexing x β should retain the interpretations associated with the invariant subspace of P β , and, jointly, the data vectors {x β , β ∈ G} should fully describe the regular symmetry invariants.Here G indicates the set of all irreducible representations of G. Any indexing x of G decomposes (via the Fourier inverse formula) as the linear superposition β∈ G x β (τ ) of regular invariants x β of G. Shortly, then x = β∈ G x β .

Sorting the line entropy by font symmetry type
Indicate by fix (τ ) = {s ∈ V; ϕ(τ, s) = s} ⊆ V the set of elements in V that remain fixed by the symmetry transformation τ applied to s ∈ V according to the rule ϕ.Then there is a general method of indexing the data by G, called the regular indexing, constructed as follows: to each element τ ∈ G associate the evaluation x(τ ) of a scalar summary of x defined over the set fix (τ ).That is, x(τ ) indicates a summary of the data over those elements (if any) in V that share the symmetry of τ .For example, if the summary of interest is the averaging, then As an example, the regular indexing is applied to the font entropy data x(s) of each font s in each line (V) of the Sloan Chart shown in the LHS Table in (5.2).Here fix(τ ) is the set of fonts with the symmetry of τ ∈ G, and x(τ ) is the average font entropy over the set of fonts with the symmetry of τ , or the mean line entropy conditional to fonts with symmetry of τ .Table (5.3)shows the line entropy indexed, or sorted, by the symmetries of G.Note that line 2 was excluded because there were vertical symmetries in that line.One can then proceed with the study of the regular decomposition for x as data indexed by the group G of planar symmetries.deleted.The results suggest a (statistically) significant drop in mean line entropy conditional to fonts with point symmetry.This result should be explored further taking into account, for example, the fact that the mean number of letters per line (± standard deviation) with point, horizontal and vertical symmetry are, respectively, 2.5 ± 0.804, 2.0 ± 0.698, and 1.5 ± 0.741.In addition, Figure 5.2 shows the distribution of the Fourier transforms x 2 , x 3 and x 4 (second, third and fourth invariants) for the entropy data 3) indexed by the font symmetries.Consistently with the previous interpretation, the distribution of the invariant x o is markedly shifted from the distributions of x v and x h .

Summary and Comments
This paper introduced the standard decomposition H = H 1 + H 2 of the entropy of any finite distribution and the method for obtaining the regular decomposition H = H 1 + . . .for the entropy of distributions indexed by arbitrary finite groups.
1.The standard decomposition appears in analogy with the decomposition of the sum of squares, or, more specifically, to the decomposition of an inner product (y|x) of two vector x and y in the same finite-dimensional vector space.In fact, the standard decomposition I = A + Q has a well-known role in statistics.It leads to the usual decomposition

Appendix B
With the notation introduced in Section 5.2, define the g × g matrix That is, the entry x(τ ) of the regular β-invariant P β x is precisely x β (τ ) = n β tr [β(τ −1 ) x(β)]/g.

Figure 3 . 1 :
Figure 3.1: Color coded entropy levels (H) and their invariant components {H 1 , H 2 } in 9 subsequent regions of the HIV virus Type I (BRU isolate).

3. 3
Geological compositionsTables .1 in Appendix A appear in[25, pp.354, 358]  and show the geological compositions of albite, blandite, cornite, daubite and endite in 25 samples of coxite and hongite, in addition to the porosity (percentage of void space) of each sample of coxite.In this example, probability distributions are referred to as compositions, following[25, p. 26].Figures 3.3 and 3.4 show, respectively, the color coded entropy and porosity levels for the coxite samples, and their invariant components {H 1 , H 2 }.The entropy among all compositions is within the range of 75 − 87 percent of the max entropy (log 5 = 1.6), thus being concentrated in a relatively narrow region.The joint distribution of porosity and H 2 shown in Figure3.5 suggests that porosity is negatively correlated with H 2 , or, equivalently, positively correlated with the divergence.In fact, the observed sample correlation coefficient based on 25 samples of these two variables is 0.78.

Figure 3 .
Figure 3.6 shows the color coded entropy in the hongite samples and their invariant components {H 1 , H 2 }.Figure 3.7 clearly shows the contrasting difference in entropy range between the two minerals, also evident in the range of the divergence in those compositions.This noticeable feature of the compositions[25, p.4], namely that coxite compositions are much less variable than hongite compositions, is clearly captured in these invariant plots.

Figure 3 . 3 :
Figure 3.3: Color coded entropy levels (H) and their invariant {H 1 , H 2 } components in the geological compositions of 25 samples of coxite.

Figure 3 . 4 :
Figure 3.4: Color coded porosity levels (P) and their invariant entropy components {H 1 , H 2 } in the geological compositions of 25 samples of coxite.

Figure 3 . 6 :
Figure 3.6: Color coded entropy levels (H) and their invariant {H 1 , H 2 } components in the geological compositions of 25 samples of hongite.

Figure 3 . 7 :
Figure 3.7: Entropy levels in the standard invariant plot for coxite and hongite data.

Figure 5 . 2 :
Figure 5.2: Distribution of the entropy invariants EINVO≡ x o , EINVV≡ x v and EINVH≡ x h for the entropy data indexed by font symmetries.

Table ( 4
.1) shows the regular decomposition H = H 1 + H 2 + H 3 for the permutation orbit of ACT in each one of 9 adjacent regions of the isolate.Table(4.2) shows the ratios H 1 /H and H 2 /H 3 for potential comparisons among these regions.Similar ratios were calculated for the words AGT and ACG (Tables(4.3)and (4.4), respectively).The regions 1 and 4 were removed from the calculations for ACG due to presence of zeros in the frequency distributions. 2