## 1 Introduction

_{j}p

_{j}log p

_{j}of a finite set of n mutually exclusive events with corresponding probabilities p

_{1},..., p

_{n}measures the amount of uncertainty associated with those events [1, p.3]. Its value is zero when any of the events is certain, it is positive otherwise, and attains its maximum value (log n) when the events are equally like, that is, p

_{1}= ... = p

_{n}= 1/n. Alternatively [2, p.7], H is the mean value of the quantities − log p

_{j}and can be interpreted as the mean information in an observation obtained to ascertain the mutually exclusive and exhaustive (hypotheses defined by those) events.

_{1}, p

_{2},..., p

_{n}. However, there are applications in which the elements in the ensemble exhibit intrinsic symmetries. Consider, for example, an election in which voters select their ordered preferences among the candidates in the list {a, b, c} by choosing one of the 6 permutations in the set

_{abc}, p

_{cab}, ..., p

_{acb})

_{AA}, p

_{AG}, ..., p

_{TT})

## 2 Symmetry Studies

_{τ∈G}χ(τ

^{−1})ρ(τ)/g, one for each irreducible represen-tation (in dimension of n and character χ) of G. The canonical projections are the key elements leading to the explicit calculation and interpretation of the canonical invariants$\mathcal{P}$

_{1}x, $\mathcal{P}$

_{2}x, ... in the data. If G has h irreducible representations, then there are h projections, and the identity operator I in $\mathcal{V}$ reduces according to the sum I = $\mathcal{P}$

_{1}+ $\mathcal{P}$

_{2}+ .. . + $\mathcal{P}$

_{h}of algebraically orthogonal ($\mathcal{P}$

_{i}$\mathcal{P}$

_{j}= $\mathcal{P}$

_{j}$\mathcal{P}$

_{i}= 0 for i ≠ j) projection ($\mathcal{P}$

_{i}

^{2}= $\mathcal{P}$

_{i}, i = 1, ..., h) matrices. A formal connection with the data analytical component of any symmetry study follows from the observation that basic decompositions of the form

_{1}y) + (x|$\mathcal{P}$

_{2}y) + ... + (x|$\mathcal{P}$

_{h}y),

_{1}, ..., p

_{n}) are indexed by the set V = {1, ..., n}, the symmetry transformations are defined by the group S

_{n}of all permutations of those indices, which then acts on V according to

_{σ}) of the probability distributions are indexed by the elements σ of a finite group (G, ∗) acting on itself according to

_{σ∈G}→ (τ ∗ σ)

_{σ∈G}, τ ∈ G.

## 3 The Standard Decomposition of Entropy

_{1}, p

_{2}). The group of permutations is S

_{2}= {1, t}, where 1 indicates the identity and t indicates the transposition (12). S

_{2}acts on V by permutation of the indices of their elements and the resulting linear representation

_{1}, e

_{2}} → {e

_{τ1}, e

_{τ2}} in the data vector space ℝ

^{2}. The canonical projections associated with ρ are defined by

_{2}. Note that

^{2}= 𝒜, $\mathcal{Q}$

^{2}= $\mathcal{Q}$,

_{1}), log(p

_{2})), so that H = −p′ℓ is the entropy in the probability distribution p. It then follows from I = 𝒜 + $\mathcal{Q}$ that H decomposes as

_{1}= −p′𝒜ℓ= − log(p

_{1}p

_{2})

_{2}can be expressed as

_{2}is precisely Kullback’s [2, p.110] divergence between p and the uniform distribution e′/2 = (1, 1)/2, thus justifying the interpretation of entropy as a measure of non- uniformity.

^{2}= 𝒜, $\mathcal{Q}$

^{2}= $\mathcal{Q}$ and 𝒜$\mathcal{Q}$ = $\mathcal{Q}$𝒜 = 0. Moreover, 𝒜 projects $\mathcal{V}$ = ℝ

^{n}into a subspace $\mathcal{V}$

_{a}of dimension tr 𝒜 = 1 generated by e = e

_{1}+ .. . + e

_{n}= (1, 1, ..., 1)′ ∈$\mathcal{V}$, whereas $\mathcal{Q}$ projects $\mathcal{V}$ into an irreducible subspace $\mathcal{V}$

_{q}in dimension tr $\mathcal{Q}$ = n − 1, the orthogonal complement of $\mathcal{V}$

_{a}in $\mathcal{V}$. The irreducibility of $\mathcal{V}$

_{q}, proved using character theory [3, p.17; 5, p.70], shows that the reduction $\mathcal{V}$ = $\mathcal{V}$

_{a}+ $\mathcal{V}$

_{q}is exactly the canonical reduction determined by ρ. This decomposition is referred to as the standard decomposition or reduction. Applying the standard reduction I = 𝒜 + $\mathcal{Q}$ to the entropy H = −p′ℓ of a distribution p′= (p

_{1}, ..., p

_{n}), where ℓ′ = (log(p1), ..., log(p

_{n})), it follows, in summary,

**Proposition 3.1**

_{1}, ..., p

_{n}) is H = H

_{1}+ H

_{2}with

#### 3.1 Graphical displays of {H_{1}, H_{2}}

_{1}and H

_{2}remain invariant under S

_{n}, in the sense that for $\mathcal{P}$ = 𝒜, $\mathcal{Q}$,

_{n}. Therefore, H

_{1}and H

_{2}define a set of one-dimensional invariants that can be jointly displayed and interpreted, along with any additional covariates. Graphical displays such as these are generically called invariant plots.

_{1}) and divergence (−H

_{2}) relative to the uniform distribution e′/6. In the present example S6 acts on V by permutation of the six DNA words. In the next section the same frequency data will be studied under the permutation action of S3 on the letters of the DNA words.

_{1}, H

_{2}) in each one of nine regions. Note, for example, that the entropy in regions 3 and 6 are equal and yet their location in the invariant plot differ, showing a slightly increased divergence from the uniform distribution e′/6 in region 3. Although the entropy in regions 6 and 8 is essentially the same, there is a three-fold ratio between their divergence components- this is noticeable by inspecting the (lack of) uniformity in the corresponding frequency distributions. Differentiations of that nature are possible because the dimension of the invariant subspace asso- ciated with H2 is tr $\mathcal{Q}$ = n − 1, where n is the number of components in the distribution under consideration. The regular decomposition described in the next section will further illustrate these differentiations and their relation to the symmetries imposed in the set of labels for the components of the distributions.

**Figure 3.1:.**Color coded entropy levels (H) and their invariant components {H

_{1}, H

_{2}} in 9 subsequent regions of the HIV virus Type I (BRU isolate).

#### 3.2 The standard decomposition of the entropy of the Sloan fonts

_{1}and H

_{2}. Figure 3.2 shows the standard decomposition for each letter. The divergence from the uniform distribution e′/2 ranges from 0.58 when the distribution is (0.844, 0.156) to 0.001 for (0.516, 0.484). The study of the entropy of Sloan fonts will continue later on in Section 5.

#### 3.3 Geological compositions

_{1}, H

_{2}}. The entropy among all compositions is within the range of 75 − 87 percent of the max entropy (log 5 = 1.6), thus being concentrated in a relatively narrow region. The joint distribution of porosity and H

_{2}shown in Figure 3.5 suggests that porosity is negatively correlated with H

_{2}, or, equivalently, positively correlated with the divergence. In fact, the observed sample correlation coefficient based on 25 samples of these two variables is 0.78.

_{1}, H

_{2}}. Figure 3.7 clearly shows the contrasting difference in entropy range between the two minerals, also evident in the range of the divergence in those compositions. This noticeable feature of the compositions [25, p.4], namely that coxite compositions are much less variable than hongite compositions, is clearly captured in these invariant plots.

**Figure 3.3:.**Color coded entropy levels (H) and their invariant {H

_{1}, H

_{2}} components in the geo-logical compositions of 25 samples of coxite.

## 4 Regular Decompositions of Entropy

_{3}, the group of permutations of three objects, will be considered. Decompositions for ensembles indexed by other finite groups can be obtained similarly.

**Figure 3.4:.**Color coded porosity levels (P) and their invariant entropy components {H

_{1}, H

_{2}} in the geological compositions of 25 samples of coxite.

_{3}, so that V is isomorphic to S

_{3}. Indicate by

_{1}, r

_{2}, r

_{3}, t

_{1}, t

_{2}, t

_{3})

_{3}, appear in a given DNA reference region. These relative frequencies are, consequently, data indexed by S

_{3}. Let also ℓ indicate the vector of the log components of p, so that H = −p′ℓ is the entropy in the probability distribution p indexed by S

_{3}.

#### 4.1 The regular decomposition

_{3}act on the DNA words by shuffling the letters. This gives a linear representation

_{3}. Each matrix ρ(τ) is defined in the vector space for the data indexed by S

_{3}. Associated with ρ there are three canonical projections

_{3}, with n

_{1}= n

_{3}= 1, n

_{2}= 2, g = 6, t = {(12), (13), (23)}, r = {(123), (132)}, given by

**Figure 3.6:.**Color coded entropy levels (H) and their invariant {H

_{1}, H

_{2}} components in the geological compositions of 25 samples of hongite.

_{i}$\mathcal{P}$

_{j}= $\mathcal{P}$

_{j}$\mathcal{P}$

_{i}= 0 for i ≠ j, = $\mathcal{P}$

_{i}, i = 1, 2, 3 and I = $\mathcal{P}$

_{1}+ $\mathcal{P}$

_{2}+ $\mathcal{P}$

_{3}. The underlying invariant image subspaces $\mathcal{P}$

_{i}x, x ∈ ℝ

^{6}, are in dimension of 1, 4, 1 respectively.

_{1}ℓ− p′$\mathcal{P}$

_{2}ℓ− p′$\mathcal{P}$

_{3}ℓ,

_{1}is the log geometric mean of the components of p; direct evaluation shows that

_{2}= r

_{•}D(r : u) + t

_{•}D(t : u)

_{•}is the marginal probability r

_{1}+ r

_{2}+ r

_{3}of a rotation, t

_{•}is the marginal probability t

_{1}+ t

_{2}+ t

_{3}of a reflection, D(r : u) is Kullback’s divergence between the rotation subcomposition [25, p.33]

_{1}, r

_{2}, r

_{3})/(r

_{1}+ r

_{2}+ r

_{3})

_{1}, t

_{2}, t

_{3})/(t

_{1}+ t

_{2}+ t

_{3}) and the uniform distribution u; and

_{3}as a three-component sum: the log geometric mean of p, the total uniform divergence within rotations and within reflections and a component measuring the between rotations and reflections separation.

_{1}+ H

_{2}+ H

_{3}for the permutation orbit of ACT in each one of 9 adjacent regions of the isolate. Table (4.2) shows the ratios H

_{1}/H and H

_{2}/H

_{3}for potential comparisons among these regions. Similar ratios were calculated for the words AGT and ACG (Tables (4.3) and (4.4), respectively). The regions 1 and 4 were removed from the calculations for ACG due to presence of zeros in the frequency distributions.

_{2}/H

_{3}clearly differentiates the ACG orbit from ACT and AGT. The between-region range of the ratios H

_{2}/H

_{3}is 0.295 − 1.45 in the ACG orbits, in contrast to 0.222 − 108 in the AGT orbits and 0.114 − 72.3 in the ACT orbits.

## 5 A Symmetry Study of Sloan Charts

#### 5.1 The Sloan Charts and the Sloan Fonts symmetries

_{2}× C

_{2}and its multiplication table is given by

#### 5.2 The regular decomposition

_{1}= 1 + o + v + h,

_{o}= 1 +o − v − h,

_{v}= 1 + v − o − h,

_{h}= 1 + h − o − v,

_{β}x is precisely x

_{β}(τ) = n

_{β}tr [β(τ

^{−1}) (β)]/g. Consequently, the data assignment x

_{β}identifies the properties of the regular canonical projection $\mathcal{P}$

_{β}with the properties of the Fourier transform (β) evaluated at β. In that sense, the indexing x

_{β}should retain the interpretations associated with the invariant subspace of $\mathcal{P}$

_{β}, and, jointly, the data vectors {x

_{β}, β ∈ } should fully describe the regular symmetry invariants. Here indicates the set of all irreducible representations of G.

**Any**indexing x of G decomposes (via the Fourier inverse formula) as the linear superposition ), ∑

_{β∈ }x

_{β}(τ) of regular invariants x

_{β}of G. Shortly, then x = ∑

_{β∈ })x

_{β}.

#### 5.3 Sorting the line entropy by font symmetry type

**Figure 5.2:.**Distribution of the entropy invariants EINVO≡

_{o}, EINVV ≡

_{v}and EINVH≡

_{h}for the entropy data indexed by font symmetries.

_{2},

_{3}and

_{4}(second, third and fourth invariants) for the entropy data

_{o}is markedly shifted from the distributions of

_{v}and

_{h}.

## 6 Summary and Comments

_{1}+ H

_{2}of the entropy of any finite distribution and the method for obtaining the regular decomposition H = H

_{1}+ .. . for the entropy of distributions indexed by arbitrary finite groups.

- The standard decomposition appears in analogy with the decomposition of the sum of squares, or, more specifically, to the decomposition of an inner product (y|x) of two vector x and y in the same finite-dimensional vector space. In fact, the standard decomposition I = 𝒜 + $\mathcal{Q}$ has a well-known role in statistics. It leads to the usual decomposition
- The methodology presented in this paper should provide an additional tool to study the entropy in distributions of nucleotide sequences in molecular biology data. This case is also of statistical and algebraic interest because it extends the decompositions introduced in the present paper to the case in which the probability distributions are indexed by the short nucleotide sequences upon which a group may act by symbol symmetry or by position symmetry [5, p.40];
- Two-dimensional invariant plots for the regular decomposition of the entropy by S
_{3}can be obtained by jointly displaying the three pairwise combinations of the invariant components {H_{1}, H_{2}, H_{3}}. The characteristic of the regions of constant entropy in these planes needs to be investigated; - The large-sample theory for multinomial distributions, described in detail, for example, in [27, p.469], can be applied to derive the moments of the entropy components.

## Appendix A

## Appendix B

_{β})

_{στ}= n

_{β}tr [β(σ

^{−1}τ)]/g,

**Proposition B.1**

_{β}= T

_{β}.

**Proof:**

^{−1}, so that ($\mathcal{P}$

_{β})

_{στ}= n

_{β}χ

_{β}(στ

^{−1M})/g = n

_{β}tr β(στ

^{−1})/g = (T

_{β})

_{στ}, as stated.

_{τ ∈G}x(τ)β(τ) of the scalar function x is also an element in , the indexing x

_{β}(τ) = n

_{β}tr [β(τ

^{−1}) (β)]/g is well-defined. The following proposition shows that it reduces as β.

**Proposition B.2**

_{β}(τ) = n

_{β}tr [β(τ

^{−1}) (β)]/g is an invariant of $\mathcal{P}$

_{β}.

**Proof:**

_{β}= T

_{β}derived above, for an arbitrary vector x ∈ ℝ

^{g}, it follows that

_{β}x is precisely x

_{τ}(τ) = n

_{β}tr [β(τ

^{−1}) (β)]/g.

## Acknowledgments

## References

- Khinchin, A. I. Information theory; Dover: New York, NY, 1957. [Google Scholar]
- Kullback, S. Information theory and statistics; Dover: New York, NY, 1968. [Google Scholar]
- Jean-Pierre Serre. Linear representations of finite groups; Springer-Verlag: New York, 1977. [Google Scholar]
- Viana, M. Symmetry studies- an introduction; IMPA Institute for Pure and Applied Mathematics Press: Rio de Janeiro, Brazil, 2003. [Google Scholar]
- ——, Lecture notes on symmetry studies, Technical Report 2005-027, EURANDOM, Eindhoven University of Technology, Eindhoven, The Netherlands. 2005. Electronic version http://www.eurandom.tue.nl.
- Eaton, M. L. Multivariate statistics- a vector space approach; Wiley: New York, NY, 1983. [Google Scholar]
- Muirhead, R. J. Aspects of multivariate statistical theory; Wiley: New York, 1982. [Google Scholar]
- Lakshminarayanan, V.; Viana, M. Dihedral representations and statistical geometric optics I: Spherocylindrical lenses. J. Optical Society of America A
**2005**, 22, no. 11. 2843–89. [Google Scholar] - Viana, M. Invariance conditions for random curvature models. Methodology and Computing in Applied Probability
**2003**, 5, 439–453. [Google Scholar] - Viana, M.; Lakshminarayanan, V. Dihedral representations and statistical geometric optics II: Elementary instruments. J. Modern Optics To appear.
**2006**. [Google Scholar] - Hannan, E. J. Group representations and applied probability. Journal of Applied Probability
**1965**, 2, 1–68. [Google Scholar] - James, A.T. The relationship algebra of an experimental design. Annals of Mathematical Statistics
**1957**, 28, 993–1002. [Google Scholar] - Grenander, U. Probabilities on algebraic structures; Wiley: New York, NY, 1963. [Google Scholar]
- Bailey, R. A. Strata for randomized experiments. Journal of the Royal Statistical Society B
**1991**, no. 53. 27–78. [Google Scholar] - Dawid, A. P. Symmetry models and hypothesis for structured data layouts. Journal of the Royal Statistical Society B
**1988**, no. 50. 1–34. [Google Scholar] - Diaconis, P. Group representation in probability and statistics; IMS: Hayward, California, 1988. [Google Scholar]
- Eaton, M. L. Group invariance applications in statistics; IMS-ASA: Hayward, California, 1989. [Google Scholar]
- Farrell, R. H. (Ed.) Multivariate calculation; Springer-Verlag: New York, NY, 1985.
- Wijsman, R. A. Invariant measures on groups and their use in statistics; Vol. 14, IMS: Hayward, California, 1990. [Google Scholar]
- Nachbin, L. The Haar integral; Van Nostrand: Princeton, N.J., 1965. [Google Scholar]
- Andersson, S. Invariant normal models. Annals of Statistics
**1975**, 3, no. 1. 132–154. [Google Scholar] - Perlman, M. D. Group symmetry covariance models. Statistical Science
**1987**, 2, 421–425. [Google Scholar] - Viana, M.; Richards, D. Viana, M., Richards, D., Eds.; Vol. 287, American Mathematical Society: Providence, RI, 2001.
- Ferris, F.L., 3rd; Freidlin, V.; Kassoff, A.; Green, S.B.; Milton, R.C. Relative letter and position difficulty on visual acuity charts from the early treatment diabetic retinopathy study. Am J Ophthalmol
**1993**, 15, 735–40. [Google Scholar] - Aitchison, J. The statistical analysis of compositional data; Chapman and Hall: New York, NY, 1986. [Google Scholar]
- Doi, H. Importance of purine and pyrimidine content of local nucleotide sequences (six bases long) for evolution of human immunodeficiency virus type 1. Evolution
**1991**, 88, no. 3. 9282–9286. [Google Scholar] - Bishop, Y.M.M.; Fienberg, S.E.; Holland, P.W. Discrete multivariate analysis: Theory and practice; MIT Press: Cambridge, Massachusetts, 1975. [Google Scholar]

© 2006 by MDPI (http://www.mdpi.org). Reproduction for noncommercial purposes permitted.