The Ordering of Shannon Entropies for the Multivariate Distributions and Distributions of Eigenvalues

In this paper, we prove the Shannon entropy inequalities for the multivariate distributions via the notion of convex ordering of two multivariate distributions. We further characterize the multivariate totally positive of order 2 (MTP2) property of the distribution functions of eigenvalues of both central Wishart and central MANOVA models, and of both noncentral Wishart and noncentral MANOVA models under the general population covariance matrix set-up. These results can be directly applied to both the comparisons of two Shannon entropy measures and the power monotonicity problem for the MANOVA problem.


Introduction
Let X 1 , X 2 , · · · be independent random vectors with a common continuous distribution function F(x) which has density function f (x), x ∈ R p with respect to Lebesgue measure. The Shannon entropy for f (x) is denoted by We further use F to denote all density functions whose Shannon entropies exist. That is Entropy inequality has been studied by several authors, such as Karlin and Rinott ( [1,2], and references therein), who used the notions of majorization and Schur functions to study the Shannon (differential) entropy comparisons. Zografos and Nadarajah [3] pointed out that some comparisons of Shannon entropies can be induced by the affine transformation, those results show agreement on the concern that Shannon entropy is not scale invariant. Tsai [4] used the notion of convex ordering to investigate the Shannon inequalities for the univariate distributions. The novel finding of his approach is to adopt the difference of two Shannon entropy measures as a new measure, which is essentially symmetric. This new measure has some advantages over the well-known Kullback-Leibler divergence (relative entropy). The former one is designed for small or moderate deviation, however, the latter one is quite often used for large deviation. The measure of the difference between two Shannon entropies enjoys the finite sample property.
The key point of this method is to transform the convex ordering into the Lorenz ordering, which nicely depicts the stochastic ordering of two normalized distributions with the same starting and ending points, respectively. The presentation of the difference between two Shannon entropy measures can be represented in terms of a standard form of Kullback-Leibler divergence, however, it is interesting to note that the original Kullback-Leibler divergence does not have such kind of symmetric representation. Furthermore, the new measure is monotone likelihood ratio of two normalized density functions. This new measure provides us with some advantages for statistical inference.
The ordering of Shannon entropies can naturally occur when two underline distributions have the convex ordering or concave ordering relationship via the help of Lorenz ordering. In this paper, one of the main goals is to extend the result of Tsai [4] to the multivariate case, which is studied in Section 3. Please note that the notions of convex ordering of two distributions, monotone likelihood ratio (MLR) and totally positive of order 2 (TP 2 ) are essentially equivalent. We then adopt the notion of multivariate totally positive of order 2 (MTP 2 ) or multivariate reverse rule of order 2 (MRR 2 ) to get a more complete theory on Shannon entropy comparisons. Eigenvalues problem has become quite a hot topic at present. We further study the MTP 2 property of the distribution functions of eigenvalues of both central Wishart and central MANOVA models, and of both noncentral Wishart and noncentral MANOVA models under the general population covariance matrix set-up, respectively, in Section 4. The results can be directly applied to the comparisons of two Shannon entropy measures as well as the power monotonicity problem for the MANOVA problem, which has been open for a long time in the literature (Perlman and Olkin [5]). For a high-dimensional Wishart matrix with unknown scale times identity matrix as the population matrix, the TP 2 property of limiting empirical density function of eigenvalues is also studied in the Section 4.5. Mixture density functions, such as the noncentral chi-square density function and those presented in Sections 4.3 and 4.4, play an important role in statistical inference. We also connected this notion with that of well-known Fisher's multivariate analysis of variance in the final remark section.

The Univariate Distributions
Tsai [4] adopted the notion of convex ordering of distribution functions to study the orderings of Shannon entropy measures. For two univariate distributions F, G ∈ F , it is said that F is c-ordered (convex ordered) with respect to G (F ≤ c G) if and only if G −1 F is convex on the interval where 0 < F(x) < 1 (van Zwet [6]). The transformation of Barlow and Doksum [7] is to transform the convex ordering of distributions F and G to the stochastic ordering of distributions, it essentially can be viewed as a kind of Lorenz ordering (Gastwirth [8]). Let . It can also be rewritten as f (F −1 (u))/g(G −1 (u)), where u = F(x), u ∈ [0, 1]. We adopt the notion of Barlow and Doksum [7] to denote that H F (u) = 1], which can be viewed as the inverse function of the transformation considered by Barlow and Doksum [1]. Please note that the notation in Barlow and Doksum [7] is originally defined that F(0) = 0 and F −1 (0) = 0, we use the general notations that F(−∞) = 0 and F −1 (0) = −∞ in this paper. Obviously, H G (u) = u, u ∈ [0, 1]. Under the assumption that F(x) ≤ c G(y), Barlow and Doksum [7] pointed out that both H F (u) and H G (u), u ∈ [0, 1] are two distribution functions. Let h F and h G be the corresponding density functions of H F and H G , respectively, then we may see that h F (u) = f (F −1 (u))/g(G −1 (u)) and h G (u) = 1, u ∈ [0, 1]. Then, it is easy to note that the difference of two Shannon entropies can be represented as the Kullback-Leibler divergence of the uniform density function h G and the density function h F . Please note that if F(x) ≤ c G(y), then h F (u) is nondecreasing in u ∈ [0, 1], namely, it enjoys the property of monotone likelihood ratio. In addition, it gets across with the uniform density function h G (u) = 1, u ∈ [0, 1] at most once, the sign of the difference (h F (u) − h G (u), u ∈ [0, 1]) of two density functions is from negative to nonnegative. Also note that 1]. Similarly, if F(x) ≤ c G(y), then F(x) ≤ G(y), and C F (x) is a convex function in x, x ∈ (−∞, ∞) (for the details see Shaked and Shanthikumar [9]). Namely, the convex ordering implies the stochastic ordering, however, the vice versa may not be true. We provide a simple counterexample for this. For 0 ≤ x ≤ 1, let F(x) = x, G(x) = (3x − 2) 3 /9 + 8/9, then 1]. Thus, f (F −1 (u)) = 1, g(G −1 (u)) = (9u − 8) 2/3 and h F (u) = (9u − 8) 2/3 . It is easy to see that F(u) ≤ G(u). However, h F (u) is not nondecreasing in u, u ∈ [0, 1], that means that F(x) ≤ c G(y) will not hold in this example. As such, we may conclude that if F(x) ≤ c G(y), then F(x) ≤ G(y). However, if F(x) ≤ G(y), then F(x) ≤ c G(y) may not hold generally.
With the convex ordering assumption, i.e., F(x) ≤ c G(y), the corresponding Lorenz ordering plays an important role for the proof of Theorem 1. To check whether the condition F(x) ≤ c G(y) holds or not, one should make simultaneously check whether the condition F(x) ≤ G(y) holds or not. Similarly, define that F(x) ≤ concave G(x) if and only if F −1 G(x) is concave function in x, x ∈ (−∞, ∞), then we have the following corollary.
Proof. It can be easily followed by Theorem 1 via changing the sign of logh F (u).
It can be proved similarly. We also note that if K is TP 2 and L is RR 2 , then M is RR 2 .

The Multivariate Distributions
By convex ordering transformation, we can neatly transform the difference of two Shannon entropy measures into the Lorenz ordering. In this section, we intend to extend this result to the multivariate case. It is natural to do so in the sense of conditional convex ordering for the multivariate case. Let X = (X 1 , . . . , X p ) and Y = (Y 1 , . . . , Y p ) be random vectors, and f and g be the density function of X and Y, respectively. Please note that the MTP 2 holds if and only if all the pairwise TP 2 hold (Fact 4.3.2 of Tong [11]), namely to examine the MTP 2 is the same as to examine all the pairwise TP 2 for any fixed pair. Hence without loss of generality, we may take p = 2. Please note that f (x 1 , By the product property that the product function is TP 2 if two functions are each TP 2 , respectively. Thus we may conclude that f (x 1 , x 2 ) is TP 2 if and only if both f 1 (x 1 ) and f 2|1 (x 2 |x 1 ) are TP 2 . That is to say that F(x 1 , x 2 ) ≤ c G(y 1 , y 2 ) iff F 1 (x 1 ) ≤ c G 1 (y 1 ) and F 2|1 (x 2 |x 1 ) ≤ c G 2|1 (y 2 |y 1 ). Continue this process, for the general p case, we have the following.
Proof. For simplicity we may consider the situation when p = 2 first. Let g 12 (y 1 , y 2 ) be the joint density function of random variables Y 1 and Y 2 , g 2|1 (y 2 |y 1 ) be the corresponding conditional density function of Y 2 given Y 1 = y 1 and G 2|1 (y 2 |y 1 ) be the corresponding distribution function.
Let f 12 (x 1 , x 2 ), f 2|1 (x 2 |x 1 ) and F 2|1 (x 2 |x 1 ) be defined similarly for the variables X 1 and X 2 . Also let Furthermore, by Proposition 2, we may note that F(x 1 ) ≤ c G(y 1 ), and thus via the arguments in Section 2, we have that 1] is the density function. Thus by the information inequality, the first term of the right hand side of Equation (3) is non-negative and the second term insight bracket is non-negative too, and hence the second term of the right hand side of Equation (3) is non-negative. Thus we may conclude that I S ( f 12 ) ≥ I S (g 12 ). Let Continuing the processes, finally for general p we have Hence, the theorem follows.
For two arbitrary densities f g , f h , the Kullback-Leibler information discrimination is defined by (4) we have the following representation: where E U 1 ,...,U i−1 denotes the expectation of independent uniform densities of (U 1 , . . . , Corollary 2. If F(x) ≤ concave G(y), then I S ( f ) ≤ I S (g).
As we have seen, the proof of Theorem 2 is surprisingly easy. However, to characterize the property of convex ordering for two multivariate distributions is sometimes much more complicated via the conditional approach than to characterize the MTP 2 property directly. The convex ordering of two multivariate distributions is essentially equivalent to the notion of MTP 2 , so does the concave ordering to the MRR 2 . Karlin and Rinott ( [12,13]) had studied several MTP 2 and MRR 2 distributions, respectively. Some other often seen models related to Theorem 2 and Corollary 2 are exampled in the following. Example 1. Let y = Ax, where A is a nonsingular matrix. Then I S (g) = log|A| + I S ( f ), where |A| denotes the determinant of matrix A (Zografos and Nadarajah [3]). Thus we have It is easy to note that x Thus, we may conclude that the generalized Dirichlet density function is MRR 2 when b i ≥ 1, ∀i = 1, . . . , p. In addition, Corollary 2 is applicable. [14] considered the multivariate Liouville distribution which the density function is of the form

Example 3. Gupta and Richards
where g : R + → R + is continuous, and θ, x ∈ R +p and The MRR 2 property of some special cases of multivariate Liouville distributions, such as Dirichlet distributions and inverted Dirichlet distributions, are studied by Karlin and Rinott [13].

Example 4.
In biostatistics, many studies are under the setup of well-known Cox proportional hazard model (Cox [15]), and its basic model is assumed to be thatḠ( . Let G 2|1 (x 2 |x 1 ) = u 2 , similar arguments as above, we may have that F 2|1 (x 2 |x 1 ) ≤ c G 2|1 (x 2 |x 1 ) when λ ≥ 1. Thus, by Proposition 2, we may have that F(x) ≤ c G(x) when λ ≥ 1. This procedure can be directly generalized to the general p case by mathematical induction. As such, for the positive disease gene case λ ≤ 1, then by Theorem 2 we may conclude that I S ( f ) ≤ I S (g). Similarly, for the negative disease gene case λ ≥ 1, then I S ( f ) ≥ I S (g). Example 5. The density function of eigenvalues for a multivariate beta matrix is of the form where 0 < x i < 1, i = 1, . . . , p, n 1 , n 2 > p, and [16]; Peddada and Richards [17]). Please note that the Vandermonde determinant ∏ 1≤i<j≤p |x i − x j | is MTP 2 , for the details see Dykstra and Hewett [18]. Using Proposition 2, it is easy to see that Then, we may conclude that the density function, which is the product of these functions, is MRR 2 . In addition, Corollary 2 is applicable.

The Central Wishart and Central MANOVA Models, and Noncentral Wishart and Noncentral MANOVA Models
Theorem 2 states that if the density function has the MTP 2 property, then the corresponding ordering of Shannon entropies holds, namely the MTP 2 (or MRR 2 ) property of density function ensures the ordering of Shannon entropies. The MTP 2 also implies the power monotonicity property of MANOVA tests, for which based on the monotone function of eigenvalues. Eigenvalues play an important role in statistical inference. Example 5 deals with the density function of eigenvalues of MANOVA models (or Jacobi ensembles) when the population covariance matrix is assumed to be an identity matrix (Σ = I). Two other often seen models are (i) Gaussian (or Hermite) ensembles, and (ii) Wishart models (or Laguerre ensembles), [16]). From similar arguments to Example 5, it is easy to see that the density function f (x) is MTP 2 either for (i) (the identity population covariance matrix is an M-matrix) or for (ii) (gamma type density function), respectively, we omit the details. For statistical inference, generally the population covariance matrix is unknown. In this section, we further study the MTP 2 property of the distribution functions of eigenvalues of central Wishart and MANOVA models, and of noncentral Wishart and MANOVA models (James [19]; Muirhead [20]) under the general population covariance matrix set-up.

Type 0 F 0 , Exponential: Eigenvalues of a Wishart Matrix
Suppose that the columns of a matrix p × n matrix X are independently normally distributed with covariance matrix Σ and E(X) = 0. Let L = diag(l 1 , . . . , l p ), where l i be the ith largest eigenvalue of XX . Similarly, let ∆ = diag(δ 1 , . . . , δ 1 ), where δ i be the ith largest eigenvalue of Σ. Write Σ = Q∆Q and XX = ULU , where Q, U ∈ O(p) with O(p) being the group of p × p orthogonal matrices. Also denote (dU) the m-form of U with m = 1 2 p(p + 1) and tr denotes the trace operator, then where H = Q U ∈ O(p) and noting that (dU) = (dH). Due to the invariant property, the density function of L depends only upon ∆, which is of the form where φ 0 (l 1 , . . . , l p ) = π Proof. Please note that Thus, it is easy to see that respectively. Please note that ∆ is a positive diagonal matrix, which is clearly positive definite. Hence the theorem follows.
By the result of Dykstra and Hewett [18], it is easy to note that φ 0 (l 1 , . . . , l p ) enjoys the MTP 2 property. Hence, by Theorem 3 we may conclude that the density function φ δ (l 1 , . . . , l p ) of eigenvalues of XX , which is the product of two MTP 2 functions (Karlin and Rinott [12]), is MTP 2 .

Type 1 F 0 , Binomial Series: Eigenvalues When
Suppose that the p × n 1 matrix variate X with n 1 independent columns distributed as N(0, Σ 1 ) and p × n 2 matrix variate Y with n 2 independent columns distributed as N(0, Σ 2 ). Also let F = diag( f 1 , . . . , f p ) and Ω = diag(ω 1 , . . . , ω p ), where f i is the ith largest eigenvalue of and the equation is well defined for all S and T being positive definite. Without loss of generality, we may take both S and T are positive diagonal matrices. Then the joint density function of eigenvalues f 1 , . . . , f p is of the form where By Schwartz inequality, we then have Please note that φ * ( f 1 , . . . , f p ) is MTP 2 . Thus, by Theorem 4 we have that the density function φ ω ( f 1 , . . . , f p ) is MTP 2 .
(ii) For p ≥ n 1 the joint density function of eigenvalues f 1 , . . . , f p is of the form Similar to arguments in Theorem 6, we may conclude that the density function φ * 2ω * ( f 1 , . . . , f p ; Ω * ) is MRR 2 .

High-Dimensional Wishart Matrices
Suppose that the columns of a matrix p × n matrix X are independently normally distributed with covariance matrix Σ and E(X) = 0. Let L = diag(l 1 , . . . , l p ), where l i are the ith largest eigenvalue of XX . Let Σ = σ 2 I and c = lim n→∞ p/n, c ∈ (0, 1). When dimension p is large, the limiting distribution of sample spectral eigenvalues has the well-known Marčenko-Pastur distribution F c,σ 2 (M-P law) with index c and scale parameter σ, which the density function is of the form where a = (1 − √ c) 2 and b = (1 + √ c) 2 (Marčenko and Paster [21]). After some algebraic manipulations, we can show that Thus, the Marčenko-Pastur density function is TP 2 in (x, σ 2 ), and hence Theorem 1 is applicable.

Remarks
We adopted the notion of convex ordering of two distributions to prove the ordering of Shannon entropies in Theorem 2. Please note that the notions of convex ordering of two distributions, the monotone likelihood ratio and totally positive of order 2 (MTP 2 ) are essentially equivalent. For many density functions such as those discussed in Sections 3 and 4, it seems easier to characterize the MTP 2 property of density functions than that of convex ordering of distribution functions. In practice, we suggest use the notion of MTP 2 for the comparisons of Shannon entropy measures.
The difference of two Shannon entropies is an intrinsic distribution measure as we emphasized in this paper. As a result, the Shannon entropy measures can be ordered when two underline distributions have the relationship of convex ordering (i.e., the MTP 2 ) property. The monotonicity of power function, which is discussed basically under the same distributions but with different ordered parameters set up, comes out to be a special case of the results of the comparison of two Shannon entropies which can be even discussed under two totally different distribution functions.
For the problem of hypothesis testing H 0 : F(x) = c G(x) against H 1 : F(x) ≤ c G(x), test statistic based on the sample version of the difference of two Shannon entropies enjoys the optimal property. The above hypothesis testing problem is equivalent to the hypothesis problem of testing H 0 : H F (u) = u, u ∈ [0, 1] against H 1 : H F (u) is convex on [0, 1]. Let U 1:n < U 2:n < . . . < U n:n be the order statistics from H F , then the most powerful test is based on the statistics ∑ n i=1 log h F (U i:n ), which is the empirical version of the difference of two Shannon entropies. Since h F (u) is monotone increasing in u, thus the test has the property of power monotonicity.
Noncentral chi-square distributions play an important role for statistical inference. For the univariate case, a noncentral chi-square or gamma density which are typically Poisson mixture of central chi-square or gamma type densities. Beyond the monotonicity property, we may aware of the following: let f α (x, β) denote the univariate gamma density function with the shape parameter α and the scale parameter β, and its distribution function is denoted by F α (x, β). For the fixed scale parameter gamma densities, van Zwet [6] showed the result of convex ordering of gamma distributions, i.e., if The above results tell us that the larger degrees of freedom is, the larger Shannon entropy measure is when both noncentralities are kept the same. For any two test statistics (or estimators) which are noncentral chi-square distributed with the same noncentrality, but with different degrees of freedom, then the Pitman efficiency (based on the Kullback-Leibler divergence) makes no difference of comparisons. However, our new measure of the difference of two Shannon entropies is nonzero, which means that the new measure makes a difference of comparisons. For the MANOVA models, the density functions of eigenvalues which are Bessel of multivariate gamma type density or confluent hypergeometric of multivariate beta type density such as those discussed in Sections 4.3 and 4.4, respectively, our results can be directly applied to the tests with monotone acceptance regions to enjoy the power monotonicity for those density functions with the MTP 2 (MLR) property. The conceived mixture model arises also for the finite case. A multi-sample model where for some G (G > 1), there are G densities f g (x), g = 1, . . . , G and a sample of size n g is drawn from the density f g (x), so that the pooled sample relates to the mixture density with w g = n g / ∑ G g=1 n g . For the finite mixture case, it is hard to find out the explicit form of Shannon entropy of mixture distribution, even under the multinormal set-up. Suppose that there are G groups, the random vector X has the mixture density function f * . The w g are nonnegative and they add up to 1, so the mixture density represents a convex combination of the component densities. We find that the decomposability of Shannon entropy still holds in a different way for the continuous random vectors.
Proof. Please note that )dx = G ∑ g=1 w g I S ( f g ) + G ∑ g=1 w g K( f g , f * G ).
It follows from Theorem 7 that I S ( f * G ) ≥ ∑ G g=1 w g I S ( f g ). We further note that for a mixture model, the Shannon entropy has the decomposability property similar to well-known Fisher's MANOVA model. Please note that the first term on the right hand side represents an average of the individual entropies (i.e., analogous to the within group sum of squares) while the second term is nonnegative and represents the between group distances.
For a high-dimensional Wishart matrix with an unknown scale times the identity matrix being the population covariance matrix, the property of TP 2 property (i.e., the MLR property) for the limiting empirical density function of eigenvalues is studied in the Section 4.5. Although the Marčenko-Pastur equation (Silverstein [22]) provides the link of limiting empirical spectral distribution, F, to the limiting behavior of the population spectral distribution, H, one can expect to retrieve the information of H from F. However, the difficulty lies within the fact that the relationship between H and F is entangled. Whether the result studied in Section 4.5 can be extended to the general population covariance matrix case or not remains to be clarified. Another focus of random matrices in the literature is related to the Gaussian divisible ensembles. The Gaussian divisible ensembles are matrices of the form H t = e −t/2 H 0 + (1 − e −t ) 1/2 H G where t > 0 is a parameter, H 0 is a Wigner matrix and H G is an independent Gaussian orthogonal ensemble matrix. The eigenvalue distribution of the Gaussian divisible ensembles are the same as that of the solution of a matrix valued Ornstein-Uhlenbeck process H t for any time t ≥ 0. The dynamics of the eigenvalues of H t is given by the system of so-called Dyson's Brownian motion (Dyson [23]). The treatment of the sample covariance matrix is analogous, but the formulas change slightly (see Erdős, L. et al. [24], for the details). It is interesting to further study that: under the general model, whether the corresponding limiting spectral density function of eigenvalues, obtained via the convolution of the limiting density function of eigenvalues of initial matrix with the Marčenko-Pastur density function, is TP 2 or not. We pose those problems as a project of future study.