Next Article in Journal
Entropy Generation through Non-Equilibrium Ordered Structures in Corner Flows with Sidewall Mass Injection
Previous Article in Journal
A Proximal Point Algorithm for Minimum Divergence Estimators with Application to Mixture Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Expected Logarithm of Central Quadratic Form and Its Use in KL-Divergence of Some Distributions

by
Pourya Habib Zadeh
1 and
Reshad Hosseini
1,2,*
1
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, P.O. Box 14395-515, Tehran, Iran
2
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), P.O. Box 19395-5746, Tehran, Iran
*
Author to whom correspondence should be addressed.
Entropy 2016, 18(8), 278; https://doi.org/10.3390/e18080278
Submission received: 10 May 2016 / Revised: 13 July 2016 / Accepted: 21 July 2016 / Published: 28 July 2016
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
In this paper, we develop three different methods for computing the expected logarithm of central quadratic forms: a series method, an integral method and a fast (but inexact) set of methods. The approach used for deriving the integral method is novel and can be used for computing the expected logarithm of other random variables. Furthermore, we derive expressions for the Kullback–Leibler (KL) divergence of elliptical gamma distributions and angular central Gaussian distributions, which turn out to be functions dependent on the expected logarithm of a central quadratic form. Through several experimental studies, we compare the performance of these methods.

Graphical Abstract

1. Introduction

Expected logarithm of random variables usually appears in the expressions of important quantities like entropy and Kullback–Leibler (KL) divergence [1,2,3]. The second kind moment is an important statistics method used in estimation problems [4,5]. It also appears in an important class of inference algorithms called the variational Bayesian inference [6,7]. Furthermore, the geometric mean of a random variable which has been used in economics [8,9] is equal to the exponential of the expected logarithm of that random variable.
Central quadratic forms (CQFs) have many applications, and most of them stem from the fact that they are asymptotically equivalent to many statistics for testing null hypotheses. They are used for finding the number of components in mixtures of Gaussians [10], to test goodness-of-fit for some distributions [11] and as test statistics for dimensionality reduction in inverse regression [12].
In this paper, we develop three algorithms for computing the expected logarithm of CQFs. There is a need to develop special algorithms for it because CQFs do not have a closed-form probability density function, which makes the computation of their expected logarithms difficult. Although there is a vast literature on many different ways for calculating probability distributions of CQFs (see [13,14,15,16]), we have not found any work on calculating their expected logarithms. It is worth noting that one of our three algorithms is based upon works for computing the probability density function of CQFs using a series of gamma random variables [13,14]. We also derive expressions for the KL-divergence of two distributions that are subclasses of generalized elliptical distributions. These are zero-mean elliptical gamma (ZEG) distribution and angular central Gaussian (ACG) distribution. The only term in their KL-divergences that can not be computed in terms of elementary functions is an expected logarithm of a CQF, which can be computed by one of our developed algorithms.
The KL-divergence or the relative entropy was first introduced in [17] as a generalization of Shannon’s definition of information [18]. This divergence has been used extensively by statisticians and engineers. Many popular divergence classes like f-divergence and alpha-divergence have been introduced as generalizations to this divergence [19]. This divergence has several invariance properties like scale invariance that makes it an interesting dissimilarity measure in statistical inference problems [20]. KL-divergence is also used as a criterion for model selection [21], hypothesis testing [22], and merging in mixture models [23,24]. Additionally, it can be used as a measure of dissimilarity in classification problems, for example, text classification [25], speech recognition [26], and texture classification [27,28].
The wide applicability of KL-divergence as a useful dissimilarity measure persuade us to derive the KL-divergence for two important distributions. One of them is ZEG [29] that has a rich modeling power and allows heavy and light tail and different peak behaviors [30,31]. The other is ACG [32] which is a distribution on the unit sphere that has been used in many applications [33,34,35,36]. This distribution has many nice features; for example, its maximum likelihood estimator is asymptotically the most robust estimator of the scatter matrix of an elliptical distribution in the sense of minimizing the maximum asymptotic variance [37].

1.1. Contributions

To summarize, the key contributions of our paper are the following:
-
Introducing three methods for computing the expected logarithm of a CQF.
-
Proposing a procedure for computing the expected logarithm of an arbitrary positive random variable.
-
Deriving expressions for the entropy and the KL-divergence of ZEG and ACG distributions (the form of KL-divergence between ZEG distributions appeared in [38] but without its derivations).
The methods for computing the expected logarithm of a CQF differ in running-time and accuracy. Two of these, namely integral and series methods, are exact. The third method is a fast but inexact set of methods. The integral method is a direct application of our proposed procedure for computing the expected logarithm of positive random variables. We propose two fast methods that are based on approximating the CQF with a gamma random variable. We show that these fast methods give upper and lower bounds to the true expected logarithm. This leads us to develop another fast method based on a convex combination of the other two fast methods. Whenever the weights of the CQF are eigenvalues of a matrix as in the case of KL-divergences, the fast methods can be very efficient because they do not need eigenvalue computation.

1.2. Outline

The remainder of this paper is organized as follows. Section 2 proposes three different methods for computing the expected logarithm of a CQF. Furthermore, a theorem is stated at the beginning of this section that has a pivotal role in the first two methods. Then, we derive expressions for the KL-divergence and entropy of ZEG and ACG distributions in Section 3. Afterwards, in Section 4, multiple experiments are conducted to examine the performance of three methods for computing the aforementioned expected logarithm in terms of accuracy and computational time. Finally, Section 5 presents our conclusions. To improve the readability of the manuscript, the proofs of some theorems are presented in appendices.

2. Calculating the Expected Logarithm of a Central Quadratic Form

Suppose N i is the i-th random variable in the series of d independent standard normal random variables, i.e., normal random variables with zero-means and unit variances. Then the central (Gaussian) quadratic form is the following random variable:
U = i = 1 d λ i N i 2 ,
where λ i s are non-negative real numbers. Note that N i 2 s are chi-square random variables with degree of freedoms equal to one; therefore, this random variable is also called the weighted sum of chi-square random variables. To the best of our knowledge, the expected logarithm of the random variable U does not have a closed-form expression using elementary mathematical functions. For its calculation, we propose three different approaches, namely an integral method, a series method and a set of fast methods. Each of them has its specific properties and does well in certain situations.
In the following theorem, a relation between the expected logarithm of two positive random variables distributed according to arbitrary densities and the Laplace transform of these two densities is given. This theorem is used in the integral method and the fast method. Note that the assumptions of the following theorem are unrestrictive. Therefore, it can also be used for computing the expected logarithm of other positive random variables.
Theorem 1. 
Let X and Y be two positive random variables, F and G be their cumulative distribution functions, and f and g be their probability density functions. Furthermore, suppose that F and G are the Laplace transform of f and g, respectively. If lim x log ( x ) ( G ( x ) F ( x ) ) = 0 , lim x 0 + log ( x ) ( G ( x ) F ( x ) ) = 0 and 0 G ( σ ) F ( σ ) σ d σ exists, then
𝔼 [ log X ] 𝔼 [ log Y ] = 0 G ( σ ) F ( σ ) σ d σ .
Proof. 
Using the definition of Laplace transform, we have
G ( s ) F ( s ) = 0 ( g ( x ) f ( x ) ) exp ( s x ) d x .
Using the integration property of Laplace transform, we have
G ( s ) F ( s ) s = 0 [ 0 x ( g ( τ ) f ( τ ) ) d τ ] exp ( s x ) d x .
Using the frequency integration property of Laplace transform, and the formulas F ( x ) = 0 x f ( τ ) d τ , and G ( x ) = 0 x g ( τ ) d τ we further obtain
s G ( σ ) F ( σ ) σ d σ = 0 1 x ( G ( x ) F ( x ) ) exp ( s x ) d x .
Letting s in the above equation go to zero and since 0 G ( σ ) F ( σ ) σ d σ exists, we further obtain
0 G ( σ ) F ( σ ) σ d σ = 0 1 x ( G ( x ) F ( x ) ) d x .
Using the integration by parts formula having log ( x ) and G ( x ) F ( x ) as its parts, we have
0 G ( σ ) F ( σ ) σ d σ = [ log ( x ) ( G ( x ) F ( x ) ) ] 0 0 log ( x ) ( g ( x ) f ( x ) ) d x .
Since lim x log ( x ) ( G ( x ) F ( x ) ) = 0 , and lim x 0 + log ( x ) ( G ( x ) F ( x ) ) = 0 , we obtain
[ log ( x ) ( G ( x ) F ( x ) ) ] 0 = 0 .
Hence by using (5) in (4), we have
0 log ( x ) f ( x ) d x 0 log ( x ) g ( x ) d x = 0 G ( σ ) F ( σ ) σ d σ .
From the definition of expectation, relation (2) is obtained. ☐

2.1. Integral Method

In this part, we will use Theorem 1 for computing the expected logarithm of a CQF. To this end, we choose a random variable Y that has a closed-form formula for its expected logarithm and Laplace transform of its density. A possible candidate is the gamma random variable. The density of gamma random variable has the following Laplace transform:
( 1 + θ s ) k ,
where k and θ are its shape and scale parameters, respectively. Also, the expected logarithm of this random variable is Ψ ( k ) + log ( θ ) , where Ψ ( · ) is digamma function.
Using the convolution property of Laplace transform, it is easy to see that the density function of the CQF given in (1) has the following closed-form Laplace transform:
i = 1 d ( 1 + 2 λ i s ) 1 2 .
Lemmas 2 and 3 show that a CQF and a gamma random variable satisfy the conditions of Theorem 1. For proving Lemma 2, we need Lemma 1. First of all, let us express the following trivial proposition.
Proposition 1. 
Let X 1 , , X n be arbitrary real random variables. Suppose we have two many-to-one transformations Y = h ( X 1 , , X n ) and Z = g ( X 1 , , X n ) . If the following inequality holds for any x i s in the support of random variables X i s:
h ( x 1 , , x n ) g ( x 1 , , x n ) ,
then we have the following inequality between the cumulative distribution functions of random variables Y and Z:
F Y ( v ) F Z ( v ) , f o r a l l v .
Lemma 1. 
Let F be the cumulative distribution function of a CQF, that is i = 1 d λ i N i 2 , where λ i s are positive real numbers and N i s are independent standard normal random variables. Also, let G ( x ; k , θ ) be the cumulative distribution function of a gamma random variable with parameters k and θ, then the following inequalities hold:
G ( x ; d 2 , 2 λ max ) F ( x ) G ( x ; d 2 , 2 λ min ) , f o r a l l x + ,
where λ max = max { λ i } i = 1 d , and λ min = min { λ i } i = 1 d .
Proof. 
This lemma is an immediate consequence of Proposition 1 and the following relation, knowing that λ i = 1 d N i 2 is a gamma random variable with the shape parameter d / 2 and the scale parameter 2 λ :
λ min i = 1 d x i 2 i = 1 d λ i x i 2 λ max i = 1 d x i 2 , for all x i .
 ☐
Lemma 2. 
Let G be the cumulative distribution function of an arbitrary gamma random variable and F be the cumulative distribution function of random variable i = 1 d λ i N i 2 , where λ i s are positive real numbers and N i s are independent standard normal random variables, then lim x log ( x ) ( G ( x ) F ( x ) ) = 0 , and lim x 0 + log ( x ) ( G ( x ) F ( x ) ) = 0 .
The proof of this lemma can be found in Appendix A.
Lemma 3. 
Let G be the Laplace transform of probability density function of an arbitrary gamma random variable and F be the Laplace transform of probability density function of i = 1 d λ i N i 2 , where λ i s are positive real numbers and N i s are independent standard normal random variables, then 0 G ( σ ) F ( σ ) σ d σ is convergent.
The proof of this lemma can be found in Appendix B.
According to Lemmas 2 and 3, the conditions of Theorem 1 hold by choosing X to be a CQF given in (1), and Y to be an arbitrary gamma random variable. Therefore, we can use (2) for calculating the expected logarithm of a CQF, and it is given by
𝔼 [ log ( i = 1 d λ i N i 2 ) ] = Ψ ( k ) + log ( θ ) + 0 ( 1 + θ σ ) k i = 1 d ( 1 + 2 λ i σ ) 1 2 σ d σ .
The above equation holds for any choice of positive scalars k and θ. To the best of our knowledge, the above integral does not have a closed-form solution, so it must be computed numerically. This integral can be computed numerically using the variety of techniques available for one-dimensional integrals (see for example [39]).

2.2. Fast Methods

The integral method explained in the previous part can be computationally expensive for some applications. To this end, we derive three approximations that can be calculated analytically and, therefore, are much faster.
Using the first or higher order Taylor expansion around 𝔼 [ U ] to approximate the expected logarithm of U has been already proposed in the literature [6,40]. However, we observed that lower order Taylor expansion does not give a very accurate approximation. Therefore, we use two different approximations, for which we can show that they provide a lower and an upper bound for the true expected logarithm. Finally, a convex combination of these two is used to get the final approximation.
Two different gamma distributions have been used in [15,41,42] to approximate a CQF. Since the expected logarithm of a gamma random variable has a closed-form solution, we use the expected logarithm of these gamma random variables to approximate the expected logarithm of a CQF. A further justification for this idea can be given based on (13) by choosing the shape and the scale parameters of gamma distribution such that the magnitude of the integral part in (13) becomes smaller.
Since the weights of CQF in the KL-divergence formulas in Section 3 are eigenvalues of a positive definite matrix Σ, we express the approximations based on this matrix. This way of expressing the approximations also elucidates the fact that the eigenvalues do not need to be calculated, which shows further computational benefits of these approximations. The shape and scale parameters of the first approximating gamma random variable are d / 2 and 2 tr ( Σ ) / d , respectively. Therefore, for the first fast approximation we have
𝔼 [ log ( i = 1 d λ i N i 2 ) ] Ψ ( d 2 ) + log ( 2 tr ( Σ ) d ) .
The shape and scale parameters of the gamma random variable for the second approximation are tr ( Σ ) 2 / 2 tr ( Σ 2 ) and 2 tr ( Σ 2 ) / tr ( Σ ) , respectively. Then, we obtain the following formula for the second fast approximation:
𝔼 [ log ( i = 1 d λ i N i 2 ) ] Ψ ( tr ( Σ ) 2 2 tr ( Σ 2 ) ) + log ( 2 tr ( Σ 2 ) tr ( Σ ) ) .
The following theorem shows that these approximations are lower and upper bounds to the true expected logarithm.
Theorem 2. 
If U = i = 1 d λ i N i 2 , where λ i s are eigenvalues of positive definite matrix Σ d × d and N i s are independent standard normal random variables, then
Ψ ( tr ( Σ ) 2 2 tr ( Σ 2 ) ) + log ( 2 tr ( Σ 2 ) tr ( Σ ) ) 𝔼 [ log ( U ) ] Ψ ( d 2 ) + log ( 2 tr ( Σ ) d ) .
The proof of this theorem can be found in Appendix C.
From this theorem, we can conclude that there exist some convex combinations of the two previously mentioned approximations which perform equal or better than each of them, in the sense that they are closer to the true expected logarithm. Therefore, we define the third fast approximation to be
𝔼 [ log ( i = 1 d λ i N i 2 ) ] ( 1 l ) [ Ψ ( d 2 ) + log ( 2 tr ( Σ ) d ) ] + l [ Ψ ( tr ( Σ ) 2 2 tr ( Σ 2 ) ) + log ( 2 tr ( Σ 2 ) tr ( Σ ) ) ] .
To determine parameter l [ 0 , 1 ] in the above equation, we used the least squares fitting on thousands of positive definite matrices with different dimensions and unit trace sampled uniformly according to an algorithm given in [43]. We observed that the fitted value is roughly equal to l = 0.7 and dimensionality has a negligible effect on the best value of l. For the case of d = 20 , the mean squared error for various values of l can be seen in Figure 1.

2.3. Series Method

One can represent the probability density function of a CQF given by (1) as an infinite weighted sum of gamma densities [13,14],
f U ( u ) = j = 0 c j g ( u ; d 2 + j , 2 β ) ,
where g ( u ; d / 2 + j , 2 β ) is the probability density function of a gamma random variable with parameters d / 2 + j and 2 β , and
c k = 1 k r = 0 k 1 v k r c r ,
v k = 1 2 j = 1 d ( 1 β λ j 1 ) k ,
c 0 = j = 1 d β λ j 1 .
This result can be used for deriving a series formula for the expected logarithm of U. Thus,
𝔼 [ log U ] = j = 0 c j Ψ ( d 2 + j ) + log ( 2 β ) .
Ruben [13] analyzed the effect of various βs on the behavior of the series expansion and proposed the following β as an appropriate one:
β = 2 λ max λ min λ max + λ min .
By using this β, j = 0 c j = 1 holds [13] and also knowing that the following relation holds for the digamma function:
Ψ ( x + 1 ) = 1 x + Ψ ( x ) ,
then (22) can be simplified
𝔼 [ log U ] = log ( 2 β ) + c 0 Ψ ( d 2 ) + j = 1 c j Ψ ( d 2 ) + l = 0 j 1 1 d 2 + l = log ( 2 β ) + Ψ ( d 2 ) + j = 1 c j l = 0 j 1 1 d 2 + l = log ( 2 β ) + Ψ ( d 2 ) + i = 0 1 d 2 + i k = i + 1 c k = log ( 2 β ) + Ψ ( d 2 ) + i = 0 ( 1 d 2 + i ( 1 k = 0 i c k ) ) .
To approximate this formula, we cut the series coefficients, which means we only use a finite number of terms to evaluate the expectation:
𝔼 ^ [ log U ] = log ( 2 β ) + Ψ ( d 2 ) + i = 0 L 1 1 d 2 + i ( 1 k = 0 i c k ) .
For this approximation, it is possible to compute an error bound which is expressed by the following lemma.
Lemma 4. 
The bound for error of the approximation (25) for L > d ϵ / ( 2 2 ϵ ) is as follows:
𝔼 [ log U ] 𝔼 ^ [ log U ] c 0 ϵ L + 1 1 d / 2 + L L ϵ 2 Γ ( d 2 + L ) Γ ( d 2 + 1 ) ( L + 1 ) ! ,
where Γ ( · ) is gamma function, and ϵ = ( λ max λ min ) / ( λ max + λ min ) .
The proof of this lemma is in Appendix D.
By using this bound, it is possible to calculate the expected logarithm with a given accuracy by selecting an appropriate L. Note that the upper bound given by (26) is growing with respect to ϵ, and ϵ is also increasing with λ max / λ min . As we will see in the simulation studies, when the ratio λ max / λ min as well as the dimensionality d are small, this method performs better than the integral method.

3. KL-Divergence of Two Generalized Elliptical Distributions

In this section, we derive expressions for the KL-divergence and the entropy of two subclasses of generalized elliptical distributions, namely ZEG and ACG distributions [44]. We first start by reviewing some related materials.

3.1. Some Background on the Elliptical Distributions

Suppose the d-dimensional random vector X is distributed according to a zero-mean elliptical contoured (ZEC) distribution with a positive definite scatter matrix Σ d × d , that is X Z E C ( Σ , φ ) . The probability density function of X is given by
f x ( x ; Σ , φ ) = | Σ | 1 2 φ x Σ 1 x ,
for some density generator functions φ : + . We know that we can decompose the vector X into a uniform hyper-spherical component and a scaled-radial component so that, X = Σ 1 / 2 R U , where U is uniformly distributed over the unit sphere S d 1 and R is a univariate random variable given by R = Σ 1 / 2 X 2 [45]. Then, the random variable R has the density
f R ( r ) = 2 π d 2 φ ( r 2 ) r d 1 / Γ ( d 2 ) .
Therefore, square radial component Υ = R 2 has the following density:
f Υ ( υ ) = π d 2 φ ( υ ) υ d 2 1 / Γ ( d 2 ) .
A ZEG is a ZEC whose square radial component is distributed according to a gamma distribution Υ G a m m a ( a , b ) . A gamma-distributed random variable has the density
f Υ ( υ ) = Γ ( a ) 1 b a υ a 1 exp υ / b ,
where a is a shape parameter and b is a scale parameter. So the probability density function of a d-dimensional random variable X Z E G ( Σ , a , b ) is given by
f x ( x ; Σ , a , b ) = Γ ( d 2 ) π d 2 Γ ( a ) b a | Σ | 1 2 x Σ 1 x a d 2 exp b 1 x Σ 1 x ,
where x d and Σ 0 is its scatter matrix, also a , b > 0 are certain scale and shape parameters [31].
When ZEC random variable is projected onto a unit sphere, the resulting random variable is called ACG and denoted by X A C G ( Σ ) . This distribution unlike many other distributions on the unit sphere has a nice closed-form density given by
f x ( x ; Σ ) = Γ ( d 2 ) 2 π d 2 | Σ | 1 2 x Σ 1 x d 2 ,
where x S d 1 and Σ 0 is its scatter matrix.

3.2. KL-Divergence between ZEG Distributions

Suppose we have two probability distributions P and Q with probability density functions p and q, KL-divergence between these two distributions is defined by
KL ( P | | Q ) = log ( p ( x ) ) p ( x ) d x log ( q ( x ) ) p ( x ) d x .
The negative of the first part, H ( X ) = log ( p ( x ) ) p ( x ) d x , is the entropy and the second part, 𝔼 [ log ( q ( X ) ) ] = log ( q ( x ) ) p ( x ) d x , is the averaged log-loss term, where X is a random variable distributed according to P.
Following lemma gives a general expression for the KL-divergence between two ZEC distributions. It is then used for deriving the KL-divergence between two ZEG distributions.
Lemma 5. 
Suppose we have two probability distributions on random variable Y, P Y = Z E C ( Σ 1 , φ ) and Q Y = Z E C ( Σ 2 , φ ) , then the KL-divergence between these two distributions is given by the following expression:
KL ( P | | Q ) = 1 2 log ( | Σ | ) + 0 log ( υ 1 d 2 f Υ ( υ ) ) f Υ ( υ ) d υ 0 1 r f w d ( υ r ) f Υ ( r ) log ( υ 1 d 2 f Υ ( υ ) ) d υ d r ,
where f Υ and f Υ are the square radial components of distributions P and Q, respectively. Also, f w d is the density of i = 1 d λ i N i 2 / i = 1 d N i 2 , where N i s are independent standard normal random variables, and λ 1 , , λ d are eigenvalues of matrix Σ = Σ 2 1 / 2 Σ 1 Σ 2 1 / 2 .
Proof. 
KL-divergence is known to be invariant against invertible transformations of random variable Y [46]. To simplify the derivations, we apply a linear transformation X = Σ 2 1 / 2 Y that makes the scatter matrix of the second distribution identity. By using this change of variable, the problem becomes that of KL-divergence computation between P X = Z E C ( Σ , φ ) and Q X = Z E C ( I , φ ) , where Σ = Σ 2 1 / 2 Σ 1 Σ 2 1 / 2 .
As expressed in (33), the KL-divergence is the subtraction of the entropy from the averaged log-loss. Firstly, let us derive the entropy of X having distribution P X , that is H ( X ) = log ( p ( x ) ) p ( x ) d x .
Assume the change of integration variable y = Σ 1 / 2 x and use (27), then we obtain the following expression for H ( X ) :
H ( X ) = 1 2 log ( | Σ | ) log φ ( y y ) φ ( y y ) d y .
Let r = y 2 and recall that the area of a sphere in dimension d with radius r equals 2 r d 1 π d / 2 / Γ ( d / 2 ) , thus
H ( X ) = 1 2 log ( | Σ | ) 0 log φ ( r 2 ) 2 π d 2 Γ ( d 2 ) r d 1 φ ( r 2 ) d r .
Using the change of variable υ = r 2 and replacing ϕ by square radial density f Υ as expressed in (29), we obtain
H ( X ) = 1 2 log ( | Σ | ) 0 log Γ ( d 2 ) π d 2 υ 1 d 2 f Υ ( υ ) f Υ ( υ ) d υ .
Now, we are deriving an expression for the averaged log-loss given by 𝔼 [ log ( q ( X ) ) ] = 𝔼 log φ ( X X ) . The argument of function φ is X X ; therefore, it is enough to compute the expectation of the function over the new random variable Z = X 2 2 :
𝔼 [ log ( q ( X ) ) ] = 𝔼 Z log φ ( Z ) = 0 log φ ( Z ) f Z ( υ ) d υ ,
where f Z is the density of Z = X 2 2 , wherein X Z E C ( Σ , φ ) . It is easy to see that the random variable Z can equally be written as Z = X ¯ Σ X ¯ where X ¯ Z E C ( I , φ ) . The density of Z with this representation has already been reported in [47]:
f Z ( z ) = 0 1 r f wd ( z / r ) f Υ ( r ) d r ,
where f Υ is the square radial density of p Y , and f wd is the density of a linear combination of Dirichlet random variable components,
Λ = j = 1 s l j D j ,
where D = ( D 1 , , D s ) is a Dirichlet random variable with parameters ( r 1 / 2 , , r s / 2 ) , and l j s are s distinct eigenvalues of the positive definite matrix Σ with respective multiplicities r j , for j = 1 , , s .
It is known that if random variables C 1 , , C s are independent chi-square random variables having r 1 , , r s degrees of freedom, and C = j = 1 s C j , then ( C 1 / C , , C s / C ) is a Dirichlet random variable with the parameters ( r 1 / 2 , , r s / 2 ) [48]. Hence, the random variable Λ in (38) can be expressed as Λ = i = 1 s l j C j / C . Equivalently, if N 1 , , N d are independent standard normal random variables, then Λ can be written as
Λ = i = 1 d λ i N i 2 i = 1 d N i 2 .
Using (37) in (36) and replacing φ by square radial density f Υ as expressed in (29), we obtain the following expression for the averaged log-loss:
𝔼 [ log ( q ( X ) ) ] = 0 1 r f wd ( υ / r ) f Υ ( r ) log Γ ( d 2 ) π d 2 υ 1 d 2 f Υ ( υ ) d υ d r .
Subtracting (35) from (40), we obtain (34). ☐
Until now, we derived an expression for the KL-divergence between two ZEC distributions. We can further simplify the KL-divergence for the case of ZEG distributions to avoid computing double-integration, and the following theorem proves it.
Theorem 3. 
Suppose we have two distributions P Y = Z E G ( Σ 1 , a p , b p ) , and Q Y = Z E G ( Σ 2 , a q , b q ) , then the entropy of random variable Y distributed according to PY and the KL-divergence between these two distributions are given by the following expressions:
H ( Y ) = 1 2 log ( | Σ 1 | ) log Γ ( d 2 ) π d 2 Γ ( a p ) b p a p ( a p d 2 ) Ψ ( a p ) + log ( b p ) + a p ,
KL ( P | | Q ) = 1 2 log ( | Σ | ) + log Γ ( a q ) b q a q Γ ( a p ) b p a q + ( a p a q ) Ψ ( a p ) a p + a p b p d b q tr ( Σ ) + ( d 2 a q ) ( 𝔼 [ log ( i = 1 d λ i N i 2 ) ] Ψ ( d 2 ) log ( 2 ) ) ,
where Ψ ( · ) is digamma function, and t r ( · ) is the trace of a matrix. Also N i s are independent standard normal random variables, and λ 1 , , λ d are eigenvalues of matrix Σ = Σ 2 1 / 2 Σ 1 Σ 2 1 / 2 .
Proof. 
Like the previous lemma, we apply the change of variable X = Σ 2 1 / 2 Y and compute the KL-divergence between the transformed distributions. The expression for entropy (35) in the case of ZEG distributions becomes
H ( X ) = 1 2 log ( | Σ | ) 0 log Γ ( d 2 ) π d 2 Γ ( a p ) b p a p υ a p d 2 exp υ b p 1 Γ ( a p ) b p a p υ a p 1 exp υ b p d υ .
Next, recall the following gamma function identities [49]:
1 Γ ( a + 1 ) b a + 1 0 r a exp r b d r = 1 ,
1 Γ ( a + 1 ) b a + 1 0 log ( r ) r a exp r b d r = Ψ ( a + 1 ) + log ( b ) .
Using (44) and (45), we can simplify (43) to obtain
H ( X ) = 1 2 log ( | Σ | ) log Γ ( d 2 ) π d 2 Γ ( a p ) b p a p + ( d 2 a p ) Ψ ( a p ) + log ( b p ) + a p .
Since Y = Σ 2 1 / 2 X , we can trivially derive the expression of H ( Y ) given in (41).
For deriving the averaged log-loss term, we obtain the following expression by putting the gamma square radial component (30) into (40):
𝔼 [ log ( q ( X ) ) ] = 0 1 r f wd ( υ / r ) r a p 1 Γ ( a p ) b p a p exp r b p log Γ ( d 2 ) π d 2 Γ ( a q ) b q a q υ a q d 2 exp υ b q d υ d r .
We apply the change of variable μ = υ / r and express the integrals in terms of new variables μ and r,
𝔼 [ log ( q ( X ) ) ] = log ( Γ ( d 2 ) ) + log ( π d 2 Γ ( a q ) b q a q ) + d 2 a q Γ ( a p ) b p a p 0 f wd ( μ ) log ( μ ) d μ 0 r a p 1 exp r b p d r + d 2 a q Γ ( a p ) b p a p 0 f wd ( μ ) d μ 0 log ( r ) r a p 1 exp r b p d r + 1 b q Γ ( a p ) b p a p 0 f wd ( μ ) μ d μ 0 r a p exp r b p d r .
Using the equalities (44) and (45), we obtain
𝔼 [ log ( q ( X ) ) ] = log ( Γ ( d 2 ) ) + log ( π d 2 Γ ( a q ) b q a q ) + ( d 2 a q ) Ψ ( a p ) + log ( b p ) + a p b p d b q 0 f wd ( μ ) μ d μ + ( d 2 a q ) 0 f wd ( μ ) log ( μ ) d μ ,
where similar to the previous lemma, f wd is the density of the random variable Λ = i = 1 d λ i N i 2 / i = 1 d N i 2 , where N i s are independent standard normal random variables, and λ 1 , , λ d are the eigenvalues of matrix Σ. Subtracting the entropy from the averaged log-loss and knowing that 0 f wd ( μ ) μ d μ = 𝔼 [ Λ ] and 0 f wd ( μ ) log ( μ ) d μ = 𝔼 [ log ( Λ ) ] , we obtain
KL ( P | | Q ) = 1 2 log ( | Σ | ) + log Γ ( a q ) b q a q Γ ( a p ) b p a q + ( a p a q ) Ψ ( a p ) a p + a p b p d b q 𝔼 [ Λ ] + ( d 2 a q ) 𝔼 [ log ( Λ ) ] .
The moments of Λ were computed in [47], but we are giving a simple derivation of the first moment below. It is known that the random variable V i = N i 2 / j = 1 d N j 2 has the following beta distribution:
V i Beta 1 2 , d 1 2 .
Since 𝔼 [ V i ] = 1 / d , then
𝔼 [ Λ ] = 𝔼 i = 1 d λ i V i = tr ( Σ ) d .
Expected logarithm of Λ can be expressed as a difference of two expectations:
𝔼 [ log ( Λ ) ] = 𝔼 [ log ( i = 1 d λ i N i 2 ) ] 𝔼 [ log ( i = 1 d N i 2 ) ] .
Using the fact that the expected logarithm of a chi-square random variable with d degrees of freedom is equal to Ψ ( d / 2 ) + log ( 2 ) , 𝔼 [ log ( Λ ) ] can be computed by the following equation:
𝔼 [ log ( Λ ) ] = 𝔼 [ log ( i = 1 d λ i N i 2 ) ] Ψ ( d 2 ) log ( 2 ) .
With substitution (48) and (50) into (46), we get (42). ☐

3.3. KL-Divergence between ACG Distributions

The following theorem gives expressions for the KL-divergence between ACG distributions and the entropy of a single ACG distribution.
Theorem 4. 
Suppose we have two probability distributions G Y = A C G ( Σ 1 ) and J Y = A C G ( Σ 2 ) , then the entropy of random variable Y distributed according to GY and the KL-divergence between these two distributions are given by the following expressions:
H ( Y ) = log 2 π d 2 | Σ 1 | 1 / 2 Γ ( d 2 ) d 2 ( 𝔼 [ log ( i = 1 d σ i N i 2 ) ] Ψ ( d 2 ) log ( 2 ) ) ,
KL ( G | | J ) = 1 2 log ( | Σ | ) + d 2 ( 𝔼 [ log ( i = 1 d λ i N i 2 ) ] Ψ ( d 2 ) log ( 2 ) ) ,
where N i s are independent standard normal random variables, λ 1 , , λ d are eigenvalues of matrix Σ = Σ 2 1 / 2 Σ 1 Σ 2 1 / 2 , and σ 1 , , σ d are eigenvalues of matrix Σ 1 .
Proof. 
Due to the invariance property of KL-divergence under invertible change of variables, we use the change of variable Ω = Σ 1 1 / 2 Y . It is easy to verify that Ω is distributed according to a zero-mean generalized elliptical distribution with identity covariance [44]. From the definition of KL-divergence given by (33), we have
KL ( G | | J ) = 𝔼 log Γ ( d 2 ) 2 π d 2 Ω Ω d 2 𝔼 log Γ ( d 2 ) 2 π d 2 | Σ ˜ | 1 / 2 Ω Σ ˜ 1 Ω d 2 ,
where Σ ˜ = Σ 1 1 / 2 Σ 2 Σ 1 1 / 2 . By some simplifications, it is immediate that
KL ( G | | J ) = 1 2 log ( | Σ ˜ | ) + d 2 𝔼 log Ω Σ ˜ 1 Ω Ω Ω .
Since projecting any zero-mean generalized elliptical distribution (with identity covariance) on the unit sphere gives an ACG random variable (with identity covariance) [50], we can substitute 𝔼 [ log ( Ω Σ ˜ 1 Ω / Ω Ω ) ] with 𝔼 [ log ( Σ ˜ 1 X / X X ) ] , where the random vector X is distributed according to a multivariate normal distribution with identity covariance and zero mean. Because X Σ ˜ 1 X is a CQF and X X is a chi-square random variable, we have
KL ( G | | J ) = 1 2 log ( | Σ ˜ | ) + d 2 ( 𝔼 [ log ( i = 1 d λ ˜ i 1 N i 2 ) ] Ψ ( d 2 ) log ( 2 ) ) ,
where λ ˜ i s are the eigenvalues of Σ ˜ . Additionally, it is easy to verify that | Σ | = | Σ ˜ | 1 and λ i = λ ˜ i 1 , therefore (52) holds.
Since one of the terms in the KL-divergence is equal to the minus entropy, we use our derived expression for the KL-divergence between ACG distributions to find a formula for the entropy of an ACG distribution. Define S Y = A C G ( I ) , then the KL-divergence between G Y and S Y can be easily derived from the main definition (33):
KL ( G | | S ) = H ( Y ) log ( Γ ( d 2 ) 2 π d 2 ) ,
where H ( Y ) is the entropy of the random variable Y. Now, we compute the above KL-divergence using (52) which is
KL ( G | | S ) = 1 2 log ( | Σ 1 | ) + d 2 ( 𝔼 [ log ( i = 1 d σ i N i 2 ) ] Ψ ( d 2 ) log ( 2 ) ) .
Equating the right-hand sides of (55) and (56) gives (51). ☐
The following corollary shows a relation between the KL-divergence of ACG distributions and the KL-divergence of ZEG distributions. It is an immediate consequence of Theorems 3 and 4.
Corollary 1. 
The KL-divergence between two ACG distributions A C G ( Σ 1 ) and A C G ( Σ 2 ) is equal to the KL-divergence between two ZEG distributions Z E G ( Σ 1 , a , b ) and Z E G ( Σ 2 , a , b ) , when a 0 .

4. Simulation Study

In Section 2, we proposed three different methods for computing the expected logarithm of the CQF given in (1). We assume the weights of CQFs that are used in the simulations are eigenvalues of some random positive definite matrices. These random matrices are generated uniformly from the space of positive definite matrices with unit trace according to the procedure proposed in [43]. In this section, we numerically investigate the running time and accuracy of these approaches. All methods were implemented in Matlab (Version R2014a) (64-bit), and the simulations were run on a personal laptop with an Intel Core i5 (2.5Ghz) processor under the OS X Yosemite 10.10.3 operating system. Since the series method depends heavily on loops that are slow in Matlab, we implemented this method in a Matlab Mex-file. For the integral method, the integral is numerically evaluated using Gauss–Kronrod 7-15 rule [51,52]. The absolute error tolerance is given as an input parameter of the numerical integration. In the integral method, the value can be computed with any given accuracy by choosing the absolute error tolerance; therefore, we do not analyze the integral method in the sense of the calculation error.
Figure 2 investigates the effects of dimensionality on the average running time of different methods for computing the expected logarithm of the CQF explained in Section 2. For the integral method (upper-left plot), two curves for two different absolute error tolerances are shown. The integral formula (13) has the parameters k and θ that can be chosen freely, and we choose those given in (14). Different curves for the series method (upper-right plot) correspond to different values of L, which is the truncation length of the series. The curve of the fast method (lower plot) corresponds to the computation time of the third fast method explained in Section 2. One reason of lower computation time of the fast method is its lack of need for any eigenvalue computation. There is a curve in upper-right plot showing the computational time of eigenvalue computation.
The approximation error of all three fast methods for different dimensions can be seen in Figure 3. The plot on the right-hand side of this figure magnifies the curve for the mean error of the third fast method (the blue curve with dots). As it can be observed in Figure 3, changing the dimensionality has a negligible effect on the mean and the standard deviation (SD) of the absolute approximation error for the fast methods. Small mean error and SD of the third method indicate the distinct advantage of the third fast method over the other two methods. This method uses a convex combination of the values of the other two approximations as explained in Section 2.
Approximating the expected logarithm of the CQF using the fast methods induces an error on the KL-divergence between ACG distributions given by (52). Figure 4 shows the mean percentage of relative error and its standard deviation as a function of dimensionality. It can be observed that the relative error decreases as the dimensionality increases. The third fast method is clearly superior to the other two fast methods. The reason for such a small percentage of relative error is the observation that whenever the error is large, then the KL-divergence is large too. We are not showing the results for the dimensions less than ten because the error percentage is quite large in that regime.
The amount of error in the series method for a given truncation length L depends on the parameters d and ϵ, as it can be observed in the upper bound error formula (26). In Figure 5, this error together with its upper bound are shown for ϵ = 0.6 , ϵ = 0.8 and ϵ = 0.95 with d = 2 . The curves are plotted for L > d ϵ / ( 2 2 ϵ ) , because the error upper bound is valid only for this region. It can be observed that by increasing ϵ, the slopes of the error and the error upper bound increase and the distance between these errors slightly increases too. This behavior indicates that the series method is more suitable for smaller ϵ values. For the case of ϵ = 0.6 , we can see that the real error remains constant after about L = 60 , whereas the error upper bound still increases. It happens because the error is computed by the integral method, which has an error itself (in this case, the error is about 10 15 ).
Figure 6 shows how increasing dimension affects the performance of the series method. The parameter ϵ is set to 0.9 by choosing maximum and minimum weights in the CQF to be 1 and 1 / 19 , respectively. The other weights of CQF are sampled uniformly between the maximum and minimum weights. It can be seen that the dimensionality has a negligible effect on the slope of the curves. This can be predicted from the formula of upper bound in (26), because the exponential term ϵ L dominates other terms in the equation and the slopes of the curves are determined mainly by the parameter ϵ. In this figure, the standard deviations are due to the different distribution of the weights between the maximum and minimum weights. Figure 5 and Figure 6 demonstrate that the error upper bound is a relatively tight bound for the actual error.
In Figure 7, we investigate the effect of ϵ and d on the averaged L to achieve an acceptable upper bound error (here 10 8 ). We can see that as the amount of ϵ increases, the slopes of the curves increase and in the limit of ϵ 1 , it goes to infinity. This figure justifies our previous claim that when ϵ and the dimensionality are small, the series method is very efficient due to relatively small L needed to achieve an acceptable error.

5. Conclusions

In this paper, we developed three methods for calculating the expected logarithm of a central quadratic form. The integral method was a direct application of a more general result applicable for positive random variables. We then introduced three fast methods for approximating the expected logarithm. Finally, using an infinite series representation of central quadratic forms, we proposed a series method for computing the expected logarithm. By proving a bound for the approximation error, we investigated the performance of this method.
We also derived expressions for the entropy and the KL-divergence of zero-mean elliptical gamma and angular central Gaussian distributions. The expected logarithm of the central quadratic form appeared in the form of KL-divergences and entropy of the angular central Gaussian distribution.
By conducting multiple experiments, we observed that the three methods for computing the expected logarithm of a central quadratic form differ in running time and accuracy. The possible user can choose the most appropriate method based on his/her requirements.
The methodologies developed in this paper can be used in many applications. For example, one can use the result of Theorem 1 for computing the expected logarithm of other positive random variables like a non-central quadratic form. Another line of research would be to use the KL-divergence between angular central Gaussian distributions with the fast approximations in learning problems that have a divergence measure in their cost functions.

Acknowledgments

This research was in part supported by a grant from IPM (No. CS1395-4-42).

Author Contributions

The authors contributed equally to this work. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 2

From Lemma 1 and the form of gamma cumulative function, we have the following inequalities:
γ ( d 2 , x 2 λ min ) Γ ( d 2 ) F ( x ) γ ( d 2 , x 2 λ max ) Γ ( d 2 ) ,
where λ max = max { λ i } i = 1 d , λ min = min { λ i } i = 1 d , and γ ( · , · ) is the lower incomplete gamma function. Adding G ( x ) to all sides of the above inequality, we get
γ ( k , x θ ) Γ ( k ) γ ( d 2 , x 2 λ min ) Γ ( d 2 ) G ( x ) F ( x ) γ ( k , x θ ) Γ ( k ) γ ( d 2 , x 2 λ max ) Γ ( d 2 ) .
Since log ( x ) is positive for x > 1 , therefore by multiplying all sides of the above inequality by log ( x ) , we obtain
log ( x ) ( γ ( k , x θ ) Γ ( k ) γ ( d 2 , x 2 λ min ) Γ ( d 2 ) ) log ( x ) ( G ( x ) F ( x ) ) log ( x ) ( γ ( k , x θ ) Γ ( k ) γ ( d 2 , x 2 λ max ) Γ ( d 2 ) ) ,
which holds for all x > 1 . For proving the first part of this lemma, namely
lim x log ( x ) ( G ( x ) F ( x ) ) = 0 ,
it is enough to show that
lim x log ( x ) ( γ ( k , x θ ) Γ ( k ) γ ( k ^ , x θ ^ ) Γ ( k ^ ) ) = 0
holds for any positive choices of k, k ^ , θ, and θ ^ and then invoke squeeze theorem by taking the limits of all sides of (A3). From the definition of lower incomplete gamma function, the left-hand side (A5) can be rewritten as
lim x 1 Γ ( k ) 0 x θ t k 1 exp ( t ) d t 1 Γ ( k ^ ) 0 x θ ^ t k ^ 1 exp ( t ) d t log ( x ) 1 .
Using L’Hôpital’s rule, it can be seen that the above limit is equivalent to
lim x x log ( x ) 2 ( 1 θ Γ ( k ) ( x θ ) k 1 exp ( x θ ) 1 θ ^ Γ ( k ^ ) ( x θ ^ ) k ^ 1 exp ( x θ ^ ) ) .
It is easy to see that (A6) is equal to zero and consequently (A5) and (A4) hold.
Now, we want to prove the second statement in the lemma, that is
lim x 0 + log ( x ) ( G ( x ) F ( x ) ) = 0 .
If we multiply all sides of (A2) by log ( x ) , then for 0 < x < 1 we have
log ( x ) ( γ k , x θ Γ ( k ) γ ( d 2 , x 2 λ max ) Γ ( d 2 ) ) log ( x ) ( G ( x ) F ( x ) ) log ( x ) ( γ ( k , x θ ) Γ ( k ) γ ( d 2 , x 2 λ min ) Γ ( d 2 ) ) .
Using the same strategy as above, we want to show that for any positive choices of k, k ^ , θ, and θ ^ , the following limit holds:
lim x 0 + log ( x ) ( γ ( k , x θ ) Γ ( k ) γ ( k ^ , x θ ^ ) Γ ( k ^ ) ) = 0 .
Using L’Hôpital’s rule, it can be seen that
lim x 0 + log ( x ) ( γ ( k , x θ ) Γ ( k ) ) = H lim x 0 + x log ( x ) 2 ( 1 θ Γ ( k ) ( x θ ) k 1 exp ( x θ ) ) = 0 .
Therefore (A9) holds, and from (A8), we have
0 lim x 0 + log ( x ) ( G ( x ) F ( x ) ) 0 .
By squeeze theorem, we can conclude that (A7) holds.

Appendix B. Proof of Lemma 3

From the expression of F and G , we have
0 G ( σ ) F ( σ ) σ d σ = 0 i = 1 d ( 1 + 2 λ i σ ) 1 2 ( 1 + θ σ ) k σ ( 1 + θ σ ) k i = 1 d ( 1 + 2 λ i σ ) 1 2 d σ .
In this proof, for the simplicity of notation, we define L ( σ ) = ( G ( σ ) F ( σ ) ) / σ and V ( σ ) = i = 1 d ( 1 + 2 λ i σ ) 1 / 2 ( 1 + θ σ ) k .
We give separate proofs for the cases d > 2 k , d < 2 k and d = 2 k . For the first case d > 2 k , we have
lim σ V ( σ ) = lim σ [ i = 1 d ( 1 + 2 λ i σ ) 1 2 ( 1 + θ σ ) k ] = + .
Consequently, it can be said that there exists a number a > 0 that for all x a , the function V ( σ ) is positive.
Therefore, the integrand of a L ( σ ) d σ is positive in its domain of integration. If we choose 1 < p < 1 + k , then
lim σ L ( σ ) 1 σ p = 0 .
Since a 1 σ p d σ is convergent and its integrand is positive in its domain, from the limit comparison test, it follows that the integral a L ( σ ) d σ is convergent.
Now, we want to show that the integral 0 a L ( σ ) d σ is also convergent. Since G ( σ ) F ( σ ) is bounded for σ + , and
lim σ 0 + G ( σ ) F ( σ ) σ = i = 1 d λ i k θ ,
then L ( σ ) is bounded and consequently 0 a L ( σ ) d σ is convergent.
Since 0 a L ( σ ) d σ and a L ( σ ) d σ are convergent, the integral 0 L ( σ ) d σ is also convergent.
For the case of 2 k > d , we have
lim σ V ( σ ) = lim σ [ ( 1 + θ σ ) k i = 1 d ( 1 + 2 λ i σ ) 1 2 ] = + .
Therefore, there exists a number a > 0 that for all x a , the function V ( σ ) is positive. Therefore, the integrand of a L ( σ ) d σ is positive in its domain of integration. If we choose 1 < p < 1 + d / 2 , then
lim σ L ( σ ) 1 σ p = 0 .
Knowing that a 1 σ p d σ is bounded, using limit comparison test, we can conclude that a L ( σ ) d σ is convergent. Now, with the same strategy as the previous case, we can show that the integral 0 a L ( σ ) d σ is convergent and it is easy to see that 0 L ( σ ) d σ is also convergent.
For 2 k = d , excluding the obvious case G ( σ ) = F ( σ ) , there exists a number a > 0 that for all x a , the function V ( σ ) is either positive or negative. If it is positive, then we use the proof strategy for the case d > 2 k . Otherwise, we exploit the strategy for the case d < 2 k .

Appendix C. Proof of Theorem 2

We first give a proof for the inequality 𝔼 [ log ( U ) ] Ψ ( d / 2 ) + log ( 2 tr ( Σ ) / d ) . Using integral formula (13) with k = d / 2 and θ = 2 tr ( Σ ) / d for 𝔼 [ log ( U ) ] , we obtain
0 ( 1 + 2 tr ( Σ ) d σ ) d 2 i = 1 d ( 1 + 2 λ i σ ) 1 2 σ d σ 0 .
For proving this inequality, it is enough to show
( 1 + 2 tr ( Σ ) d σ ) d 2 i = 1 d ( 1 + 2 λ i σ ) 1 2 0 , for all σ + .
Assume a i = 1 + 2 λ i σ , then we can rewrite the above inequality as
( i = 1 d a i ) 1 d i = 1 d a i d , for all a i [ 1 , + ) ,
which is the well-known arithmetic mean-geometric mean inequality.
For proving the second inequality, 𝔼 [ log ( U ) ] Ψ ( tr ( Σ ) 2 / 2 tr ( Σ 2 ) ) + log ( 2 tr ( Σ 2 ) / tr ( Σ ) ) , we use the integral formula in (13) with k = tr ( Σ ) 2 / ( 2 tr ( Σ 2 ) ) and θ = 2 tr ( Σ 2 ) / tr ( Σ ) to obtain
0 ( 1 + 2 tr ( Σ 2 ) tr ( Σ ) σ ) tr ( Σ ) 2 2 tr ( Σ 2 ) i = 1 d ( 1 + 2 λ i σ ) 1 2 σ d σ 0 .
For proving the above inequality, it is enough to show
i = 1 d ( 1 + 2 λ i σ ) ( 1 + 2 tr ( Σ 2 ) tr ( Σ ) σ ) tr ( Σ ) 2 tr ( Σ 2 ) , for all σ + .
Assume A ( σ ) = i = 1 d ( 1 + 2 λ i σ ) , B ( σ ) = ( 1 + 2 tr ( Σ 2 ) σ / tr ( Σ ) ) tr ( Σ ) 2 / tr ( Σ 2 ) and P ( σ ) = A ( σ ) / B ( σ ) , then (C4) can be rewrited as
P ( σ ) 1 , for all σ + .
It is easy to see that P ( 0 ) = 1 . Therefore, for proving (C5), it is enough to show that the function P ( σ ) is increasing for positive sigmas.
The derivative of the function P ( σ ) is
P ( σ ) = A ( σ ) B ( σ ) B ( σ ) A ( σ ) B 2 ( σ ) .
We want to show P ( σ ) 0 , for all positive sigmas, and it is equivalent to say
A ( σ ) A ( σ ) B ( σ ) B ( σ ) , for all σ + .
Computing the left and right sides of the above inequality, we obtain
i = 1 d 2 λ i 1 + 2 λ i σ 2 tr ( Σ ) 2 tr ( Σ ) + 2 tr ( Σ ) σ , for all σ + .
We can rewrite the above inequality as
( i = 1 d 2 λ i 1 + 2 λ i σ ) ( i = 1 d λ i ( 1 + 2 λ i σ ) ) ( i = 1 d λ i ) 2 , for all σ + .
If we assume x i 2 = 2 λ i / ( 1 + 2 λ i σ ) and y i 2 = λ i ( 1 + 2 λ i σ ) , we can rewrite the above inequality as
( i = 1 d x i 2 ) ( i = 1 d y i 2 ) ( i = 1 d x i y i ) 2 , for all { x i , y i } + ,
which is the Cauchy-Schwarz inequality. So the function P ( σ ) is increasing and consequently, the second inequality holds.

Appendix D. Proof of Lemma 4

As we can see in [14], the following bound exists for c i :
| c i | c 0 ϵ i Γ ( d 2 + i ) Γ ( d 2 ) i ! ,
where ϵ = max j | 1 β λ j 1 | . If we put our chosen β given by (23) in ϵ formula, we obtain
ϵ = λ max λ min λ max + λ min .
Consider c ^ i to be the corresponding bound for c i ,
c ^ i = c 0 ϵ i Γ ( d 2 + i ) Γ ( d 2 ) i ! ,
then we have
c ^ i + 1 c ^ i = d 2 + i i ϵ .
Since d / 2 + i / i decreases with increasing i, we have
c ^ i + k c ^ i ( d 2 + i i ϵ ϵ i ) k .
By summing up the both sides of above inequality and changing the variables, we can conclude
k = i + 1 c ^ k c ^ i + 1 k = 0 ϵ i + 1 k c ^ i + 1 k = 0 ϵ i k = c ^ i + 1 1 1 ϵ i ,
which is true if i is large enough such that ϵ i < 1 . Since L > d ϵ / ( 2 2 ϵ ) , it can be observed that ϵ i < 1 for i L , hence for the total approximation error, we obtain
𝔼 [ log U ] 𝔼 ^ [ log U ] = i = L 1 d 2 + i k = i + 1 c ^ k i = L 1 d 2 + i c ^ i + 1 1 1 ϵ i 1 ( d 2 + L ) ( 1 ϵ L ) i = L c ^ i + 1 c ^ L + 1 ( d 2 + L ) ( 1 ϵ L ) 2 c 0 ϵ L + 1 1 d 2 + L L ϵ 2 Γ ( d 2 + L ) Γ ( d 2 + 1 ) ( L + 1 ) ! .

References

  1. Lapidoth, A.; Moser, S.M. Capacity bounds via duality with applications to multiple-antenna systems on flat-fading channels. IEEE Trans. Inf. Theory 2003, 49, 2426–2467. [Google Scholar] [CrossRef]
  2. Khodabin, M.; Ahmadabadi, A. Some properties of generalized gamma distribution. Math. Sci. 2010, 4, 9–28. [Google Scholar]
  3. Eccardt, T.M. The use of the logarithm of the variate in the calculation of differential entropy among certain related statistical distributions. 2007; arXiv:0705.4045. [Google Scholar]
  4. Nicolas, J.M. Introduction to second kind statistics: Application of log-moments and log-cumulants to SAR image law analysis. Trait. Signal 2002, 19, 139–167. [Google Scholar]
  5. Nicolas, J.M.; Tupin, F. Gamma mixture modeled with “second kind statistics”: Application to SAR image processing. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Toronto, ON, Canada, 24–28 June 2002; Volume 4, pp. 2489–2491.
  6. Teh, Y.W.; Newman, D.; Welling, M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Adv. Neural Inf. Process. Syst. 2006, 19, 1353–1360. [Google Scholar]
  7. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  8. Jean, W.H. The geometric mean and stochastic dominance. J. Financ. 1980, 35, 151–158. [Google Scholar] [CrossRef]
  9. Hakansson, N.H. Multi-period mean-variance analysis: Toward a general theory of portfolio choice. J. Financ. 1971, 26, 857–884. [Google Scholar] [CrossRef]
  10. Lo, Y.; Mendell, N.R.; Rubin, D.B. Testing the number of components in a normal mixture. Biometrika 2001, 88, 767–778. [Google Scholar] [CrossRef]
  11. Moore, D.S.; Spruill, M.C. Unified large-sample theory of general chi-squared statistics for tests of fit. Ann. Stat. 1975, 3, 599–616. [Google Scholar]
  12. Li, K.C. Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 1991, 86, 316–327. [Google Scholar] [CrossRef]
  13. Ruben, H. Probability content of regions under spherical normal distributions, IV: The distribution of homogeneous and non-homogeneous quadratic functions of normal variables. Ann. Math. Stat. 1962, 33, 542–570. [Google Scholar] [CrossRef]
  14. Kotz, S.; Johnson, N.L.; Boyd, D.W. Series representations of distributions of quadratic forms in normal variables. I. Central case. Ann. Math. Stat. 1967, 38, 823–837. [Google Scholar] [CrossRef]
  15. Box, G.E. Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of variance in the one-way classification. Ann. Math. Stat. 1954, 25, 290–302. [Google Scholar] [CrossRef]
  16. Ha, H.T.; Provost, S.B. An accurate approximation to the distribution of a linear combination of non-central chi-square random variables. REVSTAT Stat. J. 2013, 11, 231–254. [Google Scholar]
  17. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  18. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  19. Basseville, M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
  20. Kanamori, T. Scale-invariant divergences for density functions. Entropy 2014, 16, 2611–2628. [Google Scholar] [CrossRef]
  21. Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach; Springer: New York, NY, USA, 2002. [Google Scholar]
  22. Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: London, UK, 2005. [Google Scholar]
  23. Blekas, K.; Lagaris, I.E. Split–Merge Incremental LEarning (SMILE) of mixture models. In Artificial Neural Networks–ICANN 2007; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4669, pp. 291–300. [Google Scholar]
  24. Runnalls, A.R. Kullback–Leibler approach to Gaussian mixture reduction. IEEE Trans. Aerosp. Electron. Syst. 2007, 43, 989–999. [Google Scholar] [CrossRef]
  25. Dhillon, I.S.; Mallela, S.; Kumar, R. A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learning Res. 2003, 3, 1265–1287. [Google Scholar]
  26. Imseng, D.; Bourlard, H.; Garner, P.N. Using Kullback–Leibler divergence and multilingual information to improve ASR for under-resourced languages. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4869–4872.
  27. Do, M.N.; Vetterli, M. Wavelet-based texture retrieval using generalized Gaussian density and Kullback–Leibler distance. IEEE Trans. Image Process. 2002, 11, 146–158. [Google Scholar] [CrossRef] [PubMed]
  28. Mathiassen, J.R.; Skavhaug, A.; Bø, K. Texture Similarity Measure Using Kullback–Leibler Divergence Between Gamma Distributions. In Computer Vision—ECCV 2002; Springer: Berlin/Heidelberg, Germany, 2002; Volume 2352, pp. 133–147. [Google Scholar]
  29. Koutras, M. On the generalized noncentral chi-squared distribution induced by an elliptical gamma law. Biometrika 1986, 73, 528–532. [Google Scholar] [CrossRef]
  30. Fang, K.T.; Zhang, Y.T. Generalized Multivariate Analysis; Springer: Berlin/Heidelberg, Germany, 1990. [Google Scholar]
  31. Hosseini, R.; Sra, S.; Theis, L.; Bethge, M. Inference and mixture modeling with the elliptical gamma distribution. Comput. Stat. Data Anal. 2016, 101, 29–43. [Google Scholar] [CrossRef]
  32. Watson, G.S. Statistics on Spheres; Wiley: New York, NY, USA, 1983. [Google Scholar]
  33. Kent, J.T. The complex Bingham distribution and shape analysis. J. R. Stat. Soc. Ser. B 1994, 56, 285–299. [Google Scholar] [CrossRef]
  34. Bethge, M.; Hosseini, R. Method and Device for Image Compression. U.S. Patent 8,750,603, 10 June 2014. [Google Scholar]
  35. Zhang, T. Robust subspace recovery by Tyler’s M-estimator. Inf. Inference 2015, 5. [Google Scholar] [CrossRef]
  36. Franke, J.; Redenbach, C.; Zhang, N. On a mixture model for directional data on the sphere. Scand. J. Stat. 2015, 43, 139–155. [Google Scholar] [CrossRef]
  37. Tyler, D.E. A distribution-free M-estimator of multivariate scatter. Ann. Stat. 1987, 15, 234–251. [Google Scholar] [CrossRef]
  38. Sra, S.; Hosseini, R.; Theis, L.; Bethge, M. Data modeling with the elliptical gamma distribution. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 903–911.
  39. Davis, P.J.; Rabinowitz, P. Methods of Numerical Integration; Dover: New York, NY, USA, 2007. [Google Scholar]
  40. Benaroya, H.; Han, S.M.; Nagurka, M. Probability Models in Engineering and Science; CRC Press: Boca Raton, FL, USA, 2005; Volume 193. [Google Scholar]
  41. Satterthwaite, F.E. Synthesis of variance. Psychometrika 1941, 6, 309–316. [Google Scholar] [CrossRef]
  42. Yuan, K.H.; Bentler, P.M. Two simple approximations to the distributions of quadratic forms. Br. J. Math. Stat. Psychol. 2010, 63, 273–291. [Google Scholar] [CrossRef] [PubMed]
  43. Martin, B.M.; Jorswieck, E. Sampling uniformly from the set of positive definite matrices with trace constraint. IEEE Trans. Signal Process. 2012, 60, 2167–2179. [Google Scholar]
  44. Frahm, G.; Jaekel, U. Tyler’s M-estimator, random matrix theory, and generalized elliptical distributions with applications to finance. Available online: http://ssrn.com/abstract=1287683 (accessed on 26 July 2016).
  45. Fang, K.T.; Kotz, S.; Ng, K.W. Symmetric Multivariate and Related Distributions; Chapman and Hall: London, UK, 1990. [Google Scholar]
  46. Chen, B.; Zhu, Y.; Hu, J.; Principe, J.C. System Parameter Identification: Information Criteria And Algorithms; Newnes: Oxford, UK, 2013. [Google Scholar]
  47. Provost, S.B.; Cheong, Y. The probability content of cones in isotropic random fields. J. Multivar. Anal. 1998, 66, 237–254. [Google Scholar] [CrossRef]
  48. Johnson, N.L.; Kotz, S. Distributions in Statistics: Continuous Univariate Distributions; Houghton Mifflin: Boston, MA, USA, 1970; Volume 1. [Google Scholar]
  49. Polyanin, A.D.; Manzhirov, A.V. Handbook of Integral Equations; CRC Press: Boca Raton, FL, USA, 1998. [Google Scholar]
  50. Soloveychik, I.; Wiesel, A. Tyler’s estimator performance analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 5688–5692.
  51. Piessens, R.; de Doncker-Kapenga, E.; Überhuber, C.W. QUADPACK, A Subroutine Package for Automatic Integration; Springer: Berlin/Heidelberg, Germany, 1983. [Google Scholar]
  52. Shampine, L.F. Vectorized adaptive quadrature in Matlab. J. Comput. Appl. Math. 2008, 211, 131–140. [Google Scholar] [CrossRef]
Figure 1. The mean squared error of the third fast method for approximating the expected logarithm of a CQF as a function of parameter l.
Figure 1. The mean squared error of the third fast method for approximating the expected logarithm of a CQF as a function of parameter l.
Entropy 18 00278 g001
Figure 2. The average running time (in milliseconds) of the integral method (a), the series method (b) and the third fast method (c) in different dimensions for computing expected logarithm of the CQF. The red curve in the upper-left plot shows the computational time for computing the eigenvalues of random positive-definite matrices (using eig function in Matlab) needed before applying the integral method or series method. Different curves for the upper-left plot correspond to the computational time of integral method for different absolute error tolerances including the time needed for computing the eigenvalues. Different curves for the series method correspond to the computational time for various values of truncation length of the series.
Figure 2. The average running time (in milliseconds) of the integral method (a), the series method (b) and the third fast method (c) in different dimensions for computing expected logarithm of the CQF. The red curve in the upper-left plot shows the computational time for computing the eigenvalues of random positive-definite matrices (using eig function in Matlab) needed before applying the integral method or series method. Different curves for the upper-left plot correspond to the computational time of integral method for different absolute error tolerances including the time needed for computing the eigenvalues. Different curves for the series method correspond to the computational time for various values of truncation length of the series.
Entropy 18 00278 g002
Figure 3. The absolute error for the approximation of the expected logarithm of the CQF by the fast methods explained in Section 2 for different dimensions. The third method uses a convex combination of the first two methods. The plot on the right shows the zoomed version of the error mean of the third method.
Figure 3. The absolute error for the approximation of the expected logarithm of the CQF by the fast methods explained in Section 2 for different dimensions. The third method uses a convex combination of the first two methods. The plot on the right shows the zoomed version of the error mean of the third method.
Entropy 18 00278 g003
Figure 4. The relative error of the KL-divergence between ACG distributions in different dimensions when the fast methods are used for approximating the expected logarithm of the CQF.
Figure 4. The relative error of the KL-divergence between ACG distributions in different dimensions when the fast methods are used for approximating the expected logarithm of the CQF.
Entropy 18 00278 g004
Figure 5. The approximation error of the series method for computing the expected logarithm of the CQF together with its upper bound. The result for three different values of ϵ is shown with different colors. In this figure, the number of weights (d) is equal to two.
Figure 5. The approximation error of the series method for computing the expected logarithm of the CQF together with its upper bound. The result for three different values of ϵ is shown with different colors. In this figure, the number of weights (d) is equal to two.
Entropy 18 00278 g005
Figure 6. The relation between L and the error in the series method for ϵ = 0.9.
Figure 6. The relation between L and the error in the series method for ϵ = 0.9.
Entropy 18 00278 g006
Figure 7. The average needed L computed by the error upper bound formula given by (26) for different values of d and ϵ.
Figure 7. The average needed L computed by the error upper bound formula given by (26) for different values of d and ϵ.
Entropy 18 00278 g007

Share and Cite

MDPI and ACS Style

Habib Zadeh, P.; Hosseini, R. Expected Logarithm of Central Quadratic Form and Its Use in KL-Divergence of Some Distributions. Entropy 2016, 18, 278. https://doi.org/10.3390/e18080278

AMA Style

Habib Zadeh P, Hosseini R. Expected Logarithm of Central Quadratic Form and Its Use in KL-Divergence of Some Distributions. Entropy. 2016; 18(8):278. https://doi.org/10.3390/e18080278

Chicago/Turabian Style

Habib Zadeh, Pourya, and Reshad Hosseini. 2016. "Expected Logarithm of Central Quadratic Form and Its Use in KL-Divergence of Some Distributions" Entropy 18, no. 8: 278. https://doi.org/10.3390/e18080278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop