The Sampling Distribution of the Total Correlation for Multivariate Gaussian Random Variables

The sampling distribution of the total correlation (TC) for a d-dimensional standardized multivariate Gaussian random variable with an identity covariance matrix is derived. It is shown to be the distribution of a sum of generalized beta random variables. It is also shown that, for large dimension and sample size, a central limit theorem holds, providing a Gaussian approximation to the sampling distribution for high dimensional data.


Introduction
Mutual information quantifies the information shared between two random variables [1][2][3]. This concept can be been generalized to d variables in a variety of ways [4][5][6][7], with the most direct generalization being Watanabe's total correlation (TC), where X is a vector whose components are the d random variables X 1 , . . . , X d , and for continuous random variables, h(X i ) is the differential entropy of X i and h(X) is the joint differential entropy of X. Total correlation is also sometimes called multivariate mutual information, and it is the Kullback-Leibler divergence between the joint density of X and the density obtained by taking the product of the marginal densities of the X i . Thus, the total correlation T(X) quantifies, in a quite general sense, the information shared among all the d random variables. The total correlation is non-negative and in the case where all d random variables are mutually independent we have T(X) = 0 [7,8]. For the special case where X is multivariate Gaussian with arbitrary mean and covariance matrix Σ, the total correlation can be written explicitly as where σ 2 ij is the ijth entry of Σ. When the X i are independent we have σ 2 ij = 0 for all i = j and so log |Σ| = log σ 2 11 σ 2 22 · · · σ 2 dd , giving T(X) = 0 in Equation (2) as expected. The total correlation provides a natural way to quantify dependencies among a set of random variables. For example, often we seek to determine if a set of random variables are mutually independent because dependency among variables can indicate interesting and meaningful relationships in nature. To do so one can take a sample from the unknown distribution and compute the total correlation from this sample. Even if the random variables are mutually independent, however, the total correlation measured using such a finite sample will typically be positive (rather than zero) simply because of sampling variation. Therefore, it is of interest to know the sampling distribution of the total correlation under independence. Once we have the sampling distribution we can then perform statistical tests of independence. Here we derive the sampling distribution of (2) in the case where the X i are standardized (i.e., zero mean, unit variance), independent, Gaussian random variables.
Previous authors have proposed exact expressions for the mean and variance of the sample total correlation [9,10]. In fact, Guerrero (Section 2.1 of [9]) derived a moment generating function for the sample total correlation using the distribution of the log-determinant of a Wishart matrix (see Wilks [11,12]). Unfortunately the asymptotic approximation of Guerrero's result does not match the results of Marralec [10] suggesting that one of the two is incorrect. We will resolve this discrepancy by deriving the moment generating function directly from our expression for the probability density function of the sample total correlation. In the limit of large sample size our results match those presented in Section 4.1 of Marralec [10], suggesting that the moment generating function of [9] is incorrect.

Definitions and Preliminaries
Let X represent a d-variate zero mean Gaussian random variable with covariance matrix Σ = I d where I d is the d-dimensional identity matrix. Let {x 1 , . . . , x n } denote a sample of n draws from the distribution of X. We focus on the case where n ≥ d. The sample covariance matrix isΣ = (1/n) ∑ n i=1 x i x i = {σ ij } and nΣ is Wishart distributed with n degrees of freedom, which we denote as nΣ ∼ W(Σ, d, n). From Equation (2) the sample total correlation is then also a random variable and is computed as where the subscripts d and n indicate that T is a family of random variables indexed by dimension and sample size. Odell and Feiveson's 1966 result [13] provides a convenient way to characterize a are independent chi-square random variables with n − i + 1 degrees of freedom. Suppose that N ij are independent standardized normal random variables for 1 ≤ i < j ≤ d, also independent of every V (n) i . Now construct the random variables and thus we have where A i are independent chi-square random variables with i − 1 degrees of freedom and we define A 1 = 0. Now following [14] we can also define the lower-triangular matrix T = {t ij } as and thus B = TT . Furthermore, Result (7) is a special case of results found in Wilks [11]. For analogous results involving complex matrices see Goodman [15].

The Sampling Distribution of the Total Correlation
With the above preliminaries the we can now state the following theorem.
Theorem 1 (The Sampling Distribution of TC). Consider a sample of size n from a set of d independent, standardized, Gaussian random variables, with n ≥ d. The total correlation (TC) is distributed as where F i,n−i are independent F-distributed random variables with i and n − i degrees of freedom. Equivalently, (8) can be written as where Y i,n is a beta-exponential random variable with probability density Proof. Writing Equation (3) as and using result (5) and (7) one obtains Scaling each chi-square by their corresponding degrees of freedom and re-indexing, yields (8).
Equivalently, if we define Y i,n = 1 2 log 1 + i n−i F i,n−i then T d,n (X) ∼ ∑ d−1 i=1 Y i,n , and using standard techniques it be can shown that the random variable Y i,n has probability density where B(x, y) is the beta function.

Corollary 1.
The moment generating function for T d,n (X) is where Γ(x) is the gamma function. The mean and variance of T d,n (X) are therefore where ψ(x) = Γ (x)/Γ(x) is the digamma function and ψ (k) (x) denotes its k th derivative.
Proof. Taking Y i,n = 1 2 log 1 + i n−i F i,n−i , the moment generating function for Y i,n is . The random variables in the sum ∑ d−1 i=1 Y i,n are independent, and therefore the moment generating function M d,n (t) for T d,n (X) is the appropriate product of the functions φ i,n (t). Equation (12) then follow directly from the properties of moment generating functions.
Guerrero [9] obtained a formula for the mean and variance of T d,n (X) (except for a typo in the variance) using Wilks' [12] moment generating function for the generalized variance. These are remarkably close to (12), but the proposed moment generating function for the sample total correlation information provided in Guerrero [9] appears to be incorrect.

A Central Limit Theorem for the Total Correlation
Girkos central limit theorem [16] implies asymptotic normality of the sample log-determinant, as seen in the work of Bao et al., and Cai et al. [17,18]. This suggests the existence of a central limit theorem result forT d,n (X) when the dimension d and sample size n are large. Here we provide such a result.
Define the mean and variance of Y i,n as m i,n = E[Y i,n ] and s 2 i,n = E[(Y i,n − µ i,n ) 2 ], and the mean-centered random variables Theorem 2 (Asymptotic normality of TC). Suppose n → ∞ and d → ∞ in such a way that n/d → k where 1 ≤ k < ∞. Then 1 where convergence is in distribution. Thus, for large n and d (with n ≥ d) the total correlation T d,n (X) is approximately normally distributed with mean and variance given by µ d,n and σ 2 d,n in Equations (12).
Proof. The Y * i,n are a triangular array of random variables such that, for any fixed n the Y * i,n (1 ≤ i ≤ d − 1) are independent. Thus, (13) will hold provided that the Lyapunov condition is satisfied [19]; namely, that there exists a δ > 0 such that For δ = 2 the entries in Lyapunov's summation represent each Y i,n 's fourth central moment, for which the generating function is C i,n (t) = e −m i,n t φ i,n (t). The summation therefore becomes while the denominator in Lyapunov's condition is In Appendix A we show that and, for any fixed 1 ≤ k < ∞, and for sufficiently large d and n with n/d sufficiently close to k, Therefore, for any fixed 1 ≤ k < ∞, and for sufficiently large d and n with n/d sufficiently close to k, we have Now first consider the case where n = d (and therefore k = 1). Then (14) simplifies to Taking the limit n → ∞ yields zero on the right-hand side, verifying Lyapunov's condition for k = 1. Next, consider the case where n > d. Taking the limit in (14) as n → ∞ and d → ∞ in such a way that n/d → k where 1 < k < ∞, again we see that the right-hand side is zero. This verifies Lyapunov's condition in the case where k > 1, thereby completing the proof.

Conclusions
The total correlation of a multivariate random variable (sometimes called multivariate mutual information) is the Kullback-Leibler divergence between the joint density of the random variable and the product of its marginal densities. It therefore provides a natural measure of the degree of independence of a set of random variables. In this paper we derived the sampling distribution of the total correlation for a d-dimensional standardized multivariate Gaussian random variable with identity covariance matrix, and showed that it is the distribution of a sum of generalized beta random variables. We also proved that, for large dimension and sample size, a central limit theorem holds, providing a Gaussian approximation to the sampling distribution for high dimensional data.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The proof of the central limit theorem result makes use of the following two lemmas. Both are based on an inequality for the digamma function found in [20] (where m ≥ 1 is an integer) Lemma A1. Suppose d ≤ n. Then the following inequality holds Proof. The left-hand inequality follows from the fact that ψ (1) (x) and ψ (3) (x) are both monotonically decreasing functions and so ψ (1) ( n−i 2 ) ≥ ψ (1) ( n 2 ) and ψ (3) ( n−i 2 ) ≥ ψ (3) ( n 2 ) for all 1 ≤ i ≤ d − 1. For the right-hand inequality we have Lemma A2. Suppose d ≤ n. Then, for any fixed 1 ≤ k ≤ ∞, and for sufficiently large d and n with n/d sufficiently close to k, the following inequality holds