Next Article in Journal
A Two-Stage Maximum Entropy Prior of Location Parameter with a Stochastic Multivariate Interval Constraint and Its Properties
Previous Article in Journal
Detection of Left-Sided and Right-Sided Hearing Loss via Fractional Fourier Transform
Previous Article in Special Issue
Information and the Quantum World
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Insights into Entropy as a Measure of Multivariate Variability

1
School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China
2
School of Electrical Engineering, Southwest Jiaotong University, Chengdu 610031, China
3
Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA
*
Author to whom correspondence should be addressed.
Entropy 2016, 18(5), 196; https://doi.org/10.3390/e18050196
Submission received: 29 February 2016 / Revised: 26 April 2016 / Accepted: 16 May 2016 / Published: 20 May 2016
(This article belongs to the Special Issue Information: Meanings and Interpretations)

Abstract

:
Entropy has been widely employed as a measure of variability for problems, such as machine learning and signal processing. In this paper, we provide some new insights into the behaviors of entropy as a measure of multivariate variability. The relationships between multivariate entropy (joint or total marginal) and traditional measures of multivariate variability, such as total dispersion and generalized variance, are investigated. It is shown that for the jointly Gaussian case, the joint entropy (or entropy power) is equivalent to the generalized variance, while total marginal entropy is equivalent to the geometric mean of the marginal variances and total marginal entropy power is equivalent to the total dispersion. The smoothed multivariate entropy (joint or total marginal) and the kernel density estimation (KDE)-based entropy estimator (with finite samples) are also studied, which, under certain conditions, will be approximately equivalent to the total dispersion (or a total dispersion estimator), regardless of the data distribution.

1. Introduction

The concept of entropy can be used to quantify uncertainty, complexity, randomness, and regularity [1,2,3,4]. Particularly, entropy is also a measure of variability (or dispersion) of the associated distribution [5]. The most popular entropy functional is the Shannon entropy which is a central concept in information theory [1]. In addition to Shannon entropy, there are many other entropy definitions, such as Renyi and Tsallis entropies [2,3]. Renyi entropy is a generalized entropy which depends on a parameter α and includes Shannon entropy as a limiting case ( α 1 ). In this work, to simplify the discussion, we focus mainly on the Shannon and Renyi entropies.
Entropy has found applications in many fields such as statistics, physics, communication, ecology, etc. In the past decades, especially in recent years, entropy and related information theoretic measures (e.g., mutual information) have also been successfully applied in machine learning and signal processing [4,6,7,8,9,10]. Information theoretic quantities can capture higher-order statistics and offer potentially significant performance improvement in machine learning applications. In information theoretic learning (ITL) [4], the measures from information theory (entropy, mutual information, divergences, etc.) are often used as an optimization cost instead of the conventional second-order statistical measures such as variance and covariance. In particular, in many machine learning (supervised or unsupervised) problems, the goal is to optimize (maximize or minimize) the variability of the data, and in these cases one can optimize the entropy of the data so as to capture the underlying structure in the data. For example, in supervised learning, such as regression, the problem can be formulated as that of minimizing the entropy of the error between model output and desired response [11,12,13,14,15,16,17]. This optimization criterion is called in ITL the minimum error entropy (MEE) criterion [4,6].
In most practical applications, the data are multidimensional and multivariate. The total dispersion (i.e., the trace of the covariance matrix) and generalized variance (i.e., the determinant of the covariance matrix) are two widely used measures of multivariate variability, although both have some limitations [18,19,20]. However, these measures of multivariate variability involve only second-order statistics and cannot describe well non-Gaussian distributions. Entropy can be used as a descriptive and comprehensive measure of multivariate variability especially when data are non-Gaussian, since it can capture higher-order statistics and information content of the data rather than simply their energy [4]. There are strong relationships between entropy and traditional measures of multivariate variability (e.g., total dispersion and generalized variance). In the present work, we study this problem in detail and provide some new insights into the behavior of entropy as a measure of multivariate variability. We focus mainly on two types of multivariate entropy (or entropy power) measures, namely joint entropy and total marginal entropy. We show that for the jointly Gaussian case, the joint entropy and joint entropy power are equivalent to the generalized variance, while total marginal entropy is equivalent to the geometric mean of the marginal variances and total marginal entropy power is equivalent to the total dispersion. Further, we study the smoothed multivariate entropy measures and show that the smoothed joint entropy and smoothed total marginal entropy will be equivalent to a weighted version of total dispersion when the smoothing vector has independent entries and the smoothing factor approaches infinity. In particular, if the smoothing vector has independent and identically distributed entries, the two smoothed entropy measures will be equivalent to the total dispersion as the smoothing factor approaches infinity. Finally, we also show that with finite number of samples, the kernel density estimation (KDE) based entropy (joint or total marginal) estimator will be approximately equivalent to a total dispersion estimator if the kernel function is Gaussian with covariance matrix being an identity matrix and the smoothing factor is large enough.
The rest of the paper is organized as follows. In Section 2, we present some entropy measures of multivariate variability and discuss the relationships between entropy and traditional measures of multivariate variability. In Section 3, we study the smoothed multivariate entropy measures and gain insights into the links between the smoothed entropy and total dispersion. In Section 4, we investigate the KDE based entropy estimator (with finite samples), and prove that under certain conditions the entropy estimator is approximately equivalent to a total dispersion estimator. Finally in Section 5, we give the conclusion.

2. Entropy Measuresfor Multivariate Variability

2.1. Shannon’s Entropy

Entropy has long been employed as a measure of variability (spread, dispersion, or scatter) of a distribution [5]. A common measure of multivariate variability is the joint entropy (JE). Given a d-dimensional random vector X = [ X 1 , , X d ] d , with probability density function (PDF) p X ( x ) , where x = [ x 1 , , x d ] , Shannon’s joint entropy of X is defined by [1]:
H ( X ) = d p X ( x ) log p X ( x ) d x
Another natural measure of multivariate variability is the Total Marginal Entropy (TME), defined as:
T ( X ) = i = 1 d H ( X i ) = i = 1 d p X i ( x i ) log p X i ( x i ) d x i
where p X i ( x i ) denotes the marginal density, and H ( X i ) the corresponding marginal entropy. We have T ( X ) H ( X ) , with equality if and only if all elements of X are independent. Further, the following theorem holds.
Theorem 1. 
If X is jointly Gaussian, with PDF:
p X ( x ) = 1 ( 2 π ) d / 2 | Σ | 1 / 2 exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) )
where μ denotes the mean vector, Σ stands for the covariance matrix, and | Σ | denotes the determinant of Σ , then:
H ( X ) = d 2 log 2 π + d 2 + 1 2 log | Σ |
T ( X ) = d 2 log 2 π + d 2 + 1 2 log i = 1 d Σ i i
where Σ i i denotes the i-th diagonal element of Σ , i.e., the variance of X i .
Proof. 
Using Equation (3), we derive:
H ( X ) = d p X ( x ) log p X ( x ) d x = d p X ( x ) log ( 1 ( 2 π ) d / 2 | Σ | 1 / 2 exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) ) d x = log ( ( 2 π ) d / 2 | Σ | 1 / 2 ) d p X ( x ) d x + 1 2 d ( x μ ) T Σ 1 ( x μ ) p X ( x ) d x = d 2 log 2 π + 1 2 log | Σ | + 1 2 T r ( Σ 1 Σ ) = d 2 log 2 π + d 2 + 1 2 log | Σ |
where T r ( . ) denotes the trace operator. In a similar way, we get:
T ( X ) = i = 1 d H ( X i ) = i = 1 d ( 1 2 log 2 π + 1 2 + 1 2 log Σ i i ) = d 2 log 2 π + d 2 + 1 2 log i = 1 d Σ i i
Remark 1. 
Since the logarithm is a monotonic function, for the jointly Gaussian case, the joint entropy H ( X ) is equivalent to the generalized variance (GV), namely the determinant of Σ [18,19,20], and the total marginal entropy T ( X ) is equivalent to the geometric mean of the d marginal variances ( ( i = 1 d Σ i i ) 1 / d ). The concept of the generalized variance, which can be traced back to Wilks [21], was suggested by Sokal [22] to measure the overall variability in multivariate biometrical studies, and was applied by Goodman [23] to get easily interpretable results on corn and cotton populations, and recently was also applied by Barrett, Barnett, and Seth [24,25] to multivariate Granger Causality analysis. The generalized variance plays an important role in Maximum Likelihood Estimation (MLE) and model selection. Some limitations of the generalized variance, however, were discussed in [18,19,20].
The covariance matrix Σ can be expressed as:
Σ = Δ P Δ
where P is the correlation matrix, and Δ is a diagonal matrix with the d marginal standard deviations, Σ i i , along the diagonal. Thus, the generalized variance and the geometric mean of the marginal variances have the following relationship:
| Σ | = | P | i = 1 d Σ i i
where | P | is the determinant of P . From Equation (7), one can see that the generalized variance depends on both | P | and the geometric mean of the marginal variances. If the correlation matrix P is near-singular, however, the generalized variance will collapse to a very small value regardless of the values of the marginal variances. This is a significant disadvantage of the generalized variance [18].
Remark 2. 
Although for the jointly Gaussian case, there is a simple relationship between the entropy based measures of variability and the traditional variance based measures, the two kinds of measures are quite different. The entropy may be related to higher-order moments of a distribution and can provide a much more comprehensive characterization of the distribution. Only when the distribution (e.g., Gaussian) can be well characterized by the first two moments, or when a quadratic approximation is satisfactory, the variance based measures are justifiable [4,6].

2.2. Renyi's Entropy

There are many extensions to Shannon’s measure of entropy. Renyi’s entropy of order- α is a well-known generalization of Shannon entropy [2,4]. Based on Renyi’s definition of entropy, the order- α joint entropy and total marginal entropy of X are:
H α ( X ) = 1 1 α log V α ( X )
T α ( X ) = i = 1 d H α ( X i ) = 1 1 α log i = 1 d V α ( X i )
where α > 0 , α 1 , and V α ( X ) denotes the order- α Information Potential (IP) [4] of X :
V α ( X ) = d p X α ( x ) d x
Remark 3. 
In recent years, Renyi’s entropy of order- α is widely accepted as an optimality criterion in Information Theoretic Learning (ITL) [4]. The nonparametric kernel (Parzen window) estimator of Renyi entropy (especially when α = 2 ) has been shown to be more computationally efficient than that of Shannon entropy [11,12].
Remark 4. 
The information potential is actually the Information Generating Function defined in [26]. It is called information potential since each term in its kernel estimator can be interpreted as a potential between two particles [4]. As the logarithm is a monotonic function, minimizing Renyi entropy is equivalent to minimizing (when α < 1 ) or maximizing (when α > 1 ) the information potential. Thus, the information potential can be used as an alternative to Renyi entropy as a measure of variability.
It is easy to verify that Renyi’s entropy will approach Shannon’s entropy as α 1 . In addition, Theorem 1 can be extended to the Renyi entropy case.
Theorem 2. 
If X is jointly Gaussian, with PDF given by Equation (3), then:
H α ( X ) = d 1 α log β + 1 2 log | Σ |
T α ( X ) = d 1 α log β + 1 2 log i = 1 d Σ i i
where β = ( 2 π ) ( 1 α ) 2 α 1 2 .
Proof. 
One can derive:
V α ( X ) = d p X α ( x ) d x = d ( 1 ( 2 π ) d / 2 | Σ | 1 / 2 exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) ) α d x = 1 ( 2 π ) α d / 2 | Σ | α / 2 d exp ( 1 2 ( x μ ) T ( α 1 Σ ) 1 ( x μ ) ) d x = 1 ( 2 π ) α d / 2 | Σ | α / 2 × ( 2 π ) d / 2 | α 1 Σ | 1 / 2 = β d | Σ | ( 1 α ) / 2
Similarly, we have:
V α ( X i ) = β Σ i i ( 1 α ) / 2
Substituting Equations (13) and (14) into Equations (8) and (9), respectively, yields Equations (11) and (12). □
Remark 5. 
From Theorem 2 we find that, for the jointly Gaussian case, Renyi’s joint entropy H α ( X ) is also equivalent to the generalized variance | Σ | , and the order- α total marginal entropy T α ( X ) is equivalent to the geometric mean of the d marginal variances.

2.3. Entropy Powers

In [5], the variability (or the extent) of a distribution was measured by the exponential entropy, or equivalently, the entropy power. Shannon and Renyi’s joint entropy powers (JEP) are defined by [27]:
N ( X ) = exp [ 2 d H ( X ) ]
N α ( X ) = exp [ 2 d H α ( X ) ]
Similarly, the total marginal entropy powers (TMEP) are:
M ( X ) = i = 1 d N ( X i ) = i = 1 d exp [ 2 H ( X i ) ]
M α ( X ) = i = 1 d N α ( X i ) = i = 1 d exp [ 2 H α ( X i ) ]
Clearly, we have N ( X ) = lim α 1 N α ( X ) , M ( X ) = lim α 1 M α ( X ) . The following theorem holds.
Theorem 3. 
If X is jointly Gaussian, with PDF given by Equation (3), then:
N ( X ) = 2 π e | Σ | 1 d
N α ( X ) = β 2 1 α | Σ | 1 d
M ( X ) = 2 π e T r ( Σ )
M α ( X ) = β 2 1 α T r ( Σ )
Proof. 
Since H ( X ) = d 2 log 2 π + d 2 + 1 2 log | Σ | , H α ( X ) = d 1 α log β + 1 2 log | Σ | , H ( X i ) = 1 2 log 2 π + 1 2 + 1 2 log Σ i i , and H α ( X i ) = 1 1 α log β + 1 2 log Σ i i , we have:
N ( X ) = exp [ 2 d H ( X ) ] = 2 π e | Σ | 1 d
N α ( X ) = exp [ 2 d H α ( X ) ] = β 2 1 α | Σ | 1 d
M ( X ) = i = 1 d exp [ 2 H ( X i ) ] = 2 π e T r ( Σ )
M α ( X ) = i = 1 d exp [ 2 H α ( X i ) ] = β 2 1 α T r ( Σ )
Remark 6. 
For the jointly Gaussian case, the joint entropy powers N ( X ) and N α ( X ) are equivalent to the generalized variance | Σ | , and the total marginal entropy powers M ( X ) and M α ( X ) are equivalent to the well-known total dispersion (TD) or total variation (TV), given by T r ( Σ ) = i = 1 d Σ i i [19]. The total dispersion is widely accepted as a measure of variation in regression, clustering, and principal components analysis (PCA). Let λ 1 , λ 2 , ..., λ d be the eigenvalues of the covariance matrix Σ . Then the generalized variance and total dispersion can be expressed as:
| Σ | = i = 1 d λ i ,   T r ( Σ ) = i = 1 d λ i
Table 1 lists Renyi’s entropy (which includes Shannon’s entropy as a special case) based measures of variability and their equivalent variance based measures (for the jointly Gaussian case).

3. Smoothed Multivariate Entropy Measures

In most practical situations, the analytical evaluation of the entropy is not possible, and one has to estimate its value from the samples. So far there are many entropy estimators, among which the k-nearest neighbors based estimators are important ones in a wide range of practical applications [28]. In ITL, however, the kernel density estimation (KDE) based estimators are perhaps the most popular ones due to their smoothness [4]. By KDE approach [29], with a fixed kernel function, the estimated entropy will converge asymptotically to the entropy of the underlying random variable plus an independent random variable whose PDF corresponds to the kernel function [4]. This asymptotic value of entropy is called the smoothed entropy [16]. In this section, we will investigate some interesting properties of the smoothed multivariate entropy (joint or total marginal) as a measure of variability. Unless mentioned otherwise, the smoothed entropy studied in the following is based on the Shannon entropy, but the obtained results can be extended to many other entropies.
Given a d -dimensional random vector X = [ X 1 , , X d ] d , with PDF p X ( x ) , and a smoothing vector Z = [ Z 1 , , Z d ] d that is independent of X and has PDF p Z ( x ) , the smoothed joint entropy of X , with smoothing factor λ ( λ > 0 ), is defined by [16]:
H λ Z ( X ) = H ( X + λ Z ) = d p X + λ Z ( x ) log p X + λ Z ( x ) d x
where p X + λ Z ( x ) denotes the PDF of X + λ Z , which is:
p X + λ Z ( x ) = p X ( x ) p λ Z ( x )
where “ ” denotes the convolution operator, and p λ Z ( x ) = 1 λ d p Z ( x λ ) is the PDF of λ Z .
Let { x ( 1 ) , x ( 2 ) , , x ( N ) } be N independent, identically distributed (i.i.d.) samples drawn from p X ( x ) . By KDE approach, with a fixed kernel function p Z ( . ) , the estimated PDF of X will be [29]:
p ^ X ( x ) = 1 N λ d k = 1 N p Z ( x x ( k ) λ )
where λ > 0 is the smoothing factor (or kernel width). As sample number N , the estimated PDF will uniformly converge (with probability 1) to the true PDF convolved with the kernel function. So we have:
p ^ X ( x ) N p X + λ Z ( x )
Plugging the above estimated PDF into the entropy definition, one may obtain an estimated entropy of X , which converges, almost surely (a.s.), to the smoothed entropy H λ Z ( X ) .
Remark 7. 
Theoretically, using a suitable annealing rate for the smoothing factor λ , the KDE based entropy estimator can be asymptotically unbiased and consistent [29]. In many machine learning applications, however, the smoothing factor is often kept fixed. The main reasons for this are basically two: (1) in practical situations, the training data are always finite; (2) in general the learning seeks extrema (either minimum or maximum) of the cost function, independently to its actual value, and the dependence on the estimation bias is decreased. Therefore, the study of the smoothing entropy will help us to gain insights into the asymptotic behaviors of the entropy based learning.
Similarly, one can define the Smoothed Total Marginal Entropy of X :
T λ Z ( X ) = i = 1 d H λ Z i ( X i ) = i = 1 d p X i + λ Z i ( x ) log p X i + λ Z i ( x ) d x
where p X i + λ Z i ( x ) denotes the smoothed marginal density, p X i + λ Z i ( x ) = p X i ( x ) p λ Z i ( x ) .
The smoothing factor λ is a very important parameter in the smoothed entropy measures (joint or total marginal). As λ 0 , the smoothed entropy measures will reduce to the original entropy measures, lim λ 0 H λ Z ( X ) = H ( X ) , lim λ 0 T λ Z ( X ) = T ( X ) . In the following, we study the case in which λ is very large. Before presenting Theorem 4, we introduce an important lemma.
Lemma 1. 
(De Bruijn's Identity [30]): For any two independent random d-dimensional vectors, X and Z , with PDFs p X and p Z , such that J ( X ) exists and Z has finite covariance, where J ( X ) denotes the d × d Fisher Information Matrix (FIM):
J ( X ) = E [ S ( X ) S ( X ) T ]
in which S ( X ) = 1 p X ( X ) X p X ( X ) is the zero-mean Score of X , then:
d d t H ( X + t Z ) | t = 0 = 1 2 T r ( J ( X ) Σ Z )
where Σ Z denotes the d × d covariance matrix of Z .
Theorem 4. 
As λ is large enough, we have:
H λ Z ( X ) H ( Z ) + 1 2 T r ( J ( Z ) Σ X ) t d 2 log t
T λ Z ( X ) i = 1 d H ( Z i ) + t 2 i = 1 d J ( Z i ) σ X i 2 d 2 log t
where t = 1 / λ 2 , and σ X i 2 denotes the variance of X i .
Proof. 
The smoothed joint entropy H λ Z ( X ) can be rewritten as:
H λ Z ( X ) = H ( X + λ Z ) = H ( λ ( 1 λ X + Z ) ) = H ( 1 λ X + Z ) + d log λ
Let t = 1 / λ 2 , we have:
H λ Z ( X ) = H ( t X + Z ) d 2 log t
Then, by De Bruijn’s Identity:
H λ Z ( X ) = H ( Z ) + d d t H ( t X + Z ) | t = 0 t d 2 log t + o ( t ) = H ( Z ) + 1 2 T r ( J ( Z ) Σ X ) t d 2 log t + o ( t )
where o ( t ) denotes the higher-order infinitesimal term of the Taylor expansion. Similarly, one can easily derive:
T λ Z ( X ) = i = 1 d H ( Z i ) + t 2 i = 1 d J ( Z i ) σ X i 2 d 2 log t + o ( t )
Thus, as λ is large enough, t will be very small, such that Equations (31) and (32) hold. □
Remark 8. 
In Equation (31), the term { H ( Z ) d 2 log t } is not related to X . So, when the smoothing factor λ is large enough, the smoothed joint entropy H λ Z ( X ) will be, approximately, equivalent to T r ( J ( Z ) Σ X ) , denoted by:
H λ Z ( X )   λ   T r ( J ( Z ) Σ X )
Similarly, we have:
T λ Z ( X )   λ   i = 1 d J ( Z i ) σ X i 2
In the following, we consider three special cases of the smoothing vector Z .
Case 1. 
If Z is a jointly Gaussian random vector, then J ( Z ) = Σ Z 1 , and J ( Z i ) = 1 / σ Z i 2 , where σ Z i 2 denotes the variance of Z i . In this case, we have:
H λ Z ( X )   λ   T r ( Σ Z 1 Σ X )
T λ Z ( X )   λ   i = 1 d σ X i 2 / σ Z i 2
Case 2. 
If Z has independent entries, then J ( Z ) is a diagonal matrix, with J ( Z i ) along the diagonal. It follows easily that:
H λ Z ( X )   λ   T λ Z ( X )   λ   i = 1 d J ( Z i ) σ X i 2
Case 3. 
If Z has independent and identically distributed (i.i.d.) entries, then J ( Z ) = J ( Z 1 ) I , where I is a d × d identity matrix. Thus:
H λ Z ( X )   λ   T λ Z ( X )   λ i = 1 d σ X i 2
Remark 9. 
It is interesting to observe that, if the smoothing vector Z has independent entries, then the smoothed joint entropy and smoothed total marginal entropy will be equivalent to each other as λ . In this case, they are both equivalent to a weighted version of total dispersion, with weights J ( Z i ) . In particular, when Z has i.i.d. entries, the two entropy measures will be equivalent (as λ ) to the ordinary total dispersion. Note that the above results hold even if X is non-Gaussian distributed. The equivalent measures of the smoothed joint and total marginal entropies as λ are summarized in Table 2.
Example 1. 
According to Theorem 4, if Z has independent entries, the smoothed joint entropy H λ Z ( X )   and the smoothed total marginal entropy T λ Z ( X ) will approach a same value with λ increasing. Below we present a simple example to confirm this fact.
Consider a two-dimensional case in which X is mixed-Gaussian with PDF:
p X ( x ) = 1 4 π 1 ρ 2 { exp ( x 2 2 2 ρ x 2 ( x 1 μ ) + ( x 1 μ ) 2 2 ( 1 ρ 2 ) ) + exp ( x 2 2 + 2 ρ x 2 ( x 1 + μ ) + ( x 1 + μ ) 2 2 ( 1 ρ 2 ) ) }
where μ = 0.5 , ρ = 0.95 , and Z is uniformly distributed over [ 1.0 , 1.0 ] × [ 1.0 , 1.0 ] . Figure 1 illustrates the smoothed entropies (joint and total marginal) with different λ values. As one can see clearly, when λ is small (close to zero), the smoothed total marginal entropy is larger than the smoothed joint entropy, and the difference is significant; while when λ gets larger (say, larger than 2.0), the discrepancy between the two entropy measures will disappear.

4. Multivariate Entropy Estimators with Finite Samples

The smoothed entropy is of only theoretical interest since in practical applications, the number of samples is always limited, and the asymptotic value of the entropy estimator can never be reached. In the following, we show, however, that similar results hold for finite samples case. Consider again the kernel density estimator Equation (26). For simplicity we assume that the kernel function is Gaussian with covariance matrix Σ Z = I , where I is a d × d identity matrix. In this case, the estimated PDF of X becomes:
p ^ X ( x ) = 1 N ( 1 2 π λ ) d k = 1 N exp ( x x ( k ) 2 2 λ 2 ) = 1 N ( 1 2 π λ ) d k = 1 N i = 1 d exp ( ( x i x i ( k ) ) 2 2 λ 2 )
where x i denotes the i-th element of vector x . With the above estimated PDF, a sample-mean estimator of the joint entropy H ( X ) is [4]:
H ^ ( X ) = 1 N j = 1 N log p ^ X ( x ( j ) ) = 1 N j = 1 N log { 1 N ( 1 2 π λ ) d k = 1 N exp ( x ( j ) x ( k ) 2 2 λ 2 ) }
Similarly, an estimator for the total marginal entropy can be obtained as follows:
T ^ ( X ) = i = 1 d ( 1 N j = 1 N log { 1 N k = 1 N 1 2 π λ exp ( ( x i ( j ) x i ( k ) ) 2 2 λ 2 ) } )
The following theorem holds.
Theorem 5. 
As λ is large enough, we have:
H ^ ( X ) T ^ ( X ) d log ( 2 π λ ) + 1 λ 2 i = 1 d σ ^ X i 2
where σ ^ X i 2 = 1 N j = 1 N [ x i ( j ) 1 N k = 1 N x i ( k ) ] 2 is the estimated variance of X i .
Proof. 
When λ , we have x ( j ) x ( k ) 2 2 λ 2 0 . It follows that:
H ^ ( X ) = d log ( 2 π λ ) 1 N j = 1 N log { 1 N k = 1 N exp ( x ( j ) x ( k ) 2 2 λ 2 ) } ( a ) d log ( 2 π λ ) 1 N j = 1 N log ( 1 1 N k = 1 N x ( j ) x ( k ) 2 2 λ 2 ) ( b ) d log ( 2 π λ ) + 1 N 2 j = 1 N k = 1 N x ( j ) x ( k ) 2 2 λ 2 = d log ( 2 π λ ) + 1 2 N 2 λ 2 j = 1 N k = 1 N i = 1 d ( x i ( j ) x i ( k ) ) 2 = d log ( 2 π λ ) + 1 2 λ 2 i = 1 d ( 1 N 2 j = 1 N k = 1 N ( x i ( j ) x i ( k ) ) 2 ) = ( c ) d log ( 2 π λ ) + 1 λ 2 i = 1 d σ ^ X i 2
where (a) comes from exp ( x ) 1 + x as x 0 , (b) comes from log ( 1 + x ) x as x 0 , and (c) comes from:
1 N 2 j = 1 N k = 1 N ( x i ( j ) x i ( k ) ) 2 = 1 N 2 j = 1 N k = 1 N ( x i 2 ( j ) + x i 2 ( k ) 2 x i ( j ) x i ( k ) ) = 1 N 2 [ j = 1 N k = 1 N x i 2 ( j ) + j = 1 N k = 1 N x i 2 ( k ) 2 j = 1 N k = 1 N x i ( j ) x i ( k ) ] = 1 N 2 [ j = 1 N k = 1 N x i 2 ( j ) + j = 1 N k = 1 N x i 2 ( k ) 2 ( j = 1 N x i ( j ) ) ( k = 1 N x i ( k ) ) ] = 2 N 2 [ j = 1 N k = 1 N x i 2 ( j ) ( j = 1 N x i ( j ) ) ( k = 1 N x i ( k ) ) ] = 2 N 2 [ j = 1 N k = 1 N x i 2 ( j ) 2 ( j = 1 N x i ( j ) ) ( k = 1 N x i ( k ) ) + ( k = 1 N x i ( k ) ) 2 ] = 2 N j = 1 N [ x i ( j ) 1 N k = 1 N x i ( k ) ] 2 = 2 σ ^ X i 2
In a similar way, we prove:
T ^ ( X ) = i = 1 d ( 1 N j = 1 N log { 1 N k = 1 N 1 2 π λ exp ( ( x i ( j ) x i ( k ) ) 2 2 λ 2 ) } ) d log ( 2 π λ ) + i = 1 d ( 1 N j = 1 N log { 1 1 N k = 1 N ( x i ( j ) x i ( k ) ) 2 2 λ 2 } ) d log ( 2 π λ ) + i = 1 d ( 1 N 2 j = 1 N k = 1 N ( x i ( j ) x i ( k ) ) 2 2 λ 2 ) = d log ( 2 π λ ) + 1 λ 2 i = 1 d σ ^ X i 2
Combining Equations (47) and (49) we obtain Equation (46). □
Remark 10. 
When the kernel function is Gaussian with covariance matrix being an identity matrix, the KDE based entropy estimators (joint or total marginal) will be, approximately, equivalent to the total dispersion estimator ( i = 1 d σ ^ X i 2 ) as the smoothing factor λ is very large. This result coincides with Theorem 4. For the case in which the Gaussian covariance matrix is diagonal, one can also prove that the KDE based entropy (joint or total marginal) estimators will be approximately equivalent to a weighted total dispersion estimator as λ . Similar results hold for other entropies such as Renyi entropy.
Example 2. 
Consider 1000 samples drawn from a two-dimensional Gaussian distribution with zero-mean and covariance matrix Σ X = [ 1 0.99 0.99 1 ] . Figure 2 shows the scatter plot of the samples. Based on these samples, we evaluate the joint entropy and total marginal entropy using Equations (44) and (45), respectively. The estimated entropy values with different λ are illustrated in Figure 3, from which we observe that when λ becomes larger, the difference between the two estimated entropies will disappear. The results support the Theorem 5.

5. Conclusions

Measures of the variability of data play significant roles in many machine learning and signal processing applications. Recent studies suggest that machine learning (supervised or unsupervised) can benefit greatly from the use of entropy as a measure of variability, especially when data possess non-Gaussian distributions. In this paper, we have studied the behaviors of entropy as a measure of multivariate variability. The relationships between multivariate entropy (joint or total marginal) and traditional second-order statistics based multivariate variability measures, such as total dispersion and generalized variance, have been investigated. For the jointly Gaussian case, the joint entropy (or entropy power) is shown to be equivalent to the generalized variance, while total marginal entropy is equivalent to the geometric mean of the marginal variances, and total marginal entropy power is equivalent to the total dispersion. We have also gained insights into the relationships between the smoothed multivariate entropy (joint or total marginal) and the total dispersion. Under certain conditions, the smoothed multivariate entropy will be, approximately, equivalent to the total dispersion. Similar results hold for the multivariate entropy estimators (with finite number of samples) based on the kernel density estimation (KDE). The results of this work can help us to understand the behaviors of multidimensional information theoretic learning.

Acknowledgments

This work was supported by 973 Program (No. 2015CB351703) and National NSF of China (No. 61372152).

Author Contributions

Badong Chen proved the main results and wrote the draft; Jianji Wang and Haiquan Zhao provided the illustrative examples and polished the language; Jose C. Principe was in charge of technical checking. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
  2. Renyi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1961; pp. 547–561.
  3. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  4. Principe, J.C. Information Theoretic Learning: Rényi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
  5. Campbell, L.L. Exponential entropy as a measure of extent of a distribution. Probab. Theory Relat. Fields 1966, 53, 217–225. [Google Scholar] [CrossRef]
  6. Chen, B.; Zhu, Y.; Hu, J.; Principe, J.C. System Parameter Identification: Information Criteria and Algorithms; Elsevier: Amsterdam, The Netherlands, 2013. [Google Scholar]
  7. Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 158–171. [Google Scholar] [CrossRef]
  8. Erdogmus, D.; Principe, J.C. From linear adaptive filtering to nonlinear information processing. IEEE Signal Process. Mag. 2006, 23, 15–33. [Google Scholar] [CrossRef]
  9. Brown, G.; Pocock, A.; Zhao, M.; Luján, M. Conditional likelihood maximization: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
  10. Chen, B.; Zhu, P.; Principe, J.C. Survival information potential: A new criterion for adaptive system training. IEEE Trans. Signal Process. 2012, 60, 1184–1194. [Google Scholar] [CrossRef]
  11. Erdogmus, D.; Principe, J.C. An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Trans. Signal Process. 2002, 50, 1780–1786. [Google Scholar] [CrossRef]
  12. Erdogmus, D.; Principe, J.C. Generalized information potential criterion for adaptive system training. IEEE Trans. Neural Netw. 2002, 13, 1035–1044. [Google Scholar] [CrossRef] [PubMed]
  13. Chen, B.; Hu, J.; Pu, L.; Sun, Z. Stochastic gradient algorithm under (h, φ)-entropy criterion. Circuits Syst. Signal Process. 2007, 26, 941–960. [Google Scholar] [CrossRef]
  14. Chen, B.; Zhu, Y.; Hu, J. Mean-square convergence analysis of ADALINE training with minimum error entropy criterion. IEEE Trans. Neural Netw. 2010, 21, 1168–1179. [Google Scholar] [CrossRef] [PubMed]
  15. Chen, B.; Principe, J.C. Some further results on the minimum error entropy estimation. Entropy 2012, 14, 966–977. [Google Scholar] [CrossRef]
  16. Chen, B.; Principe, J.C. On the Smoothed Minimum Error Entropy Criterion. Entropy 2012, 14, 2311–2323. [Google Scholar] [CrossRef]
  17. Chen, B.; Yuan, Z.; Zheng, N.; Príncipe, J.C. Kernel minimum error entropy algorithm. Neurocomputing 2013, 121, 160–169. [Google Scholar] [CrossRef]
  18. Kowal, R.R. Note: Disadvantages of the Generalized Variance as a Measure of Variability. Biometrics 1971, 27, 213–216. [Google Scholar] [CrossRef]
  19. Mustonen, S. A measure for total variability in multivariate normal distribution. Comput. Stat. Data Anal. 1997, 23, 321–334. [Google Scholar] [CrossRef]
  20. Peña, D.; Rodríguez, J. Descriptive measures of multivariate scatter and linear dependence. J. Multivar. Anal. 2003, 85, 361–374. [Google Scholar] [CrossRef]
  21. Wilks, S.S. Certain generalizations in the analysis of variance. Biometrika 1932, 24, 471–494. [Google Scholar] [CrossRef]
  22. Sokal, R.R. Statistical methods in systematics. Biol. Rev. 1965, 40, 337–389. [Google Scholar] [CrossRef] [PubMed]
  23. Goodman, M. Note: A Measure of “Overall Variability” in Populations. Biometrics 1968, 24, 189–192. [Google Scholar] [CrossRef] [PubMed]
  24. Barnett, L.; Barrett, A.B.; Seth, A.K. Granger causality and transfer entropy are equivalent for Gaussian variables. Phys. Rev. Lett. 2009, 103, 238701. [Google Scholar] [CrossRef] [PubMed]
  25. Barrett, A.B.; Barnett, L.; Seth, A.K. Multivariate Granger causality and generalized variance. Phys. Rev. E 2010, 81, 041907. [Google Scholar] [CrossRef] [PubMed]
  26. Golomb, S. The information generating function of a probability distribution. IEEE Trans. Inf. Theory 1966, 12, 75–77. [Google Scholar] [CrossRef]
  27. Bobkov, S.; Chistyakov, G.P. Entropy power inequality for the Renyi entropy. IEEE Trans. Inf. Theory 2015, 61, 708–714. [Google Scholar] [CrossRef]
  28. Kraskov, A.; Stogbauer, H.; Grassberger, P. Estimating Mutual Information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
  29. Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman & Hall: New York, NY, USA, 1986. [Google Scholar]
  30. Rioul, O. Information theoretic proofs of entropy power inequalities. IEEE Trans. Inf. Theory 2011, 57, 33–55. [Google Scholar] [CrossRef]
Figure 1. Smoothed entropies with different smoothing factors.
Figure 1. Smoothed entropies with different smoothing factors.
Entropy 18 00196 g001
Figure 2. Scatter plot of the two-dimensional Gaussian samples.
Figure 2. Scatter plot of the two-dimensional Gaussian samples.
Entropy 18 00196 g002
Figure 3. Estimated entropy values with different smoothing factors.
Figure 3. Estimated entropy values with different smoothing factors.
Entropy 18 00196 g003
Table 1. Renyi’s entropy based measures of variability and their equivalent variance based measures.
Table 1. Renyi’s entropy based measures of variability and their equivalent variance based measures.
Entropy Based MeasuresEquivalent Variance Based Measures
Renyi’s Joint Entropy: H α ( X ) Generalized Variance: | Σ |
Renyi’s Total Marginal Entropy: T α ( X ) Geometric Mean of Marginal Variances: ( i = 1 d Σ i i ) 1 / d
Renyi’s Joint Entropy Power: N α ( X ) Generalized Variance: | Σ |
Renyi’s Total Marginal Entropy Power: M α ( X ) Total Dispersion: T r ( Σ )
Table 2. Equivalent measures of the smoothed joint and total marginal entropies as λ .
Table 2. Equivalent measures of the smoothed joint and total marginal entropies as λ .
Smoothed Joint EntropySmoothed Total Marginal Entropy
General case T r ( J ( Z ) Σ X )   i = 1 d J ( Z i ) σ X i 2
If Z is jointly Gaussian T r ( Σ Z 1 Σ X ) i = 1 d σ X i 2 / σ Z i 2
If Z has independent entries i = 1 d J ( Z i ) σ X i 2 i = 1 d J ( Z i ) σ X i 2
If Z has i.i.d. entries i = 1 d σ X i 2 i = 1 d σ X i 2

Share and Cite

MDPI and ACS Style

Chen, B.; Wang, J.; Zhao, H.; Principe, J.C. Insights into Entropy as a Measure of Multivariate Variability. Entropy 2016, 18, 196. https://doi.org/10.3390/e18050196

AMA Style

Chen B, Wang J, Zhao H, Principe JC. Insights into Entropy as a Measure of Multivariate Variability. Entropy. 2016; 18(5):196. https://doi.org/10.3390/e18050196

Chicago/Turabian Style

Chen, Badong, Jianji Wang, Haiquan Zhao, and Jose C. Principe. 2016. "Insights into Entropy as a Measure of Multivariate Variability" Entropy 18, no. 5: 196. https://doi.org/10.3390/e18050196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop