Open Access
This article is
- freely available
- re-usable
Entropy 2013, 15(12), 5154-5177; doi:10.3390/e15125154
Article
Non–Parametric Estimation of Mutual Information through the Entropy of the Linkage
Department of Mathematics, University of Torino, Via Carlo Alberto 10, Torino 10123, Italy
*
Authors to whom correspondence should be addressed.
Received: 25 September 2013; in revised form: 30 October 2013 / Accepted: 11 November 2013 / Published: 26 November 2013
Abstract
:A new, non–parametric and binless estimator for the mutual information of a d–dimensional random vector is proposed. First of all, an equation that links the mutual information to the entropy of a suitable random vector with uniformly distributed components is deduced. When this equation reduces to the well known connection between mutual information and entropy of the copula function associated to the original random variables. Hence, the problem of estimating the mutual information of the original random vector is reduced to the estimation of the entropy of a random vector obtained through a multidimensional transformation. The estimator we propose is a two–step method: first estimate the transformation and obtain the transformed sample, then estimate its entropy. The properties of the new estimator are discussed through simulation examples and its performances are compared to those of the best estimators in the literature. The precision of the estimator converges to values of the same order of magnitude of the best estimator tested. However, the new estimator is unbiased even for larger dimensions and smaller sample sizes, while the other tested estimators show a bias in these cases.
Keywords:
information measures; mutual information; entropy; copula function; linkage function; kernel method; binless estimatorMSC Classification:
62G05; 94A171. Introduction
The measure of multivariate association, that is, of the association between groups of components of a general d-dimensional random vector , is a topic of increasing interest in a series of application contexts. Among the information measures, the mutual information, a special case of relative entropy or Kullback-Leibler distance [1], is a quantity that measures the mutual dependence between the variables considered. Unlike the measures of linear dependency between random variables, such as the correlation coefficient, the mutual information is particularly interesting as it is sensitive also to dependencies that are not codified in the covariance. Although this is one of the most popular dependency measures, it is only one of the many other existing ones, see for example [2].
The extension of mutual information to d–dimensional random vectors is not trivial [3,4,5]. Considering the distance between the joint distribution of the random vector and the joint distribution of the random vector with independent univariate components gives the so called total mutual information. However, in many instances, it is more interesting to study the distance between groups of components of the random vector, which are again multivariate random vectors of different dimensions. Hence the mutual information, in its general d–dimensional definition, is a distance of the joint distribution of the random vector to the joint distribution of groups of components considered independent from each other. We consider here this general framework with the aim of introducing a good estimator for the multivariate mutual information.
The problem of statistical estimation of mutual information has been considered by various authors. Optimized estimators that use adaptive bin sizes have been developed in [6,7]. A very good estimator based on k–nearest neighbor statistics is proposed in [8]. A computationally efficient modification of this method appeared recently in [9]. The estimation of mutual information sets in the broader context of the estimation of information–type measures such as entropy, Kullback–Leibler distance, divergence functionals, Rényi entropy. The k–nearest neighbor statistics are used to build an estimator for entropy in [10] and for the more general Rényi entropy in [11,12]. An overview of nonparametric entropy estimators can be found in [13]. For the case of discrete random variables, see for example [14]. A variational characterization of divergences allows the estimation of Kullback–Leibler divergence (and more generally any f–divergence) to turn into a convex optimization problem [15,16,17]. An interesting application of information–type quantities to devise a measure of statistical dispersion can be found in [18].
We propose here a new and simple estimator for the mutual information in its general multidimensional definition. To accomplish this aim, we first deduce an equation that links the mutual information between groups of components of a d-dimensional random vector to the entropy of the so called linkage function [19], that reduces to the copula function [20] in dimension . In this way the problem of estimating mutual information is reduced to the estimation of the entropy of a suitably transformed sample of the same dimensions as the original random vector. This topic is hence closely related to the estimation of the Shannon entropy.
The structure of the paper is as follows: Notation and mathematical background are introduced in Section 2, where a brief survey on copula function and linkage function is also provided. In Section 3 we expose the method proposed to estimate the mutual information, providing also some details on the resulting algorithm and on the implementation care required. Section 4 contains some examples where the estimator is applied to simulated data and its performances are compared with some other estimators in literature. Conclusions are drawn in Section 5.
2. Notations and Mathematical Background
Given a dimensional random vector , let us denote , for , the joint cumulative distribution function (c.d.f.) and , for and , the marginal c.d.f. of the th component. When X is absolutely continuous, for is the joint probability density function (p.d.f.) and for and the corresponding marginal p.d.f. of the th component. Furthermore, and , where each assumes a different value ranging from 1 to d and , denote respectively the conditional c.d.f. and the conditional p.d.f. of the variable with respect to the variables . For the sake of brevity, when is any multi–index of length n we use the following notation denoting the c.d.f. of the random vector and denoting the p.d.f. of . Furthermore, when not necessary, we omit the argument of the functions, i.e., we may write in spite of .
The mutual information (MI) of a 2-dimensional random vector is given by
If and are independent .
The relative entropy (or Kullback-Leibler distance) D between the p.d.f.’s f and g is defined by
Hence, MI can be interpreted as the Kullback-Leibler distance of the joint p.d.f. from the p.d.f. of a vector with independent components, i.e., it is a measure of the distance between and the random vector with the same marginal distributions but independent components.
The differential entropy of the absolutely continuous random vector is defined as
MI and entropy in the case are related through the well known equation [1]
The generalization of MI to more than two random variables is not unique [3]. Indeed different definitions can be given according to different grouping of the components of the random vector . More precisely, for any couple of multi-indices of dimensions h and k respectively, with and partitioning the set of indices , the can be defined as the , where and . Therefore we have
More generally, for any n multi-indices of dimensions respectively, such that and partitioning the set of indices the following quantities
are all d–dimensional extensions of Equation (1). In the particular case where all the multi-indices have dimension 1, Equation (6) gives the so called total MI, a measure of the distance of the distribution of the given random vector to the one with the same marginal distributions but mutually independent components.
Equation (6) can be rewritten as
and then, integrating each term, it is is easy to prove the following generalization to the d–dimensional case of Equation (4)
Another approach to the study of dependencies between random variables is given by copula functions (for example [20]). A dimensional copula (or copula) is a function with the following properties:
- (1)
- for every , if at least one coordinate is null and if all coordinates are 1 except ;
- (2)
- for every and such that , .
Here is the so called volume of , i.e., the th order difference of C on
where
Hence, a copula function is a non–decreasing function in each argument.
A central result in the theory of copulas is the so called Sklar’s theorem, see [20]. It captures the role that copulas play in the relationship between joint c.d.f.’s and their corresponding marginal univariate distributions. In particular it states that for any dimensional c.d.f. of the random vector there exists a copula C such that for all
where are the univariate margins. If the margins are continuous, then the copula C is uniquely determined. Otherwise, C is uniquely determined over RanRan, where Ran is the range of the function . Conversely, if C is a copula and , are one-dimensional distribution functions, then the function defined in Equation (11) is a d-dimensional distribution function with margins , .
In particular, a copula C can be interpreted as the joint c.d.f. of the random vector , where each component is obtained through the transformation
Indeed
Moreover, as each component is obtained with Equation (12), the univariate marginal distributions of the random vector U are all uniform on . An important property that is tightly bound to MI is that copula functions are invariant under strictly increasing transformations of the margins, see [20].
If a copula C is differentiable, the joint p.d.f. of the random vector X can be written as
where
is the so called density of the copula C.
As illustrated in [21], it is not possible to use copula functions to handle multivariate distribution with given marginal distributions of general dimensions, since the only copula that is compatible with any assigned multidimensional marginal distributions is the independent one. To study dependencies between multidimensional random variables we resort to a generalization of copula notion. It relies on the use of the so-called linkage functions introduced in [19].
Let us consider the d-dimensional random vector X and any n multi-indices of dimensions respectively, such that and partitioning the set . For , let be the -dimensional c.d.f. of and let be the joint c.d.f. of , which is of dimension d as the multi–indices are partitioning the set .
Define the transformation , as
for all in the range of .
Then the vectors
are -dimensional vectors of independent uniform random variables, see [19].
The linkage function corresponding to the d-dimensional random vector is defined as the joint p.d.f. L of the vector
Notice that for the linkage function reduces to the copula function. Analogously to the two–dimensional case, linkages are invariant under strictly increasing functions, that is a d–dimensional function with components strictly increasing real univariate functions, see Theorem 3.4 in [19].
The information regarding the dependence properties between the multivariate components of the random vector is contained in the linkage function, that is independent from the marginal c.d.f.’s. Linkage functions can then be successfully employed when one is interested in studying the dependence properties between the random vectors disregarding their single components. On the other hand, linkage functions allow to study the interrelationships not only between all the components of a random vector, but also between given chosen sets of not overlapping random components of the vector. They contain the information regarding the dependence structure among such marginal vectors, while the dependence structure within the marginal vectors is not considered explicitly. It must be underlined that different multivariate c.d.f.’s can have the same linkage function but different marginal distributions.
In the next Section we will discuss the use of linkage functions for the computation of the MI between random vectors.
3. The Method
We propose here a method to estimate the MI of a d-dimensional random vector defined in Equation (6) by means of a random sample drawn from the corresponding d-dimensional joint distribution. We assume that neither the joint c.d.f. nor the marginal c.d.f.’s of the components of the random vector are known. The estimation approach proposed is then completely non parametric.
3.1. MI of a d–Dimensional Random Vector and Entropy of the Linkage
The method we are presenting is based on the equation between the MI of a d–dimensional random vector and its entropy deduced in the next theorem.
Theorem 1. Let be a d–dimensional random vector. For any n multi-indices of dimensions respectively, such that and partitioning the set of indices , it holds
where and the function are defined in Equation (16).
Proof. According to Equation (16) the random vector is obtained from by means of the following transformations:
where the elements of the vector are grouped according to the multi–indices . The corresponding Jacobian matrix is given as follows
It is evident that the Jacobian matrix is a block diagonal matrix having as main diagonal blocks lower triangular matrices. Its determinant can be calculated as the product of the determinants of the main diagonal blocks, and hence
Recalling the transformations given in Equation (20) we have
where the second to the last equality is obtained using iteratively the definition of conditional probability density function, i.e.,
As the random variables are the transformation of the vector given in Equation (20), their joint p.d.f. can be deduced as
where is the joint p.d.f. of and is given in Equation (22). The last equation can be further simplified by renaming , and we get
From the d–dimensional definition of mutual information given in Equation (6) and applying the change of variables given in Equation (20) we have
Just rewriting the last line of the previous equation using Equation (3), we get the thesis
□
Theorem 1 proves that the MI of a random vector can be computed without resorting to the marginal distributions, but rather transforming the components by means of Equation (20) into one-dimensional uniform random variables and then computing the Shannon entropy of the resulting vector.
Remark: Linkage functions do not change for monotone transformations. This property reflects the well know invariance property of MI under reparametrization.
In the particular case the definition of mutual information is simpler, see Equation (1) and the linkage functions reduce to the well known copula functions, as shown in the following
Corollary 1. The MI of the 2-dimensional random vector can be obtained as
where and .
Proof. For the only non trivial choice of multi–indices partitioning the set is and , hence and . The transformations given in Equation (20) reduce to Equation (12) for
Let us notice that the joint c.d.f. of the couple is the copula function, see Equation (13). Hence their p.d.f. is obtained as
where is the copula density defined in Equation (15). In this special case Equation (22) reduces to
and the definition given in Equation (1) can be rewritten as
Hence, the MI between two one-dimensional random variables can be computed without resorting to the marginal distributions of the two variables, but transforming them into the corresponding one-dimensional uniform variables according to Equation (12) and then computing their Shannon entropy. Equation (26) can be equivalently rewritten as
where denotes the expectation with respect to the copula distribution. Note that the result concerning has already been obtained in [22,23] and one of the more impactful studies that exploits this equality is [24].
Example. In the particular case where and the multi–indices are chosen such that and we have
where
Let us remark that the choice of the multi–indices grouping the elements of the random vector X uniquely determines the transformation given in Equation (32) required to properly get the random vector such that Equation (31) holds true. Different choices of multi–indices give different transformations and different random vectors U’s.
3.2. The Estimation Procedure
Let be a random sample of size N drawn from the multivariate distribution of the random vector X. Using Theorem 1, we propose here a method to estimate the mutual information of the random vector X in its general definition, i.e., the for any n multi-indices of dimensions respectively, such that and partitioning the set .
Let us proceed as follows:
For the particular case when the procedure becomes the following:
Let us now enter the details of the above illustrated procedure. The performances of the proposed method strongly depend on the estimators used for the c.d.f.’s and conditional c.d.f.’s at step 1 and the entropy at step 3 of the above proposed algorithm.
Step 1 of the procedure: estimating c.d.f’s. Let us compare the performances of the estimation algorithm for three different choices of the estimator for the c.d.f.’s, namely the empirical distribution function, the kernel density estimator and the k–nearest neighbor density estimator.
The most natural estimator for the c.d.f is the empirical distribution function, that for a univariate random sample of size N is defined as
Despite its good asymptotic properties, the empirical distribution function exhibits a slow rate of convergence, see for example [25] where it is shown that the empirical distribution function is a deficient estimator with respect to a certain class of kernel type estimators. It could be then necessary to resort to more reliable techniques to estimate the c.d.f.’s.
As suggested in [25], we could consider kernel–based density estimators, both to estimate the univariate c.d.f.’s and the conditional c.d.f.’s, where the conditioning random variable could be multivariate, see Equation (20). Let us recall that, given a random sample of size N drawn from a law with p.d.f. f, the univariate kernel density estimator for f is obtained as
where K is a suitable function called kernel, that is required to be a normalized probability density, and h is a positive number called the bandwidth, see [26,27]. One of the most commonly used kernel is the Gaussian kernel
However, depending on the properties of the density to be estimated, it can be convenient to consider different kernels. For example the rectangular kernel and the Gamma kernel can be recommended to estimate the uniform and the exponential distributions respectively. The bandwidth h has to be chosen carefully as too small values can give rise to spurious fine structures of the estimate, while too large values can hide all the details in the estimated density curve. The choice of a possible optimal bandwidth usually falls on the one that minimizes the mean integrated square error. When it is assumed that the underlying distribution is Gaussian, the optimal bandwidth is given by
where σ denotes the standard deviation of the sample values, see [26]. Let us remark that having no parametric hypothesis on the distribution of the random vector considered, in our method the choice of the kernel function and of the bandwidth should be made by visual inspection of the histograms of the involved distributions.
A third choice to estimate the c.d.f.’s could rely on the nearest neighbor density estimator, see [27,28]. Let us recall that, given a random sample of size N drawn from a law with p.d.f. f, the nearest neighbor density estimator for f is obtained as
where k is fixed and V is the volume of the sphere that contains precisely k data points, the nearest to x. A possible generalization of the nearest neighbor estimator provides an estimator of the kernel type again, as in Equation (34) but with bandwidth h equal to the distance between x and its nearest neighbor. As actually the generalized nearest neighbor is a kernel estimator with a suitable choice of the bandwidth parameter, we are calling nearest neighbor density estimator only the simple one given in Equation (37).
For the estimation of conditional densities, some papers propose methods improving the naive kernel estimator obtained as the ratio of the multidimensional kernel estimators for the joint p.d.f. and the joint p.d.f. of the conditioning variables, see [29,30,31]. In this paper we use the estimator proposed in [30] and successively improved in [31], based on a locally polynomial regression.
The performances of our estimator for the three different choices of c.d.f.’s estimations are compared in Figure 1. The samples are drawn from a bivariate gaussian distribution with standard marginal distributions and correlation coefficient for different sample sizes N. The accuracy of the estimation increases with increasing sample size, but the best performance is definitely obtained with the kernel density estimator (black).
Figure 1.
95% confidence intervals of the estimated MI of a bivariate gaussian distribution as the sample size N increases. True value of MI = 1.1979 bits. Cumulative distribution functions are estimated by the empirical distribution function given in Equation (33) (blue), the kernel density estimator given in Equation (34) (black) and the nearest neighbor density estimator given in Equation (37) (red).
Figure 1.
95% confidence intervals of the estimated MI of a bivariate gaussian distribution as the sample size N increases. True value of MI = 1.1979 bits. Cumulative distribution functions are estimated by the empirical distribution function given in Equation (33) (blue), the kernel density estimator given in Equation (34) (black) and the nearest neighbor density estimator given in Equation (37) (red).

Step 3 of the procedure: estimating entropy. The second important point that requires some care, is the estimation of the differential entropy of the transformed sample at step 3 of our method. Also in this case the estimation can be performed using a kernel type estimator or a nearest neighbor one.
The unbinned nearest-neighbor method proposed in [10,32] seems to have good properties. It is based on nearest–neighbor Euclidean distances and it has been shown to be asymptotically unbiased and consistent, provided that the underlying p.d.f. obeys certain conditions that control the convergence of the integrals for the differential entropy, see [10]. Given a random sample of size N from the dimensional random vector , the estimator proposed for the differential entropy is given by
where is the Euler-Mascheroni constant, is the Euclidean distance of each sample point to its nearest neighbor and
with Γ the gamma function is the area of a unit d-dimensional spherical surface (for example .
Another choice could be an estimator of the kernel type, see [13]. The so called resubstitution estimate proposed in [13] is of the form
where is a kernel density estimate.
Figure 2.
95% confidence intervals of the estimated MI of a bivariate gaussian distribution as the sample size N increases. True value of MI = 1.1979 bits. Entropy is estimated by the nearest neighbor given in Equation (38) (black) and the kernel type entropy estimator given in Equation (39) (green).
Figure 2.
95% confidence intervals of the estimated MI of a bivariate gaussian distribution as the sample size N increases. True value of MI = 1.1979 bits. Entropy is estimated by the nearest neighbor given in Equation (38) (black) and the kernel type entropy estimator given in Equation (39) (green).

The performances of our estimator for the two different choices of entropy estimators are compared in Figure 2. Again, the samples are drawn from a bivariate gaussian distribution with standard marginal distributions and correlation coefficient for different sample sizes N. The best performance is definitely obtained with the nearest neighbor entropy estimator (black).
Hence we come to the conclusion that the best results are obtained using a kernel type density estimator at the step 1 of the method and with a k–nearest neighbor entropy estimator at the step 3 of the method. This is the chosen procedure for the examples proposed in the next Section.
4. Examples and Simulation Results
In this Section we show the results obtained with the estimator proposed in Section 3. The examples cover the cases . We choose cases for which the MI can be computed either in a closed form expression or numerically, so that we can easily compare our results with the exact ones.
Several methods have been already proposed in the literature to estimate the MI between multivariate random vectors from sample data. One of the most efficient seems to be the binless method exposed in [8], based on estimates from k–nearest neighbor statistics. We refer from now on to this method as “KSG” from the names of the authors. On the other hand, a straightforward way to estimate MI could be to estimate separately the entropy terms in Equation (8) by means of a suitable estimator as the one proposed in [10,32] and sum them up. Let us notice that this procedure is not recommended as the errors made in the individual estimates of each entropy term would presumably not cancel. We refer to this methodology as the “plain entropy” method. The results obtained with our estimator are here compared both to the KSG and to the “plain entropy”.
As mutual information is a particular case of Kullback–Leibler divergence, we could use the estimators proposed in [15,17]. However let us remark that our estimator is based on one single sample of size N drawn from , the multivariate joint law of the random vector . On the other side, the estimators proposed in [15,17] and adapted to the case of mutual information are based on two samples: one drawn from the multivariate joint law and the other from , the product of the n marginal laws. We could build the missing sample from the given one, but it is not obvious that the estimators will keep their good properties. Indeed in our setting the samples are no more independent as some resampling procedure should be devised to make the method suitable for our particular case. From a preliminary numerical study it seems that the performances of the estimators proposed in [15,17] are not comparable to the performances of our estimator. Hence we skip this comparison.
The examples illustrated in the next two Subsections are chosen to test our approach on distributions that could be troublesome for our approach. Following the discussion introduced in the previous Section, the delicate point of the method is the estimation of the (conditional) cumulative distribution functions in Equation (20). Hence we consider an example where the density to be estimated has bounded support as for the Uniform r.v. (Example 2) and an example where the density is positive valued and with an early steep peak as for the Exponential r.v. (Example 3). To address the problem of tails in estimation of information–theoretic quantities, we present two examples of heavy-tailed distributions: the lognormal (Example 4) and the Levy (Example 5). Note that the lognormal distribution is heavy tailed but with finite all order moments, while the Levy is an alpha–stable distribution with power law tail probability and no finite moments.
We rely on simulated data, while we reserve to a future work the application of the estimator to data coming from real experiments. For each example we perform simulation batches of the random vectors involved for each of the following sample sizes .
4.1. Two-Dimensional Vectors
Let us consider the following five cases of two-dimensional random vectors .
- Example 1. Let be a gaussian random vector with standard normal components and correlation coefficient . Denoting as σ the covariance matrix of X, the MI between and is given as
The numerical results are illustrated in Figure 3. The horizontal line indicates the exact value . Both plots reveal that our estimator of the MI is centered around its true value. The interquartile range considerably decreases with increasing sample size, becoming minimal for . The comparison of the performances with the KSG and “plain entropy” estimators shows that our method performs comparably with KSG except for very small sample sizes (), where it has a larger dispersion, see Figure 3—right panel—bottom right inset. It results even better for the highest sample sizes. However its variance quickly decreases to the values attained by the KSG estimator that are sensibly smaller that the variances of the “plain entropy” method. Actually, the “plain entropy” estimator leads to poorer performances such as larger interquartile range and less marked centering of the confidence interval around the true value.
Figure 3.
(Example 1) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.
Figure 3.
(Example 1) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.

- Example 2. Let and be linearly dependent random variables, related through the following equation
The numerical results are illustrated in Figure 4. Similar observations as for the previous example can be deduced for this case. The dispersion is clearly highly reduced as the sample size increases, but the estimated MI is always centered with respect to the true value. The “plain entropy” method is again less performing with respect to the others both in terms of dispersion and of centering the true value. The standard deviations of the estimates obtained with our method appear always lower than the ones obtained with the “plain entropy” method and comparable with those obtained by KSG method except for very small sample sizes .
Figure 4.
(Example 2) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.
Figure 4.
(Example 2) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.

- Example 3. Let , have joint c.d.f. given in the following equation
Both the boxplots and the confidence intervals in Figure 5 show that the values of MI obtained with our estimator are centered around the true value even for the smallest sample sizes and . The comparison with KSG and “plain entropy” methods in Figure 5—right panel shows that the method proposed allows to obtain results that are comparable with the first one apart only for and is always better performing than the second one. This last method produces not centered estimates for small sample sizes such as and . The standard deviations behave as in the previous example.
Figure 5.
(Example 3) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.
Figure 5.
(Example 3) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.

- Example 4. Let be a bivariate lognormal random vector with zero mean and covariance matrix given by
The results are illustrated in Figure 6. The methods are applied to samples drawn from the joint distribution of with no pre–processing. Our approach gives good results both in terms of centered confidence intervals and standard deviations. Hence the method is robust to this kind of heavy–tailed distributions. The KSG estimator gives bad results for smaller sample sizes () as the corresponding confidence intervals do not include the true value of MI. The “plain entropy” method seems to behave perfectly in this case.
Figure 6.
(Example 4) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.
Figure 6.
(Example 4) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.

- Example 5. Let be a random vector with Levy distributed margins, i.e., with p.d.f. given by
All the three methods fail in estimating MI in this case. The results are totally unreliable. The errors are very large even for larger sample sizes. Concerning our method the problem relies in the estimation of the density of the margins when the tail has strongly power law behavior. In this very ill–served example, the tails are so heavy that it is frequent to have values very far from those corresponding to the largest part of the probability mass and the kernel estimate is poor as it would need too many points. The kernel method fails and so the estimation of MI.
However we propose here to exploit the property of MI of being invariant under reparametrization of the margins. This trick allows to get rid of the heavy tail problems at least in some cases. For example, when the random variables are positive valued, we could apply a sufficiently good transformation of the values in order to improve the estimates. Such a transformation should be strictly monotone and differentiable and able to lighten and shorten the tails of the distribution, as the logarithm function. In Figure 7 we illustrate the results obtained on samples drawn from the couple with Levy margins and pre–processed applying a log–transformation to the values. Such a procedure will not affect the estimated value, as it can be seen in the figure. All the three methods now give correctly estimated values.
Figure 7.
(Example 5) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.
Figure 7.
(Example 5) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.

Hence the method is shown to be only partly robust to heavy–tailed distributions, as KSG and “plain entropy”. In particular it succeeds when applied to distributions with tail behavior as in the log–normal case and fails on strong power law tailed distributions as the Levy one. However, in the latter case, our proposal is to take advantage of the invariance property of MI and to pre–process data in order to reduce the importance of the tails. This comment holds good for any dimension , as the examples shown in the next section.
4.2. Three-Dimensional Vectors
We tested our estimator also in the case , considering the multi–indices given in Equation (31). Let be a gaussian random vector with standard normal components and pair correlation coefficients . The exact MI can be evaluated by means of the following equation
where denotes the covariance matrix of the random vector Y. In the specific case we get bit.
In Figure 8—left panel we show the boxplots of the estimated MI values as the sample size increases. From both plots the unbiasedness of the estimator even for small sample sizes can be clearly detected. Our estimator always leads to better results than the others both for unbiasedness and for the dispersion, cf. Figure 8—right panel. Both the KSG and the “plain entropy” estimator do not center the exact value of the MI for the largest sample sizes ( and ), even though the standard deviations are comparable with the ones obtained with our method.
Figure 8.
(Gaussian ) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.
Figure 8.
(Gaussian ) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.

4.3. Four-Dimensional Vectors
For the case we considered a multivariate Gaussian random vector with mean and covariance matrix given as
We chose the following multi–indices to group the components: and . The results are illustrated in Figure 9. As already noticed for the three dimensional Gaussian case, the KSG and “plain entropy” estimators seems biased while the estimator here presented centers the true value for all the explored sample sizes.
Figure 9.
(Gaussian ) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.
Figure 9.
(Gaussian ) Left panel: Boxplots of the estimated MI grouped according to different sample sizes. Right panel: mean error as a function of (Sample Size). Insets to the right panel: 95% confidence interval for the estimated MI (bottom left) and standard deviations (bottom right) versus Sample Size. Color map: black and white for the estimator we propose in Section 3.2, red for KSG and blue for “plain entropy”.

5. Conclusions
In this paper, we have presented a new estimator for the mutual information between subsets of components of d-dimensional random vectors. It exploits the link between the mutual information and the entropy of the linkage function here proved. Hence the problem of estimating mutual information is reduced to the computation of the entropy of a suitably transformed sample of uniformly distributed random variables on the interval , that can be easily performed by the k–nearest neighbor technique illustrated in [10].
The method gives very good performances in terms both of unbiasedness and variance of the estimates. For the 2–dimensional case, the results are comparable to the KSG and “plain entropy” estimates. However, for higher dimensions (we tested on examples up to dimension 4), our method is preferable as it keeps being centered while KSG and “plain entropy” both show a bias. All the tested estimators are shown to be robust for mild heavy tailed distributions, such as the lognormal distribution, but they fail on distributions with power law tail probabilities and no finite moments, such as the Levy distribution. However, we suggest to overcome the problem using the invariance property of MI under reparametrization of the margins and hence pre–processing the data with a suitable transformation of the univariate components before estimation.
The fact that the estimator gives unbiased results for small sample sizes and larger dimensions allows a wide use, also in applications where the availability of big data sets from real experiments is extremely rare.
Acknowledgments
The authors would like to thank the two referees and Prof. Ilya Nemenman for their useful comments that helped improving this paper. The authors are grateful to András Horváth for his precious support concerning some numerical issue. RS is grateful to Michael I. Jordan for a short but valuable comment on KL–divergence estimators. Work partially supported by “AMALFI” project (ORTO119W8J) and by “Stochastic Processes and Applications 2012” project.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience (John Wiley & Sons): Hoboken, NJ, USA, 2006. [Google Scholar]
- Ghahramani, Z.; Póczos, B.; Schneider, J.G. Copula-Based Kernel Dependency Measures. In Proceedings of the 29th International Conference on Machine Learning, (ICML-12), Edinburgh, UK, 26 June–1 July 2012; pp. 775–782.
- McGill, W.J. Multivariate information transmission. Psychometrika 1954, 19, 97–116. [Google Scholar] [CrossRef]
- Schneidman, E.; Still, S.; Berry, M.J.; Bialek, W. Network information and connected correlations. Phys. Rev. Lett. 2003, 91, 238701. [Google Scholar] [CrossRef] [PubMed]
- Margolin, A.A.; Wang, K.; Califano, A.; Nemenman, I. Multivariate dependence and genetic networks inference. IET Syst. Biol. 2010, 4, 428–440. [Google Scholar] [CrossRef] [PubMed]
- Darbellay, G.A. An estimator of the mutual information based on a criterion for independence. Comput. Stat. Data Anal. 1999, 32, 1–17. [Google Scholar] [CrossRef]
- Darbellay, G.A.; Vajda, I. Estimation of the information by an adaptive partitioning of the observation space. IEEE Trans. Inform. Theory 1999, 45, 1315–1321. [Google Scholar] [CrossRef]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
- Evans, D. A computationally efficient estimator for mutual information. Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 2008, 464, 1203–1215. [Google Scholar] [CrossRef]
- Kozachenko, L.F.; Leonenko, N.N. A statistical estimate for the entropy of a random vector. Probl. Peredachi Inform. 1987, 23, 9–16. [Google Scholar]
- Leonenko, N.; Pronzato, L.; Savani, V. A class of Rényi information estimators for multidimensional densities. Ann. Stat. 2008, 36, 2153–2182. [Google Scholar] [CrossRef]
- Pál, D.; Póczos, B.; Szepesvári, C. Estimation of Rényi Entropy and Mutual Information Based On Generalized Nearest-Neighbor Graphs.
- Beirlant, J.; Dudewicz, E.J.; Györfi, L.; van der Meulen, E.C. Nonparametric entropy estimation: An overview. Int. J. Math. Stat. Sci. 1997, 6, 17–40. [Google Scholar]
- Nemenman, I. Coincidences and estimation of entropies of random variables with large cardinalities. Entropy 2011, 13, 2013–2023. [Google Scholar] [CrossRef]
- Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating Divergence Functionals and the Likelihood Ratio by Penalized Convex Risk Minimization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2007; pp. 1089–1096. [Google Scholar]
- Sugiyama, M.; Nakajima, S.; Kashima, H.; Buenau, P.V.; Kawanabe, M. Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2007; pp. 1433–1440. [Google Scholar]
- Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
- Kostal, L.; Pokora, O. Nonparametric estimation of information-based measures of statistical dispersion. Entropy 2012, 14, 1221–1233. [Google Scholar] [CrossRef]
- Li, H.; Scarsini, M.; Shaked, M. Linkages: A tool for the construction of multivariate distributions with given nonoverlapping multivariate marginals. J. Multivar. Anal. 1996, 56, 20–41. [Google Scholar] [CrossRef]
- Nelsen, R.B. An Introduction to Copulas, Lecture Notes in Statistics; Springer-Verlag: New York, NY, USA, 1999; Volume 139. [Google Scholar]
- Genest, C.; Quesada Molina, J.J.; Rodríguez Lallena, J.A. De l’impossibilité de construire des lois à marges multidimensionnelles données à partir de copules. C. R. Acad. Sci. Paris Sér. I Math. 1995, 320, 723–726. [Google Scholar]
- Jenison, R.L.; Reale, R.A. The shape of neural dependence. Neural Comput. 2004, 16, 665–672. [Google Scholar] [CrossRef] [PubMed]
- Blumentritt, T.; Schmid, F. Mutual information as a measure of multivariate association: Analytical properties and statistical estimation. J. Stat. Comput. Simul. 2012, 82, 1257–1274. [Google Scholar] [CrossRef]
- Margolin, A.A.; Nemenman, I.; Basso, K.; Wiggins, C.; Stolovitzky, G.; Favera, R.D.; Califano, A. ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinf. 2006, 7, S7. [Google Scholar] [CrossRef] [PubMed]
- Reiss, R.D. Nonparametric estimation of smooth distribution functions. Scand. J. Statist. 1981, 8, 116–119. [Google Scholar]
- Wand, M.P.; Jones, M.C. Kernel Smoothing; Chapman and Hall Ltd.: London, UK, 1995; Volume 60 Monographs on Statistics and Applied Probability. [Google Scholar]
- Silverman, B.W. Density Estimation for Statistics and Data Analysis; CRC Press: London, UK, 1986; Volume 26. [Google Scholar]
- Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: New York, NY, USA, 1995. [Google Scholar]
- Hyndman, R.J.; Bashtannyk, D.M.; Grunwald, G.K. Estimating and visualizing conditional densities. J. Comput. Graph. Statist. 1996, 5, 315–336. [Google Scholar]
- Fan, J.; Yao, Q.; Tong, H. Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika 1996, 83, 189–206. [Google Scholar] [CrossRef]
- Hyndman, R.J.; Yao, Q. Nonparametric estimation and symmetry tests for conditional density functions. J. Nonparametr. Stat. 2002, 14, 259–278. [Google Scholar] [CrossRef]
- Victor, J.D. Binless strategies for estimation of information from neural data. Phys. Rev. E 2002, 66, 051903. [Google Scholar] [CrossRef] [PubMed]
- Marek, T.; Tichavsky, P. On the estimation of mutual information. In Proceedings of ROBUST 2008, Honolulu, USA, 2–5 November 2008; Antoch, J., Dohnal, G., Eds.; 2008. [Google Scholar]
- Platen, E.; Bruti-Liberati, N. Numerical Solution of Stochastic Differential Equations with Jumps in Finance; Springer: Berlin, Germany, 2010; Volume 64. [Google Scholar]
© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).