An Assessment of Hermite Function Based Approximations of Mutual Information Applied to Independent Component Analysis

At the heart of many ICA techniques is a nonparametric estimate of an information measure, usually via nonparametric density estimation, for example, kernel density estimation. While not as popular as kernel density estimators, orthogonal functions can be used for nonparametric density estimation (via a truncated series expansion whose coefficients are calculated from the observed data). While such estimators do not necessarily yield a valid density, which kernel density estimators do, they are faster to calculate than kernel density estimators, in particular for a modified version of Renyi's entropy of order 2. In this paper, we compare the performance of ICA using Hermite series based estimates of Shannon's and Renyi's mutual information, to that of Gaussian kernel based estimates. The comparisons also include ICA using the RADICAL estimate of Shannon's entropy and a FastICA estimate of neg-entropy.


Introduction
Many of the techniques used to perform Independent Component Analysis (ICA) minimize an "information" measure.One of the pioneering papers for ICA [4], uses the Gram-Charlier expansion to give an approximation of differential entropy in terms of third and fourth order cumulants; FastICA [7] approximates neg-entropy, which is a measure of non-gaussianity, with a number of non-linear functions; RADICAL [10] uses order statistics to estimate differential entropy; in [3] kernel density estimators are used to estimate differential entropy; in [6] kernel density estimators are used to estimate Renyi's mutual information; in [5], nonparametric density estimation using Legendre polynomials are used to estimate Renyi's mutual information.At the heart of these works is a nonparametric density estimator, which is then used to approximate the information measure.In this paper we will be comparing the performance of some existing ICA techniques whose basis is a nonparametric estimator of Shannon's or Renyi's mutual information, to that of performing ICA by nonparametric estimation of Shannon's and Renyi's mutual information by Hermite functions.
This paper is organized as follows.Section 2 introduces the ICA problem and the usefulness of Shannon's and Renyi's mutual information for solving it; Section 2.1 outlines how orthogonal functions can be used for nonparametric density estimation, and defines the Hermite polynomials and functions; Section 2.2 outlines kernel density estimation, which will be used in the ICA comparisons; Section 2.3 outlines the justification for the differential entropy estimate used in the RADICAL algorithm, which will be used in the ICA comparisons; Section 2.4 outlines a neg-entropy estimate used by FastICA, which will be used in the ICA comparisons.Section 3 discusses the results of a simulation experiment comparing six approaches to ICA.Section 4 discusses how the ideas in this paper could be extended to the complex-valued case.

Independent Component Analysis and Mutual Information
The term Independent Component Analysis (ICA) refers to a range of techniques that have emerged since the mid-nineties to tackle the following blind signal separation problem: find the unknown mixing matrix A in the following equation with only a knowledge of y = y 1 y 2 . . .
where the components of x are independent.In [4], it is shown that if the components of x are independent and at most one Gaussian, then the matrix minimizing Shannon's mutual information uniquely identifies A −1 up to a permutation matrix (acting on rows); Shannon's mutual information is defined for where H(•) is called differential entropy.The following quantity need not be calculated due to the following relation where b i is the i th row of B. This observation implies that minimizing I(BY) is equivalent to minimizing This still requires that L i=1 H(b i Y) be minimized, but this is much easier than calculating the multivariate version of differential entropy.Most papers calculate H(b i Y) in some sort of nonparametric fashion as knowledge of the source distributions is generally unknown.For example, in [4], the Gram-Charlier expansion is used to estimate the densities, which with some approximations gave an approximation of differential entropy in terms of third and fourth order cumulants.
Most ICA methods that attempt to minimize mutual information start by whitening the data, which is useful in the two ways.Suppose E is a matrix that whitens Y; let where C = EA, which turns out to be orthogonal, see [12].The orthogonality of C is useful in two ways.Firstly, the search for the inverse mixing matrix is restricted to orthogonal matrices.Secondly, if D is an orthogonal matrix, then I(DZ) simplifies to where d i is the i th row of D. This observation implies that minimizing I(DZ) is equivalent to minimizing Shannon's mutual information is not the only information measure used for ICA.In [5] and [6], a modified version of Renyi's mutual information is used.Renyi's mutual information for a random vector Z is given by This quantity is minimized if the components of Z are independent.Unfortunately, unlike Shannon's mutual information in equation ( 2), Renyi's mutual information does not split up into one part involving marginal distributions and another the joint distribution.To overcome this, they instead use This quantity is minimized if the components of Z are independent.Moreover, the property of Shannon's mutual information displayed in equation ( 4), is replicated in this measure.Hence, in [5] and [6], they use whitened data and the following quantity In the following subsections we will describe some ways of estimating the marginal entropies in equation ( 5) and (8).

Orthogonal Series for Nonparametric Density Estimation
This material comes from [17].Given a random sample {x i } N i=1 for a random variable X with density p(x) and a set of basis functions {ξ n (x)} ∞ n=0 which are orthogonal with respect to a kernel K(x), an estimate of p(x) is where A variety of criteria exist for choosing M , see [9] for a review (we do not discuss this issue here as it turned out not to be important).Unfortunately this basis expansion density estimate, may not be a proper density in terms on non-negativity and summation to 1 -unlike kernel density estimators.However, they are much faster to calculate than kernel density estimators for large N .
Perhaps the most commonly used basis functions for nonparametric density estimation are the Hermite functions, which are defined as where the H n (x) are the Hermite polynomials, which are defined as Note that the Hermite polynomials are orthogonal with respect to the kernel e −x 2 , with eigenvalues 2 n n! √ π, which means that the Hermite functions are orthogonal with respect to the kernel 1, with eigenvalues 1.Now, to the best of our knowledge, we do not whether it is better to use Hermite polynomials or Hermite functions for nonparametric density estimation.However, the Hermite functions are more commonly used.One reason may be the ease with which it is possible to prove results regarding the convergence of such an estimation; see [15] and [16] for example with regard to mean square and integrated mean square rates of convergence.We used the Hermite functions for two reasons: firstly, the tails of estimated density behaved much better using the Hermite functions; secondly, a simple expression for Renyi's entropy can be derived.
Using the Hermite functions for nonparametric density estimation gave the following estimate of a density p(x) In our ICA experiments, we estimate the quantity in equation ( 5) using the following estimate for the marginal entropies The absolute value is taken to handle negative values.We took M = 40 because for M ≥ 40, there was little difference in the performance of ICA based on mutual information using this approximation to the marginal entropies.
As previously mentioned, in [5], Legendre polynomials are used to perform ICA by minimizing an estimate of the quantity in (8).We will also perform ICA by minimizing an estimate of the quantity in equation ( 8) for α = 2 using Hermite functions instead.In this case, the quantity is very fast to calculate as the marginal Renyi entropy (for α = 2 that is) has form Again, we took M = 40 because for M ≥ 40, there was little difference in the performance of ICA based on mutual information using this approximation to the marginal entropies.We finish this subsection by presenting the following property of Hermite polynomials that was used to calculate the quantities in equations ( 9) and ( 10) efficiently: The last line of the following equation shows how the property was used to calculate the coefficients in equations ( 9) and ( 10) The quantity in equation ( 9) is calculated in a similar fashion.

Kernel Density Estimation
A popular method for nonparametric density estimation, called kernel density estimation, was introduced by Parzen in [13], which is as follows.Given a random sample {x i } N i=1 for a uni-variate random variable X, the kernel density estimate of p X (x) is where K(•) is called the kernel and h the smoothing parameter or bandwidth (note that for a multivariate random variable, almost the same is done, with the smoothing parameter being a matrix instead).
A variety of criteria exist for choosing the bandwidth h, see [14], but we do not discuss this issue here.Unsurprisingly, the Gaussian kernel is the most commonly used, which has form In [3], ICA is done by minimizing an estimate of the quantity in equation ( 5) -they use Gaussian kernel density estimation, which leads them to using the following marginal entropy estimate This estimate involves a double summation, compared to a single summation in equation ( 9), making it much slower to calculate for large N .
In [6], ICA is performed by minimizing an estimate of the quantity in equation ( 8) for α = 2.They use a Gaussian kernel to estimate the marginal densities; the marginal Renyi entropies are given exactly by Again, a double summation is involved in the calculation, compared to a single summation in equation (10).
In our comparisons we will include estimates of Shannon's and Renyi's mutual information using equation ( 11) and ( 12) respectively.As is done in [3], we use an "optimal" value for the kernel bandwidth, h = 1.06N −1/5 .

RADICAL
The ICA algorithm RADICAL, proposed in [10], uses an estimate of differential entropy based on order statistics.In our comparisons, ICA will also be performed by using this estimate for the marginal entropies in equation (5).Their estimate of differential entropy is derived as follows.Given a random sample Z 1 , Z 2 , . . ., Z N for a random variable Z, let Z (1) ≤ Z (2) ≤ . . .≤ Z (N ) denote the sample sorted in ascending order.Let p(•) and P (•) be the probability density function (what we have been referring to as density) and cumulative density functions for Z respectively.Then This leads to the following nonparametric estimate for p(•) where Z (0) and Z (N +1) are infimum and supremum of the support of p(z) respectively.A corresponding estimate of entropy based on this estimate of p(z) is This estimate has a high variance; the variance is lowered by using larger spacing, which leads to their estimate of differential entropy Furthermore, they apply this to an augmented data set, where each observed point is augmented by the addition of R data points drawn form a N (0, σ r ) distribution, with R = 30 and σ r = 0.1 typically; the augmentation is done once to the mixed data, not to the "unmixed" data.

FastICA
One of the most popular methods for ICA is FastICA [7], for which there are a number of variants.In our comparisons we will include one of their measures of neg-entropy to estimate the marginal densities in equation ( 5).The neg-entropy for a random variable X is defined to be where H(Z) is the differential entropy of a Gaussian random variable Z with the same mean and variance as X.Neg-entropy is useful for two reasons.The first reason is that it measures the "non-Gaussianity" of a random variable; the more non-Gaussian a random variable is, the more interesting.These "interesting" components can be found individually, rather than collectively, as is done when minimizing mutual information.Moreover, under certain conditions, the set of "interesting" components includes the independent components.The second reason neg-entropy is useful is that it is scale invariant, unlike differential entropy.Now, we will be taking a measure of neg-entropy used by FastICA out of context.Due to the whitening of the data, H(Z) will remain constant, allowing neg-entropy to be used to estimate differential entropy.We will use the following estimate for neg-entropy, which can be found in equation (5.47) of [8] where k 1 = 36/(8 √ 3 − 9) and k 2 = 1/(2 − 6/π).

Simulations
The performance of six approaches to ICA were compared for eight different scenarios.For each scenario, two signals were generated from particular distributions and then mixed.The following six ICA approaches were compared 1. Minimizing Shannon's mutual information via equation ( 5), using Hermite functions to estimate the marginal differential entropies, as described in equation ( 9).
3. Minimizing Shannon's mutual information via equation ( 5), using kernel density estimation to estimate the marginal differential entropies, as described in equation ( 11).
4. Minimizing Renyi's mutual information via equation ( 8), using kernel density estimation to estimate the marginal differential entropies, as described in equation ( 12).
5. Minimizing Shannon's mutual information via equation ( 5), using the RADICAL estimate for differential entropy, as described in equation ( 13).
For each scenario, 200 simulations with 1000 data points were performed.For each simulation, the mixing matrix was also randomly generated so that the results would not be dependent on the mixing matrix.The data was also whitened, and a "full" search (rather than a gradient approach) for the inverse mixing matrix done over the space of (orthogonal) rotation matrices.Each scenario used the same two distributions; Table [1] lists the distributions used, with Figure [1] displaying the distributions.Note that in the first scenario, the distributions are Gaussian Mixtures, where U is a uniformly distributed random variable on [0, 1].The performance of each algorithm was measured with the commonly used Amari's Error [1].Amari's error is particularly useful for ICA comparisons as it is permutation invariant.We used it in the following way.Suppose W is an estimate of the inverse mixing matrix A −1 .Amari's error was then calculated to be  The mean Amari errors for each scenario can found in Table [2]; the last row containing the mean over all simulations.We can see that no-one method is superior or inferior to the rest.However, if we compare the mean of Amari's error over all simulations, an estimate of Shannon's mutual information based on kernel density estimators performs best.Also, using Hermite functions, ICA using Renyi's mutual information works better than Shannon's mutual information.A possible explanation for this is displayed in Figure [2]: the first graph displays the estimates of Shannon's mutual information over a range of rotation matrices; the second graph displays estimates of Renyi's mutual information for the same rotation matrices.Shannon's and Renyi's estimate of mutual information using Hermite functions displayed different behaviors, that is, Shannon's was generally not that smooth compared to Renyi's, leading to "false" minima.One possible reason for this can be seen in equation (9), where the absolute value of the estimated density is taken -the lack of smoothness in the absolute value function maybe transferred to the marginal entropy estimates.
We finish this section by comparing the time needed to calculate the various entropy estimates.One would expect that the Hermite function based estimates should be of order O(n) as they really only  Observations bear out these hypothesis -see Table [3], which contains the time in milliseconds taken to calculate the various entropy estimates for 1000, 2000, 4000 and 8000 data points.Moreover, we see that the Hermite based approximations are significantly faster than all but the FastICA estimate, which is to be expected as the FastICA does not involve many calculations.

Further Work
The complex-valued ICA case has not received the same attention as the real-valued case.Some ICA techniques have been developed for the case of sources with spherically symmetric distributions, that is, the distribution of a source is unchanged by the multiplication of e iθ .In this case, ICA techniques can be applied to the absolute value of un-mixed sources; both RADICAL and FastICA (see [2] and [11] respectively) have been extended to the complex case using this idea.Now, if the sources do not have spherically symmetric distributions, one cannot just consider the absolute value of the unmixed sources.However, there is no reason why one cannot use the complex-valued versions of the information measures in equations ( 5) and (8).For both information measures, estimates of the entropy quantities making them up will be based on bi-variate density estimates; both Hermite series and kernel density estimates can give these.To illustrate, consider a bi-variate random variable (X, Y ) with samples {(x i , y i )} M i=1 .The Hermite series estimate of the bi-variate density p X,Y (x, y) is pX,Y (x, y) = The bi-variate versions of the entropy estimates in equations ( 9) and ( 10) are then and respectively.Nonparametric density estimation using Hermite functions does have some advantages over kernel density estimation in higher dimensions.Firstly, as was noted in [15] and [16], the Hermite series estimate converges at the same rate (with respect to mean integrated square error and mean square error) regardless of dimension, whereas the rate of convergence of the kernel density based estimate degrades as the dimension increase.The second reason is that unlike the uni-variate case, it is a non-trivial task to choose the smoothing matrix required by the kernel density estimate.Some further work would be to compare the performance of Hermite series and kernel density based estimates of the information measures in equations ( 5) and (8) for the case of complex data which is not spherically symmetric.In this case, the Hermite series approach may well prove to be superior due to it's better scaling in multi-dimensions to that of kernel density estimates.

Conclusion
In this paper we have compared the performance of ICA using Hermite function based nonparametric estimates of Shannon's and Renyi's mutual information to that of kernel density based estimates, along with RADICAL and FastICA estimates of Shannon's mutual information.In the eight scenarios considered, the Hermite function based approach was competitive in terms of performance with the other ICA approaches, and much faster to calculate than the kernel density based approach.

Figure 1 .
Figure 1.Plots of Distributions Used

Figure 2 .
Figure 2. Behavior of Shannon and Renyi Mutual Information Estimates using Hermite Functions h m (x)h n (y) , where bn,m = N i=1 h m (x i )h n (y i ) .

Table 1 .
dz Distributions used in Scenarios

Table 3 .
Calculation times for entropy calculation (milliseconds) involve one major summation; the kernel density based estimates should be of order O(n 2 ) due to the double summation involved; the RADICAL estimate should be of order O(n) as it's major component is sorting.