1. Introduction
Differential entropy (DE) has wide applications in a range of fields including signal processing, machine learning, and feature selection [
1,
2,
3]. DE estimation is also related to dimension reduction through independent component analysis [
4], a method for separating data into additive components. Such algorithms typically look for linear combinations of different independent signals. Since two variables are independent if and only if their mutual information vanishes, accurate and efficient entropy estimation algorithms are highly advantageous [
5]. Another important application of DE estimation is quantifying order in outofequilibrium physical systems [
6,
7]. In such systems, existent efficient methods for entropy approximation using thermodynamic integration fail and more fundamental approaches for estimating DE using independent samples are required.
The DE of a continuous multidimensional distribution with density
$p\left(x\right):{\mathbb{R}}^{D}\to \mathbb{R}$ is defined as,
Despite a large number of suggested algorithms [
8,
9], the problem of estimating the DE from independent samples of distributions remains a challenge in high dimensions. Broadly speaking, algorithms can be classified as one of two approaches: Binning and samplespacing methods, or their multidimensional analogues, namely partitioning and nearestneighbor (NN) methods. In 1D, the most straightforward method is to partition the support of the distribution into bins and either calculate the entropy of the histogram or use it for plugin estimates [
8,
10,
11]. This amounts to approximating
$p\left(x\right)$ as a piecewise constant function (i.e., assuming that the distribution is uniform in each subset in the partition). This works well if the support of the underlying distribution is bounded and given. If the support is not known or is unbounded, it can be estimated as well, for example using the minimal and maximal observations. In such cases, samplespacing methods [
8] that use the spacings between adjacent samples are advantageous. Overall, the literature provides a good arsenal of tools for estimating 1D entropy including rigorous bounds on convergence rates (given some further assumptions of
p). See [
8,
9] for reviews.
Estimating entropy in higher dimensions is significantly more challenging [
9,
12]. Binning methods become impractical as having
M bins in each dimension implies
${M}^{D}$ bins overall. Beyond the computational costs, most such bins will often have 1 or 0 samples, leading to the significant underestimating of the entropy. In order to overcome this difficulty, Stowell and Plumbley [
13] suggested partitioning the data using a
kD partitioning treehierarchy (
kDP). In each level of the tree, the data is divided into two parts with an equal number of samples. The splitting continues recursively across the different dimensions (see below for a discussion on the stopping criteria). The construction essentially partitions the support of
p into bins that are multidimensional rectangles whose sides are aligned with the principal axes. The DE is then calculated assuming a uniform distribution in each rectangle. As shown below, this strategy works well at low dimensions (typically 23) and only if the support is known. The method is highly efficient, as constructing the partition tree has an
$O(NlogN)$ efficiency. In particular, it has no explicit dependence on the dimension.
Spacing methods are generalized using the set of
k nearestneighbors to each sample (
kNN) [
11,
14,
15,
16,
17]. These are used to locally approximate the density, typically using kernels [
10,
18,
19,
20,
21,
22]. As shown below,
kNN schemes preform well at moderately high dimensions (up to 10–15) for distributions with unbounded support. However, they fail completely when
p has a compact support and become increasingly inefficient with the dimension. Broadly speaking, algorithms for approximating
kNN in
Ddimensions have an efficiency of
${\u03f5}^{D}NlogN$, where
$\u03f5$ is the required accuracy [
23]. Other approaches for entropy estimation include variations and improvements of
kNN (e.g., [
4,
19,
22]), Voronoibased partitions [
24] (which are also prohibitively expensive at very high dimensions), Parzen windows [
1], and ensemble estimators [
25].
Here, we follow the approach of Stowell and Plumbley [
13], partitioning space using trees. However, we add an important modification that significantly enhances the accuracy of the method. The main idea is to decompose the density
$p\left(x\right)$ into a product of marginal (1D) densities and a copula. The copula is computed over the compact support of the one dimensional cumulative distributions. As such, the multidimensional DE estimates become the combination of one dimensional estimates, and a multidimensional estimate on a compact support, even if the support of the original distribution was not compact. We term the proposed method as copula decomposition entropy estimate (CADEE).
Following Sklar’s theorem [
26,
27], any continuous multidimensional density
$p\left(x\right)$ can be written uniquely as:
where,
$x=({x}_{1},\cdots ,{x}_{D})$,
${p}_{k}(\xb7)$ denotes the marginal density of the
k’th dimension with the cumulative distribution function (CDF)
${F}_{k}\left(t\right)={\int}_{\infty}^{t}{p}_{k}\left(x\right)dx$, and
$c({u}_{1},\cdots ,{u}_{D})$ is the density of the copula, i.e., a probability density on the hypersquare
${[0,1]}^{D}$ whose marginals are all uniform on
$[0,1]$,
for all
k. Substituting Equation (
2) into Equation (
1) yields,
where
${H}_{k}$ is the entropy of the
k’th marginal, to be computed using appropriate 1D estimators, and
${H}_{c}$ is the entropy of the copula. Using Sklar’s theorem has been previously suggested as a method for calculating the mutual information between variables, which is identical to the copula entropy
${H}_{c}$ [
5,
28,
29,
30]. The new approach here is in showing that
${H}_{c}$ can be efficiently estimated recursively, similar to the
kDP approach.
Splitting the overall estimation into the marginal and copula contributions has several major advantages. First, the support of the copula is compact, which is exactly the premise for which partitioning methods are most adequate. Second, since the entropy of the copula is nonpositive, adding up the marginal entropies across treelevels provides an improving approximation (from above) of the entropy. Finally, the decomposition bringsforth a natural criterion for terminating the treepartitioning and for dimension reduction using pairwise independence.
The following sections are organized as follows.
Section 2 describes the outline of the CADEE algorithm. In order to demonstrate its wide applicability, several examples in which the DE can be calculated analytically are presented. In addition, our results are compared to previously suggested methods.
Section 4 discusses implementation issues and the algorithm’s computational cost. We conclude in
Section 5.
2. CADEE Method
The main idea proposed here is to write the entropy
H as a sum of
D 1D marginal entropies, and the entropy of the copula. Analytically, the copula is obtained by a change of variables,
Let
${x}^{i}=({x}_{1}^{i},\cdots ,{x}_{D}^{i})\in {\mathbb{R}}^{D}$,
$i=1\cdots N$ denote
N independent samples from a real
Ddimensional random variable (RV) with density
$p\left(x\right)$. We would like to use the samples
${x}^{i}$ in order to obtain samples from the copula density
$c({u}_{1},\cdots ,{c}_{D})$. From Equation (
5), this can be obtained by finding the rank (in increasing order) of samples along each dimension. In the following, this operation will be referred to as a rank transformation. This is the empirical analogue of the integral transform where one plugs the sample into the CDF. More formally, for each
$k=1\cdots D$, let
${\sigma}_{k}$ denote a permutation of
$\{1\cdots N\}$ that arranges
${x}_{k}^{1},\cdots {x}_{k}^{N}$ in increasing order, i.e.,
${x}_{k}^{{\sigma}_{k}^{i}}\le {x}_{k}^{{\sigma}_{k}^{j}}$ for
$i\le j$. Then, taking
yields
N samples
${u}^{i}=({u}_{1}^{i},\cdots ,{u}_{D}^{i})\in {[0,1]}^{D}$,
$i=1\cdots N$ from the distribution
$c({u}_{1},\cdots ,{u}_{D})$. Note that the samples are not independent. In other words, the rank is the emperical CDF, shifted by
$1/2N$. In particular, they correspond to
N distinct points on a uniform grid,
${u}^{i}\in {\{1/2N,3/2N,11/2N\}}^{D}$.
1D entropies are estimated using either uniform binning or samplespacing methods, depending on whether the support of the marginal is known to be compact (bins) or unbounded/unknown (spacing). The main challenge lies in evaluating the DE of highdimensional copulas [
5,
31]. In order to overcome this difficulty, we compute it recursively, following the
kDP approach. Let
$k\in \{1,\cdots ,D\}$ be spatial dimensions, to be chosen using any given order. The copula samples
${u}^{i}$ are split into two equal parts (note that the median in each dimension is
$1/2$). Denote the two halves as
${v}_{j}^{i}=\left\{{u}_{j}^{i}\right{u}_{k}^{i}\le 1/2\}$ and
${w}_{j}^{i}=\left\{{u}_{j}^{i}\right{u}_{k}^{i}>1/2\}$. Scaling the halves as
$2{v}_{j}^{i}$ and
$2{w}_{j}^{i}1$ produces two sample sets for two new copulas, each with
$N/2$ points. A simple calculation shows that:
where
${H}_{2v}$ is the entropy estimate obtained using the set of points
$2{v}_{j}^{i}$ and
${H}_{2w1}$ is the entropy estimate obtained using the set of points
$2{w}_{j}^{i}1$. The marginals of each half may no longer be uniformly distributed in
$[0,1]$, which suggests continuing recursively, i.e., the entropy of each half is a decomposed using Sklar’s theorem, etc. See
Figure 1 for a schematic sketch of the method.
A key question is finding a stopping condition for the recursion. In [
13], Stowell and Plumbley apply a statistical test for uniformity of
${x}_{k}$, the dimension used for splitting. This condition is meaningless for our method as copulas have uniform marginals by construction. In fact, this suggests that one reason for the relatively poor
kDP estimates at high
D is the rather simplistic stopping criterion, requiring that only one of the marginals is statistically similar to a uniform RV.
In principle, we would like to stop the recursion once the copula cannot be statistically distinguished from the uniform distribution on
${[0,1]}^{D}$. However, reliable statistical tests for uniformity at high
D are essentially equivalent to evaluating the copula entropy [
5,
18,
31]. As a result, we relax the stopping condition to only test for pairwise dependence. The precise test for that will be further discussed. Calculating pairwise dependencies also allows a dimension reduction approach: If the matrix of pairwisedependent dimensions can be split into blocks, then each block can be treated independently.
In order to demonstrate the applicability of the method described above, we study the results of our algorithm for several distributions for which the DE in Equation (
1) can be computed analytically.
Figure 2 and
Figure 3 show numerical results for
H and the running time as a function of the dimension using an implementation in Matlab. Five different distributions are studied. Three have a compact support in
${[0,1]}^{D}$ (
Figure 2):
C1: A uniform distribution;
C2: Dependent pairs. The dimensions are divided into pairs. The density in each pair is $p(x,y)=x+y$, supported on ${[0,1]}^{2}$. Different pairs are independent;
C3: Independent boxes. Uniform density in a set consisting of D small hypercubes, ${\cup}_{k=1}^{D}{[(k1)/D,k/D]}^{D}$.
Two examples have an unbounded support (
Figure 3):
UB1: Gaussian distribution. The covariance is chosen to be a randomly rotated diagonal matrix with eigenvalues ${k}^{2}$, $k=1\cdots D$. Then, the samples are rotated to a random orthonormal basis in ${\mathbb{R}}^{D}$. The support of the distribution is ${\mathbb{R}}^{D}$;
UB2: Powerlaw distribution. Each dimension k is sampled independently from a density ${x}^{22/k}$, $k=1\cdots D$ in $[1,\infty )$. Then, the samples are rotated to a random orthonormal basis in ${\mathbb{R}}^{D}$. The support of the distribution is a ${2}^{D}$ fraction of ${\mathbb{R}}^{D}$ that is not aligned with the principal axes.
Results with our method are compared to three algorithms:
The
kDP algorithm [
20]. We use the C implementation available in [
32];
The
kNN algorithm based on the Kozachenko–Leonenko estimator [
14]. We use the C implementation available in [
33];
A lossless compression approach [
6,
7]. Following [
6], samples are binned into 256 equal bins in each dimension, and the data is converted into a
$N\times D$ matrix of 8bit unsigned integers. The matrix is compressed using the Lempel–Ziv–Welch (LZW) algorithm (implemented in Matlab’s imwrite function to a gif file). In order to estimate the entropy, the file size is interpolated linearly between a constant matrix (minimal entropy) and a random matrix with independent uniformly distributed values (maximal entropy), both of the same size.
Theoretically, in order to get rigorous convergence of estimators, the number of samples should grow exponentially with the dimension [
8]. Since this requirement is impractical at very high dimensions, we considered an undersampled case and only used
$N=\mathrm{10,000}{D}^{2}$ samples. Each method was tested at increasing dimensions until a running time of about 3 hours was reached (per run, on a standard PC) or the implementation ran out of memory. In such cases, no results are reported for this and following dimensions. See also
Table 1 and
Table 2 for numerical results for
$D=10$ and 20.
Note that, in principle, it may be advantageous to apply a Principle Component Analysis (PCA) or Singular value Decomposition (SVD) of the sample convariance to decouple dependent directions. Such methods will be particular advantageous for the unbounded problems. We do not apply such conventional preprocessing methods here in order to make it more difficult for the CADEE method. If SVD converges the distribution into a product of independent 1D variables, the copula is close to 1 and the method will be highly exact after a single iteration.
For compact distributions, it is well known than kNN methods may fail completely. This can be seen even for the most simple examples such as uniform distributions (example C1). However, kNN worked well in example C3 because the density occupied a small fraction of the volume, which is optimal for kNN. kDP and compression methods are precise for uniform distribution, which is a reference case for these methods. For examples C2 and C3, both were highly inaccurate at $D>5$. In comparison, CADEE showed very good accuracy up to D = 30–50, depending on the example.
For unbounded distributions,
kDP and compression methods did not provide meaningful results for
$D>3$. Both CADEE and
kNN provided good estimates up to
$D=20$ (
kNN was slightly better), but diverged slowly at higher dimensions (CADEE was better). Numerical tests suggest this was primarily due to the relatively small number of samples, which severely undersampled the distributions at high
D. Comparing running times, the recursive copula splitting method was significantly more efficient at high dimensions. Simulations suggest a polynomial running time (see
Section 4 for details), while
kNN was exponential in
D, becoming prohibitively inefficient at
$D>30$.
3. Convergence Analysis
In this section, we study the convergence properties of CADEE, i.e., the estimation error as N increases with fixed D. We proceeded along three routes. First, we considered an example in which the first several copula splittings could be preformed analytically. The example demonstrates how, ignoring statistical errors, recursive splitting of the copula and adding up the marginal entropies at the different recursion levels gets close to the exact entropy. Next, we provided a general analytical bound on the error of the model. Although the bound is not tight, it establishes that, in principle, the method provides a valid approximation of the entropy. Finally, we study the convergence of the method numerically for several low dimensional examples, providing empirical evidence that the rate of convergence of the method (the average absolute value of the error) is $O\left({N}^{\alpha}\right)$ for some $0<\alpha <0.5$.
3.1. Analytical Example
In order to demonstrate the main idea why splitting the copula iteratively improved the entropy estimate, we workedout a simple example in which the splittings could be performed analytically. For the purpose of this example, sampling errors in the estimate of the 1D entropy were neglected.
Consider the dependent pairs example (C2) with
$D=2$. The two dimensional density of the sampled random variable is given by:
The exact entropy is
$H={\int}_{0}^{1}{\int}_{0}^{1}plnpdxdy=5/64/3\times ln2\simeq 0.09086$. In order to obtain the copula, we first write the marginal densities and CDFs,
Since the CDFs are invertible (in
$[0,1]$), it can be equivalently written as,
We invert the CDF’s in Equation (
11),
${F}_{Y}^{1}\left(t\right)={F}_{X}^{1}\left(t\right)=(1+\sqrt{1+8{t}^{2}})/2$. Then substitute into Equation (
11), hence,
Indeed, one verifies that the marginals are uniform,
Continuing the CADEE algorithm, we computed the entropy of marginals,
${H}_{X}={H}_{Y}=1/29/8\times ln3+ln2\simeq 0.04279$. This implies that the copula entropy is
$H{H}_{X}{H}_{Y}\simeq 0.00528$ (5.8% of
H). In order to approximate it, we split
$c(x,y)$ into two halves, for example along the
Y axis. Each density is shifted and stretched linearly to have support in
${[0,1]}^{2}$ again,
We continue recursively, computing the marginals for
${c}_{1}$ and
${c}_{2}$,
The marginal entropies are
${H}_{1X}=0.00284$,
${H}_{1Y}=0$,
${H}_{1X}=0.00267$, and
${H}_{2Y}=0$. Overall, summing up the marginal entropies of the two iterations, we have
${H}_{X}+{H}_{Y}+0.5({H}_{1X}+{H}_{1Y}+{H}_{2X}+{H}_{2Y})=0.08834$ (error = 2.77%).
We similarly continue, calculating the copula of ${c}_{1}$ and ${c}_{2}$ and then the marginal distributions of their copulas. We found that the entropy after the third iteration is ${H}_{X}+{H}_{Y}+0.5({H}_{1X}+{H}_{1Y}+{H}_{2X}+{H}_{2Y}+0.5({H}_{11X}+{H}_{11Y}+{H}_{12X}+{H}_{12Y}+{H}_{21X}+{H}_{21Y}+{H}_{22X}+{H}_{22Y}))=0.08993$ (error = 1.02%).
Indeed, we see that in the absence of statistical errors, the recursive splitting provides in improving upper bound for the entropy.
3.2. Analytical Bound
Here, we provide an analytical estimate of the bias and statistical error incurred by the algorithm. We derive a bound, which is not tight. Detailed analysis of the bias and error in some adequate norm is beyond the scope of the current paper.
The first part of the analysis estimated the worstcase accuracy by iteratively approximating the entropy using q repeated splittings of the copula. In the last iterations, the dimensions are assumed to be independent, i.e., the copula equals 1.
Consider the copula
$c({u}_{1},\cdots ,{u}_{D})$, which is split, e.g., along
${u}_{1}\in [0,1]$ into two halves corresponding to
${u}_{1}\in [0,1/2]$ and
${u}_{1}\in [1/2,1]$. Linearly scaling back into
$[0,1]$, we obtain two densities:
where
$(s,{u}_{2},\cdots ,{u}_{D})\in {[0,1]}^{D}$. It is easily seen that
${H}_{c}=({H}_{1c}+{H}_{2c})/2$, where
${H}_{1c}$ and
${H}_{1c}$ are the entropies of
${c}_{1}$ and
${c}_{2}$, respectively. We continue recursively, splitting the resulting copulas along some dimension. After
q iterations, we obtain an expression of the form,
where
${H}_{{i}_{1},\cdots ,{i}_{k},j}$ is the 1D entropy of the
j’th marginal and
${H}_{{i}_{1},\cdots ,{i}_{k},c}$ is the entropy of the copula, obtained after
k splittings along the dimensions
${i}_{1},\cdots ,{\mathsf{\xdf}}_{k}$. For simplicity, we assume that the dimensions are chosen sequentially and suppose that
$q={D}^{r}$, i.e., each dimension was split
r times.
Let
$\Delta ={2}^{r}$ and suppose that the copula
$c\left(x\right)$ is constant on small hyperrectangles with sides:
where
${i}_{k}\in \{0,\cdots ,r1\}$. This implies that within these rectangles all dimensions are independent. Then,
${H}_{{i}_{1},\cdots ,{i}_{D},c}=0$ and the last sum in Equation (
17) vanishes.
Next, we approximate
$c\left(x\right)$ in each small rectangle using Taylor. Without loss of generality, we focus on the case
${i}_{1}=\cdots ={i}_{D}=0$. To first order,
$c\left(x\right)=(A+{B}_{1}{x}_{1}+\cdots {B}_{D}{x}_{D})$, with
$A,{B}_{1},\cdots ,{B}_{D}$ are
$O\left(1\right)$. Scaling to
${[0,1]}^{D}$,
${c}_{\Delta}\left(x\right)={Z}^{1}(A+{B}_{1}{F}_{1}^{1}(\Delta ){x}_{1}+\cdots {B}_{D}{F}_{D}^{1}(\Delta ){x}_{D})$, where
Z is a normalization constant. Assuming that
${F}_{k}$ are continuously differentiable and strictly increasing,
${F}_{K}^{1}$ are also continuously differentiable and
${F}_{k}^{1}(\Delta )=O(\Delta )$. Then, since the total mass in each rectangle is exactly
${\Delta}^{D}$, we have that
$A/Z=1+O(\Delta )$. Finally, the entropy of the normalized density
${c}_{\Delta}\left(x\right)$ can be estimated. Expanding the log to order 1 in
$\Delta $,
From this, one needs to subtract
$Dln\Delta $ to compensate for the scaling. Therefore, for any continuously differential, strictly positive (in its support) density,
${H}_{{i}_{1},\cdots ,{i}_{k},c}=O(\Delta )$. We conclude that the entire last sum in Equation (
17) sums to order
$\Delta $. The prefactor is typically proportional to
D.
Next, we consider statistical errors. Using the Kolmogorov–Smirnov statistics, the distance between the empirical CDF and the exact one is of order
${N}^{1/2}$. Suppose 1D entropy estimates use a method with accuracy (absolute error) of order
${N}^{\alpha}$,
$\alpha \le 1/2$. Then, in the worst case, if all errors are additive, then each estimate in the
k’th iterate has an error (in absolute value) of order
${(N/{2}^{k})}^{\alpha}$. Overall, we have,
For fixed
q, the statistical error decreases like
${N}^{\alpha}$. Typically, for an unbiased 1D estimator in which the variance of the estimator is of order
${N}^{2\alpha}$, the variance of the overall estimation using CADEE is,
However, the prefactor depends linearly on the dimension D and exponentially on the number of iterations q. Recall that the bias decreases exponentially with $q/D$. Hence, the two sources of errors should be balanced in order to obtain a convergent approximation.
3.3. Numerical Examples
In order to demonstrate the convergence of the method, we test the error of the estimate obtained using CADEE for small
D examples.
Figure 4 shows numerical results with four types of distributions (dependent pairs, independent boxes, Gaussian, and powerlaw) with
$D=2$ and
$D=5$ and
${10}^{3}$–
${10}^{8}$ samples. As discussed above, larger dimensions require significantly more samples in order to guarantee that the entire support is sampled at appropriate frequencies. We see that for all examples, the method indeed converged. For nonbounded distributions, the rate decreased with dimension.
4. Implementation Details
The following is a pseudocode implementation of the algorithm described above (Algorithm 1). Several aspects of the codes, such as choice of constants, stopping criterion, and estimation of pairwise independence are rather heuristic approaches, which were found to improve the accuracy and efficiency of our method. See
Appendix A for details. Recall that for every
i,
$({x}_{1}^{i},\cdots ,{x}_{D}^{i})\in {\mathbb{R}}^{D}$ is an independent sample.
Algorithm 1 Recursive entropy estimator 
 1:
functioncopulaH($\left\{{x}_{k}^{i}\right\}$, D, N, $level=0$)  2:
$H\leftarrow 0$  3:
for k = 1 to D do  4:
${u}_{k}\leftarrow $ rank(${x}_{k}$)/N ▹ Calculate rank (by sorting)  5:
$H\leftarrow H+$ H1D($\left\{{u}_{k}^{i}\right\}$, N, $level$) ▹ entropy of marginal k  6:
end for  7:
 8:
if $D=1$ or $N<=$ min #samples then  9:
return H  10:
end if  11:
 12:
▹A is the matrix of pairwise independence  13:
${A}_{ij}=$ true if ${x}^{i}$ and ${x}^{j}$ are statistically independent  14:
${n}_{\mathrm{blocks}}\leftarrow $ # of blocks in A.  15:
if ${n}_{\mathrm{blocks}}>1$ then ▹ Split dimensions  16:
for $j=1$ to ${n}_{\mathrm{blocks}}$ do  17:
$v\leftarrow $ elements in block j  18:
$H\leftarrow H+$ copulaH(${\left\{{u}_{k}^{i}\right\}}_{k\in v}^{i=1\cdots N}$,dim(v),N,$level$)  19:
end for  20:
return H  21:
else ▹ No independent blocks  22:
$k\leftarrow $ choose a dim for splitting  23:
$L=\left\{i\right{u}_{k}^{i}\le 1/2\}$  24:
$\left\{{v}_{j}^{i}\right\}=\left\{2{u}_{j}^{i}\righti\in L,j=1\cdots D\}$  25:
$H\leftarrow H+$ copulaH($\left\{{v}_{j}^{i}\right\}$,D,$N/2$,$level+1$) /2  26:
 27:
$R=\left\{i\right{u}_{k}^{i}>1/2\}$  28:
$\left\{{w}_{j}^{i}\right\}=\{2{u}_{j}^{i}1i\in R,j=1\cdots D\}$  29:
$H\leftarrow H+$ copulaH($\left\{{w}_{j}^{i}\right\}$,D,$N/2$,$level+1$) /2  30:
end if  31:
end function

Several steps in the above algorithm should be addressed.
The rank of an array x is the order in which values appear. Since the support of all marginals in the copula is $[0,1]$, we take $\mathrm{rank}\left(x\right)=\{1/2,3/2,N1/2\}$. For example, $\mathrm{rank}\left(\right[2,0,3\left]\right)=\{3/2,5/2,1/2\}$. This implies that the minimal and maximal samples are not mapped into $\{0,1\}$, which would artificially change the support of the distribution. The rank transformation is easily done using sorting;
1D entropy: Onedimensional entropy of compact distributions (whose support is
$[0,1]$) is estimated using a histogram with uniformly spaced bins. The number of bins can be taken to depend of
N, and order
${N}^{1/3}$ is typically used (we used
${N}^{1/3}$ or
${N}^{0}.4$ for spacing or binbased methods, respectively. For additional considerations and methods for choosing the number of bins see [
34]. At the first iteration, the distribution may not be compact, and the entropy is estimated using
${m}_{N}$spacings (see [
8], Equation (
16));
Finding blocks in the adjacency matrix
A: Let
A be a matrix whose entries are 0 and 1, where
${A}_{kl}=1$ implies that
${u}^{k}$ and
${u}^{l}$ are independent. By construction,
A is symmetric. Let
D denote the diagonal matrix whose diagonal elements are the sums of rows of
A. Then,
$L=AD$ is the Laplacian associated with the graph described by
A. In particular, the sum of all rows of
L is zero. We seek a rational basis for the kernel of a matrix
L: Let ker(
L) denote the kernel of a matrix
L. By a rational basis we mean an orthogonal basis (for ker(
L)), in which all the coordinates are either 0 or 1 and the number of 1’s is minimal. In each vector in the basis, components with 1’s form a cluster (or block), which is pairwise independent of all other marginals. In Matlab, this can be obtained using the command null(
L,’r’). For example, consider the adjacency matrix:
whose graph Laplacian is:
A rational basis for the kernel of
L (which is 2D) is:
which corresponds to two blocks: Components 1+3 and component 2.
Pairwise independence is determined as follows:
Calculate the Spearman correlation matrix of the samples $\left\{{x}_{k}\right\}$, denoted R. Note that this is the same as the Pearson correlation matrix of the ranked data $\left\{{u}_{k}\right\}$;
Assuming normality and independence (which does not hold), the distribution of elements in
R is asymptotically given by the tdistribution with
$N2$ degrees of freedom. Denoting the CDF of the tdistribution with
n degrees of freedom by
${T}_{n}\left(z\right)$, two marginals
$(k,l)$ are considered uncorrelated if
${R}_{kl}>{T}_{n2}^{1}(1\alpha /2)$, where
$\alpha $ is the acceptance threshold. We take the standard
$\alpha =0.05$. Note that because we do
$D(D1)/2$ tests, the probability of observing independent vectors by chance grows with
D. This can be corrected by looking at the statistics of the maximal value for
R (in absolute value), which tends to a Gumbel distribution [
35]. This approach (using Gumbel) is not used because below we also consider independence between blocks;
Pairwise independence using mutual information: Two 1D RVs
X and
Y are independent if and only if their mutual information vanishes,
$I(X,Y)=H(X,Y)H\left(X\right)H\left(Y\right)=0$ [
10]. In our case, the marginals are
$U(0,1)$ and
$H\left(X\right)=H\left(Y\right)=0$, hence
$I(X,Y)=H(X,Y)$. This suggests a statistical test for the hypothesis that
X and
Y are independent as follows. Suppose
X and
Y are independent. Draw
N independent samples and plot the density of the 2D entropy
$H(X,Y)$. For a given acceptance threshold
$\alpha $, find the cutoff value
${H}_{2,c}$ such that
$P(H(X,Y)<{H}_{2,c})=1\alpha $.
Figure A1 shows the distribution for different values of
N. With
$\alpha =0.05$, the cutoff can be approximated by
${H}_{2,c}=0.75{N}^{0.62}$. Accordingly, any pair of marginals which were found to be statistically uncorrelated, are also tested for independence using they mutual information (see below);
2D entropy: Twodimensional entropy (which, in our case, is always compact with support ${[0,1]}^{2}$) is estimated using a 2D histogram with uniformly spaced bins in each dimension.
As a final note, we address the choice of which dimension should be used for splitting in the recursion step. We suggest splitting the dimension which shows the strongest correlations with other marginals. To this end, we square the elements in the correlation matrix R and sum the rows. We pick the column with the largest sum (or the first of them if several are equal).
Lastly, we consider the computational cost of the algorithm, which has four components whose efficiency requires consideration:
Sorting of 1D samples: In the first level, samples may be unbounded and sorting can cost $O(NlogN)$. However, for the next levels, the samples are approximately uniformly distributed in $[0,1]$ and bucket sort works with an average cost of $O\left(N\right)$. This is multiplied by the number of levels, which is $O(logN)$. As all D dimensions need to be sorted, the overall cost of sorting is $O(DNlogN)$;
Calculating 1D entropies. Since the data is already sorted, calculating the entropy using either binning or spacing has a cost $O\left(N\right)$ per dimension, per level. Overall $O(DNlogN)$;
Pairwise correlations: $D(D1)/2$ presorted pairs, each costs $O\left(N\right)$ per level. Overall $O({D}^{2}NlogN)$;
Pairwise entropy: The worstcase is that all pairs are uncorrelated but dependent, which implies that all pairwise mutual information need to be calculated at all levels. However, presorting again reduces the cost of calculating histograms to $O\left(N\right)$ per level. With $O(logN)$ levels, the cost is $O({D}^{2}NlogN)$.
Overall, the cost of the algorithm is $O({D}^{2}NlogN)$. The bottleneck is due to the stopping criterion for the recursion. A simpler test may reduce the cost by a factor D. However, in addition to the added accuracy, checking for pairwise independence allows, for some distributions, splitting the samples into several lower dimensional estimates which is both efficient and more accurate.
5. Summary
We presented a new algorithm for estimating the differential entropy of highdimensional distributions using independent samples. The method applied the idea of decoupling the entropy to a sum of 1D contributions, corresponding to the entropy of marginals, and the entropy of the copula, describing the dependence between the variables. Marginal densities were estimated using known methods for scalar distributions. The entropy of the copula was estimated recursively, similar to the kD partitioning tree method. Our numerical examples demonstrated the applicability of our method up to a dimension of 50, showing improved accuracy and efficiency compared to previously suggested schemes. The main disadvantage of the algorithm was the assumption that pairwise independent components of the data were truly independent. This approximations may clearly fail for particularly chosen setups. Rigorous proofs of consistency and analysis of convergence rates were beyond the scope of the present manuscript.
Our tests demonstrated that compressionbased methods did not provide accurate estimates of the entropy, at least for the synthetic examples tested. Nonetheless, it was surprising that some quantitative estimate of entropy could be obtained using such simpletoimplement method. Moreover, this approach could be easily applied to highdimensional distributions. Under some ergodic or mixing properties, independent sampling could be easily replaced by larger ensembles. Thus, for dimension 100 or higher (e.g., a 50 particles system in 2D), all the direct estimation methods (kDP, kNN, and CADEE) were prohibitively expensive.
To conclude, our numerical experiments suggest that
kNN methods were favorable for unbounded distributions up to about dimension 20. At higher dimensions,
kNN may become inaccurate, in particular for distributions with compact support (e.g., examples C1 and C2 in
Figure 2). In addition, we found that
kNN methods become inefficient at dimensions higher than 30 (e.g., examples UB1 and UB2 in
Figure 3). For distribution with compact support, or when the support is mixed or unknown, the proposed CADEE method was significantly more robust. Our simple numerical examples suggest that the CADEE method may provide reliable estimates at relatively high dimensions (up to 100), even under severe undersampling and at a reasonable computational cost. Here, we focused on the presentation of the algorithm and demonstrated its advantages for relatively simple analytically tractable examples. Applications to more realistic problems, for example estimating the entropy of physical systems that were out of equilibrium will be presented in a future publication. We suggest using the recursive copula splitting scheme for other applications requiring estimation of copulas and evaluation of mutual dependencies between RVs, for example, in financial applications and neural signal processing algorithms.
A Matlab code is available in Matlab’s File Exchange.