An Estimator of Mutual Information and its Application to Independence Testing

: This paper proposes a novel estimator of mutual information for discrete and continuous variables. The main feature of this estimator is that it is zero for a large sample size n if and only if the two variables are independent. The estimator can be used to construct several histograms, compute estimations of mutual information, and choose the maximum value. We prove that the number of histograms constructed has an upper bound of O ( log n ) and apply this fact to the search. We compare the performance of the proposed estimator with an estimator of the Hilbert-Schmidt independence criterion (HSIC), though the proposed method is based on the minimum description length (MDL) principle and the HSIC provides a statistical test. The proposed method completes the estimation in O ( n log n ) time, whereas the HSIC kernel computation requires O ( n 3 ) time. We also present examples in which the HSIC fails to detect independence but the proposed method successfully detects it.


Introduction
Shannon's information theory [1] has contributed to the development of communication and storage systems in which sequences can be compressed up to the entropy of the source assuming that the sender and receiver know the probability of each sequence.In the 30 years since its birth, information theory has developed such that sequences can be compressed without sharing the associated probability (universal coding): the probability of each future sequence can be learned from the past sequence such that the compression ratio of the total sequence converges to its entropy.
Mutual information is a quantity that can be used to analyze the performances of encoding and decoding in information theory, and its value expresses the dependency of two random variables and is nonnegative (that is, zero) if and only if they are independent.Mutual information can be estimated from actual sequences.In this paper, we construct an estimator of the mutual information based on the minimum description length (MDL) principle [2] such that the estimator is zero if and only if the two variables are independent for long sequences.
In any science, a law is determined based on experiments: the law should be simple and explain the experiments.Suppose that we generate pairs of a rule and its exceptions for the experiments and describe the pairs using universal coding.Then, the MDL principle chooses the rule of the pair that has the shortest description length (the number of bits) as the scientific law: the simpler the rule is, the more exceptions there are.In our situation, two variables may be either independent or dependent, and we compute the values of the corresponding description lengths to choose one of them based on which length is shorter.We estimate mutual information based on the difference between the description length values assuming that the two variables are independent and dependent, divided by the original sequence length n.
Let X and Y be discrete random variables.Suppose that we have examples (X = x 1 , Y = y 1 ), • • • , (X = x n , Y = y n ) and that we wish to know whether X and Y are independent, denoted as X ⊥ ⊥ Y, not knowing the distributions P X , P Y , and P XY of X, Y and (X, Y), respectively.
One way of approaching this problem would be to estimate the correlation coefficient ρ(X, Y) of X, Y to determine whether it is close to zero.Although the notions of independence and correlation are close, simply because ρ(X, Y) = 0 does not mean that X and Y are independent.For example, let X and U be mutually independent variables with a standard Gaussian distribution and {−1, 1} with probability 0.5, respectively, and let Y = XU.Apparently, X and Y are not independent, but note that For this problem, we know that the mutual information defined by I(X, Y) := ∑ x ∑ y P XY (x, y) log P XY (x, y) P X (x)P Y (y) Thus, it is sufficient to estimate I(X, Y) to determine whether it is positive.Given x n = (x 1 , • • • , x n ) and y n = (y 1 , • • • , y n ), one might estimate I(X, Y) by plugging in the frequencies c X (x), c Y (y), and c XY (x, y) of X = x, Y = y, and (X, Y) = (x, y) divided by n into P X , P Y , and P XY , respectively, to obtain the quantity However, we observe that I n > 0 even when X ⊥ ⊥ Y for large values of n.In fact, since Equation ( 1) is the Kullback-Leibler divergence between n , we have I n ≥ 0, and for all x, y, which does not hold infinitely many times with a positive probability, even when X ⊥ ⊥ Y. Thus, we need to guess X ⊥ ⊥ Y when I n is small, say when I n < δ(n) for some appropriate function of n: Nobody was certain that such a function δ(n) of sample size n exists.
In 1993, Suzuki [3] identified such a function δ and proposed a new mutual information estimator for large n based on the minimum description length (MDL) principle.The exact form of function δ is presented in Section 2. In this paper, we consider an extension of the estimation J n of mutual information I(X, Y) for a case where X and Y may be continuous.
There are many ways of estimating mutual information for continuous variables.If we assume that X and Y are Gaussian, then the mutual information is expressed by and we can show that for large n, where I n is the maximum likelihood estimator of I(X, Y).However, the equivalence only holds for Gaussian variables.
For general settings, several mutual information estimators are available, such as kernel density-based estimators [4], k-nearest neighbors [5,6], and other estimators based on quantizers [7].In general, the kernel-based method requires an extraordinarily large computational effort to test for independence.To overcome this problem, efficient estimators have been proposed, such as one that completes the test in O(n log n) time [5].However, correctness, such as consistency, is required and has a higher priority than efficiency.Although some of these methods converge to the correct value I(X, Y) for large n in O(n log n) time [7], the estimation values are positive with nonzero probability for large n when X and Y are independent (I(X, Y) = 0).
Currently, the construction of nonlinear alternatives of cov(X, Y) to test for independence between X and Y by using positive definite kernels is becoming popular.In particular, a quantity known as the Hilbelt-Schmidt independence criterion (HSIC) [8], which is defined in Section 2, is extensively used for independence testing.It is known that the HSIC value HSIC(X, Y, k, l) depends on the kernels k and l w.r.t. the ranges of X and Y, and if the kernel pair (k, l) is chosen properly.In this paper, we assume that we always use such a kernel pair and denote HSIC(X, Y, k, l) simply by H(X, Y).For the estimation of H(X, Y) given x n and y n , the most popular estimator H n of H(X, Y), which is defined in Section 2, always takes positive values, and given a significance level of 0 < α < 1 (typically, α = 0.05), we need to obtain (α) such that the decision is as accurate as possible.
In this paper, we propose a new estimator J n of mutual information.This new estimator quantizes the two-dimensional Euclidean space R 2 of X and Y into 2 u × 2 u bins for u = 1, 2 • • • .For each value of u that indicates a histogram, we obtain the estimation J (u) n of mutual information for discrete variables.The maximum value of J (u) n over u = 1, 2, • • • is the final estimation.We prove that the optimal value of u is at most O(log n).In particular, the proposed method divides R 2 without distinguishing between discrete and continuous data, and it satisfies Equation (3).
Then, we experimentally compare the proposed estimator J n of I(X, Y) with the estimator H n of HSIC H(X, Y) in terms of independence testing.Although we obtained several insights, we could not obtain confirmation that one of the estimators outperforms the other.However, we found that the HSIC only considers the magnitude of the data and would fail to detect relations among the data that cannot be identified by simply observing the changes in magnitude.We present two examples for which the HSIC fails to detect the dependencies among x n and y n due to the aforementioned limitation.The proposed estimation procedure completes the computation in O(n log n) time, whereas the HSIC requires O(n 3 ) time.In this sense, the proposed method based on mutual information would be useful in many situations.
The remainder of this paper is organized as follows.Section 2 provides the background for the work presented in this paper, and Sections 2.1 and 2.2 explain the mutual information and HSIC estimations, respectively.Section 3 presents the contributions of this paper, and Section 3.1 proposes the new algorithm for estimating mutual information.Section 3.2 mathematically proves the merits of the proposed method, and Section 3.3 presents the results of the preliminary experiments.Section 4 presents the results of the experiments using the R language to compare the performance in terms of independence testing for the proposed estimator of mutual information and its HSIC counterpart.Section 5 summarizes the contributions and discusses opportunities for future work.
Throughout the paper, the base two logarithm is assumed unless specified otherwise.

Background
This section describes the basic properties of the estimations of mutual information I(X, Y) for discrete variables and HSIC H(X, Y).

Mutual Information for Discrete Variables
In 1993, Suzuki [3] proposed an estimator of mutual information based on the minimum description length (MDL) principle [2].Given examples, the MDL chooses a rule that minimizes the total description length when the examples are described in terms of a rule and its exceptions.In this case, there are two candidate rules: X and Y are either independent or not.When they are independent, for each X and Y, we first describe the independent conditional probability values, and using them, the examples can be described.The total length will be plus up to constant values, where α and β are the cardinalities of X and Y, respectively.When they are not independent, we describe the examples in length up to constant values.Hence, the difference Equation ( 6) + Equation ( 7) − Equation ( 8) divided by n is It is known that for large n [9], which means that δ  Although the estimator of mutual information is defined by Equation ( 9) in the original paper by Suzuki [3], in this paper, we define instead such that Equation ( 10) is replaced by

Maximizing the Posterior Probability
Note that this paper seeks whether X ⊥ ⊥ Y or not rather than the mutual information value itself.
We claim that the decision (9.5) asymptotically maximizes the posterior probability of X ⊥ ⊥ Y given x n and y n .Let Q n (X) := ∏ x θ n x x w(θ|a)dθ, where n x is the occurrence of X = x in x n , and w(θ|a) ∝ ∏ x θ a x −1 x is the prior probability of the probability θ = (θ x ) of X = x assuming the hyper-parameters a = (a x ) with a x > 0. Suppose that we similarly construct Q n (Y) and Q n (X, Y) with a y > 0 and a xy > 0. It is known that if we choose a x = 0.5, a y = 0.5, and a xy = 0.5, then are bounded by constants [10].Hence, for large n, we have where the prior probability p of X ⊥ ⊥ Y is a constant and is negligible for large n.
On the other hand, Nemenman, Shafee, and Bialek [11] proposed a Bayesian estimator and its expectation H n (x n ) w.r.t. a prior over the hyper-parameter a = (a x ), where θ x is the probability of the event (X = x).If we similarly construct a Bayesian estimators H n (y n ) and H n (x n , y n ) of entropies H(Y), H(X, Y), respectively, then we also obtain a Bayesian estimator of mutual information I(X, Y) [12].M. Hutter [13] proposed another estimator and its expectation I n H (x n , y n ) w.r.t. a prior over the hyper-parameter a = (a x , a y ), where θ y and θ xy are the probabilities of the events (Y = y) and (X = x, Y = y).
The main drawback of estimators I n NSB and I n H is that both of fail for large n.For example, we have I n (x n , y n , a) > 0 unless w(θ|x n , y n , a) concentrates on the case θ xy = θ x θ y for all x, y, which occurs with probability zero even when X ⊥ ⊥ Y.Note that they seek the mutual information value itself rather than whether X ⊥ ⊥ Y or not.

HSIC
The HSIC is formally defined by using the positive definite kernels k : X 2 → R and l : Y 2 → R, where X and Y are the ranges of X and Y, respectively, and P XY = P X Y .The most common estimator of H(X, Y), given x n and y n , would be We prepare 0 < α < 1 (for example, α = 0.05).Then, there exists a threshold (α) such that if the null hypothesis is true, then the value of H n should be less than (α) with probability 1 − α.The decision is based on Applying HSIC to independence testing is widely accepted in the machine learning community; in addition to the equivalence Equation ( 5) with independence, (weak) consistency has been shown in the sense that the difference between Equations ( 12) and ( 13) is at most O(1/ √ n) in probability [8].Furthermore, HSIC exhibits satisfactory performance in actual situations and is currently considered to be the de facto method for independence testing.
However, we still encounter serious problems when applying HSIC.The most significant problem is that HSIC requires O(n 3 ) computational time, and n is required to be small if it is necessary that the test be completed within a predetermined time.Moreover, the calculation of the correct value of (n) requires many hours for simulating the null hypothesis.Given x n and y n , we randomly reorder y n to obtain independent pairs of examples and compute H n many times to obtain the (1 − α) × 100 percentile point (α).If α = 0.05, then we obtain 10 samples of a higher α × 100 percentile to ensure that the value of (α) is correct by executing the computation more than 200 times.

Estimation of Mutual Information for both Discrete and Continuous Variables
This paper proposes a new estimator of mutual information that is able to address both discrete and continuous variables and that becomes zero if and only if X and Y are independent for large n.

Proposed Algorithm
The proposed estimation consists of three steps: 1. prepare nested histograms [14], 2. compute estimations J First, we assume that no consecutive values are equal in each of the two sequences Equation ( 14), which is true with probability one when the density function exists.Let s ≥ 1 be an integer, and for each u = 1, • • • , s, we prepare histograms with 2 u bins for X, Y, and (X, Y).Let t := n/2 u .The sequences Equation ( 14) are divided into clusters such as Thus, we have quantized sequences n ) with u = 1, • • • , s using the clusters.For example, suppose that we generate n = 1000 standard Gaussian random sequences x n and y n with a correlation coefficient of 0.8.The frequency distribution tables of a n and b n for u = 3 are  Thus, the distributions of a n and b n are nearly uniform.Because a sufficient number of samples is allocated to each cluster, at least for one-dimensional X, Y if n is large, the estimations are more robust than for other histogram-based methods [9].
Because the obtained sequences a n u and b n u are discrete, we can compute where n is the empirical mutual information w.r. 2 is the number of independent parameters.The derivation of Equation ( 15) is similar to that of Equation (9).
Let (X u , Y u ) and (X v , Y v ) be the random variables for histograms u and v such that u ≤ v. Suppose that examples a n v and b n v have been emitted from (X v , Y v ); we wish to know whether (X v , Y v ) are conditionally independent given (X u , Y u ) based on the MDL principle.Then, we can answer the question affirmatively if we compare the description length values to find that n .This means that according to the MDL principle, we can use the decision that (X v , Y v ) are conditionally independent given (X u , Y u ) if and only if n .Hence, if u provides the maximum value of J (u) n , then we choose the histogram u.Thus, we propose the estimation given by J n := max 1≤u≤s J (u) n , and we prove why the optimal value of u is at most s = 0.5 log n in Section 3.2 (Theorem 1).
Another interpretation is that if the sample size in each bin is smaller, then the estimation is less robust.However, if the number of bins is smaller, then the approximation of the histogram is less appropriate.These two factors are balanced by the MDL principle.
For example, suppose that n = 1000; thus, s = 0.5 log 1000 = 4.If we have the following four values: u J(u) 1 0.2664842 2 0.5077115 3 0.5731657 4 0.4601272 then the final estimation will be 0.5731657 (u = 3).Note that there are other methods for finding the maximum mutual information.For example, s = 0.5 log a n and a u clusters for each of (X, Y) work if a > 1 (the smaller a is, the larger s is).For a = 1.5, we experimentally find (Figure 2) that the value of J(u) depicts a concave curve, i.e., the maximum value is obtained at the point u = 5 at which the sample size of each bin (robustness of the estimation) and the number of bins (approximation of the histogram) are balanced.Next, we consider the case for which two values at consecutive locations are the same in one of the two sequences Equation ( 14).In general, we divide each cluster in half at each stage u = 1, 2, • • • , s.

Properties
In this subsection, we prove two fundamental claims: 1.The optimal u that maximizes J (u) n is no larger than s := 0.5 log n .2. For large n, the mutual information estimation of each histogram converges to the correct approximated value.3.For large n, the estimation is zero if and only if X and Y are independent.
First, we have the following lemma from the law of large numbers: Lemma 1.The 2 u − 1 breaking points of histograms u = 1, 2, • • • , converge to the correct values (100 × j/2 u percentile points, j = 1, • • • , 2 u − 1) with probability one as the sample size n (hence, its maximum depth s) increases, and the value of a is assumed to be two for simplicity.
Let I(X u , Y u ) be the true mutual information w.r.t. the correct breaking points of the histogram u = 1, • • • , s. Theorem 1.For n ≥ 4, the optimal u is no larger than s = 0.5 log n .

We observe that for all
In fact, from u to u + 1, the increases in the empirical entropies of X and Y are at most one, respectively, and the decrease in the empirical entropy of (X, Y) is at least zero.If we have the inequality for some 1 ≤ u ≤ s, then we cannot expect u + 1 to be the optimal value.However, when u = s = 0.5 log n, under Equation ( 16), Equation ( 17) implies that .This completes the proof.
Theorem 2. For large n, the estimation of the mutual information of each histogram converges to the correct value.
Proof.Each boundary converges to the true value for each histogram (Lemma 1), and the number of samples in each bin increases as n becomes larger; therefore, the estimation in histogram u = 1, 2, • • • converges to the correct mutual information value I(X u , Y u ).Theorem 3.With probability one as n → ∞, J n = 0 if and only if X and Y are independent.
Proof.Suppose that X and Y are not independent.Because I(X, Y) > 0, we have I(X u , Y u ) > 0 for the value u.Thus, the J (u) n for u is positive mutual information I(X, Y) with probability one as n → ∞ (Theorem 2), and J n > 0. For proof of the other direction, see the Appendix.

Preliminary Experiments
If the random variables are known to be Gaussian a priori, it is considered to be easier to use a Gaussian method to estimate the correlation coefficient and to compute the estimation based on Equation (4) than by using the proposed method, which does not require the variables to be Gaussian.We compared the proposed algorithm with the Gaussian method.
For the first experiment, because X and Y are discrete, we expect the proposed method to successfully compute the mutual information values even though none of the ranges of X and Y are bounded.The Gaussian method only considers the correlation between two variables, whereas the proposed method counts the occurrences of the pairs.Consequently, particularly for large n, the proposed method outperformed the Gaussian method and tended to converge to the true mutual information value as n increased (Figure 3).For the second experiment, although X and Y are continuous, the difference is discrete, as is the probabilistic relation.The proposed method can count the differences (integers) and the quantized values.Consequently, the proposed method estimated the mutual information values more correctly than the Gaussian method (Figure 4a).
However, for the ANOVA case (Experiment 3), the mutual information values obtained using the proposed method are closer to the true value than those obtained using the Gaussian method (Figure 4b).We expected that the Gaussian method would outperform the proposed method, but in this case, X is discrete and the mutual information is at most the entropy of X; thus, the proposed method shows a slightly better performance than the Gaussian method.However, the difference between the two methods is not as large as that in Experiment 2, which is because the noise is Gaussian and the Gaussian method is designed to address Gaussian noise.

Application to Independence Tests
We conducted experiments using the R language, and we obtained evidence that supports the "No Free Lunch" theorem [15] for independence tests: no single independence test is capable of outperforming all the other tests.The proposed and HSIC methods require O(n log n) and O(n 3 ) time, respectively, to perform the computation; thus, the former is considerably faster than the latter, particularly for large values of n.
For the HSIC method, we used the Gaussian kernel [8] k We set the significance level α to be 0.05.To compute the threshold (α) such that we decide X ⊥ ⊥ Y if and only if H n ≤ (α), because only x n and y n are available, this requires us to repeatedly and randomly reorder y n to generate mutually independent x n and y n such that we can simulate the null hypothesis.However, this process is time-consuming for our experiments, and we generate mutually independent pairs x n and y n to compute H n 200 times to estimate the distribution of H n under the null hypothesis and the 95 percentile point (0.05).
For the proposed method, we set the prior probability of X ⊥ ⊥ Y to be 0.5.

Binary and Gaussian Sequences
First, we generated mutually independent binary X and U, with the probabilities of X = 1 and U = 1 being 0.5 and p = 0.1, 0.2, 0.3, 0.4, 0.5, respectively, to obtain Y = X + U mod 2. When we simulated the null hypothesis, we generated y n in the same way as that used for generating x n .We computed J n and H n 100 times for n = 100 and n = 200.
The obtained results are presented in Figure 5.For each p and n = 200, we depict the distributions of H n and J n in the plots on the left and right, respectively.If the data occur to the left of the red vertical line, then the tests consider x n and y n to be independent.In particular, for p = 0.5 (X ⊥ ⊥ Y) and p = 0.4 (X ⊥ ⊥ Y), we counted how many times the two tests chose X ⊥ ⊥ Y and X ⊥ ⊥ Y (see Table 1).
We could not find any significant difference in the correctness of testing for the two tests.Next, we generated mutually independent Gaussian X and U with mean zero and variance one, and Y = qX + 1 − q 2 U for q = 0, 0.2, 0.4, 0.6, 0.8.When we simulated the null hypothesis, we generated y n in the same way as that used for generating x n .We computed J n and H n 100 times for n = 100 and n = 200.Table 1.Experiments for binary sequences: the figures show how many times (out of 100) the HSIC and the proposed method regarded the two sequences as being independent (⊥ ⊥) and dependent ( ⊥ ⊥) for p = 0.5, 0.4, 0. The obtained results are presented in Figure 6.For each q and n = 200, we depict the distributions of H n and J n on the left and right, respectively.If the data occur to the left of the red vertical line, then the tests consider x n and y n to be independent.In particular, for q = 0.5 (X ⊥ ⊥ Y) and q = 0.4 (X ⊥ ⊥ Y), we counted how many times the two tests chose X ⊥ ⊥ Y and X ⊥ ⊥ Y (see Table 2).
We refer to the two problems as ROUNDING and INTEGER, respectively.Apparently, the answers to both of these are that X, Y are not independent, although the correlation coefficient ρ(X, Y) is zero.
Table 3 shows the number of times the tests chose X ⊥ ⊥ Y and X ⊥ ⊥ Y for the experiments.We observed that the HSIC failed to detect dependencies for both of the problems, whereas the proposed method successfully be independent.The obvious reason appears to be that the HSIC simply considers the magnitudes of X, Y.For the ROUNDING problem, X, Y are independent for the integer parts, but the fractional parts are related.However, when using HSIC, because the integer part contributes to the score considerably more than the fractional part, the HSIC cannot detect the relation between the whole parts.
The same reasoning can be applied to the INTEGER problem.In fact, the values of X/2 and Y/2 are independent, where x denotes the largest integer not exceeding x.However, the relation X ≡ Y mod 2 always holds, and this cannot be detected by the HSIC.
Note that we do not claim that the proposed method is always superior to the HSIC.Admittedly, for many problems, the HSIC performs better.For example, for typical problems such as 3. X and U follow the standard Gaussian and binary (probability 0.5) distributions, and Y = XU (ZERO-COV), we find that the HSIC offers more advantages (see Table 4).We rather claim that no single independence test outperforms all the others.

Execution Time
We compare the execution times for the Gaussian sequences.Table 5 lists the average execution times for n = 100, 500, 1000, and 2000 and q = 0.2 (the results were almost identical for the other values of q).
We find that the proposed method is considerably faster than the HSIC, particularly for large n.This result occurs because the proposed method requires O(n log n) time for the computation, whereas O(n 3 ) time.Although the HSIC estimator might detect some independence for large n because of its (weak) consistency, it appears that the HSIC is not efficient for large n.Because the HSIC requires the null hypothesis to be simulated, a considerable amount of additional computation would be required.

Concluding Remarks
We proposed an estimator of mutual information and demonstrated the effectiveness of the algorithm in solving the independence testing problem.
Although estimating mutual information of continuous variables was considered to be difficult, the proposed estimator was shown to detect independence for a large sample size if and only if the two variables are independent.The estimator constructs many histograms of size 2 u × 2 u , estimates their mutual information J (u) n , and chooses the one with the maximum J (u) n value over u = 1, 2, • • • .We find that the optimal u has an upper bound of 0.5 log n .The proposed algorithm requires O(n log n) time to perform the computation.
Then, we compared the performance of our proposed estimator with that of the HSIC estimator, de facto for the independence testing principle.The two methods differ in that the proposed method is based on the MDL principle given data x n , y n , although the HSIC detects abnormalities assuming the null hypothesis given the data.We could not obtain a definite answer to enable us to determine which method is superior for general settings; rather, we obtained evidence that no single statistical test outperforms all the others for all problems.In fact, although HSIC will clearly be superior when certain specific dependency structures form the alternative hypothesis, the proposed estimator is more universal.
One meaningful insight obtained is that the HSIC only considers the magnitude of the data and neglects to find relations that cannot be detected by simply considering the changes in magnitude.
The most notable merit of the proposed algorithm compared to the HSIC is its efficiency.The HSIC requires O(n 3 ) computational time for one test.However, prior to the test, it is necessary to simulate the null hypothesis and set the threshold such that the algorithm determines that the data are independent if and only if the HSIC values do not exceed the threshold.In this sense, executing the HSIC would be time-consuming, and it would be safe to say that the proposed algorithm is useful for designing intelligent machines, whereas the HSIC is appropriate for scientific discovery.
In future work, we will consider exactly when the proposed method exhibits a particularly good performance.
Moreover, we should address the question of how generalizations to three dimensions might work.In this paper, it is not clear whether one would want to estimate some form of total independence such as E log p(X, Y, Z) p(X)p(Y)p(Z) or conditional mutual information such as E log p(X, Y, Z)p(X) p(X, Z)p(X, Y) .In fact, for Bayesian network structure learning (BNSL), we need to compute Bayesian scores of conditional mutual information from the data to apply a standard scheme of BNSL based on dynamic programming [16].
Currently, a constraint-based approach for estimating conditional mutual information values using positive definite kernels is available [17], but no theoretical guarantee, such as consistency, is obtained by the method.

Figure 1
presents a box plot of 1000 trials for the two estimations for n = 100 and α = β = 2, where X and Y are independent and dependent, respectively.

Figure 1 .
Figure 1.Estimating mutual information: the minimum description length (MDL) computes the correct values, whereas the maximum likelihood yields values that are larger than the correct values.

n
of mutual information for the histogram s = 1, 2, • • • , and 3. choose the maximum among the estimations J (u) n w.r.t. the histograms u = 1, 2, • • • .Suppose that we are given examples x n = (x 1 , • • • , x n ) and y n = (y 1 , • • • , y n ) and that they have been sorted as x1 125 125 125 125 125 and that of (a n , b n ) for u = 3 are as follows:

Figure 2 .
Figure 2. Values of J (u) n with 1 ≤ u ≤ s: the maximum value is obtained at the point where the sample size of each bin and the number of bins are balanced.

Figure 3 .
Figure 3. Experiment 1: for large n, the proposed method outperforms the Gaussian method.

Figure 4 .
Figure 4. Experiments 2 and 3: the proposed method outperformed the Gaussian method for both experiments.The difference is less in Experiment 3 than in Experiment 2.

− 1 )n
only but for a finite n(n s.t.n/ log n ≤ e 2 ), where we have written the quantity f (n) s.t.n f (n) → 0 as n → ∞ as o(n −1 ).Hence, from the Borel-Cantelli lemma, with probability max u≥2 J with probability one as n → ∞.This completes the proof. ⊥