Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

Although Shannon mutual information has been widely used, its effective calculation is often difficult for many practical problems, including those in neural population coding. Asymptotic formulas based on Fisher information sometimes provide accurate approximations to the mutual information but this approach is restricted to continuous variables because the calculation of Fisher information requires derivatives with respect to the encoded variables. In this paper, we consider information-theoretic bounds and approximations of the mutual information based on Kullback-Leibler divergence and Rényi divergence. We propose several information metrics to approximate Shannon mutual information in the context of neural population coding. While our asymptotic formulas all work for discrete variables, one of them has consistent performance and high accuracy regardless of whether the encoded variables are discrete or continuous. We performed numerical simulations and confirmed that our approximation formulas were highly accurate for approximating the mutual information between the stimuli and the responses of a large neural population. These approximation formulas may potentially bring convenience to the applications of information theory to many practical and theoretical problems.


Introduction
Information theory is a powerful tool widely used in many disciplines, including, for example, neuroscience, machine learning, and communication technology [1][2][3][4][5][6][7]. As it is often notoriously difficult to effectively calculate Shannon mutual information in many practical applications [8], various approximation methods have been proposed to estimate the mutual information, such as those based on asymptotic expansion [9][10][11][12][13], k-nearest neighbor [14], and minimal spanning trees [15]. Recently, Safaai et al. proposed a copula method for estimation of mutual information, which can be nonparametric and potentially robust [16]. Another approach for estimating the mutual information is to simplify the calculations by approximations based on information-theoretic bounds, such as the Cramér-Rao lower bound [17] and the van Trees' Bayesian Cramér-Rao bound [18].
Suppose the input x is a K-dimensional vector, x = (x 1 , · · · , x K ) T , which could be interpreted as the parameters that specifies a stimulus for a sensory system, and the outputs is an N-dimensional vector, r = (r 1 , · · · , r N ) T , which could be interpreted as the responses of N neurons. We assume N is large, generally N K. We denote random variables by upper case letters, e.g., random variables X and R, in contrast to their vector values x and r. The mutual information between X and R is defined by where x ∈ X ⊆ R K , r ∈ R ⊆ R N , and · r,x denotes the expectation with respect to the probability density function p(r, x). Similarly, in the following, we use · r|x and · x to denote expectations with respect to p(r|x) and p(x), respectively. If p(x) and p(r|x) are twice continuously differentiable for almost every x ∈ X , then for large N we can use an asymptotic formula to approximate the true value of I with high accuracy [24]: which is sometimes reduced to where det (·) denotes the matrix determinant, H(X) = − ln p(x) x is the stimulus entropy, and is the Fisher information matrix. We denote the Kullback-Leibler divergence as and denote Rényi divergence [29] of order β + 1 as Here, βD β (x||x) is equivalent to Chernoff divergence of order β + 1 [30]. It is well known that D β (x||x) → D (x||x) in the limit β → 0.
We define where in I β,α we have β ∈ (0, 1) and α ∈ (0, ∞) and assume p (x) > 0 for all x ∈ X . In the following, we suppose x takes M discrete values, x m , m ∈ M = {1, 2, · · · , M}, and p(x m ) > 0 for all m. Now, the definitions in Equations (9)-(11) become Furthermore, we define Here, notice that, if x is uniformly distributed, then by definition I d and I D become identical. The elements in setM β m are those that make D β (x m ||xm) take the minimum value, excluding any element that satisfies the condition D β (x m ||xm) = 0. Similarly, the elements in setM u m are those that minimize D (x m ||xm) excluding the ones that satisfy the condition D (x m ||xm) = 0.

Theorems
In the following, we state several conclusions as theorems and prove them in Appendix A.

Remark 1.
We see from Theorems 1 and 2 that the true mutual information I and the approximation I e both lie between I β 1 ,1 and I u , which implies that their values may be close to each other. For discrete variable x, Theorem 3 tells us that I e and I d are asymptotically equivalent (i.e., their difference vanishes) in the limit of large N. For continuous variable x, Theorem 4 tells us that I e and I G are asymptotically equivalent in the limit of large N, which means that I e and I are also asymptotically equivalent because I G and I are known to be asymptotically equivalent [24].

Approximations for Mutual Information
In this section, we use the relationships described above to find effective approximations to true mutual information I in the case of large but finite N. First, Theorems 1 and 2 tell us that the true mutual information I and its approximation I e lie between lower and upper bounds given by: I β,α ≤ I ≤ I u and I β 1 ,1 ≤ I e ≤ I u . As a special case, I is also bounded by I β 1 ,1 ≤ I ≤ I u . Furthermore, from Equations (2) and (36) we can obtain the following asymptotic equality under suitable conditions: Hence, for continuous stimuli, we have the following approximate relationship for large N: For discrete stimuli, by Equation (31) for large but finite N, we have Consider the special case p(xm) p(x m ) form ∈ M u m . With the help of Equation (18), substitution of p(xm) p(x m ) into Equation (52) yields where I 0 D ≤ I D and the second approximation follows from the first-order Taylor expansion assuming that the term ∑ The theoretical discussion above suggests that I e and I d are effective approximations to true mutual information I in the limit of large N. Moreover, we find that they are often good approximations of mutual information I even for relatively small N, as illustrated in the following section.

Results of Numerical Simulations
Consider Poisson model neuron whose responses (i.e., numbers of spikes within a given time window) follow a Poisson distribution [24]. The mean response of neuron n, with n ∈ {1, 2, · · · , N}, is described by the tuning function f (x; θ n ), which takes the form of a Heaviside step function: where the stimulus x ∈ [−T, T] with T = 10, A = 10, and the centers θ 1 , for N ≥ 2, and θ n = 0 for N = 1. We suppose that the discrete stimulus x has M = 21 possible values that are evenly spaced from −T to T, namely, M}. Now, the Kullback-Leibler divergence can be written as More generally, this equality holds true whenever the tuning function has binary values.
In the first example, as illustrated in Figure 1, we suppose the stimulus has a uniform distribution, so that the probability is given by p(x m ) = 1/M. Figure 1a shows graphs of the input distribution p(x) and a representative tuning function f (x; θ) with the center θ = 0.
To assess the accuracy of the approximation formulas, we employed Monte Carlo (MC) simulation to evaluate the mutual information I [24]. In our MC simulation, we first sampled an input x j ∈ X from the uniform distribution p(x j ) = 1/M, then generated the neural responses r j by the conditional distribution p(r j |x j ) based on the Poisson model, where j = 1, 2, · · · , j max . The value of mutual information by MC simulation was calculated by where To assess the precision of our MC simulation, we computed the standard deviation of repeated trials by bootstrapping: where and Γ j,i ∈ {1, 2, · · · , j max } is the (j, i)-th entry of the matrix Γ ∈ N j max ×i max with samples taken randomly from the integer set {1, 2, · · · , j max } by a uniform distribution. Here, we set j max = 5 × 10 5 , i max = 100 and M = 10 3 . For different N ∈ {1, 2, 3, 4, 6, 10, 14, 20, 30, 50, 100, 200, 400, 700, 1000}, we compared I MC with I e , I d and I D , as illustrated in Figure 1b-d. Here, we define the relative error of approximation, e.g., for I e , as and the relative standard deviation DI std = I std I MC .
(63) For the second example, we only changed the probability distribution of stimulus p(x m ) while keeping all other conditions unchanged. Now, p(x m ) is a discrete sample from a Gaussian function: where Z is the normalization constant and σ = T/2. The results are illustrated in Figure 2. Next, we changed each tuning function f (x; θ n ) to a rectified linear function:  Finally, we let the tuning function f (x; θ n ) have a random form: where the stimulus x ∈ X = {1, 2, · · · , 999, 1000}, B = 10, the values of θ 1 n , θ 2 n , · · · , θ K n are distinct and randomly selected from the set X with K = 10. In this example, we may regard X as a list of natural objects (stimuli), and there are a total of N sensory neurons, each of which responds only to K randomly selected objects. Figure 5 shows the results under the condition that p(x) is a uniform distribution. In Figure 6, we assume that p(x) is not flat but a half Gaussian given by Equation (64) with σ = 500.  Figures 1 and 5, all three approximations were practically indistinguishable. In general, all these approximations were extremely accurate when N > 100.
In all our simulations, the mutual information tended to increase with the population size N, eventually reaching a plateau for large enough N. The saturation of information for large N is due to the fact that it requires at most log 2 M bits of information to completely distinguish all M stimuli. It is impossible to gain more information than this maximum amount regardless of how many neurons are used in the population. In Figure 1, for instance, this maximum is log 2 21 = 4.39 bits, and in Figure 5, this maximum is log 2 1000 = 9.97 bits. For relatively small values of N, we found that I D tended to be less accurate than I e or I d (see Figures 5 and 6). Our simulations also confirmed two analytical results. The first one is that I d = I D when the stimulus distribution is uniform; this result follows directly from the definitions of I d and I D and is confirmed by the simulations in Figures 1, 3, and 5. The second result is that I d = I e (Equation (56)) when the tuning function is binary, as confirmed by the simulations in Figures 1, 2, 5, and 6. When the tuning function allows many different values, I e can be much more accurate than I d and I D , as shown by the simulations in Figures 3 and 4. To summarize, our best approximation formula is I e because it is more accurate than I d and I D , and, unlike I d and I D , it applies to both discrete and continuous stimuli (Equations (10) and (13)).

Discussion
We have derived several asymptotic bounds and effective approximations of mutual information for discrete variables and established several relationships among different approximations. Our final approximation formulas involve only Kullback-Leibler divergence, which is often easier to evaluate than Shannon mutual information in practical applications. Although in this paper our theory is developed in the framework of neural population coding with concrete examples, our mathematical results are generic and should hold true in many related situations beyond the original context.
We propose to approximate the mutual information with several asymptotic formulas, including I e in Equation (10) or Equation (13), I d in Equation (15) and I D in Equation (18). Our numerical experimental results show that the three approximations I e , I d and I D were very accurate for large population size N, and sometimes even for relatively small N. Among the three approximations, I D tended to be the least accurate, although, as a special case of I d , it is slightly easier to evaluate than I d . For a comparison of I e and I d , we note that I e is the universal formula, whereas I d is restricted only to discrete variables. The two formulas I e and I d become identical when the responses or the tuning functions have only two values. For more general tuning functions, the performance of I e was better than I d in our simulations.
As mentioned before, an advantage of of I e is that it works not only for discrete stimuli but also for continuous stimuli. Theoretically speaking, the formula for I e is well justified, and we have proven that it approaches the true mutual information I in the limit of large population. In our numerical simulations, the performance of I e was excellent and better than that of I d and I D . Overall, I e is our most accurate and versatile approximation formula, although, in some cases, I d and I D are slightly more convenient to calculate.
The numerical examples considered in this paper were based on an independent population of neurons whose responses have Poisson statistics. Although such models are widely used, they are appropriate only if the neural responses can be well characterized by the spike counts within a fixed time window. To study the temporal patterns of spike trains, one has to consider more complicated models. Estimation of mutual information from neural spike trains is a difficult computational problem [25][26][27][28]. In future work, it would be interesting to apply the asymptotic formulas such as I e to spike trains with small time bins each containing either one spike or nothing. A potential advantage of the asymptotic formula is that it might help reduce the bias caused by small samples in the calculation of the response marginal distribution p(r) = ∑ x p(r|x)p(x) or the response entropy H(R) because here one only needs to calculate the Kullback-Leibler divergence D (x||x), which may have a smaller estimation error.
Finding effective approximation methods for computing mutual information is a key step for many practical applications of the information theory. Generally speaking, Kullback-Leibler divergence (Equation (7)) is often easier to evaluate and approximate than either Chernoff information (Equation (44)) or Shannon mutual information (Equation (1)). In situations where this is indeed the case, our approximation formulas are potentially useful. Besides applications in numerical simulations, the availability of a set of approximation formulas may also provide helpful theoretical tools in future analytical studies of information coding and representations.
As mentioned in the Introduction, various methods have been proposed to approximate the mutual information [9][10][11][12][13][14][15][16]. In future work, it would be useful to compare different methods rigorously under identical conditions in order to asses their relative merits. The approximation formulas developed in this paper are relatively easy to compute for practical problems. They are especially suitable for analytical purposes; for example, they could be used explicitly as objective functions for optimization or learning algorithms. Although the examples used in our simulations in this paper are parametric, it should be possible to extend the formulas to nonparametric problem, possibly with help of the copula method to take advantage of its robustness in nonparametric estimations [16].
In this section, we use integral for variable x, although our argument is valid for both continuous variables and discrete variables. For discrete variables, we just need to replace each integral by a summation, and our argument remains valid without other modification. The same is true for the response variable r.
To prove the upper bound, let where q(x) satisfies By Jensen's inequality, we get To find a function q(x) that minimizes Φ [q(x)], we apply the variational principle as follows: where λ is the Lagrange multiplier and ∂q(x) = 0 and using the constraint in Equation (A4), we find the optimal solution Thus, the variational lower bound of Φ [q(x)] is given by Therefore, from Equations (1), (A5) and (A9), we get the upper bound in Equation (25). This completes the proof of Theorem 1.

Appendix A.4. Proof of Theorem 4
The upper bound I u for mutual information I in Equation (25) can be written as where L (r|x) = ln (p (r|x) p (x)) and L (r|x) = ln (p (r|x) p (x)). Consider the Taylor expansion for L(r|x) around x. Assuming that L(r|x) is twice continuously differentiable for anyx ∈ X ω (x), we get For later use, we also define where l (r|x) = ln p (r|x) .