Next Article in Journal
Entropy and Contrast Enhancement of Infrared Thermal Images Using the Multiscale Top-Hat Transform
Next Article in Special Issue
CDCS: Cluster-Based Distributed Compressed Sensing to Facilitate QoS Routing in Cognitive Video Sensor Networks
Previous Article in Journal
A Method Based on Differential Entropy-Like Function for Detecting Differentially Expressed Genes Across Multiple Conditions in RNA-Seq Studies
Previous Article in Special Issue
Identification of Hypsarrhythmia in Children with Microcephaly Infected by Zika Virus
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

1
Key Laboratory of Cognition and Intelligence and Information Science Academy of China Electronics Technology Group Corporation, Beijing 100086, China
2
Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
*
Authors to whom correspondence should be addressed.
Entropy 2019, 21(3), 243; https://doi.org/10.3390/e21030243
Submission received: 15 December 2018 / Revised: 11 February 2019 / Accepted: 28 February 2019 / Published: 4 March 2019

Abstract

:
Although Shannon mutual information has been widely used, its effective calculation is often difficult for many practical problems, including those in neural population coding. Asymptotic formulas based on Fisher information sometimes provide accurate approximations to the mutual information but this approach is restricted to continuous variables because the calculation of Fisher information requires derivatives with respect to the encoded variables. In this paper, we consider information-theoretic bounds and approximations of the mutual information based on Kullback-Leibler divergence and Rényi divergence. We propose several information metrics to approximate Shannon mutual information in the context of neural population coding. While our asymptotic formulas all work for discrete variables, one of them has consistent performance and high accuracy regardless of whether the encoded variables are discrete or continuous. We performed numerical simulations and confirmed that our approximation formulas were highly accurate for approximating the mutual information between the stimuli and the responses of a large neural population. These approximation formulas may potentially bring convenience to the applications of information theory to many practical and theoretical problems.

1. Introduction

Information theory is a powerful tool widely used in many disciplines, including, for example, neuroscience, machine learning, and communication technology [1,2,3,4,5,6,7]. As it is often notoriously difficult to effectively calculate Shannon mutual information in many practical applications [8], various approximation methods have been proposed to estimate the mutual information, such as those based on asymptotic expansion [9,10,11,12,13], k-nearest neighbor [14], and minimal spanning trees [15]. Recently, Safaai et al. proposed a copula method for estimation of mutual information, which can be nonparametric and potentially robust [16]. Another approach for estimating the mutual information is to simplify the calculations by approximations based on information-theoretic bounds, such as the Cramér–Rao lower bound [17] and the van Trees’ Bayesian Cramér–Rao bound [18].
In this paper, we focus on mutual information estimation based on asymptotic approximations [19,20,21,22,23,24]. For encoding of continuous variables, asymptotic relations between mutual information and Fisher information have been presented by several researchers [19,20,21,22]. Recently, Huang and Zhang [24] proposed an improved approximation formula, which remains accurate for high-dimensional variables. A significant advantage of this approach is that asymptotic approximations are sometimes very useful in analytical studies. For instance, asymptotic approximations allow us to prove that the optimal neural population distribution that maximizes the mutual information between stimulus and response can be solved by convex optimization [24]. Unfortunately this approach does not generalize to discrete variables since the calculation of Fisher information requires partial derivatives of the likelihood function with respect to the encoded variables. For encoding of discrete variables, Kang and Sompolinsky [23] represented an asymptotic relationship between mutual information and Chernoff information for statistically independent neurons in a large population. However, Chernoff information is still hard to calculate in many practical applications.
Discrete stimuli or variables occur naturally in sensory coding. While some stimuli are continuous (e.g., the direction of movement, and the pitch of a tone), others are discrete (e.g., the identities of faces, and the words in human speech). For definiteness, in this paper, we frame our questions in the context of neural population coding; that is, we assume that the stimuli or the input variables are encoded by the pattern of responses elicited from a large population of neurons. The concrete examples used in our numerical simulations were based on Poisson spike model, where the response of each neuron is taken as the spike count within a given time window. While this simple Poisson model allowed us to consider a large neural population, it only captured the spike rate but not any temporal structure of the spike trains [25,26,27,28]. Nonetheless, our mathematical results are quite general and should be applicable to other input–output systems under suitable conditions to be discussed later.
In the following, we first derive several upper and lower bounds on Shannon mutual information using Kullback-Leibler divergence and Rényi divergence. Next, we derive several new approximation formulas for Shannon mutual information in the limit of large population size. These formulas are more convenient to calculate than the mutual information in our examples. Finally, we confirm the validity of our approximation formulas using the true mutual information as evaluated by Monte Carlo simulations.

2. Theory and Methods

2.1. Notations and Definitions

Suppose the input x is a K-dimensional vector, x = ( x 1 , , x K ) T , which could be interpreted as the parameters that specifies a stimulus for a sensory system, and the outputs is an N-dimensional vector, r = ( r 1 , , r N ) T , which could be interpreted as the responses of N neurons. We assume N is large, generally N K . We denote random variables by upper case letters, e.g., random variables X and R, in contrast to their vector values x and r . The mutual information between X and R is defined by
I = I ( X ; R ) = ln p ( r | x ) p ( r ) r , x ,
where x X R K , r R R N , and · r , x denotes the expectation with respect to the probability density function p ( r , x ) . Similarly, in the following, we use · r | x and · x to denote expectations with respect to p ( r | x ) and p ( x ) , respectively.
If p ( x ) and p ( r | x ) are twice continuously differentiable for almost every x X , then for large N we can use an asymptotic formula to approximate the true value of I with high accuracy [24]:
I I G = 1 2 ln det G ( x ) 2 π e x + H ( X ) ,
which is sometimes reduced to
I I F = 1 2 ln det J ( x ) 2 π e x + H ( X ) ,
where det · denotes the matrix determinant, H ( X ) = ln p ( x ) x is the stimulus entropy,
G ( x ) = J ( x ) + P x ,
P ( x ) = 2 ln p ( x ) x x T ,
and
J ( x ) = 2 ln p ( r | x ) x x T r | x = ln p ( r | x ) x ln p ( r | x ) x T r | x
is the Fisher information matrix.
We denote the Kullback-Leibler divergence as
D x | | x ^ = ln p r | x p r | x ^ r | x ,
and denote Rényi divergence [29] of order β + 1 as
D β x | | x ^ = 1 β ln p ( r | x ) p r | x ^ β r | x .
Here, β D β x | | x ^ is equivalent to Chernoff divergence of order β + 1 [30]. It is well known that D β x | | x ^ D x | | x ^ in the limit β 0 .
We define
I u = ln exp D x | | x ^ x ^ x ,
I e = ln exp e 1 D x | | x ^ x ^ x ,
I β , α = ln exp β D β x | | x ^ + 1 α ln p x p x ^ x ^ x ,
where in I β , α we have β 0 , 1 and α 0 , and assume p x > 0 for all x X .
In the following, we suppose x takes M discrete values, x m , m M = 1 , 2 , , M , and p ( x m ) > 0 for all m. Now, the definitions in Equations (9)–(11) become
I u = m = 1 M p x m ln m ^ = 1 M p x m ^ p x m exp D x m | | x m ^ + H ( X ) ,
I e = m = 1 M p x m ln m ^ = 1 M p x m ^ p x m exp e 1 D x m | | x m ^ + H ( X ) ,
I β , α = m = 1 M p x m ln m ^ = 1 M p ( x m ^ ) p ( x m ) α exp β D β x m | | x m ^ + H ( X ) .
Furthermore, we define
I d = m = 1 M p ( x m ) ln 1 + m ^ M m u p ( x m ^ ) p ( x m ) exp e 1 D x m | | x m ^ + H X ,
I u d = m = 1 M p ( x m ) ln 1 + m ^ M m u p ( x m ^ ) p ( x m ) exp D x m | | x m ^ + H X ,
I β , α d = m = 1 M p ( x m ) ln 1 + m ^ M m β p ( x m ^ ) p ( x m ) α exp β D β x m | | x m ^ + H X ,
I D = m = 1 M p ( x m ) ln 1 + m ^ M m u exp e 1 D x m | | x m ^ + H X ,
where
M ˇ m β = m ^ : m ^ = arg min m ˇ M M ^ m β D β x m | | x m ˇ ,
M ˇ m u = m ^ : m ^ = arg min m ˇ M M ^ m u D x m | | x m ˇ ,
M ^ m β = m ^ : D β x m | | x m ^ = 0 ,
M ^ m u = m ^ : D x m | | x m ^ = 0 ,
M m β = M ˇ m β M ^ m β m ,
M m u = M ˇ m u M ^ m u m .
Here, notice that, if x is uniformly distributed, then by definition I d and I D become identical. The elements in set M ˇ m β are those that make D β x m | | x m ˇ take the minimum value, excluding any element that satisfies the condition D β x m | | x m ^ = 0 . Similarly, the elements in set M ˇ m u are those that minimize D x m | | x m ˇ excluding the ones that satisfy the condition D x m | | x m ^ = 0 .

2.2. Theorems

In the following, we state several conclusions as theorems and prove them in Appendix A.
Theorem 1.
The mutual information I is bounded as follows:
I β , α I I u .
Theorem 2.
The following inequalities are satisfied,
I β 1 , 1 I e I u
where I β 1 , 1 is a special case of I β , α in Equation (11) with β 1 = e 1 so that
I β 1 , 1 = ln exp β 1 D β 1 x | | x ^ x ^ x .
Theorem 3.
If there exist γ 1 > 0 and γ 2 > 0 such that
β D β x m | | x m 1 γ 1 ln N ,
D x m | | x m 2 γ 2 ln N ,
for discrete stimuli x m , where m M , m 1 M M m β and m 2 M M m u , then we have the following asymptotic relationships:
I β , α = I β , α d + O N γ 1 I I u = I u d + O N γ 2
and
I e = I d + O N γ 2 / e .
Theorem 4.
Suppose p ( x ) and p ( r | x ) are twice continuously differentiable for x X , q ( x ) < , q ( x ) < , where q ( x ) = ln p ( x ) and and denote partial derivatives / x and 2 / x x T , and G γ ( x ) is positive definite with N G γ 1 x = O 1 , where · denotes matrix Frobenius norm,
G γ ( x ) = γ J ( x ) + P x ,
γ = β 1 β and β 0 , 1 . If there exist an ω = ω x > 0 such that
det G ( x ) 1 / 2 X ¯ ε x p ( x ^ ) exp D x | | x ^ d x ^ = O N 1 ,
det G γ ( x ) 1 / 2 X ¯ ε x p ( x ^ ) exp β D β x | | x ^ d x ^ = O N 1 ,
for all x X and ε 0 , ω , where X ¯ ω ( x ) = X X ω ( x ) is the complementary set of X ω ( x ) = x ˘ R K : x ˘ x T G ( x ) x ˘ x < N ω 2 , then we have the following asymptotic relationships:
I β , α I γ 0 + O N 1 I I u = I G + K / 2 + O N 1 ,
I e = I G + O N 1 ,
I β , α = I γ + O N 1 ,
where
I γ = 1 2 X p ( x ) ln det G γ ( x ) 2 π d x + H ( X )
and γ 0 = β 0 1 β 0 = 1 / 4 with β 0 = 1 / 2 .
Remark 1.
We see from Theorems 1 and 2 that the true mutual information I and the approximation I e both lie between I β 1 , 1 and I u , which implies that their values may be close to each other. For discrete variable x , Theorem 3 tells us that I e and I d are asymptotically equivalent (i.e., their difference vanishes) in the limit of large N. For continuous variable x , Theorem 4 tells us that I e and I G are asymptotically equivalent in the limit of large N, which means that I e and I are also asymptotically equivalent because I G and I are known to be asymptotically equivalent [24].
Remark 2.
To see how the condition in Equation (33) could be satisfied, consider the case where D x | | x ^ has only one extreme point at x ^ = x for x ^ X ω x and there exists an η > 0 such that N 1 D x | x ^ η for x ^ X ¯ ω x . Now, the condition in Equation (33) is satisfied because
det G ( x ) 1 / 2 X ¯ ε x p ( x ^ ) exp D x | | x ^ d x ^ det G ( x ) 1 / 2 X ¯ ε x p ( x ^ ) exp η ^ ε N d x ^ = O N K / 2 e η ^ ε N ,
where by assumption we can find an η ^ ε > 0 for any given ε 0 , ω . The condition in Equation (34) can be satisfied in a similar way. When β 0 = 1 / 2 , β 0 D β 0 x | | x ^ is the Bhattacharyya distance [31]:
β 0 D β 0 x | | x ^ = ln R p ( r | x ) p ( r | x ^ ) d r ,
and we have
J x = 2 D x | | x ^ x ^ x ^ T x ^ = x = 2 4 β 0 D β 0 x | | x ^ x ^ x ^ T x ^ = x = 2 8 H l 2 x | | x ^ x ^ x ^ T x ^ = x ,
where H l x | | x ^ is the Hellinger distance [32] between p ( r | x ) and p ( r | x ^ ) :
H l 2 x | | x ^ = 1 2 R p ( r | x ) p ( r | x ^ ) 2 d r .
By Jensen’s inequality, for β 0 , 1 we get
0 D β x | | x ^ D x | | x ^ .
Denoting the Chernoff information [8] as
C x | | x ^ = max β 0 , 1 β D β x | | x ^ = β m D β m x | | x ^ ,
where β D β x | | x ^ achieves its maximum at β m , we have
I β , α H ( X ) h c = m = 1 M p x m ln m ^ = 1 M p x m ^ p x m exp C x m | | x m ^
h d = m = 1 M p x m ln m ^ = 1 M p x m ^ p x m exp β m D β x m | | x m ^ .
By Theorem 4,
max β 0 , 1 I β , α = I γ 0 + O N 1 ,
I γ 0 = I G K 2 ln 4 e .
If β m = 1 / 2 , then, by Equations (46)–(48) and (50), we have
max β 0 , 1 I β + K 2 ln 4 e + O N 1 I e = I + O N 1 h d + H ( X ) I u .
Therefore, from Equations (45), (46) and (49), we can see that I e and I are close to h c + H ( X ) .

2.3. Approximations for Mutual Information

In this section, we use the relationships described above to find effective approximations to true mutual information I in the case of large but finite N. First, Theorems 1 and 2 tell us that the true mutual information I and its approximation I e lie between lower and upper bounds given by: I β , α I I u and I β 1 , 1 I e I u . As a special case, I is also bounded by I β 1 , 1 I I u . Furthermore, from Equations (2) and (36) we can obtain the following asymptotic equality under suitable conditions:
I = I e + O N 1 .
Hence, for continuous stimuli, we have the following approximate relationship for large N:
I I e I G .
For discrete stimuli, by Equation (31) for large but finite N, we have
I I e I d = m = 1 M p ( x m ) ln 1 + m ^ M m u p ( x m ^ ) p ( x m ) exp e 1 D x m | | x m ^ + H X .
Consider the special case p ( x m ^ ) p ( x m ) for m ^ M m u . With the help of Equation (18), substitution of p ( x m ^ ) p ( x m ) into Equation (52) yields
I I D = m = 1 M p ( x m ) ln 1 + m ^ M m u exp e 1 D x m | | x m ^ + H X m = 1 M p ( x m ) m ^ M m u exp e 1 D x m | | x m ^ + H X = I D 0
where I D 0 I D and the second approximation follows from the first-order Taylor expansion assuming that the term m ^ M m u exp e 1 D x m | | x m ^ is sufficiently small.
The theoretical discussion above suggests that I e and I d are effective approximations to true mutual information I in the limit of large N. Moreover, we find that they are often good approximations of mutual information I even for relatively small N, as illustrated in the following section.

3. Results of Numerical Simulations

Consider Poisson model neuron whose responses (i.e., numbers of spikes within a given time window) follow a Poisson distribution [24]. The mean response of neuron n, with n 1 , 2 , , N , is described by the tuning function f x ; θ n , which takes the form of a Heaviside step function:
f x ; θ n = A , if   x θ n , 0 , if   x < θ n ,
where the stimulus x T , T with T = 10 , A = 10 , and the centers θ 1 , θ 2 , ⋯, θ N of the N neurons are uniformly spaced in interval T , T , namely, θ n = n 1 d T with d = 2 T / ( N 1 ) for N 2 , and θ n = 0 for N = 1 . We suppose that the discrete stimulus x has M = 21 possible values that are evenly spaced from T to T, namely, x X = x m : x m = 2 m 1 T / ( M 1 ) T , m = 1 , 2 , , M . Now, the Kullback-Leibler divergence can be written as
D x m | | x m ^ = f x m ; θ n log f x m ; θ n f x m ^ ; θ n + f x m ^ ; θ n f x m ; θ n .
Thus, we have exp e 1 D x m | | x m ^ = 1 when f x m ; θ n = f x m ^ ; θ n , exp e 1 D x m | | x m ^ = exp e 1 A when f x m ; θ n = 0 and f x m ^ ; θ n = A , and exp e 1 D x m | | x m ^ = 0 when f x m ; θ n = A and f x m ^ ; θ n = 0 . Therefore, in this case, we have
I e = I d .
More generally, this equality holds true whenever the tuning function has binary values.
In the first example, as illustrated in Figure 1, we suppose the stimulus has a uniform distribution, so that the probability is given by p ( x m ) = 1 / M . Figure 1a shows graphs of the input distribution p ( x ) and a representative tuning function f x ; θ with the center θ = 0 .
To assess the accuracy of the approximation formulas, we employed Monte Carlo (MC) simulation to evaluate the mutual information I [24]. In our MC simulation, we first sampled an input x j X from the uniform distribution p ( x j ) = 1 / M , then generated the neural responses r j by the conditional distribution p ( r j | x j ) based on the Poisson model, where j = 1 , 2, ⋯, j max . The value of mutual information by MC simulation was calculated by
I M C = 1 j max j = 1 j max ln p ( r j | x j ) p ( r j ) ,
where
p ( r j ) = m = 1 M p ( r j | x m ) p ( x m ) .
To assess the precision of our MC simulation, we computed the standard deviation of repeated trials by bootstrapping:
I s t d = 1 i max i = 1 i max I M C i I M C 2 ,
where
I M C i = 1 j max j = 1 j max ln p ( r Γ j , i | x Γ j , i ) p ( r Γ j , i ) ,
I M C = 1 i max i = 1 i max I M C i ,
and Γ j , i 1 , 2 , , j max is the j , i -th entry of the matrix Γ N j max × i max with samples taken randomly from the integer set { 1 , 2, ⋯, j max } by a uniform distribution. Here, we set j max = 5 × 10 5 , i max = 100 and M = 10 3 .
For different N { 1 , 2 , 3 , 4 , 6 , 10 , 14 , 20 , 30 , 50 , 100 , 200 , 400 , 700 , 1000 } , we compared I M C with I e , I d and I D , as illustrated in Figure 1b–d. Here, we define the relative error of approximation, e.g., for I e , as
D I e = I e I M C I M C ,
and the relative standard deviation
D I s t d = I s t d I M C .
For the second example, we only changed the probability distribution of stimulus p x m while keeping all other conditions unchanged. Now, p x m is a discrete sample from a Gaussian function:
p ( x m ) = Z 1 exp ( x m 2 2 σ 2 ) , m = 1 , 2 , , M ,
where Z is the normalization constant and σ = T / 2 . The results are illustrated in Figure 2.
Next, we changed each tuning function f x ; θ n to a rectified linear function:
f x ; θ n = max 0 , x θ n ,
Figure 3 and Figure 4 show the results under the same conditions of Figure 1 and Figure 2 except for the shape of the tuning functions.
Finally, we let the tuning function f x ; θ n have a random form:
f x ; θ n = B , if x θ n = θ n 1 , θ n 2 , , θ n K , 0 , otherwise ,
where the stimulus x X = { 1 , 2 , 999 , 1000 } , B = 10 , the values of θ n 1 , θ n 2 , , θ n K are distinct and randomly selected from the set X with K = 10 . In this example, we may regard X as a list of natural objects (stimuli), and there are a total of N sensory neurons, each of which responds only to K randomly selected objects. Figure 5 shows the results under the condition that p ( x ) is a uniform distribution. In Figure 6, we assume that p ( x ) is not flat but a half Gaussian given by Equation (64) with σ = 500 .
In all these examples, we found that the three formulas, namely, I e , I d and I D , provided excellent approximations to the true values of mutual information as evaluated by Monte Carlo method. For example, in the examples in Figure 1 and Figure 5, all three approximations were practically indistinguishable. In general, all these approximations were extremely accurate when N > 100 .
In all our simulations, the mutual information tended to increase with the population size N, eventually reaching a plateau for large enough N. The saturation of information for large N is due to the fact that it requires at most log 2 M bits of information to completely distinguish all M stimuli. It is impossible to gain more information than this maximum amount regardless of how many neurons are used in the population. In Figure 1, for instance, this maximum is log 2 21 = 4.39 bits, and in Figure 5, this maximum is log 2 1000 = 9.97 bits.
For relatively small values of N, we found that I D tended to be less accurate than I e or I d (see Figure 5 and Figure 6). Our simulations also confirmed two analytical results. The first one is that I d = I D when the stimulus distribution is uniform; this result follows directly from the definitions of I d and I D and is confirmed by the simulations in Figure 1, Figure 3, and Figure 5. The second result is that I d = I e (Equation (56)) when the tuning function is binary, as confirmed by the simulations in Figure 1, Figure 2, Figure 5, and Figure 6. When the tuning function allows many different values, I e can be much more accurate than I d and I D , as shown by the simulations in Figure 3 and Figure 4. To summarize, our best approximation formula is I e because it is more accurate than I d and I D , and, unlike I d and I D , it applies to both discrete and continuous stimuli (Equations (10) and (13)).

4. Discussion

We have derived several asymptotic bounds and effective approximations of mutual information for discrete variables and established several relationships among different approximations. Our final approximation formulas involve only Kullback-Leibler divergence, which is often easier to evaluate than Shannon mutual information in practical applications. Although in this paper our theory is developed in the framework of neural population coding with concrete examples, our mathematical results are generic and should hold true in many related situations beyond the original context.
We propose to approximate the mutual information with several asymptotic formulas, including I e in Equation (10) or Equation (13), I d in Equation (15) and I D in Equation (18). Our numerical experimental results show that the three approximations I e , I d and I D were very accurate for large population size N, and sometimes even for relatively small N. Among the three approximations, I D tended to be the least accurate, although, as a special case of I d , it is slightly easier to evaluate than I d . For a comparison of I e and I d , we note that I e is the universal formula, whereas I d is restricted only to discrete variables. The two formulas I e and I d become identical when the responses or the tuning functions have only two values. For more general tuning functions, the performance of I e was better than I d in our simulations.
As mentioned before, an advantage of of I e is that it works not only for discrete stimuli but also for continuous stimuli. Theoretically speaking, the formula for I e is well justified, and we have proven that it approaches the true mutual information I in the limit of large population. In our numerical simulations, the performance of I e was excellent and better than that of I d and I D . Overall, I e is our most accurate and versatile approximation formula, although, in some cases, I d and I D are slightly more convenient to calculate.
The numerical examples considered in this paper were based on an independent population of neurons whose responses have Poisson statistics. Although such models are widely used, they are appropriate only if the neural responses can be well characterized by the spike counts within a fixed time window. To study the temporal patterns of spike trains, one has to consider more complicated models. Estimation of mutual information from neural spike trains is a difficult computational problem [25,26,27,28]. In future work, it would be interesting to apply the asymptotic formulas such as I e to spike trains with small time bins each containing either one spike or nothing. A potential advantage of the asymptotic formula is that it might help reduce the bias caused by small samples in the calculation of the response marginal distribution p ( r ) = x p ( r | x ) p ( x ) or the response entropy H ( R ) because here one only needs to calculate the Kullback-Leibler divergence D x | | x ^ , which may have a smaller estimation error.
Finding effective approximation methods for computing mutual information is a key step for many practical applications of the information theory. Generally speaking, Kullback-Leibler divergence (Equation (7)) is often easier to evaluate and approximate than either Chernoff information (Equation (44)) or Shannon mutual information (Equation (1)). In situations where this is indeed the case, our approximation formulas are potentially useful. Besides applications in numerical simulations, the availability of a set of approximation formulas may also provide helpful theoretical tools in future analytical studies of information coding and representations.
As mentioned in the Introduction, various methods have been proposed to approximate the mutual information [9,10,11,12,13,14,15,16]. In future work, it would be useful to compare different methods rigorously under identical conditions in order to asses their relative merits. The approximation formulas developed in this paper are relatively easy to compute for practical problems. They are especially suitable for analytical purposes; for example, they could be used explicitly as objective functions for optimization or learning algorithms. Although the examples used in our simulations in this paper are parametric, it should be possible to extend the formulas to nonparametric problem, possibly with help of the copula method to take advantage of its robustness in nonparametric estimations [16].

Author Contributions

W.H. developed and proved the theorems, programmed the numerical experiments and wrote the manuscript. K.Z. verified the proofs and revised the manuscript.

Funding

This research was supported by an NIH grant R01 DC013698.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Proofs

Appendix A.1. Proof of Theorem 1

By Jensen’s inequality, we have
I β , α = ln X p β ( r | x ^ ) p α ( x ^ ) p β r | x p α ( x ) r | x d x ^ x + H ( X ) ln X p β ( r | x ^ ) p α ( x ^ ) p β r | x p α ( x ) d x ^ r | x x + H ( X )
and
ln X p β ( r | x ^ ) p α ( x ^ ) p β r | x p α ( x ) d x ^ r | x x + H ( X ) I = ln X p ( r , x ^ ) p r , x d x ^ X p β ( r | x ^ ) p α ( x ^ ) p β r | x p α ( x ) d x ^ 1 r | x x ln R p ( r ) X p β r | x p α ( x ) d x X p β ( r | x ^ ) p α ( x ^ ) d x ^ d r = 0 .
Combining Equations (A1) and (A2), we immediately get the lower bound in Equation (25).
In this section, we use integral for variable x , although our argument is valid for both continuous variables and discrete variables. For discrete variables, we just need to replace each integral by a summation, and our argument remains valid without other modification. The same is true for the response variable r .
To prove the upper bound, let
Φ q ( x ^ ) = R p r | x X q ( x ^ ) ln p r | x q ( x ^ ) p r | x ^ p ( x ^ ) d x ^ d r ,
where q ( x ^ ) satisfies
X q ( x ^ ) d x ^ = 1 q ( x ^ ) 0 .
By Jensen’s inequality, we get
Φ q ( x ^ ) R p r | x ln p r | x p r d r .
To find a function q ( x ^ ) that minimizes Φ q ( x ^ ) , we apply the variational principle as follows:
Φ ˜ q ( x ^ ) q ( x ^ ) = R p r | x ln p r | x q ( x ^ ) p r | x ^ p ( x ^ ) d r + 1 + λ ,
where λ is the Lagrange multiplier and
Φ ˜ q ( x ^ ) = Φ q ( x ^ ) + λ X q ( x ^ ) d x ^ 1 .
Setting Φ ˜ q ( x ^ ) q ( x ^ ) = 0 and using the constraint in Equation (A4), we find the optimal solution
q ( x ^ ) = p ( x ^ ) exp D x | | x ^ X p ( x ˇ ) exp D x | | x ˇ d x ˇ .
Thus, the variational lower bound of Φ q ( x ^ ) is given by
Φ q ( x ^ ) = min q ( x ^ ) Φ q ( x ^ ) = ln X p ( x ^ ) exp D x | | x ^ d x ^ d x .
Therefore, from Equations (1), (A5) and (A9), we get the upper bound in Equation (25). This completes the proof of Theorem 1.

Appendix A.2. Proof of Theorem 2

It follows from Equation (43) that
I β 1 , α 1 = ln exp β 1 D β 1 x | | x ^ + 1 α 1 ln p x p x ^ x ^ x ln exp e 1 D x | | x ^ x ^ x = I e ln exp D x | | x ^ x ^ x = I u ,
where β 1 = e 1 and α 1 = 1 . We immediately get Equation (26). This completes the proof of Theorem 2.

Appendix A.3. Proof of Theorem 3

For the lower bound I β , α , we have
I β , α = m = 1 M p ( x m ) ln m ˇ = 1 M p x m ˇ p x m α exp β D β x m | x m ˇ = m = 1 M p ( x m ) ln 1 + d x m + H X ,
where
d x m = m ˇ M m p x m ˇ p x m α exp β D β x m | x m ˇ .
Now, consider
ln 1 + d x m = ln 1 + a x m + b x m = ln 1 + a x m + ln 1 + b x m 1 + a x m 1 = ln 1 + a x m + O N γ ,
where
a x m = m ^ M m β p x m ^ p x m α exp β D β x m | | x m ^ ,
b x m = m ˇ M M m β p x m ˇ p x m α exp β D β x m | | x m ˇ N γ 1 m ˇ M M m β p x m ˇ p x m α = O N γ 1 .
Combining Equations (A11) and (A13) and Theorem 1, we get the lower bound in Equation (30). In a manner similar to the above, we can get the upper bound in Equations (30) and (31). This completes the proof of Theorem 3.

Appendix A.4. Proof of Theorem 4

The upper bound I u for mutual information I in Equation (25) can be written as
I u = X ln X p x ^ exp D x | x ^ d x ^ p x d x = ln X exp L ( r | x ^ ) L ( r | x ) r | x d x ^ x + H X .
where L r | x ^ = ln p r | x ^ p x ^ and L r | x = ln p r | x p x .
Consider the Taylor expansion for L ( r | x ^ ) around x . Assuming that L ( r | x ^ ) is twice continuously differentiable for any x ^ X ω ( x ) , we get
L ( r | x ^ ) L ( r | x ) r | x = y T v 1 1 2 y T y 1 2 y T G 1 / 2 x G ( x ˘ ) G ( x ) G 1 / 2 x y
where
y = G 1 / 2 x ( x ^ x ) ,
v 1 = G 1 / 2 x q ( x )
and
x ˘ = x + t x ^ x X ω ( x ) , t 0 , 1 .
For later use, we also define
v = G 1 / 2 x l ( r | x )
where
l r | x = ln p r | x .
Since G ( x ˘ ) is continuous and symmetric for x ˘ X , for any ϵ 0 , 1 , there is a ε 0 , ω such that
y T G 1 / 2 x G ( x ˘ ) G ( x ) G 1 / 2 x y < ϵ y 2
for all y Y ε , where Y ε = y R K : y < ε N . Then, we get
ln X exp L ( r | x ^ ) L ( r | x ) r | x d x ^ 1 2 ln det G ( x ) + ln Y ε exp y T v 1 1 2 1 + ϵ y T y d y
and with Jensen’s inequality,
ln Y ε exp y T v 1 1 2 1 + ϵ y T y d y ln Ψ ε + Y ^ ε y T v 1 ϕ ε y d y = K 2 ln 2 π 1 + ϵ + O N K / 2 e N δ ,
where δ is a positive constant, Y ^ ε y T v 1 ϕ ε y d y = 0 ,
ϕ ε y = Ψ ε 1 exp 1 2 1 + ϵ y T y Ψ ε = Y ^ ε exp 1 2 1 + ϵ y T y d y
and
Y ^ ε = y R K : y k < ε N / K , k = 1 , 2 , , K Y ε .
Now, we evaluate
Ψ ε = R K exp 1 2 1 + ϵ y T y d y R K Y ^ ε exp 1 2 1 + ϵ y T y d y = 2 π 1 + ϵ K / 2 R K Y ^ ε exp 1 2 1 + ϵ y T y d y .
Performing integration by parts with a e t 2 / 2 d t = e a 2 / 2 a a e t 2 / 2 t 2 d t , we find
R K Y ^ ε exp 1 2 1 + ϵ y T y d y exp 1 2 1 + ϵ ε 2 N 1 + ϵ 2 ε 2 N / 4 K K / 2 = O N K / 2 e N δ ,
for some constant δ > 0 .
Combining Equations (A15), (A23) and (A24), we get
I u 1 2 ln det 1 + ϵ 2 π G ( x ) x + H X + O N K / 2 e N δ .
On the other hand, from Equation (A22) and the condition in Equation (33), we obtain
X ε x exp L ( r | x ^ ) L ( r | x ) r | x d x ^ det G ( x ) 1 / 2 R K exp y T v 1 1 2 1 ϵ y T y d y = det 1 ϵ 2 π G ( x ) 1 / 2 exp 1 2 1 ϵ 1 v T v 1
and
X exp L ( r | x ^ ) L ( r | x ) r | x d x ^ = X ε x exp L ( r | x ^ ) L ( r | x ) r | x d x ^ + X X ε x exp L ( r | x ^ ) L ( r | x ) r | x d x ^ det 1 ϵ 2 π G ( x ) 1 / 2 exp v T v 1 2 1 ϵ + O N 1 .
It follows from Equations (A15) and (A31) that
ln X exp L ( r | x ^ ) L ( r | x ) r | x d x ^ x 1 2 ln det 1 ϵ 2 π G ( x ) x + 1 2 1 ϵ 1 v T v 1 x + O N 1 .
Note that
v T v 1 x = O N 1 .
Now, we have
I u 1 2 ln det 1 ϵ 2 π G ( x ) x + H X + O N 1 .
Since ϵ is arbitrary, we can let it go to zero. Therefore, from Equations (25), (A29) and (A34), we obtain the upper bound in Equation (35).
The Taylor expansion of h x ^ , x = p ( r | x ^ ) p r | x β r | x around x is
h x ^ , x = 1 + β p ( r | x ) p ( r | x ) x r | x ( x ^ x ) + ( x ^ x ) T β 2 p ( r | x ) 2 β 1 p ( r | x ) x p ( r | x ) x T + p ( r | x ) 2 p ( r | x ) x x T r | x ( x ^ x ) + = 1 β 1 β 2 ( x ^ x ) T J ( x ) ( x ^ x ) + .
In a similar manner as described above, we obtain the asymptotic relationship (37):
I β , α = I γ + O N 1 = 1 2 X p ( x ) ln det G γ ( x ) 2 π d x + H ( X ) .
Notice that 0 < γ = β 1 β 1 / 4 and the equality holds when β = β 0 = 1 / 2 . Thus, we have
det G γ ( x ) det G γ 0 ( x ) .
Combining Equations (25), (A36) and (A37) yields the lower bound in Equation (35).
The proof of Equation (36) is similar. This completes the proof of Theorem 4.

References

  1. Borst, A.; Theunissen, F.E. Information theory and neural coding. Nat. Neurosci. 1999, 2, 947–957. [Google Scholar] [CrossRef] [PubMed]
  2. Pouget, A.; Dayan, P.; Zemel, R. Information processing with population codes. Nat. Rev. Neurosci. 2000, 1, 125–132. [Google Scholar] [CrossRef] [PubMed]
  3. Laughlin, S.B.; Sejnowski, T.J. Communication in neuronal networks. Science 2003, 301, 1870–1874. [Google Scholar] [CrossRef] [PubMed]
  4. Brown, E.N.; Kass, R.E.; Mitra, P.P. Multiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci. 2004, 7, 456–461. [Google Scholar] [CrossRef] [PubMed]
  5. Bell, A.J.; Sejnowski, T.J. The “independent components” of natural scenes are edge filters. Vision Res. 1997, 37, 3327–3338. [Google Scholar] [CrossRef] [Green Version]
  6. Huang, W.; Zhang, K. An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  7. Huang, W.; Huang, X.; Zhang, K. Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, 22–24 March 2017. [Google Scholar]
  8. Cover, T.M.; Thomas, J.A. Elements of Information, 2nd ed.; Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
  9. Miller, G.A. Note on the bias of information estimates. Inf. Theory Psychol. Probl. Methods 1955, 2, 100. [Google Scholar]
  10. Carlton, A. On the bias of information estimates. Psychol. Bull. 1969, 71, 108. [Google Scholar] [CrossRef]
  11. Treves, A.; Panzeri, S. The upward bias in measures of information derived from limited data samples. Neural Comput. 1995, 7, 399–407. [Google Scholar] [CrossRef]
  12. Victor, J.D. Asymptotic bias in information estimates and the exponential (Bell) polynomials. Neural Comput. 2000, 12, 2797–2804. [Google Scholar] [CrossRef] [PubMed]
  13. Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
  14. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
  15. Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J., III; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 2007, 76, 026209. [Google Scholar] [CrossRef] [PubMed]
  16. Safaai, H.; Onken, A.; Harvey, C.D.; Panzeri, S. Information estimation using nonparametric copulas. Phys. Rev. E 2018, 98, 053302. [Google Scholar] [CrossRef]
  17. Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
  18. Van Trees, H.L.; Bell, K.L. Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking; John Wiley: Piscataway, NJ, USA, 2007. [Google Scholar]
  19. Clarke, B.S.; Barron, A.R. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 1990, 36, 453–471. [Google Scholar] [CrossRef] [Green Version]
  20. Rissanen, J.J. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory 1996, 42, 40–47. [Google Scholar] [CrossRef]
  21. Brunel, N.; Nadal, J.P. Mutual information, Fisher information, and population coding. Neural Comput. 1998, 10, 1731–1757. [Google Scholar] [CrossRef] [PubMed]
  22. Sompolinsky, H.; Yoon, H.; Kang, K.J.; Shamir, M. Population coding in neuronal systems with correlated noise. Phys. Rev. E 2001, 64, 051904. [Google Scholar] [CrossRef] [PubMed]
  23. Kang, K.; Sompolinsky, H. Mutual information of population codes and distance measures in probability space. Phys. Rev. Lett. 2001, 86, 4958–4961. [Google Scholar] [CrossRef] [PubMed]
  24. Huang, W.; Zhang, K. Information-theoretic bounds and approximations in neural population coding. Neural Comput. 2018, 30, 885–944. [Google Scholar] [CrossRef] [PubMed]
  25. Strong, S.P.; Koberle, R.; van Steveninck, R.R.D.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 80, 197. [Google Scholar] [CrossRef]
  26. Nemenman, I.; Bialek, W.; van Steveninck, R.D.R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E 2004, 69, 056111. [Google Scholar] [CrossRef] [PubMed]
  27. Panzeri, S.; Senatore, R.; Montemurro, M.A.; Petersen, R.S. Correcting for the sampling bias problem in spike train information measures. J. Neurophysiol. 2017, 98, 1064–1072. [Google Scholar] [CrossRef] [PubMed]
  28. Houghton, C. Calculating the Mutual Information Between Two Spike Trains. Neural Comput. 2019, 31, 330–343. [Google Scholar] [CrossRef] [PubMed]
  29. Rényi, A. On measures of entropy and information. In Fourth Berkeley Symposium on Mathematical Statistics and Probability; The Regents of the University of California, University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  30. Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
  31. Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
  32. Beran, R. Minimum Hellinger distance estimates for parametric models. Ann. Stat. 1977, 5, 445–463. [Google Scholar] [CrossRef]
Figure 1. A comparison of approximations I e , I d and I D against I M C obtained by Monte Carlo method for one-dimensional discrete stimuli. (a) Discrete uniform distribution of the stimulus p ( x ) (black dots) and the Heaviside step tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Figure 1. A comparison of approximations I e , I d and I D against I M C obtained by Monte Carlo method for one-dimensional discrete stimuli. (a) Discrete uniform distribution of the stimulus p ( x ) (black dots) and the Heaviside step tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Entropy 21 00243 g001
Figure 2. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 1 except that the stimulus distribution p ( x ) is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus p ( x ) (black dots) and the Heaviside step tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Figure 2. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 1 except that the stimulus distribution p ( x ) is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus p ( x ) (black dots) and the Heaviside step tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Entropy 21 00243 g002
Figure 3. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 1 except for the shape of the tuning function (blue dashed lines in (a)). (a) Discrete uniform distribution of the stimulus p ( x ) (black dots) and the rectified linear tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Figure 3. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 1 except for the shape of the tuning function (blue dashed lines in (a)). (a) Discrete uniform distribution of the stimulus p ( x ) (black dots) and the rectified linear tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Entropy 21 00243 g003
Figure 4. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 3 except that the stimulus distribution p ( x ) is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus p ( x ) (black dots) and the rectified linear tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Figure 4. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 3 except that the stimulus distribution p ( x ) is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus p ( x ) (black dots) and the rectified linear tuning function f x ; θ with center θ = 0 (blue dashed lines); (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Entropy 21 00243 g004
Figure 5. A comparison of approximations I e , I d and I D against I M C . The situation is similar to that in Figure 1 except that the tuning function is random (blue dashed lines in (a)); see Equation (66). (a) Discrete uniform distribution of the stimulus p ( x ) (black dots) and the random tuning function f x ; θ ; (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Figure 5. A comparison of approximations I e , I d and I D against I M C . The situation is similar to that in Figure 1 except that the tuning function is random (blue dashed lines in (a)); see Equation (66). (a) Discrete uniform distribution of the stimulus p ( x ) (black dots) and the random tuning function f x ; θ ; (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Entropy 21 00243 g005
Figure 6. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 5 except that the stimulus distribution p ( x ) is not flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus p ( x ) (black dots) and the random tuning function f x ; θ ; (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Figure 6. A comparison of approximations I e , I d and I D against I M C . The situation is identical to that in Figure 5 except that the stimulus distribution p ( x ) is not flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus p ( x ) (black dots) and the random tuning function f x ; θ ; (b) The values of I M C , I e , I d and I D depend on the population size or total number of neurons N; (c) The relative errors D I e , D I d and D I D for the results in (b); (d) The absolute values of the relative errors | D I e | , | D I d | and | D I D | as in (c), with error bars showing standard deviations of repeated trials.
Entropy 21 00243 g006

Share and Cite

MDPI and ACS Style

Huang, W.; Zhang, K. Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding. Entropy 2019, 21, 243. https://doi.org/10.3390/e21030243

AMA Style

Huang W, Zhang K. Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding. Entropy. 2019; 21(3):243. https://doi.org/10.3390/e21030243

Chicago/Turabian Style

Huang, Wentao, and Kechen Zhang. 2019. "Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding" Entropy 21, no. 3: 243. https://doi.org/10.3390/e21030243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop