Neural Population Coding and Approximations of Mutual Information for Discrete Variables

Neural Population Coding and Approximations of Mutual Information for Discrete Variables Wentao Huang 1,2* and Kechen Zhang 2* 1 Key Laboratory of Cognition and Intelligence and Information Science Academy of China Electronics Technology Group Corporation, Beijing 100846, China; whuang21@jhmi.edu 2 Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, U.S.A.; kzhang4@jhmi.edu * Correspondence: whuang21@jhmi.edu, kzhang4@jhmi.edu; Tel.: +1-443-287-5080 Version December 18, 2018 submitted to Preprints


Introduction
Information theory is a powerful tool with widespread applications in diverse fields such as neuroscience, machine learning, and information and communication technology [1][2][3][4][5][6][7].However, in many applications it is often notoriously difficult to effectively calculate Shannon mutual information [8].Various approximation methods have been proposed to approximate the mutual information, such as those based on asymptotic expansion [9][10][11][12][13], k-nearest neighbor [14] and minimal spanning trees [15].Another approach is to simplify the calculations by approximations based on information-theoretic bounds, such as the Cramér-Rao lower bound [16] and the van Trees' Bayesian Cramér-Rao bound [17].
Recently Huang and Zhang [23] proposed a more precise approximation formula which remains very accurate for high-dimensional variables.Unfortunately this approach does not generalize to discrete variables because calculation of Fisher information requires partial derivatives of the likelihood function with respect to the encoded variables.For encoding of discrete variables, Kang and Sompolinsky [22] represented an asymptotic relationship between mutual information and Chernoff information for statistically independent neurons in a large population.However, Chernoff information is still hard to calculate in many practical applications.
In this paper, we present several information metrics to approximate the mutual information for discrete variables.While some input variables or stimuli are naturally continuous, such as movement direction, luminance level, and the pitch of a tone, other stimuli are naturally discrete, such as the types of molecules in olfaction, distinct visual objects, identity of faces, and words in human speech.For definiteness, in this paper we will frame our questions in the context of neural population coding, where some physical variables are encoded by eliciting responses from many neurons or a population of neurons.Nonetheless, our mathematical results are quite general and should be applicable to any large input-output system that satisfies specific conditions as it scales up.For example, independent sampling from repeated trials can be mathematically equivalent to having a population of neurons with independent responses.
In the following we first derive several upper and lower bounds on Shannon mutual information using Kullback-Leibler divergence and Rényi divergence.Next we derive several new approximation formulas for mutual information in the limit of a large population or a large sample size.In particular, our approximation formula I e (see Eq. 10) is valid for both discrete variables and continuous variables.The approximation formulas I d and I D (see Eq. 15 and 18) are applicable only to discrete variables, and they are more convenient to calculate than mutual information in some situations.Finally, we use numerical simulations to confirm the validity of our approximation formulas by a comparison against the values of mutual information obtained by Monte Carlo simulations.

Notations and Definitions
Suppose the input x is a K-dimensional vector, x = (x 1 , • • • , x K ) T , which could be interpreted as the parameters that specifies a stimulus for a sensory system, and the outputs is an N-dimensional vector, r = (r 1 , • • • , r N ) T , which could be interpreted as the responses of N neurons.We assume N is large, generally N K.We denote random variables by upper case letters, e.g., random variables X and R, in contrast to their vector values x and r.The mutual information between X and R is defined by where x ∈ X ⊆ R K , r ∈ R ⊆ R N , and • r,x denotes the expectation with respect to the probability density function p(r, x).Similarly, in the following we use • r|x and • x to denote expectations with respect to p(r|x) and p(x), respectively.
If p(x) and p(r|x) are twice continuously differentiable for almost every x ∈ X , then for large N we can use an asymptotic formula to approximate the true value of I with high accuracy [23]: which is sometimes reduced to where det (•) denotes the matrix determinant, H(X) = − ln p(x) x is the stimulus entropy, and is the Fisher information matrix.We denote the Kullback-Leibler divergence as
In the following we suppose x takes M discrete values, x m , m ∈ M = {1, 2, • • • , M}, and p(x m ) > 0 for all m.By the definitions above in ( 9)-( 11), we have Furthermore, we define where M

Theorems 60
In the following we state several conclusions as theorems and prove them in Appendix.

61
Theorem 1.The mutual information I has the following bounds: Theorem 2. The following inequalities are satisfied, where β 1 = e −1 and according to Eq. 11, Theorem 3. If there exist γ 1 > 0 and γ 2 > 0 such that for discrete stimuli x m , where m ∈ M, m 1 ∈ M − M β m and m 2 ∈ M − M u m , then we have the following asymptotic relationships: and Theorem 4. Suppose p(x) and p(r|x) are twice continuously differentiable for x ∈ X , q (x) < ∞, q (x) < ∞, where q(x) = ln p(x) and and denote partial derivatives ∂/∂x and ∂ 2 /∂x∂x T , and , where • denotes matrix Frobenius norm, for all x ∈ X and ε ∈ (0, ω), where Xω , then we have the following asymptotic relationships: where and Remark 1.To see how condition (33) could be satisfied, consider the case where D (x|| x) has only one extreme point at x = x for x ∈ X ω (x) and there exists an η > 0 such that where by assumption we can find an η (ε) > 0 for any given ε ∈ (0, ω).Condition (34) can be satisfied in a similar way. When is the Bhattacharyya distance [26]: and we have where H l (x|| x) is the Hellinger distance [27] between p(r|x) and p(r| x): By Jensen's inequality, for β ∈ (0, Denoting the Chernoff information [8] as where βD β (x|| x) achieves its maximum at β m , we have By Theorem 4, max β∈(0, 1) If β m = 1/2, then by ( 50), ( 46), ( 47) and (48) we have max β∈(0, 1) Therefore, from (45), ( 46) and (49), we can see that I e and I are close to h c + H(X).

Approximations for Mutual Information
In this section, we use the relationships described above to find effective approximations to true mutual information I in the case of finite N. First of all, Theorem 1 and Theorem 2 tell us that the true mutual information I and I e lie between I β,α and I u or between I β 1 ,1 and I u ; that is, I β,α ≤ I ≤ I u , and I β 1 ,1 ≤ I e ≤ I u .On the other hand, from (2) and (36) we can obtain the following asymptotic equality under suitable conditions: Hence, for continuous stimuli, we have the following approximate relationship for large N: For discrete stimuli, by (31) for large but finite N, we have Consider the special case p(x m) p(x m ) for m ∈ M u m .With the help of definition (18), substitution of p(x m) p(x m ) into (52) yields where I 0 D ≤ I D and the second approximation follows from the first-order Taylor expansion when the term The theoretical discussion above tells us that I e and I d are effective approximations to true mutual information I in the case of large N.Moreover, we find that they are often good approximations of mutual information I even for relatively small N, as illustrated in the following section.

Results of Numerical Simulations
Consider Poisson model neuron whose responses follow a Poisson distribution [23].The mean responses or the tuning curve of neuron n, n ∈ {1, 2, • • • , N} is described by the tuning function f (x; θ n ), which takes the form of a rectified linear function: In the first example, we suppose the stimulus has a uniform distribution, so that the probability is given by p(x m ) = 1/M.Figure 1a shows graphs of the input distribution p(x) and a representative tuning curve f (x; θ) with the center θ = 0.
To evaluate the precision of the approximation formulas, we employed Monte Carlo (MC) simulation to approximate mutual information I [23].In our MC simulation, we first sampled an input x j ∈ X from the uniform distribution p(x j ) = 1/M, then generated the neural responses r j by the conditional distribution p(r j |x j ) based on the Poisson model, where j = 1, 2, • • • , j max .The value of mutual information by MC simulation was calculated by where To evaluate the accuracy of our MC simulation, we computed the standard deviation of repeated trials by bootstrapping: where and Γ j,i ∈ {1, 2, • • • , j max } is the (j, i)-th entry of the matrix Γ ∈ N j max ×i max with samples taken randomly from the integer set {1, 2, • • • , j max } by a uniform distribution.Here we set j max = 5 × 10 5 , i max = 100 and M = 10 3 .
For different N ∈ {1, 2, 3, 4, 6, 10, 14, 20, 30, 50, 100, 200, 400, 700, 1000}, we compare I MC with I e , I d and I D , which are illustrated in Figure 1b-d.Here we define the relative error of approximation, e.g., for I e , as and the relative standard deviation For the second example, we kept all other conditions unchanged and only changed the probability distribution of stimulus p(x m ).Now p(x m ) is a discrete sample from a Gaussian function: where Z is the normalization constant.The results are illustrated in Figure 2.
Next, we changed each tuning curve f (x; θ n ) to a step function or Heaviside function: Figure 3 and Figure 4 show the results under the same conditions of Figures 1 and 2 except for the shape of the tuning curves.
In all these examples, we found that the three formulas, namely, I e , I d and I D provided excellent approximations to the true values of mutual information as obtained by Monte Carlo method.All these approximations were extremely accurate when N > 100.The saturation of information for large N is due to the fact that completely distinguishing all M = 5 stimuli requires at most log 2 5 = 2.32 bits of information.It is not possible to gain more information than this amount regardless of how many neurons are used in the population.
For relatively small values of N, we found that I D tended to be less accurate than I d and I e .In these simulations, I d and I e were about equally good although I e is more versatile and could potentially be used for continuous stimuli as well.

Conclusions
We have derived several asymptotic bounds and effective approximations of mutual information for discrete variables and found some relationships among different approximations.Our methods are based on Kullback-Leibler divergence and Rényi divergence, and our final approximation formulas involve only Kullback-Leibler divergence, which is often easier to evaluate than Shannon mutual information in many practical applications.Although in this paper our theory is developed in the framework of neural population coding, our mathematical results are generic and should hold in many related situations beyond the original for discrete stimuli but also for continuous stimuli.The formula for I e is well justified theoretically and its performance was excellent in our numerical simulations.Overall I e is our most accurate and versatile approximation formula although in some cases I d and I D are more convenient to calculate.
Our numerical experimental results show that the three approximations I e , I d and I D were very accurate even when the population size N is relatively small.Among the three approximations, I D tended to be the least accurate, although as a special case of I d it is slightly easier to evaluate than I d .In conclusion we recommend using I e and I d as approximations for Shannon mutual information.Of these two, I e is the universal formula and I d is restricted only to discrete variables.
Finding effective approximation methods for computing mutual information is a key step for many practical applications of information theory.Generally speaking, Kullback-Leibler divergence (Eq.7) is often easier to evaluate or approximate than Chernoff information (Eq.44) and Shannon mutual information (Eq.1).In situations where this is indeed the case, our approximation formulas are potentially useful.Besides advantages in numerical simulations, the availability of a diverse set of approximation formulas might potentially provide helpful theoretical insights in analytical studies of information coding and representations.where δ is a positive constant, Ŷε y T v 1 φ ε (y) dy = 0, (1 + ) y T y dy (A25) and Now we evaluate (1 + ) 2 ε 2 N/ (4K) for some constant δ > 0.

Figure 1 .Figure 2 .Figure 3 .Figure 4 .
Figure 1.A comparison of approximations I e , I d and I D against I MC obtained by Monte Carlo method for one-dimensional discrete stimuli.(a) Discrete uniform distribution of the stimulus p(x) (black dots) and the rectified linear tuning curve f (x; θ) with center θ = 0 (blue dashed lines).(b) The values of I MC , I e , I d and I D depend on the population size or total number of neurons N. (c) The relative errors DI e , DI d and DI D for the results in panel b.(d) The absolute values of the relative errors |DI e |, |DI d | and |DI D | as in panel c, with error bars showing standard deviations of repeated trials.