Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

Huang, Wentao; Zhang, Kechen

doi:10.3390/e21030243

Open AccessArticle

Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

by

Wentao Huang

^1,2,*

and

Kechen Zhang

^2,*

¹

Key Laboratory of Cognition and Intelligence and Information Science Academy of China Electronics Technology Group Corporation, Beijing 100086, China

²

Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA

^*

Authors to whom correspondence should be addressed.

Entropy 2019, 21(3), 243; https://doi.org/10.3390/e21030243

Submission received: 15 December 2018 / Revised: 11 February 2019 / Accepted: 28 February 2019 / Published: 4 March 2019

(This article belongs to the Special Issue The 20th Anniversary of Entropy - Recent Advances in Entropy and Information-Theoretic Concepts and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Although Shannon mutual information has been widely used, its effective calculation is often difficult for many practical problems, including those in neural population coding. Asymptotic formulas based on Fisher information sometimes provide accurate approximations to the mutual information but this approach is restricted to continuous variables because the calculation of Fisher information requires derivatives with respect to the encoded variables. In this paper, we consider information-theoretic bounds and approximations of the mutual information based on Kullback-Leibler divergence and Rényi divergence. We propose several information metrics to approximate Shannon mutual information in the context of neural population coding. While our asymptotic formulas all work for discrete variables, one of them has consistent performance and high accuracy regardless of whether the encoded variables are discrete or continuous. We performed numerical simulations and confirmed that our approximation formulas were highly accurate for approximating the mutual information between the stimuli and the responses of a large neural population. These approximation formulas may potentially bring convenience to the applications of information theory to many practical and theoretical problems.

Keywords:

neural population coding; mutual information; Kullback-Leibler divergence; Rényi divergence; Chernoff divergence; approximation; discrete variables

1. Introduction

Information theory is a powerful tool widely used in many disciplines, including, for example, neuroscience, machine learning, and communication technology [1,2,3,4,5,6,7]. As it is often notoriously difficult to effectively calculate Shannon mutual information in many practical applications [8], various approximation methods have been proposed to estimate the mutual information, such as those based on asymptotic expansion [9,10,11,12,13], k-nearest neighbor [14], and minimal spanning trees [15]. Recently, Safaai et al. proposed a copula method for estimation of mutual information, which can be nonparametric and potentially robust [16]. Another approach for estimating the mutual information is to simplify the calculations by approximations based on information-theoretic bounds, such as the Cramér–Rao lower bound [17] and the van Trees’ Bayesian Cramér–Rao bound [18].

In this paper, we focus on mutual information estimation based on asymptotic approximations [19,20,21,22,23,24]. For encoding of continuous variables, asymptotic relations between mutual information and Fisher information have been presented by several researchers [19,20,21,22]. Recently, Huang and Zhang [24] proposed an improved approximation formula, which remains accurate for high-dimensional variables. A significant advantage of this approach is that asymptotic approximations are sometimes very useful in analytical studies. For instance, asymptotic approximations allow us to prove that the optimal neural population distribution that maximizes the mutual information between stimulus and response can be solved by convex optimization [24]. Unfortunately this approach does not generalize to discrete variables since the calculation of Fisher information requires partial derivatives of the likelihood function with respect to the encoded variables. For encoding of discrete variables, Kang and Sompolinsky [23] represented an asymptotic relationship between mutual information and Chernoff information for statistically independent neurons in a large population. However, Chernoff information is still hard to calculate in many practical applications.

Discrete stimuli or variables occur naturally in sensory coding. While some stimuli are continuous (e.g., the direction of movement, and the pitch of a tone), others are discrete (e.g., the identities of faces, and the words in human speech). For definiteness, in this paper, we frame our questions in the context of neural population coding; that is, we assume that the stimuli or the input variables are encoded by the pattern of responses elicited from a large population of neurons. The concrete examples used in our numerical simulations were based on Poisson spike model, where the response of each neuron is taken as the spike count within a given time window. While this simple Poisson model allowed us to consider a large neural population, it only captured the spike rate but not any temporal structure of the spike trains [25,26,27,28]. Nonetheless, our mathematical results are quite general and should be applicable to other input–output systems under suitable conditions to be discussed later.

In the following, we first derive several upper and lower bounds on Shannon mutual information using Kullback-Leibler divergence and Rényi divergence. Next, we derive several new approximation formulas for Shannon mutual information in the limit of large population size. These formulas are more convenient to calculate than the mutual information in our examples. Finally, we confirm the validity of our approximation formulas using the true mutual information as evaluated by Monte Carlo simulations.

2. Theory and Methods

2.1. Notations and Definitions

Suppose the input

x

is a K-dimensional vector,

x = {(x_{1}, \dots, x_{K})}^{T}

, which could be interpreted as the parameters that specifies a stimulus for a sensory system, and the outputs is an N-dimensional vector,

r = {(r_{1}, \dots, r_{N})}^{T}

, which could be interpreted as the responses of N neurons. We assume N is large, generally

N ≫ K

. We denote random variables by upper case letters, e.g., random variables X and R, in contrast to their vector values

x

and

r

. The mutual information between X and R is defined by

I = I (X; R) = {〈ln \frac{p (r | x)}{p (r)}〉}_{r, x},

(1)

where

x \in X \subseteq R^{K}

,

r \in R \subseteq R^{N}

, and

{〈\cdot〉}_{r, x}

denotes the expectation with respect to the probability density function

p (r, x)

. Similarly, in the following, we use

{〈\cdot〉}_{r | x}

and

{〈\cdot〉}_{x}

to denote expectations with respect to

p (r | x)

and

p (x)

, respectively.

If

p (x)

and

p (r | x)

are twice continuously differentiable for almost every

x \in X

, then for large N we can use an asymptotic formula to approximate the true value of I with high accuracy [24]:

I ≃ I_{G} = \frac{1}{2} {〈ln (det (\frac{G (x)}{2 π e}))〉}_{x} + H (X),

(2)

which is sometimes reduced to

I ≃ I_{F} = \frac{1}{2} {〈ln (det (\frac{J (x)}{2 π e}))〉}_{x} + H (X),

(3)

where

det (\cdot)

denotes the matrix determinant,

H (X) = - {〈ln p (x)〉}_{x}

is the stimulus entropy,

\begin{matrix} G (x) = J (x) + P (x), \end{matrix}

(4)

\begin{matrix} P (x) = - \frac{\partial^{2} ln p (x)}{\partial x \partial x^{T}}, \end{matrix}

(5)

and

J (x) = - {〈\frac{\partial^{2} ln p (r | x)}{\partial x \partial x^{T}}〉}_{r | x} = {〈\frac{\partial ln p (r | x)}{\partial x} \frac{\partial ln p (r | x)}{\partial x^{T}}〉}_{r | x}

(6)

is the Fisher information matrix.

We denote the Kullback-Leibler divergence as

D (x | | \hat{x}) = {〈ln \frac{p (r | x)}{p (r | \hat{x})}〉}_{r | x},

(7)

and denote Rényi divergence [29] of order

β + 1

as

D_{β} (x | | \hat{x}) = \frac{1}{β} ln {〈{(\frac{p (r | x)}{p (r | \hat{x})})}^{β}〉}_{r | x} .

(8)

Here,

β D_{β} (x | | \hat{x})

is equivalent to Chernoff divergence of order

β + 1

[30]. It is well known that

D_{β} (x | | \hat{x}) \to D (x | | \hat{x})

in the limit

β \to 0

.

We define

\begin{matrix} I_{u} & = - {〈ln {〈exp (- D (x | | \hat{x}))〉}_{\hat{x}}〉}_{x}, \end{matrix}

(9)

\begin{matrix} I_{e} & = - {〈ln {〈exp (- e^{- 1} D (x | | \hat{x}))〉}_{\hat{x}}〉}_{x}, \end{matrix}

(10)

\begin{matrix} I_{β, α} & = - {〈ln {〈exp (- β D_{β} (x | | \hat{x}) + (1 - α) ln \frac{p (x)}{p (\hat{x})})〉}_{\hat{x}}〉}_{x}, \end{matrix}

(11)

where in

I_{β, α}

we have

β \in (0, 1)

and

α \in (0, \infty)

and assume

p (x) > 0

for all

x \in X

.

In the following, we suppose

x

takes M discrete values,

x_{m}

,

m \in M = \{1, 2, \dots, M\}

, and

p (x_{m}) > 0

for all m. Now, the definitions in Equations (9)–(11) become

\begin{matrix} I_{u} & = - \sum_{m = 1}^{M} p (x_{m}) ln (\sum_{\hat{m} = 1}^{M} \frac{p (x_{\hat{m}})}{p (x_{m})} exp (- D (x_{m} | | x_{\hat{m}}))) + H (X), \end{matrix}

(12)

\begin{matrix} I_{e} & = - \sum_{m = 1}^{M} p (x_{m}) ln (\sum_{\hat{m} = 1}^{M} \frac{p (x_{\hat{m}})}{p (x_{m})} exp (- e^{- 1} D (x_{m} | | x_{\hat{m}}))) + H (X), \end{matrix}

(13)

\begin{matrix} I_{β, α} & = - \sum_{m = 1}^{M} p (x_{m}) ln (\sum_{\hat{m} = 1}^{M} {(\frac{p (x_{\hat{m}})}{p (x_{m})})}^{α} exp (- β D_{β} (x_{m} | | x_{\hat{m}}))) + H (X) . \end{matrix}

(14)

Furthermore, we define

\begin{matrix} I_{d} & = - \sum_{m = 1}^{M} p (x_{m}) ln (1 + \sum_{\hat{m} \in M_{m}^{u}} \frac{p (x_{\hat{m}})}{p (x_{m})} exp (- e^{- 1} D (x_{m} | | x_{\hat{m}}))) + H (X), \end{matrix}

(15)

\begin{matrix} I_{u}^{d} & = - \sum_{m = 1}^{M} p (x_{m}) ln (1 + \sum_{\hat{m} \in M_{m}^{u}} \frac{p (x_{\hat{m}})}{p (x_{m})} exp (- D (x_{m} | | x_{\hat{m}}))) + H (X), \end{matrix}

(16)

\begin{matrix} I_{β, α}^{d} & = - \sum_{m = 1}^{M} p (x_{m}) ln (1 + \sum_{\hat{m} \in M_{m}^{β}} {(\frac{p (x_{\hat{m}})}{p (x_{m})})}^{α} exp (- β D_{β} (x_{m} | | x_{\hat{m}}))) + H (X), \end{matrix}

(17)

\begin{matrix} I_{D} & = - \sum_{m = 1}^{M} p (x_{m}) ln (1 + \sum_{\hat{m} \in M_{m}^{u}} exp (- e^{- 1} D (x_{m} | | x_{\hat{m}}))) + H (X), \end{matrix}

(18)

where

\begin{matrix} {\overset{ˇ}{M}}_{m}^{β} = \{\hat{m} : \hat{m} = \underset{\overset{ˇ}{m} \in M - {\hat{M}}_{m}^{β}}{\arg \min} D_{β} (x_{m} | | x_{\overset{ˇ}{m}})\}, \end{matrix}

(19)

\begin{matrix} {\overset{ˇ}{M}}_{m}^{u} = \{\hat{m} : \hat{m} = \underset{\overset{ˇ}{m} \in M - {\hat{M}}_{m}^{u}}{\arg \min} D (x_{m} | | x_{\overset{ˇ}{m}})\}, \end{matrix}

(20)

\begin{matrix} {\hat{M}}_{m}^{β} = \{\hat{m} : D_{β} (x_{m} | | x_{\hat{m}}) = 0\}, \end{matrix}

(21)

\begin{matrix} {\hat{M}}_{m}^{u} = \{\hat{m} : D (x_{m} | | x_{\hat{m}}) = 0\}, \end{matrix}

(22)

\begin{matrix} M_{m}^{β} = {\overset{ˇ}{M}}_{m}^{β} \cup {\hat{M}}_{m}^{β} - \{m\}, \end{matrix}

(23)

\begin{matrix} M_{m}^{u} = {\overset{ˇ}{M}}_{m}^{u} \cup {\hat{M}}_{m}^{u} - \{m\} . \end{matrix}

(24)

Here, notice that, if

x

is uniformly distributed, then by definition

I_{d}

and

I_{D}

become identical. The elements in set

{\overset{ˇ}{M}}_{m}^{β}

are those that make

D_{β} (x_{m} | | x_{\overset{ˇ}{m}})

take the minimum value, excluding any element that satisfies the condition

D_{β} (x_{m} | | x_{\hat{m}}) = 0

. Similarly, the elements in set

{\overset{ˇ}{M}}_{m}^{u}

are those that minimize

D (x_{m} | | x_{\overset{ˇ}{m}})

excluding the ones that satisfy the condition

D (x_{m} | | x_{\hat{m}}) = 0

.

2.2. Theorems

In the following, we state several conclusions as theorems and prove them in Appendix A.

Theorem 1.

The mutual information I is bounded as follows:

I_{β, α} \leq I \leq I_{u} .

(25)

Theorem 2.

The following inequalities are satisfied,

I_{β_{1}, 1} \leq I_{e} \leq I_{u}

(26)

where

I_{β_{1}, 1}

is a special case of

I_{β, α}

in Equation (11) with

β_{1} = e^{- 1}

so that

I_{β_{1}, 1} = - {〈ln {〈exp (- β_{1} D_{β_{1}} (x | | \hat{x}))〉}_{\hat{x}}〉}_{x} .

(27)

Theorem 3.

If there exist

γ_{1} > 0

and

γ_{2} > 0

such that

\begin{matrix} β D_{β} (x_{m} | | x_{m_{1}}) & \geq γ_{1} ln N, \end{matrix}

(28)

\begin{matrix} D (x_{m} | | x_{m_{2}}) & \geq γ_{2} ln N, \end{matrix}

(29)

for discrete stimuli

x_{m}

, where

m \in M

,

m_{1} \in M - M_{m}^{β}

and

m_{2} \in M - M_{m}^{u}

, then we have the following asymptotic relationships:

I_{β, α} = I_{β, α}^{d} + O (N^{- γ_{1}}) \leq I \leq I_{u} = I_{u}^{d} + O (N^{- γ_{2}})

(30)

and

I_{e} = I_{d} + O (N^{- γ_{2} / e}) .

(31)

Theorem 4.

Suppose

p (x)

and

p (r | x)

are twice continuously differentiable for

x \in X

,

∥q^{'} (x)∥ < \infty

,

∥q^{″} (x)∥ < \infty

, where

q (x) = ln p (x)

and

^{'}

and

^{″}

denote partial derivatives

\partial / \partial x

and

\partial^{2} / \partial x \partial x^{T}

, and

G_{γ} (x)

is positive definite with

∥N G_{γ}^{- 1} (x)∥ = O (1)

, where

∥\cdot∥

denotes matrix Frobenius norm,

G_{γ} (x) = γ (J (x) + P (x)),

(32)

γ = β (1 - β)

and

β \in (0, 1)

. If there exist an

ω = ω (x) > 0

such that

\begin{matrix} det {(G (x))}^{1 / 2} \int_{{\bar{X}}_{ε} (x)} p (\hat{x}) exp (- D (x | | \hat{x})) d \hat{x} = O (N^{- 1}), \end{matrix}

(33)

\begin{matrix} det {(G_{γ} (x))}^{1 / 2} \int_{{\bar{X}}_{ε} (x)} p (\hat{x}) exp (- β D_{β} (x | | \hat{x})) d \hat{x} = O (N^{- 1}), \end{matrix}

(34)

for all

x \in X

and

ε \in (0, ω)

, where

{\bar{X}}_{ω} (x) = X - X_{ω} (x)

is the complementary set of

X_{ω} (x) = \{\overset{˘}{x} \in R^{K} : {(\overset{˘}{x} - x)}^{T} G (x) (\overset{˘}{x} - x) < N ω^{2}\}

, then we have the following asymptotic relationships:

I_{β, α} \leq I_{γ_{0}} + O (N^{- 1}) \leq I \leq I_{u} = I_{G} + K / 2 + O (N^{- 1}),

(35)

I_{e} = I_{G} + O (N^{- 1}),

(36)

I_{β, α} = I_{γ} + O (N^{- 1}),

(37)

where

I_{γ} = \frac{1}{2} \int_{X} p (x) ln (det (\frac{G_{γ} (x)}{2 π})) d x + H (X)

(38)

and

γ_{0} = β_{0} (1 - β_{0}) = 1 / 4

with

β_{0} = 1 / 2

.

Remark 1.

We see from Theorems 1 and 2 that the true mutual information I and the approximation

I_{e}

both lie between

I_{β_{1}, 1}

and

I_{u}

, which implies that their values may be close to each other. For discrete variable

x

, Theorem 3 tells us that

I_{e}

and

I_{d}

are asymptotically equivalent (i.e., their difference vanishes) in the limit of large N. For continuous variable

x

, Theorem 4 tells us that

I_{e}

and

I_{G}

are asymptotically equivalent in the limit of large N, which means that

I_{e}

and I are also asymptotically equivalent because

I_{G}

and I are known to be asymptotically equivalent [24].

Remark 2.

To see how the condition in Equation (33) could be satisfied, consider the case where

D (x | | \hat{x})

has only one extreme point at

\hat{x} = x

for

\hat{x} \in X_{ω} (x)

and there exists an

η > 0

such that

N^{- 1} D (x | \hat{x}) \geq η

for

\hat{x} \in {\bar{X}}_{ω} (x)

. Now, the condition in Equation (33) is satisfied because

\begin{matrix} det {(G (x))}^{1 / 2} \int_{{\bar{X}}_{ε} (x)} p (\hat{x}) exp (- D (x | | \hat{x})) d \hat{x} \\ \leq det {(G (x))}^{1 / 2} \int_{{\bar{X}}_{ε} (x)} p (\hat{x}) exp (- \hat{η} (ε) N) d \hat{x} \\ = O (N^{K / 2} e^{- \hat{η} (ε) N}), \end{matrix}

(39)

where by assumption we can find an

\hat{η} (ε) > 0

for any given

ε \in (0, ω)

. The condition in Equation (34) can be satisfied in a similar way. When

β_{0} = 1 / 2

,

β_{0} D_{β_{0}} (x | | \hat{x})

is the Bhattacharyya distance [31]:

β_{0} D_{β_{0}} (x | | \hat{x}) = - ln (\int_{R} \sqrt{p (r | x) p (r | \hat{x})} d r),

(40)

and we have

J (x) = {\frac{\partial^{2} (D (x | | \hat{x}))}{\partial \hat{x} \partial {\hat{x}}^{T}}|}_{\hat{x} = x} = {\frac{\partial^{2} (4 β_{0} D_{β_{0}} (x | | \hat{x}))}{\partial \hat{x} \partial {\hat{x}}^{T}}|}_{\hat{x} = x} = {\frac{\partial^{2} (8 H_{l}^{2} (x | | \hat{x}))}{\partial \hat{x} \partial {\hat{x}}^{T}}|}_{\hat{x} = x},

(41)

where

H_{l} (x | | \hat{x})

is the Hellinger distance [32] between

p (r | x)

and

p (r | \hat{x})

:

H_{l}^{2} (x | | \hat{x}) = \frac{1}{2} \int_{R} {(\sqrt{p (r | x)} - \sqrt{p (r | \hat{x})})}^{2} d r .

(42)

By Jensen’s inequality, for

β \in (0, 1)

we get

0 \leq D_{β} (x | | \hat{x}) \leq D (x | | \hat{x}) .

(43)

Denoting the Chernoff information [8] as

C (x | | \hat{x}) = max_{β \in (0, 1)} (β D_{β} (x | | \hat{x})) = β_{m} D_{β_{m}} (x | | \hat{x}),

(44)

where

β D_{β} (x | | \hat{x})

achieves its maximum at

β_{m}

, we have

\begin{matrix} I_{β, α} - H (X) \\ \leq h_{c} = - \sum_{m = 1}^{M} p (x_{m}) ln (\sum_{\hat{m} = 1}^{M} \frac{p (x_{\hat{m}})}{p (x_{m})} exp (- C (x_{m} | | x_{\hat{m}}))) \end{matrix}

(45)

\begin{matrix} \leq h_{d} = - \sum_{m = 1}^{M} p (x_{m}) ln (\sum_{\hat{m} = 1}^{M} \frac{p (x_{\hat{m}})}{p (x_{m})} exp (- β_{m} D_{β} (x_{m} | | x_{\hat{m}}))) . \end{matrix}

(46)

By Theorem 4,

\begin{matrix} max_{β \in (0, 1)} I_{β, α} = I_{γ_{0}} + O (N^{- 1}), \end{matrix}

(47)

\begin{matrix} I_{γ_{0}} = I_{G} - \frac{K}{2} ln \frac{4}{e} . \end{matrix}

(48)

If

β_{m} = 1 / 2

, then, by Equations (46)–(48) and (50), we have

\begin{matrix} max_{β \in (0, 1)} I_{β} + \frac{K}{2} ln \frac{4}{e} + O (N^{- 1}) & \leq I_{e} = I + O (N^{- 1}) \\ \leq h_{d} + H (X) \leq I_{u} . \end{matrix}

(49)

Therefore, from Equations (45), (46) and (49), we can see that

I_{e}

and I are close to

h_{c} + H (X)

.

2.3. Approximations for Mutual Information

In this section, we use the relationships described above to find effective approximations to true mutual information I in the case of large but finite N. First, Theorems 1 and 2 tell us that the true mutual information I and its approximation

I_{e}

lie between lower and upper bounds given by:

I_{β, α} \leq I \leq I_{u}

and

I_{β_{1}, 1} \leq I_{e} \leq I_{u}

. As a special case, I is also bounded by

I_{β_{1}, 1} \leq I \leq I_{u}

. Furthermore, from Equations (2) and (36) we can obtain the following asymptotic equality under suitable conditions:

I = I_{e} + O (N^{- 1}) .

(50)

Hence, for continuous stimuli, we have the following approximate relationship for large N:

I ≃ I_{e} ≃ I_{G} .

(51)

For discrete stimuli, by Equation (31) for large but finite N, we have

I ≃ I_{e} ≃ I_{d} = - \sum_{m = 1}^{M} p (x_{m}) ln (1 + \sum_{\hat{m} \in M_{m}^{u}} \frac{p (x_{\hat{m}})}{p (x_{m})} exp (- e^{- 1} D (x_{m} | | x_{\hat{m}}))) + H (X) .

(52)

Consider the special case

p (x_{\hat{m}}) ≃ p (x_{m})

for

\hat{m} \in M_{m}^{u}

. With the help of Equation (18), substitution of

p (x_{\hat{m}}) ≃ p (x_{m})

into Equation (52) yields

\begin{matrix} I & ≃ I_{D} = - \sum_{m = 1}^{M} p (x_{m}) ln (1 + \sum_{\hat{m} \in M_{m}^{u}} exp (- e^{- 1} D (x_{m} | | x_{\hat{m}}))) + H (X) \\ ≃ - \sum_{m = 1}^{M} p (x_{m}) \sum_{\hat{m} \in M_{m}^{u}} exp (- e^{- 1} D (x_{m} | | x_{\hat{m}})) + H (X) \\ = I_{D}^{0} \end{matrix}

(53)

where

I_{D}^{0} \leq I_{D}

and the second approximation follows from the first-order Taylor expansion assuming that the term

\sum_{\hat{m} \in M_{m}^{u}} exp (- e^{- 1} D (x_{m} | | x_{\hat{m}}))

is sufficiently small.

The theoretical discussion above suggests that

I_{e}

and

I_{d}

are effective approximations to true mutual information I in the limit of large N. Moreover, we find that they are often good approximations of mutual information I even for relatively small N, as illustrated in the following section.

3. Results of Numerical Simulations

Consider Poisson model neuron whose responses (i.e., numbers of spikes within a given time window) follow a Poisson distribution [24]. The mean response of neuron n, with

n \in \{1, 2, \dots, N\}

, is described by the tuning function

f (x; θ_{n})

, which takes the form of a Heaviside step function:

f (x; θ_{n}) = \{\begin{matrix} A, & if x \geq θ_{n}, \\ 0, & if x < θ_{n}, \end{matrix}

(54)

where the stimulus

x \in [- T, T]

with

T = 10

,

A = 10

, and the centers

θ_{1}

,

θ_{2}

, ⋯,

θ_{N}

of the N neurons are uniformly spaced in interval

[- T, T]

, namely,

θ_{n} = (n - 1) d - T

with

d = 2 T / (N - 1)

for

N \geq 2

, and

θ_{n} = 0

for

N = 1

. We suppose that the discrete stimulus x has

M = 21

possible values that are evenly spaced from

- T

to T, namely,

x \in X = \{x_{m} : x_{m} = 2 (m - 1) T / (M - 1) - T, m = 1, 2, \dots, M\}

. Now, the Kullback-Leibler divergence can be written as

D (x_{m} | | x_{\hat{m}}) = f (x_{m}; θ_{n}) log (\frac{f (x_{m}; θ_{n})}{f (x_{\hat{m}}; θ_{n})}) + f (x_{\hat{m}}; θ_{n}) - f (x_{m}; θ_{n}) .

(55)

Thus, we have

exp (- e^{- 1} D (x_{m} | | x_{\hat{m}})) = 1

when

f (x_{m}; θ_{n}) = f (x_{\hat{m}}; θ_{n})

,

exp (- e^{- 1} D (x_{m} | | x_{\hat{m}})) = exp (- e^{- 1} A)

when

f (x_{m}; θ_{n}) = 0

and

f (x_{\hat{m}}; θ_{n}) = A

, and

exp (- e^{- 1} D (x_{m} | | x_{\hat{m}})) = 0

when

f (x_{m}; θ_{n}) = A

and

f (x_{\hat{m}}; θ_{n}) = 0

. Therefore, in this case, we have

I_{e} = I_{d} .

(56)

More generally, this equality holds true whenever the tuning function has binary values.

In the first example, as illustrated in Figure 1, we suppose the stimulus has a uniform distribution, so that the probability is given by

p (x_{m}) = 1 / M

. Figure 1a shows graphs of the input distribution

p (x)

and a representative tuning function

f (x; θ)

with the center

θ = 0

.

To assess the accuracy of the approximation formulas, we employed Monte Carlo (MC) simulation to evaluate the mutual information I [24]. In our MC simulation, we first sampled an input

x_{j} \in X

from the uniform distribution

p (x_{j}) = 1 / M

, then generated the neural responses

r_{j}

by the conditional distribution

p (r_{j} | x_{j})

based on the Poisson model, where

j = 1

, 2, ⋯,

j_{max}

. The value of mutual information by MC simulation was calculated by

I_{M C}^{*} = \frac{1}{j_{max}} \sum_{j = 1}^{j_{max}} ln (\frac{p (r_{j} | x_{j})}{p (r_{j})}),

(57)

where

p (r_{j}) = \sum_{m = 1}^{M} p (r_{j} | x_{m}) p (x_{m}) .

(58)

To assess the precision of our MC simulation, we computed the standard deviation of repeated trials by bootstrapping:

I_{s t d} = \sqrt{\frac{1}{i_{max}} \sum_{i = 1}^{i_{max}} {(I_{M C}^{i} - I_{M C})}^{2}},

(59)

where

\begin{matrix} I_{M C}^{i} & = \frac{1}{j_{max}} \sum_{j = 1}^{j_{max}} ln (\frac{p (r_{Γ_{j, i}} | x_{Γ_{j, i}})}{p (r_{Γ_{j, i}})}), \end{matrix}

(60)

\begin{matrix} I_{M C} & = \frac{1}{i_{max}} \sum_{i = 1}^{i_{max}} I_{M C}^{i}, \end{matrix}

(61)

and

Γ_{j, i} \in \{1, 2, \dots, j_{max}\}

is the

(j, i)

-th entry of the matrix

Γ \in N^{j_{max} \times i_{max}}

with samples taken randomly from the integer set

{1

, 2, ⋯,

j_{max}}

by a uniform distribution. Here, we set

j_{max} = 5 \times 10^{5}

,

i_{max} = 100

and

M = 10^{3}

.

For different

N \in {1, 2, 3, 4, 6, 10, 14, 20, 30, 50, 100, 200, 400, 700, 1000}

, we compared

I_{M C}

with

I_{e}

,

I_{d}

and

I_{D}

, as illustrated in Figure 1b–d. Here, we define the relative error of approximation, e.g., for

I_{e}

, as

D I_{e} = \frac{I_{e} - I_{M C}}{I_{M C}},

(62)

and the relative standard deviation

D I_{s t d} = \frac{I_{s t d}}{I_{M C}} .

(63)

For the second example, we only changed the probability distribution of stimulus

p (x_{m})

while keeping all other conditions unchanged. Now,

p (x_{m})

is a discrete sample from a Gaussian function:

p (x_{m}) = Z^{- 1} \exp (- \frac{x_{m}^{2}}{2 σ^{2}}), m = 1, 2, \dots, M,

(64)

where

Z

is the normalization constant and

σ = T / 2

. The results are illustrated in Figure 2.

Next, we changed each tuning function

f (x; θ_{n})

to a rectified linear function:

f (x; θ_{n}) = max (0, x - θ_{n}),

(65)

Figure 3 and Figure 4 show the results under the same conditions of Figure 1 and Figure 2 except for the shape of the tuning functions.

Finally, we let the tuning function

f (x; θ_{n})

have a random form:

f (x; θ_{n}) = \{\begin{matrix} B, & if x \in θ_{n} = \{θ_{n}^{1}, θ_{n}^{2}, \dots, θ_{n}^{K}\}, \\ 0, & otherwise, \end{matrix}

(66)

where the stimulus

x \in X = {1, 2, \dots 999, 1000}

,

B = 10

, the values of

\{θ_{n}^{1}, θ_{n}^{2}, \dots, θ_{n}^{K}\}

are distinct and randomly selected from the set

X

with

K = 10

. In this example, we may regard

X

as a list of natural objects (stimuli), and there are a total of N sensory neurons, each of which responds only to K randomly selected objects. Figure 5 shows the results under the condition that

p (x)

is a uniform distribution. In Figure 6, we assume that

p (x)

is not flat but a half Gaussian given by Equation (64) with

σ = 500

.

In all these examples, we found that the three formulas, namely,

I_{e}

,

I_{d}

and

I_{D}

, provided excellent approximations to the true values of mutual information as evaluated by Monte Carlo method. For example, in the examples in Figure 1 and Figure 5, all three approximations were practically indistinguishable. In general, all these approximations were extremely accurate when

N > 100

.

In all our simulations, the mutual information tended to increase with the population size N, eventually reaching a plateau for large enough N. The saturation of information for large N is due to the fact that it requires at most

{log}_{2} M

bits of information to completely distinguish all M stimuli. It is impossible to gain more information than this maximum amount regardless of how many neurons are used in the population. In Figure 1, for instance, this maximum is

{log}_{2} 21 = 4.39

bits, and in Figure 5, this maximum is

{log}_{2} 1000 = 9.97

bits.

For relatively small values of N, we found that

I_{D}

tended to be less accurate than

I_{e}

or

I_{d}

(see Figure 5 and Figure 6). Our simulations also confirmed two analytical results. The first one is that

I_{d} = I_{D}

when the stimulus distribution is uniform; this result follows directly from the definitions of

I_{d}

and

I_{D}

and is confirmed by the simulations in Figure 1, Figure 3, and Figure 5. The second result is that

I_{d} = I_{e}

(Equation (56)) when the tuning function is binary, as confirmed by the simulations in Figure 1, Figure 2, Figure 5, and Figure 6. When the tuning function allows many different values,

I_{e}

can be much more accurate than

I_{d}

and

I_{D}

, as shown by the simulations in Figure 3 and Figure 4. To summarize, our best approximation formula is

I_{e}

because it is more accurate than

I_{d}

and

I_{D}

, and, unlike

I_{d}

and

I_{D}

, it applies to both discrete and continuous stimuli (Equations (10) and (13)).

4. Discussion

We have derived several asymptotic bounds and effective approximations of mutual information for discrete variables and established several relationships among different approximations. Our final approximation formulas involve only Kullback-Leibler divergence, which is often easier to evaluate than Shannon mutual information in practical applications. Although in this paper our theory is developed in the framework of neural population coding with concrete examples, our mathematical results are generic and should hold true in many related situations beyond the original context.

We propose to approximate the mutual information with several asymptotic formulas, including

I_{e}

in Equation (10) or Equation (13),

I_{d}

in Equation (15) and

I_{D}

in Equation (18). Our numerical experimental results show that the three approximations

I_{e}

,

I_{d}

and

I_{D}

were very accurate for large population size N, and sometimes even for relatively small N. Among the three approximations,

I_{D}

tended to be the least accurate, although, as a special case of

I_{d}

, it is slightly easier to evaluate than

I_{d}

. For a comparison of

I_{e}

and

I_{d}

, we note that

I_{e}

is the universal formula, whereas

I_{d}

is restricted only to discrete variables. The two formulas

I_{e}

and

I_{d}

become identical when the responses or the tuning functions have only two values. For more general tuning functions, the performance of

I_{e}

was better than

I_{d}

in our simulations.

As mentioned before, an advantage of of

I_{e}

is that it works not only for discrete stimuli but also for continuous stimuli. Theoretically speaking, the formula for

I_{e}

is well justified, and we have proven that it approaches the true mutual information I in the limit of large population. In our numerical simulations, the performance of

I_{e}

was excellent and better than that of

I_{d}

and

I_{D}

. Overall,

I_{e}

is our most accurate and versatile approximation formula, although, in some cases,

I_{d}

and

I_{D}

are slightly more convenient to calculate.

The numerical examples considered in this paper were based on an independent population of neurons whose responses have Poisson statistics. Although such models are widely used, they are appropriate only if the neural responses can be well characterized by the spike counts within a fixed time window. To study the temporal patterns of spike trains, one has to consider more complicated models. Estimation of mutual information from neural spike trains is a difficult computational problem [25,26,27,28]. In future work, it would be interesting to apply the asymptotic formulas such as

I_{e}

to spike trains with small time bins each containing either one spike or nothing. A potential advantage of the asymptotic formula is that it might help reduce the bias caused by small samples in the calculation of the response marginal distribution

p (r) = \sum_{x} p (r | x) p (x)

or the response entropy

H (R)

because here one only needs to calculate the Kullback-Leibler divergence

D (x | | \hat{x})

, which may have a smaller estimation error.

Finding effective approximation methods for computing mutual information is a key step for many practical applications of the information theory. Generally speaking, Kullback-Leibler divergence (Equation (7)) is often easier to evaluate and approximate than either Chernoff information (Equation (44)) or Shannon mutual information (Equation (1)). In situations where this is indeed the case, our approximation formulas are potentially useful. Besides applications in numerical simulations, the availability of a set of approximation formulas may also provide helpful theoretical tools in future analytical studies of information coding and representations.

As mentioned in the Introduction, various methods have been proposed to approximate the mutual information [9,10,11,12,13,14,15,16]. In future work, it would be useful to compare different methods rigorously under identical conditions in order to asses their relative merits. The approximation formulas developed in this paper are relatively easy to compute for practical problems. They are especially suitable for analytical purposes; for example, they could be used explicitly as objective functions for optimization or learning algorithms. Although the examples used in our simulations in this paper are parametric, it should be possible to extend the formulas to nonparametric problem, possibly with help of the copula method to take advantage of its robustness in nonparametric estimations [16].

Author Contributions

W.H. developed and proved the theorems, programmed the numerical experiments and wrote the manuscript. K.Z. verified the proofs and revised the manuscript.

Funding

This research was supported by an NIH grant R01 DC013698.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Proofs

Appendix A.1. Proof of Theorem 1

By Jensen’s inequality, we have

\begin{matrix} I_{β, α} & = - {〈ln (\int_{X} {〈\frac{p^{β} (r | \hat{x}) p^{α} (\hat{x})}{p^{β} (r | x) p^{α} (x)}〉}_{r | x} d \hat{x})〉}_{x} + H (X) \\ \leq - {〈{〈ln (\int_{X} \frac{p^{β} (r | \hat{x}) p^{α} (\hat{x})}{p^{β} (r | x) p^{α} (x)} d \hat{x})〉}_{r | x}〉}_{x} + H (X) \end{matrix}

(A1)

and

\begin{matrix} - {〈{〈ln (\int_{X} \frac{p^{β} (r | \hat{x}) p^{α} (\hat{x})}{p^{β} (r | x) p^{α} (x)} d \hat{x})〉}_{r | x}〉}_{x} + H (X) - I \\ = {〈{〈ln (\int_{X} \frac{p (r, \hat{x})}{p (r, x)} d \hat{x}) {(\int_{X} \frac{p^{β} (r | \hat{x}) p^{α} (\hat{x})}{p^{β} (r | x) p^{α} (x)} d \hat{x})}^{- 1}〉}_{r | x}〉}_{x} \\ \leq ln \int_{R} p (r) \frac{\int_{X} p^{β} (r | x) p^{α} (x) d x}{\int_{X} p^{β} (r | \hat{x}) p^{α} (\hat{x}) d \hat{x}} d r \\ = 0 . \end{matrix}

(A2)

Combining Equations (A1) and (A2), we immediately get the lower bound in Equation (25).

In this section, we use integral for variable

x

, although our argument is valid for both continuous variables and discrete variables. For discrete variables, we just need to replace each integral by a summation, and our argument remains valid without other modification. The same is true for the response variable

r

.

To prove the upper bound, let

Φ [q (\hat{x})] = \int_{R} p (r | x) \int_{X} q (\hat{x}) ln (\frac{p (r | x) q (\hat{x})}{p (r | \hat{x}) p (\hat{x})}) d \hat{x} d r,

(A3)

where

q (\hat{x})

satisfies

\{\begin{matrix} \int_{X} q (\hat{x}) d \hat{x} = 1 \\ q (\hat{x}) \geq 0 \end{matrix} .

(A4)

By Jensen’s inequality, we get

Φ [q (\hat{x})] \geq \int_{R} p (r | x) ln (\frac{p (r | x)}{p (r)}) d r .

(A5)

To find a function

q (\hat{x})

that minimizes

Φ [q (\hat{x})]

, we apply the variational principle as follows:

\frac{\partial \tilde{Φ} [q (\hat{x})]}{\partial q (\hat{x})} = \int_{R} p (r | x) ln (\frac{p (r | x) q (\hat{x})}{p (r | \hat{x}) p (\hat{x})}) d r + 1 + λ,

(A6)

where

λ

is the Lagrange multiplier and

\tilde{Φ} [q (\hat{x})] = Φ [q (\hat{x})] + λ (\int_{X} q (\hat{x}) d \hat{x} - 1) .

(A7)

Setting

\frac{\partial \tilde{Φ} [q (\hat{x})]}{\partial q (\hat{x})} = 0

and using the constraint in Equation (A4), we find the optimal solution

q^{*} (\hat{x}) = \frac{p (\hat{x}) exp (- D (x | | \hat{x}))}{\int_{X} p (\overset{ˇ}{x}) exp (- D (x | | \overset{ˇ}{x})) d \overset{ˇ}{x}} .

(A8)

Thus, the variational lower bound of

Φ [q (\hat{x})]

is given by

Φ [q^{*} (\hat{x})] = min_{q (\hat{x})} Φ [q (\hat{x})] = - ln (\int_{X} p (\hat{x}) exp (- D (x | | \hat{x})) d \hat{x}) d x .

(A9)

Therefore, from Equations (1), (A5) and (A9), we get the upper bound in Equation (25). This completes the proof of Theorem 1.

Appendix A.2. Proof of Theorem 2

It follows from Equation (43) that

\begin{matrix} I_{β_{1}, α_{1}} & = - {〈ln {〈exp (- β_{1} D_{β_{1}} (x | | \hat{x}) + (1 - α_{1}) ln \frac{p (x)}{p (\hat{x})})〉}_{\hat{x}}〉}_{x} \\ \leq - {〈ln {〈exp (- e^{- 1} D (x | | \hat{x}))〉}_{\hat{x}}〉}_{x} = I_{e} \\ \leq - {〈ln {〈exp (- D (x | | \hat{x}))〉}_{\hat{x}}〉}_{x} = I_{u}, \end{matrix}

(A10)

where

β_{1} = e^{- 1}

and

α_{1} = 1

. We immediately get Equation (26). This completes the proof of Theorem 2.

Appendix A.3. Proof of Theorem 3

For the lower bound

I_{β, α}

, we have

\begin{matrix} I_{β, α} & = - \sum_{m = 1}^{M} p (x_{m}) ln (\sum_{\overset{ˇ}{m} = 1}^{M} {(\frac{p (x_{\overset{ˇ}{m}})}{p (x_{m})})}^{α} exp (- β D_{β} (x_{m} | x_{\overset{ˇ}{m}}))) \\ = - \sum_{m = 1}^{M} p (x_{m}) ln (1 + d (x_{m})) + H (X), \end{matrix}

(A11)

where

d (x_{m}) = \sum_{\overset{ˇ}{m} \in M - \{m\}} {(\frac{p (x_{\overset{ˇ}{m}})}{p (x_{m})})}^{α} exp (- β D_{β} (x_{m} | x_{_{\overset{ˇ}{m}}})) .

(A12)

Now, consider

\begin{matrix} ln (1 + d (x_{m})) \\ = ln (1 + a (x_{m}) + b (x_{m})) \\ = ln (1 + a (x_{m})) + ln (1 + b (x_{m}) {(1 + a (x_{m}))}^{- 1}) \\ = ln (1 + a (x_{m})) + O (N^{- γ}), \end{matrix}

(A13)

where

\begin{matrix} a (x_{m}) & = \sum_{\hat{m} \in M_{m}^{β}} {(\frac{p (x_{\hat{m}})}{p (x_{m})})}^{α} exp (- β D_{β} (x_{m} | | x_{\hat{m}})), \end{matrix}

(A14a)

\begin{matrix} b (x_{m}) & = \sum_{\overset{ˇ}{m} \in {M - M}_{m}^{β}} {(\frac{p (x_{\overset{ˇ}{m}})}{p (x_{m})})}^{α} exp (- β D_{β} (x_{m} | | x_{\overset{ˇ}{m}})) \\ \leq N^{- γ_{1}} \sum_{\overset{ˇ}{m} \in {M - M}_{m}^{β}} {(\frac{p (x_{\overset{ˇ}{m}})}{p (x_{m})})}^{α} = O (N^{- γ_{1}}) . \end{matrix}

(A14b)

Combining Equations (A11) and (A13) and Theorem 1, we get the lower bound in Equation (30). In a manner similar to the above, we can get the upper bound in Equations (30) and (31). This completes the proof of Theorem 3.

Appendix A.4. Proof of Theorem 4

The upper bound

I_{u}

for mutual information I in Equation (25) can be written as

\begin{matrix} I_{u} & = - \int_{X} (ln \int_{X} p (\hat{x}) exp (- D (x | \hat{x})) d \hat{x}) p (x) d x \\ = - {〈ln (\int_{X} exp ({〈L (r | \hat{x}) - L (r | x)〉}_{r | x}) d \hat{x})〉}_{x} + H (X) . \end{matrix}

(A15)

where

L (r | \hat{x}) = ln (p (r | \hat{x}) p (\hat{x}))

and

L (r | x) = ln (p (r | x) p (x))

.

Consider the Taylor expansion for

L (r | \hat{x})

around

x

. Assuming that

L (r | \hat{x})

is twice continuously differentiable for any

\hat{x} \in X_{ω} (x)

, we get

\begin{matrix} {〈L (r | \hat{x}) - L (r | x)〉}_{r | x} \\ = {y^{T} v}_{1} - \frac{1}{2} y^{T} y - \frac{1}{2} y^{T} G^{- 1 / 2} (x) (G (\overset{˘}{x}) - G (x)) G^{- 1 / 2} (x) y \end{matrix}

(A16)

where

y = G^{1 / 2} (x) (\hat{x} - x),

(A17)

v_{1} = G^{- 1 / 2} (x) q^{'} (x)

(A18)

and

\overset{˘}{x} = x + t (\hat{x} - x) \in X_{ω} (x), t \in (0, 1) .

(A19)

For later use, we also define

v = G^{- 1 / 2} (x) l^{'} (r | x)

(A20)

where

l (r | x) = ln p (r | x) .

(A21)

Since

G (\overset{˘}{x})

is continuous and symmetric for

\overset{˘}{x} \in X

, for any

ϵ \in (0, 1)

, there is a

ε \in (0, ω)

such that

|y^{T} G^{- 1 / 2} (x) (G (\overset{˘}{x}) - G (x)) G^{- 1 / 2} (x) y| < ϵ {∥y∥}^{2}

(A22)

for all

y \in Y_{ε}

, where

Y_{ε} = \{y \in R^{K} : ∥y∥ < ε \sqrt{N}\}

. Then, we get

\begin{matrix} ln (\int_{X} exp ({〈L (r | \hat{x}) - L (r | x)〉}_{r | x}) d \hat{x}) \\ \geq - \frac{1}{2} ln (det (G (x))) + ln \int_{Y_{ε}} exp ({y^{T} v}_{1} - \frac{1}{2} (1 + ϵ) y^{T} y) d y \end{matrix}

(A23)

and with Jensen’s inequality,

\begin{matrix} ln \int_{Y_{ε}} exp ({y^{T} v}_{1} - \frac{1}{2} (1 + ϵ) y^{T} y) d y \\ \geq ln Ψ_{ε} + \int_{{\hat{Y}}_{ε}} {y^{T} v}_{1} ϕ_{ε} (y) d y \\ = \frac{K}{2} ln (\frac{2 π}{1 + ϵ}) + O (N^{- K / 2} e^{- N δ}), \end{matrix}

(A24)

where

δ

is a positive constant,

\int_{{\hat{Y}}_{ε}} {y^{T} v}_{1} ϕ_{ε} (y) d y = 0

,

\{\begin{matrix} ϕ_{ε} (y) = Ψ_{ε}^{- 1} exp (- \frac{1}{2} (1 + ϵ) y^{T} y) \\ Ψ_{ε} = \int_{{\hat{Y}}_{ε}} exp (- \frac{1}{2} (1 + ϵ) y^{T} y) d y \end{matrix}

(A25)

and

{\hat{Y}}_{ε} = \{y \in R^{K} : |y_{k}| < ε \sqrt{N / K}, k = 1, 2, \dots, K\} \subseteq Y_{ε} .

(A26)

Now, we evaluate

\begin{matrix} Ψ_{ε} & = \int_{R^{K}} exp (- \frac{1}{2} (1 + ϵ) y^{T} y) d y - \int_{R^{K} - {\hat{Y}}_{ε}} exp (- \frac{1}{2} (1 + ϵ) y^{T} y) d y \\ = {(\frac{2 π}{1 + ϵ})}^{K / 2} - \int_{R^{K} - {\hat{Y}}_{ε}} exp (- \frac{1}{2} (1 + ϵ) y^{T} y) d y . \end{matrix}

(A27)

Performing integration by parts with

\int_{a}^{\infty} e^{- t^{2} / 2} d t = \frac{e^{- a^{2} / 2}}{a} - \int_{a}^{\infty} \frac{e^{- t^{2} / 2}}{t^{2}} d t

, we find

\begin{matrix} \int_{R^{K} - {\hat{Y}}_{ε}} exp (- \frac{1}{2} (1 + ϵ) y^{T} y) d y & \leq \frac{exp (- \frac{1}{2} (1 + ϵ) ε^{2} N)}{{({(1 + ϵ)}^{2} ε^{2} N / (4 K))}^{K / 2}} \\ = O (N^{- K / 2} e^{- N δ}), \end{matrix}

(A28)

for some constant

δ > 0

.

Combining Equations (A15), (A23) and (A24), we get

I_{u} \leq \frac{1}{2} {〈ln (det (\frac{(1 + ϵ)}{2 π} G (x)))〉}_{x} + H (X) + O (N^{- K / 2} e^{- N δ}) .

(A29)

On the other hand, from Equation (A22) and the condition in Equation (33), we obtain

\begin{matrix} \int_{X_{ε} (x)} exp ({〈L (r | \hat{x}) - L (r | x)〉}_{r | x}) d \hat{x} \\ \leq det {(G (x))}^{- 1 / 2} \int_{R^{K}} exp ({y^{T} v}_{1} - \frac{1}{2} (1 - ϵ) y^{T} y) d y \\ = det {(\frac{1 - ϵ}{2 π} G (x))}^{- 1 / 2} exp (\frac{1}{2} {(1 - ϵ)}^{- 1} v^{T} v_{1}) \end{matrix}

(A30)

and

\begin{matrix} \int_{X} exp ({〈L (r | \hat{x}) - L (r | x)〉}_{r | x}) d \hat{x} \\ = \int_{X_{ε} (x)} exp ({〈L (r | \hat{x}) - L (r | x)〉}_{r | x}) d \hat{x} + \int_{{X - X}_{ε} (x)} exp ({〈L (r | \hat{x}) - L (r | x)〉}_{r | x}) d \hat{x} \\ \leq det {(\frac{1 - ϵ}{2 π} G (x))}^{- 1 / 2} (exp (\frac{v^{T} v_{1}}{2 (1 - ϵ)}) + O (N^{- 1})) . \end{matrix}

(A31)

It follows from Equations (A15) and (A31) that

\begin{matrix} {〈ln (\int_{X} exp ({〈L (r | \hat{x}) - L (r | x)〉}_{r | x}) d \hat{x})〉}_{x} \\ \leq - \frac{1}{2} {〈ln (det (\frac{(1 - ϵ)}{2 π} G (x)))〉}_{x} + \frac{1}{2} {(1 - ϵ)}^{- 1} {〈v^{T} v_{1}〉}_{x} + O (N^{- 1}) . \end{matrix}

(A32)

Note that

{〈v^{T} v_{1}〉}_{x} = O (N^{- 1}) .

(A33)

Now, we have

I_{u} \geq \frac{1}{2} {〈ln (det (\frac{(1 - ϵ)}{2 π} G (x)))〉}_{x} + H (X) + O (N^{- 1}) .

(A34)

Since

ϵ

is arbitrary, we can let it go to zero. Therefore, from Equations (25), (A29) and (A34), we obtain the upper bound in Equation (35).

The Taylor expansion of

h (\hat{x}, x) = {〈{(\frac{p (r | \hat{x})}{p (r | x)})}^{β}〉}_{r | x}

around

x

is

\begin{matrix} h (\hat{x}, x) & = 1 + {〈\frac{β}{p (r | x)} \frac{\partial p (r | x)}{\partial x}〉}_{r | x} (\hat{x} - x) + \\ (\hat{x} - {x)}^{T} {〈\frac{β}{2 p {(r | x)}^{2}} ((β - 1) \frac{\partial p (r | x)}{\partial x} \frac{\partial p (r | x)}{\partial x^{T}} + p (r | x) \frac{\partial^{2} p (r | x)}{\partial x \partial x^{T}})〉}_{r | x} (\hat{x} - x) + \dots \\ = 1 - \frac{β (1 - β)}{2} (\hat{x} - {x)}^{T} J (x) (\hat{x} - x) + \dots . \end{matrix}

(A35)

In a similar manner as described above, we obtain the asymptotic relationship (37):

\begin{matrix} I_{β, α} & = I_{γ} + O (N^{- 1}) \\ = \frac{1}{2} \int_{X} p (x) ln (det (\frac{G_{γ} (x)}{2 π})) d x + H (X) . \end{matrix}

(A36)

Notice that

0 < γ = β (1 - β) \leq 1 / 4

and the equality holds when

β = β_{0} = 1 / 2

. Thus, we have

det (G_{γ} (x)) \leq det (G_{γ_{0}} (x)) .

(A37)

Combining Equations (25), (A36) and (A37) yields the lower bound in Equation (35).

The proof of Equation (36) is similar. This completes the proof of Theorem 4.

References

Borst, A.; Theunissen, F.E. Information theory and neural coding. Nat. Neurosci. 1999, 2, 947–957. [Google Scholar] [CrossRef] [PubMed]
Pouget, A.; Dayan, P.; Zemel, R. Information processing with population codes. Nat. Rev. Neurosci. 2000, 1, 125–132. [Google Scholar] [CrossRef] [PubMed]
Laughlin, S.B.; Sejnowski, T.J. Communication in neuronal networks. Science 2003, 301, 1870–1874. [Google Scholar] [CrossRef] [PubMed]
Brown, E.N.; Kass, R.E.; Mitra, P.P. Multiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci. 2004, 7, 456–461. [Google Scholar] [CrossRef] [PubMed]
Bell, A.J.; Sejnowski, T.J. The “independent components” of natural scenes are edge filters. Vision Res. 1997, 37, 3327–3338. [Google Scholar] [CrossRef] [Green Version]
Huang, W.; Zhang, K. An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Huang, W.; Huang, X.; Zhang, K. Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, 22–24 March 2017. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information, 2nd ed.; Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
Miller, G.A. Note on the bias of information estimates. Inf. Theory Psychol. Probl. Methods 1955, 2, 100. [Google Scholar]
Carlton, A. On the bias of information estimates. Psychol. Bull. 1969, 71, 108. [Google Scholar] [CrossRef]
Treves, A.; Panzeri, S. The upward bias in measures of information derived from limited data samples. Neural Comput. 1995, 7, 399–407. [Google Scholar] [CrossRef]
Victor, J.D. Asymptotic bias in information estimates and the exponential (Bell) polynomials. Neural Comput. 2000, 12, 2797–2804. [Google Scholar] [CrossRef] [PubMed]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J., III; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 2007, 76, 026209. [Google Scholar] [CrossRef] [PubMed]
Safaai, H.; Onken, A.; Harvey, C.D.; Panzeri, S. Information estimation using nonparametric copulas. Phys. Rev. E 2018, 98, 053302. [Google Scholar] [CrossRef]
Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
Van Trees, H.L.; Bell, K.L. Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking; John Wiley: Piscataway, NJ, USA, 2007. [Google Scholar]
Clarke, B.S.; Barron, A.R. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 1990, 36, 453–471. [Google Scholar] [CrossRef] [Green Version]
Rissanen, J.J. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory 1996, 42, 40–47. [Google Scholar] [CrossRef]
Brunel, N.; Nadal, J.P. Mutual information, Fisher information, and population coding. Neural Comput. 1998, 10, 1731–1757. [Google Scholar] [CrossRef] [PubMed]
Sompolinsky, H.; Yoon, H.; Kang, K.J.; Shamir, M. Population coding in neuronal systems with correlated noise. Phys. Rev. E 2001, 64, 051904. [Google Scholar] [CrossRef] [PubMed]
Kang, K.; Sompolinsky, H. Mutual information of population codes and distance measures in probability space. Phys. Rev. Lett. 2001, 86, 4958–4961. [Google Scholar] [CrossRef] [PubMed]
Huang, W.; Zhang, K. Information-theoretic bounds and approximations in neural population coding. Neural Comput. 2018, 30, 885–944. [Google Scholar] [CrossRef] [PubMed]
Strong, S.P.; Koberle, R.; van Steveninck, R.R.D.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 80, 197. [Google Scholar] [CrossRef]
Nemenman, I.; Bialek, W.; van Steveninck, R.D.R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E 2004, 69, 056111. [Google Scholar] [CrossRef] [PubMed]
Panzeri, S.; Senatore, R.; Montemurro, M.A.; Petersen, R.S. Correcting for the sampling bias problem in spike train information measures. J. Neurophysiol. 2017, 98, 1064–1072. [Google Scholar] [CrossRef] [PubMed]
Houghton, C. Calculating the Mutual Information Between Two Spike Trains. Neural Comput. 2019, 31, 330–343. [Google Scholar] [CrossRef] [PubMed]
Rényi, A. On measures of entropy and information. In Fourth Berkeley Symposium on Mathematical Statistics and Probability; The Regents of the University of California, University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
Beran, R. Minimum Hellinger distance estimates for parametric models. Ann. Stat. 1977, 5, 445–463. [Google Scholar] [CrossRef]

Figure 1. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

obtained by Monte Carlo method for one-dimensional discrete stimuli. (a) Discrete uniform distribution of the stimulus

p (x)

(black dots) and the Heaviside step tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 1. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

obtained by Monte Carlo method for one-dimensional discrete stimuli. (a) Discrete uniform distribution of the stimulus

p (x)

(black dots) and the Heaviside step tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 2. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 1 except that the stimulus distribution

p (x)

is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus

p (x)

(black dots) and the Heaviside step tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 2. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 1 except that the stimulus distribution

p (x)

is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus

p (x)

(black dots) and the Heaviside step tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 3. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 1 except for the shape of the tuning function (blue dashed lines in (a)). (a) Discrete uniform distribution of the stimulus

p (x)

(black dots) and the rectified linear tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 3. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 1 except for the shape of the tuning function (blue dashed lines in (a)). (a) Discrete uniform distribution of the stimulus

p (x)

(black dots) and the rectified linear tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 4. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 3 except that the stimulus distribution

p (x)

is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus

p (x)

(black dots) and the rectified linear tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 4. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 3 except that the stimulus distribution

p (x)

is peaked rather flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus

p (x)

(black dots) and the rectified linear tuning function

f (x; θ)

with center

θ = 0

(blue dashed lines); (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 5. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is similar to that in Figure 1 except that the tuning function is random (blue dashed lines in (a)); see Equation (66). (a) Discrete uniform distribution of the stimulus

p (x)

(black dots) and the random tuning function

f (x; θ)

; (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 5. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is similar to that in Figure 1 except that the tuning function is random (blue dashed lines in (a)); see Equation (66). (a) Discrete uniform distribution of the stimulus

p (x)

(black dots) and the random tuning function

f (x; θ)

; (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 6. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 5 except that the stimulus distribution

p (x)

is not flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus

p (x)

(black dots) and the random tuning function

f (x; θ)

; (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

Figure 6. A comparison of approximations

I_{e}

,

I_{d}

and

I_{D}

against

I_{M C}

. The situation is identical to that in Figure 5 except that the stimulus distribution

p (x)

is not flat (black dots in (a)). (a) Discrete Gaussian-like distribution of the stimulus

p (x)

(black dots) and the random tuning function

f (x; θ)

; (b) The values of

I_{M C}

,

I_{e}

,

I_{d}

and

I_{D}

depend on the population size or total number of neurons N; (c) The relative errors

D I_{e}

,

D I_{d}

and

D I_{D}

for the results in (b); (d) The absolute values of the relative errors

| D I_{e} |

,

| D I_{d} |

and

| D I_{D} |

as in (c), with error bars showing standard deviations of repeated trials.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, W.; Zhang, K. Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding. Entropy 2019, 21, 243. https://doi.org/10.3390/e21030243

AMA Style

Huang W, Zhang K. Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding. Entropy. 2019; 21(3):243. https://doi.org/10.3390/e21030243

Chicago/Turabian Style

Huang, Wentao, and Kechen Zhang. 2019. "Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding" Entropy 21, no. 3: 243. https://doi.org/10.3390/e21030243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

Abstract

1. Introduction

2. Theory and Methods

2.1. Notations and Definitions

2.2. Theorems

2.3. Approximations for Mutual Information

3. Results of Numerical Simulations

4. Discussion

Author Contributions

Funding

Conflicts of Interest

Appendix A. The Proofs

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Theorem 2

Appendix A.3. Proof of Theorem 3

Appendix A.4. Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI