Limitations to Estimating Mutual Information in Large Neural Populations

Information theory provides a powerful framework to analyse the representation of sensory stimuli in neural population activity. However, estimating the quantities involved such as entropy and mutual information from finite samples is notoriously hard and any direct estimate is known to be heavily biased. This is especially true when considering large neural populations. We study a simple model of sensory processing and show through a combinatorial argument that, with high probability, for large neural populations any finite number of samples of neural activity in response to a set of stimuli is mutually distinct. As a consequence, the mutual information when estimated directly from empirical histograms will be equal to the stimulus entropy. Importantly, this is the case irrespective of the precise relation between stimulus and neural activity and corresponds to a maximal bias. This argument is general and applies to any application of information theory, where the state space is large and one relies on empirical histograms. Overall, this work highlights the need for alternative approaches for an information theoretic analysis when dealing with large neural populations.


Introduction
The neural code is inherently stochastic, so that even when the same stimulus is repeated, we observe a considerable variability in the neural response. Thus, an important question when considering the interplay between sensory stimuli and neural activity, and when trying to understand how information is represented and processed in neural systems, is how much an observer can actually infer about the stimulus from only looking at its representation in the neural activity. This problem can be formulated and quantitatively studied in the general framework of information theory.
With neural information processing happening at the level of populations, any such analysis necessarily has to consider the activity of an entire population [1,2]. The number of neurons which can be recorded simultaneously has increased in recent years, driven by technologies such as calcium fluorescence imaging [3] and high-density electrode arrays [4]. These methods allow recording of population activity from thousands of neurons simultaneously. However, at these scales, information theoretic analyses of neural activity become considerably harder. Critically, any information theoretic analysis depends at some point on the precise knowledge of the joint probability distribution of the states of the stimuli and the neural population.
However, estimating this distribution from data, and thus entropy and mutual information, is notoriously difficult. This problem is well known, in particular, when the state space for the

Results
In a typical experiment investigating sensory processing, one studies the activity from a neural population of interest in response to a defined set of stimuli. Due to the stochastic nature of neural activity, every stimulus is presented separately multiple times, yielding a finite set of samples of neural activity for every stimulus [25]. From these samples across all stimuli, one can then estimate the relative frequency of patterns of neural activity in response to different stimuli in order to, for example, calculate the dependence in terms of mutual information between the stimuli and the neural activity [26,27].
To make this paradigm more concrete, suppose we are interested in the representation of the visual field in the primary visual cortex. In order to probe this representation, we simultaneously record the spiking activity from a neural population in the respective brain area when exposed to a visual stimulus whose position we vary in the course of the experiment. We can sample the neural activity for every presentation of the stimulus, by measuring the neural activity, for example, in terms of the number of spikes. In order then to quantify the representation of the different stimulus values in their entirety, we can collect the samples of neural activity that we observed, compute empirical histograms and perform an information theoretic analysis.
The central objects of interest in information theory are random variables and their distributions. At its core one defines a measure of their uncertainty, the entropy [28]. If X is a discrete random variable taking values in an at most countable set, its entropy H(X) is defined as where P [X = · ] denotes the probability distribution of X, and therefore P [X = x] the probability the X attains the value x. In this definition, we adopt the customary convention that 0 log 0 := 0. For the purpose of this paper, we leave the base of the logarithm unspecified; however, we note that in the context of information theory the base is generally chosen to be 2 so that entropy will be measured in units of bits.
If X is a second random variable, the conditional entropy of X given X , H(X|X ), is defined as where P [X = · ∧ X = · ] and P [ X = · | X = · ] denote the joint probability distribution of X and X and conditional probability distribution of X given X , respectively. The conditional entropy is a measure for the residual uncertainty in a random variable given observation of another random variable. Specifically, we have that H(X|X ) ≤ H(X) with equality if X and X are independent. An important quantity in information theory is the mutual information between two random variables, X and X , which is defined as MI X; X = H(X) − H X|X .
Mutual information quantifies the amount of entropy of X that knowledge of X annihilates or, in other words, the information that X holds about X. Furthermore, as the mutual information is symmetric in its arguments it likewise also quantifies the information X holds about X . Mutual information is non-negative and vanishes if and only if X and X are independent. Because of this, mutual information is frequently used as a measure of the independence of two random variables. The mutual information between X and X can be written in terms of the relative entropy (Kullback-Leibler divergence) between their joint probability distributions and the product of the corresponding marginal distributions, which defines a premetric on the probability distribution.
Importantly, we note that, in contrast to what the notation suggests, both the entropy and the mutual information do not depend on the random variables and their values themselves, but are rather functionals of their distributions [29]. As a consequence of that, we refrain from specifying the random variables' codomains in the following and if a random variable X attains the value x, x may simply be regarded as a label in an abstract alphabet.

Analysis of a Computational Model of Sensory Processing Regarding Estimating Information Theoretic Quantities
Building on the experimental paradigm we outlined above, we devise and analyse a simple computational model. We assume the stimulus to be modelled by a discrete, almost surely non-constant random variable S and the neural population's activity by the discrete vector-valued random variable X ≡ N n=1 X (n) , where N is the size of the population. As the stimulus is determined by the experimental protocol, we assume perfect knowledge of its statistics. In addition, we assume that for every stimulus value s there exist a subset of the whole neural population U ⊥ ⊥s ⊆ {1, . . . N} such that N s,s := |U ⊥ ⊥s ∩ U ⊥ ⊥s | = ω(ln N) as N → ∞, i.e., N s,s asymptotically grows faster than the logarithm, for all stimulus values s and s such that the components X (n) | S ∈ {s, s } for n ∈ U ⊥ ⊥s ∩ U ⊥ ⊥s are independent and identically distributed and almost surely non-constant ( Figure 1). In an experimental setting, one might think of U ⊥ ⊥s as the set of neurons that are not receptive to a stimulus value s and therefore activated independently according to a common noise profile. Importantly, these sets differ in general from stimulus to stimulus. However for every two stimulus values, they overlap on a sufficiently large common set. Moreover, in contrast to what one might initially think, these sets cannot be excluded since one can imagine a scenario where every neuron in the population is receptive to at least one of the stimulus values and is simultaneously an element of at least one independent set for other stimulus values.
Next, suppose that for every stimulus value s we are given K s independent samples from for any sample x and stimulus value s. The Kronecker delta here attains the value 1 whenever its arguments coincide, and otherwise vanishes. Again, in the experimental setting, one might think of x (s) as the samples of neural activity that was recorded in response to a stimulus value s. Consistent with the intuition of Rhee et al. [7], with high probability, the samples for every stimulus value are mutually different, as we are considering larger and larger neural populations and, moreover, these samples even become uniquely associated with one of the stimulus values, i.e., x (s) ∩ x (s ) = ∅ for s = s . This simplifies the empirical estimate for P [ X = · | S = · ] from above so that, for a sample x (s) k and a stimulus value s ,P X = x (s) k S = s = 1 K s δ s,s , using that every sample is uniquely associated with a stimulus value and occured only once for those stimulus value. As we will show, the probability for this event is at least This not only implies that in the limit of an infinitely large population, the probability is 1, but also makes a statement about the dependence on the size of population. In fact, the probability approaches 1 eventually faster than any polynomial, so that it will be close to 1 even for moderately sized populations.
. . . Figure 1. A computational model of sensory processing. We consider a population of N neurons, illustrated irrespective of the physical location as points in the plane on the left-hand side that is exposed to the presentation of a stimulus, which is modelled by a random variable S. The (measured) activity of this population, illustrated as spiking activity on the right-hand side over the presentation of two different values of the stimulus, is modelled by a vector-valued random variable X ≡ N n=1 X (n) . For every stimulus value s, we assume there exists a subset of the population, U s , depicted as circular regions on the plane, such that, intuitively, the neurons within this subpopulation are receptive to the particular stimulus value. Conversely, the neurons in the complement of that set in {1, . . . N}, U ⊥ ⊥s , are assumed to be not receptive to that stimulus value and to activate independently according to a common noise profile. As an example, the sets for stimulus values s and s are highlighted. Neurons from any of the two sets are shown to have an increased spiking activity during the presentation of the corresponding stimulus value. In contrast to this activity, others are shown to be rather sporadically active.

X
For any two stimuli there exists by assumption a subset of the components of X which is independent and identically distributed when conditioned on either of the two stimuli. As it suffices that the samples differ in these components for them to be mutually different, the probability for this event is a lower bound on the probability that the samples are mutually different. Therefore, the probability for all samples corresponding to stimulus s and s to be mutually different is at least 1 − O(N −∞ ), for N large. As we show in the next section, this is the probability that a finite number of independent samples from a random vector of length N s,s with N s,s = ω(ln N) are mutually different, provided the components are independent and identically distributed and almost surely non-constant, which is the case by assumption.
From that the probability that this event occurs for all stimulus values can be estimated to be at least In that event, when all the samples are mutually different, we compute for the entropy and conditional entropy,Ĥ(X) andĤ(X|S), using the law of total probability to express both in terms of the distribution of S and the empirical estimate of the distribution of X given S, the fact that the latter vanishes away from the samples and that, as we have seen above, for a sample x Thus,MI(X; S) =Ĥ(X) −Ĥ(X|S) = H(S), and in addition we also get from the above thatĤ(X, S) = H(X) and thatĤ(S|X) = 0. As for mutual information the classical bound MI(X; S) ≤ min(H(X) , H(S)) holds, mutual information is in fact estimated to be maximal.
Importantly, in this analysis we did not make any assumptions about the precise dependence between S and X apart from the existence of a sufficiently large subpopulation U ⊥ ⊥s for every stimulus value s. Therefore, the conclusion holds also if S and X were, in fact, independent and mutual information should have vanished.

Computing the Probability of a Finite Number of Independent and Identically Distributed Random Variables being Mutually Different
In the previous section, we were interested in the probability that a finite set of independent and identically distributed (discrete) random variables yields mutually different realisations. This probability is trivially 0 by the pigeonhole principle if there are more random variables in any such set than the number of values these random variables attain with non-vanishing probability. However, once the number of values that are attained exceeds the number of random variables this is not the case anymore, and for any arbitrary distribution of the random variable it is not obvious how to obtain the exact probabilities.
We first take a step back and consider generally series of the form ∑ n 1 ,...n K ∈N n 1 =n 2 =...n K a n 1 · · · a n K for sequences (a n ) n∈N such that the corresponding series is absolutely convergent. This specifically guarantees that the series in question will also converge. We will show how to evaluate such a series and from a closed-form expression derive the probabilities for any finite set of random variables to yield mutually different realisations. From this it will follow, in particular, that large random vectors yield mutually different values with high probability.
Let (a n ) n∈N be a sequence with a n ∈ R for every n ∈ N such that ∑ n∈N a n is absolutely convergent and define Q m := ∑ n∈N a m n for m ∈ N. Then, for any K ∈ N we have that ∑ n 1 ,...n K ∈N n 1 =n 2 =...n K K ∏ k=1 a n k = Here and in the following we use the conventional notation for multi-indices α ∈ N K 0 , so that, e.g., |α| = ∑ K k=1 α k and α! = ∏ K k=1 α k !. In addition, we set α k = ∑ k m=1 α m m for 1 ≤ k ≤ K. We will prove this assertion in two steps. We will first derive a recursion relation for the series in question and then in a second step conclude with an induction argument.
In particular, we have for every 1 ≤ k ≤ k ≤ K and m ∈ N 0 that (Γ 1 • Γ 2 • · · · Γ k ) (a m n k ) = ∑ n 1 ,...n k ∈N n 1 =n 2 =...n k a n 1 · · · a m+1 n k · · · a n k = ∑ n 1 ,...n k ∈N n 1 =n 2 =...n k a n 1 · · · a m+1 n k = (Γ 1 • Γ 2 • · · · Γ k ) (a m n k ), where we have relabeled the indices and the interchangability of the sums is guaranteed by the absolute convergence of every individual series. Therefore, (Γ 1 • Γ 2 • · · · Γ k ) (a m n k ) = (Γ 1 • Γ 2 • · · · Γ k ) (a m n k ). Now, using the linearity of the operators Γ 1 , . . . Γ K we have that In the last step we explicitly resolved the recursion, which can be verified by a simple inductive argument. In particular, for m = 0 we obtain an expression for A K so that we conclude after reordering the sum and rearranging the terms that As A K is precisely the sum we intend to compute, the last expression yields a recursion relation for it with initial datum A 0 = 1. In particular, we note that this equation is reminiscent of a discrete Volterra equation of convolution type. Thus, this concludes the first step of the argument and in the second step we will show that the expression stated in the beginning actually solves this recursion relation.
For the induction argument we assume now that the assertion holds for 1 ≤ k ≤ K − 1 in order to perform the step K − 1 → K and we use the recursion relation derived above to relate A K to A 0 , A 1 , . . . A K−1 . In addition, we denote with e k the multi-index that is 0 in every component except the kth one, where it is 1.
In the last step, we used that, as we will show, for any multi-index functional Φ, Indeed, we first observe that the terms of Φ that appear in both sums are identical. Therefore, we only have to carefully evaluate the coefficients.
This concludes the argument, so that as claimed In order to apply this result now in a probabilistic context to answer the question about the probability that any finite set of (discrete) random variables yields mutually different realisations, consider a discrete random variable X that takes values in the set {x 1 , x 2 , . . .}, which we assume without loss of generality to be countably infinite. Then, applying the above result to the sequence (P [X = x n ]) n∈N we can compute the probability that K independent copies of this random variable, X 1 , . . . X K , all yield mutually different realisations. Specifically, we have that Now, consider random vectors of increasing length, X ≡ f(N) n=1 X (n) , with independent components and assume that there exists > 0 such that Q (n) For the latter a necessary condition is that X (n) is almost surely non-constant and in particular it is satisfied if the components are identically distributed and almost surely non-constant. Then, if f (N) = ω(ln N) increasing, we will show that this implies that in the limit N → ∞ where we recall that for some sequence (z N ) N∈N we have that z N = O(N −∞ ) if for every m ≥ 0 lim N→∞ N m z N = 0.
. This implies the claim since P [X 1 = X 2 = · · · X K ] is a polynomial in terms of Q 1 , . . . Q K with the only asymptotically non-vanishing term being Q K 1 = 1. Besides the immediate application to random vectors that we required in the last section, we remark that the expression for the probability of random variables being mutually different can also be used to derive a closed-form expression for the Stirling numbers of the first kind. Briefly, let X be a uniform random variable on {1, . . . L} for some L ∈ N, so that Q m = L 1−m in the above statement. The probability that K ≤ L independent copies of this random variable are all mutually different is then given by the above expression. On the other hand, this probability can also be computed purely combinatorially to be (L) K L K , where ( · ) K denotes the falling factorial [30]. By definition, (L) K = ∑ K n=0 s(K, n) L n with s(K, n) the (signed) Stirling numbers of the first kind. Therefore, comparing the two expressions we find the Stirling numbers to be given as (cf. [31])

Discussion
In this work, we have analysed a concrete, but general, model of sensory processing and demonstrated the extent of the sampling bias when directly estimating information theoretic quantities from experimental histograms. We have shown that as we consider larger and larger neural populations, with high probability any estimate of entropy or mutual information will only depend on the stimulus entropy.
We found that the issue lies in the fact that the samples of neural activity in response to the presentation of different stimulus values turn out to be mutually distinct. One way this can happen is through the existence of a subpopulation of neurons that only contributes independent noise to the population's activity. Importantly, the composition of the subpopulation can be stimulus-dependent, so that it is impossible to exclude this subpopulation from the beginning in the analysis. A plausible origin for these noisy subpopulations are sufficiently localised, compactly supported receptive fields. In a simplified scenario, imagine a neural population, whose neurons are receptive to any one of three stimuli. For each of the stimuli, the neurons receptive to one of the other two stimuli constitute such a noisy subpopulation. Now, as we have mentioned also earlier, none of the three subpopulations can be excluded on the grounds that it contributes only noise across all stimuli, yet their activity lets samples recorded from the population appear more distinctive than they are. While for small neural populations the effect of noisy subpopulations can be counteracted by increasing the number of samples that one considers, this becomes infeasible for even moderately sized populations due to the combinatorially large repertoire of population activity. As we have shown, as we consider larger and larger populations and therefore also larger noisy subpopulations the possibility to sample even one activity pattern twice is exponentially suppressed. In reality, receptive fields are localised, yet at least in computational models their range often extends indefinitely. This is the case, for example, for Gaussian receptive fields and the activity of a neuron then depends strongly on some stimulus values while for others the dependence in vanishingly small. While the assumptions of our model are clearly not met, we argue that for all practical purposes the consequence is the same, although weakened. This is because one will only ever consider a finite number of samples. The smaller the dependence between the stimulus and the activity, the weaker it manifests itself in particular when only considering those finitely many samples, so that corresponding neurons appear essentially independent.
In the model we have proposed, we assumed that for any two stimulus values there exist sufficiently large subpopulations such that their components contribute independent and identically distributed noise when exposed to either of the two stimulus values. Now, this is certainly a simplification and a more realistic assumption would be that in the absence of sensory stimulation the activity in this subpopulations is governed by low-dimensional dynamics in addition to some individual noise [32,33]. In principle though, our conclusions should hold true irrespectively following the same general argument that we formalised in this work: In a large neural population, stochastic variability, even if it is only in a relatively small subpopulation, is sufficient to produce unique samples of neural activity in an experiment. In turn, this then manifests itself in a maximal bias when estimating mutual information and other information theoretic quantities.
Without any further assumptions on the statistical relation between the different neurons within the population, this work shows that, despite being a powerful framework, in general, information theoretic analyses become essentially intractable when considering larger neural populations, because of the difficulties of accurately estimating the joint probability distribution between activity and stimulus states from experimental histograms. Therefore, this is where modelling approaches come into play. These approaches frequently employ maximum entropy Ising models that were fit initially only to low-order [34] and later also higher-order statistics of the data [35][36][37]. Alternative approaches include the cascaded logistic model which utilises Dirichlet processes at its core [38] or more recently the population tracking model [24]. However, for an information theoretic analysis it is in addition important to be able to compute the involved quantities in an efficient way. While most models include the possibility to sample states, this is not an option for large populations again because of the issues presented in this work. In the context of maximum entropy Ising models, thermodynamic considerations turned out to be useful to efficiently compute quantities such as the entropy [37,39,40]. In the population tracking model, on the other hand, one derives a reduced, low-dimensional model, from which those quantities can be computed likewise [24].
Altogether, while for many statistics of neural activity it might be sufficient to consider experimental histograms, this is not the case for information theoretic quantities when considering large neural populations. Given the many insights that information theoretic analyses can bring [41][42][43], this motivates new approaches for studying experimental recordings from ever larger neural populations. Furthermore, it also inspires consideration of how the brain handles the informational constraints we have identified.