Estimation Bias in Maximum Entropy Models

Maximum entropy models have become popular statistical models in neuroscience and other areas in biology and can be useful tools for obtaining estimates of mutual information in biological systems. However, maximum entropy models fit to small data sets can be subject to sampling bias; i.e., the true entropy of the data can be severely underestimated. Here, we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We focus on pairwise binary models, which are used extensively to model neural population activity. We show that if the data is well described by a pairwise model, the bias is equal to the number of parameters divided by twice the number of observations. If, however, the higher order correlations in the data deviate from those predicted by the model, the bias can be larger. Using a phenomenological model of neural population recordings, we find that this additional bias is highest for small firing probabilities, strong correlations and large population sizes—for the parameters we tested, a factor of about four higher. We derive guidelines for how long a neurophysiological experiment needs to be in order to ensure that the bias is less than a specified criterion. Finally, we show how a modified plug-in estimate of the entropy can be used for bias correction.


Introduction
Understanding how neural populations encode information about external stimuli is one of the central problems in computational neuroscience [1,2].Information theory [3,4] has played a major role in our effort to address this problem, but its usefulness is limited by the fact that the information-theoretic quantity of interest, mutual information (usually between stimuli and neuronal responses), is hard to compute from data [5].That is because the key ingredient of mutual information is the entropy, and estimation of entropy from finite data suffers from a severe downward bias when the data is undersampled [6,7].While a number of improved estimators have been developed (see [5,8] for an overview), the amount of data one needs is, ultimately, exponential in the number of neurons, so even modest populations (tens of neurons) are out of reach-this is the so-called curse of dimensionality.Consequently, although information theory has led to a relatively deep understanding of neural coding in single neurons [2], it has told us far less about populations [9,10].In essence, the brute force approach to measuring mutual information that has worked so well on single spike trains simply does not work on populations.
One way around this problem is to use parametric models in which the number of parameters grows (relatively) slowly with the number of neurons [11,12]; and, indeed, parametric models have been used to bound entropy [13].For such models, estimating entropy requires far less data than for brute force methods.However, the amount of data required is still nontrivial, leading to bias in naive estimators of entropy.Even small biases can result in large inaccuracies when estimating entropy differences, as is necessary for computing mutual information or comparing maximum entropy models of different orders (e.g., independent versus pairwise).Additionally, when one is interested only in the total entropy (a quantity that is useful, because it provides an upper bound on the coding capacity [1]), bias can be an issue.That's because, as we will see below, the bias typically scales at least quadratically with the number of neurons.Since entropy generally scales linearly, the quadratic contribution associated with the bias eventually swamps the linear contribution, yielding results that tell one far more about the amount of data than the entropy.
Here, we estimate the bias for a popular class of parametric models, maximum entropy models.We show that if the true distribution of the data lies in the model class (that is, it comes from the maximum entropy model that is used to fit the data), then the bias can be found analytically, at least in the limit of a large number of samples.In this case, the bias is equal to the number of parameters divided by twice the number of observations.When the true distribution is outside the model class, the bias can be larger.
To illustrate these points, we consider the Ising model [14], which is the second-order (or pairwise) maximum entropy distribution on binary data.This model has been used extensively in a wide range of applications, including recordings in the retina [15][16][17], cortical slices [18] and anesthetized animals [19,20].In addition, several recent studies [13,21,22] have used numerical simulations of large Ising models to understand the scaling of the entropy of the model with population size.Ising models have also been used in other fields of biology; for example, to model gene regulation networks [23,24] and protein folding [25].
We show that if the data is within the model class (i.e., the data is well described by an Ising model), the bias grows quadratically with the number of neurons.To study bias out of model class, we use a phenomenological model of neural population activity, the Dichotomized Gaussian [26][27][28].This model has higher-order correlations which deviate from those of an Ising model, and the structure of those deviations has been shown to be in good agreement with those found in cortical recordings [27,29].These higher order correlations do affect bias-they can increase it by as much as a factor of four.We provide worst-case estimates of the bias, as well as an effective numerical technique for reducing it.
Non-parametric estimation of entropy is a well studied subject, and a number of very sophisticated estimators have been proposed [5,30,31].In addition, several studies have looked at bias in the estimation of entropy for parametric models.However, those studies focused on Gaussian distributions and considered only the within model class case (that is, they assumed that the data really did come from a Gaussian distribution) [32][33][34][35].To our knowledge, the entropy bias of maximum entropy models when the data is out of model class has not be studied.

Bias in Maximum Entropy Models
Our starting point is a fairly standard one in statistical modeling: having drawn K samples of some variable, here denoted, x, from an unknown distribution, we would like to construct an estimate of the distribution that is somehow consistent with those samples.To do that, we use the so-called maximum entropy approach: we compute, based on the samples, empirical averages over a set of functions and construct a distribution that exactly matches those averages, but otherwise has maximum entropy.
Let us use g i (x), i = 1, . . ., m to denote the set of functions and μi to denote their empirical averages.Assuming we draw K samples, these averages are given by: where x (k) is the k th sample.Given the μi , we would like to construct a distribution that is constrained to have the same averages as in Equation ( 1) and also has maximum entropy.Using q(x| μ) to denote this distribution (with μ ≡ (μ 1 , μ2 , ..., μm )), the former condition implies that: The entropy of this distribution, denoted S q ( μ), is given by the usual expression: where log denotes the natural logarithm.Maximizing the entropy with respect to q(x| μ) subject to the constraints given in Equation ( 2) yields (see, e.g., [4]): The λ i (the Lagrange multipliers of the optimization problem) are chosen, such that the constraints in Equation ( 2) are satisfied, and Z( μ), the partition function, ensures that the probabilities normalize to one: Given the λ i ( μ), the expression for the entropy of q(x| μ) is found by inserting Equation (4) into Equation ( 3).The resulting expression: depends only on μ, either directly or via the functions, λ i ( μ).
Because of sampling error, the μi are not equal to their true values, and S q ( μ) is not equal to the true maximum entropy.Consequently, different sets of x (k) lead to different entropies and, because the entropy is concave, to bias (see Figure 1).Our focus here is on the bias.To determine it, we need to compute the true parameters.Those parameters, which we denote µ i , are given by the K → ∞ limit of Equation ( 1); alternatively, we can think of them as coming from the true distribution, denoted p(x): Associated with the true parameters is the true maximum entropy, S q (µ).The bias is the difference between the average value of S q ( μ) and S q (µ); that is, the bias is equal to S q ( μ) − S q (µ), where the angle brackets indicate an ensemble average-an average over an infinite number of data sets (with, of course, each data set containing K samples).Assuming that μ is close to µ, we can Taylor expand the bias around the true parameters, leading to: where: Because δµ i is zero on average [see Equations (7) and (9)], the first term on the right-hand side of Equation ( 8) is zero.The second term is, therefore, the lowest order contribution to the bias, and it is what we work with here.For convenience, we multiply the second term by −2K, which gives us the normalized bias: In the Methods, we explicitly compute b, and we find that: where: and: Here, C q −1 ij denotes the ij th entry of C q −1 .Because C p and C q are both covariance matrices, it follows that b is positive.In Figure 2, we plot −2K times the true bias (−2K times the left-hand side of Equation ( 8), which we compute from finite samples), and b [via Equation (11)] versus the number of samples, K.When K is about 30, the two are close, and when K > 500, they are virtually indistinguishable.Thus, although Equation (11) represents an approximation to the bias, it is a very good approximation for realistic data sets.(11).Colored curves: −2K times the bias, computed numerically using the expression on the left-hand side of Equation (8).We used a homogeneous Dichotomized Gaussian distribution with n = 10 and a mean of 0.1.Different curves correspond to different correlation coefficients [see Equation (19) below], as indicated in the legend.Evaluating the bias is, typically, hard.However, when the true distribution lies in the model class, so that p(x) = q(x|µ), we can write down an explicit expression for it.That is because, in this case, C q = C p , so the normalized bias [Equation (11)] is just the trace of the identity matrix, and we have b = m (recall that m is the number of constraints); alternatively, the actual bias is −m/2K.An important within model-class case arises when the parametrized model is a histogram of the data.If x can take on M values, then there are M −1 parameters (the "−1" comes from the fact that p(x) must sum to one) and the bias is −(M − 1)/2K.We thus recover a general version of the Miller-Madow [6] or Panzeri-Treves bias correction [7], which were derived for a multinomial distribution.

Is Bias Correction Important?
The fact that the bias falls off as 1/K means that we can correct for it simply by drawing a large number of samples.However, how large is "large"?For definitiveness, suppose we want to draw enough samples that the absolute value of the bias is less than times the true entropy, denoted S p .Quantitatively, this means we want to choose K, so that |Bias| < S p .Using Equation (10) to relate the true bias to b, assuming that K is large enough that −b/2K provides a good approximation to the true bias, and making use of the fact that b is positive, the condition |Bias| < S p implies that K must be greater than K min , where K min is given by: Let us take x to be a vector with n components: x ≡ (x 1 , x 2 , ..., x n ).The average entropy of the components, denoted S 1 , is given by S 1 = (1/n) i S p (x i ), where S p (x i ) is the true entropy of x i .Since nS 1 , the "independent" entropy of x, is greater than or equal to the true entropy, S p , it follows that K min obeys the inequality: Not surprisingly, the minimum number of samples scales with the number of constraints, m (assuming b/m does not have a strong m-dependence; something we show below).Often, m is at least quadratic in n; in that case, the minimum number of samples increases with the dimensionality of x.
To obtain an explicit expression for m in terms of n, we consider a common class of maximum entropy models: second order models on binary variables.For these models, the functions g i (x) constrain the mean and covariance of the x i , so there are m = n(n + 1)/2 parameters: n parameters for the mean and n(n − 1)/2 for the covariance (because the x i are binary, the variances are functions of the means, which is why there are n(n − 1)/2 parameters for the covariances rather than n(n + 1)/2).Consequently, m/n = (n + 1)/2 and, dropping the "+1" (which makes the inequality stronger), we have: How big is K min in practice?To answer that, we need estimates of S 1 and b/m.Let us focus first on S 1 .For definiteness, here (and throughout the paper), we consider maximum entropy models that describe neural spike trains [15,16].In that case, x i is one if there are one or more spikes in a time bin of size δt and zero, otherwise.Assuming a population average firing rate of ν, and using the fact that entropy is concave, we have S 1 ≤ h(νδt), where h(p) is the entropy of a Bernoulli variable with probability p: h(p) = −p log p − (1 − p) log(1 − p).Using also the fact that h(p) ≤ p log(e/p), we see that S 1 ≤ νδt log(e/(νδt)), and so: Exactly how to interpret K min depends on whether we are interested in the total entropy or the conditional entropy.For the total entropy, every data point is a sample, so the number of samples in an experiment that runs for time T is T /δt.The minimum time to run an experiment, denoted T min , is, then, given by: Ignoring for the moment the factor b/m and the logarithmic term, the minimum experimental time scales as n/4 ν.If one is willing to tolerate a bias of 10% of the true maximum entropy ( = 0.1) and the mean firing rate is not so low (say 10 Hz), then T min ∼ n/4 s.Unless n is in the hundreds of thousands, running experiments long enough to ensure an acceptably small bias is relatively easy.However, if the tolerance and firing rates drop, say to = 0.01 and ν = 1 Hz, respectively, then T min ∼ 25n s, and experimental times are reasonable until n gets into the thousands.Such population sizes are not feasible with current technology, but they are likely to be in the not so distant future.
The situation is less favorable if one is interested in the mutual information.That is because to compute the mutual information, it is necessary to repeat the stimulus multiple times.Consequently, K min [Equation (17)] is the number of repeats, with the repeats typically lasting 1-10 s.Again, ignoring the factor b/m and the logarithmic term, assuming, as above, that = 0.1 and the mean firing rate is 10 Hz and taking the bin size to be (a rather typical) 10 ms, then K min ∼ 25n.For n = 10, K min = 250, a number that is within experimental reach.When n = 100; however, K min = 2500.For a one second stimulus, this is about 40 min, still easily within experimental reach.However, for a ten second stimulus, recording times approach seven hours, and experiments become much more demanding.Moreover, if the firing rate is 1 Hz and a tighter tolerance, say = 0.01, is required, then K min = 2500n.Here, even if the stimulus lasts only one second, one must record for about 40 min per neuron-or almost seven hours for a population of 10 neurons.This would place severe constraints on experiments.
So far, we have ignored the factor (b/m)/ log(e/(νδt)) that appears in Equations ( 17) and (18).Is this reasonable in practice, when the data is not necessarily well described by a maximum entropy model?We address this question in two ways: we compute it for a particular distribution, and we compute its maximum and minimum.The distribution we use is the Dichotomized Gaussian model [26,36], chosen because it is a good model of the higher-order correlations found in cortical recordings [27,29].
To access the large n regime, we consider a homogeneous model-one in which all neurons have the same firing rate, denoted ν, and all pairs have the same correlation coefficient, denoted ρ.In general, the correlation coefficient between neuron i and j is given by: In the homogeneous model, all the ρ ij are the same and equal to ρ. Assuming, as above, a bin size of δt, the two relevant parameters are the probability of firing in a bin, νδt, and the correlation coefficient, ρ.In Figure 3a,b, we plot b/m (left axis) and (b/m)/ log(e/(νδt)) (right axis) versus n for a range of values for νδt and ρ.There are two salient features to these plots.First, b/m increases as νδt decreases and as ρ increases, suggesting that bias correction is more difficult at low firing rates and high correlations.Second, the factor (b/m)/ log(e/(νδt)) that affects the minimum experimental time, T min , has, for large n, a small range: a low of about 0.3 and a high of about one.Consequently, this factor has only a modest effect on the minimum number of trials one needs to avoid bias.
Figure 3 gives us b/m for the homogeneous Dichotomized Gaussian model.Does the general picture-that b/m is largest at low firing rates and high correlations, but never more than about four-hold for an inhomogeneous population, one in which different neurons have different firing rates and different pairs of neurons have different correlation coefficients?To address this question, in Figure 4, we compare a heterogeneous Dichotomized Gaussian model with n = 5 neurons to a homogeneous model: in Figure 4a, we plot b/m for a range of median firing rates and correlations coefficients in an inhomogeneous model, and in Figure 4b, we do the same for a homogeneous model.At very low firing rates, the homogeneous model is slightly more biased than the heterogeneous one, while at very low correlations, it is the other way around.Overall, though, the two models have very similar biases.Although not proof that the lack of difference will remain at large n, the results are at least not discouraging.Above, we saw that the factor (b/m)/ log(e/(νδt)) was about one for the Dichotomized Gaussian model.That is encouraging, but not definitive.In particular, we are left with the question: Is it possible for the bias to be much smaller or much larger than what we saw in Figures 3 and 4? To answer that, we write the true distribution, p(x), in the form: and ask how the bias depends on δp(x); that is, how the bias changes as p(x) moves out of model class.
To ensure that δp(x) represents only a move out of model class, and not a shift in the constraints (the µ i ), we choose it, so that p(x) satisfies the same constraints as q(x|µ): We cannot say anything definitive about the normalized bias in general, but what we can do is compute its maximum and minimum as a function of the distance between p(x) and q(x|µ).For "distance", we use the Kullback-Leibler divergence, denoted ∆S, which is given by: where S p is the entropy of p(x).The second equality follows from the definition of q(x|µ), Equation ( 4) and the fact that g i (x) p(x) = g i (x) q(x|µ) , which comes from Equation (21).
Rather than maximizing the normalized bias at fixed ∆S, we take a complementary approach and minimize ∆S at fixed bias.Since S q (µ) is independent of p(x), minimizing ∆S is equivalent to maximizing S p (see Equation (22)).Thus, again, we have a maximum entropy problem.Now, though, we have an additional constraint on the normalized bias.To determine exactly what that constraint is, we use Equations ( 11) and ( 12) to write: where: and, importantly, C q ij depends on q(x, µ), but not on δp(x) [see Equation (12a)].Fixed normalized bias, b, thus corresponds to fixed B(x) p(x) .This additional constraint introduces an additional Lagrange multiplier besides the λ i , which we denote β.Taking into account the additional constraint, and using the same analysis that led to Equation (4), we find that the distribution with the smallest difference in entropy, ∆S, at fixed b, which we denote q(x|µ, β), is given by: where Z(µ, β) is the partition function, the λ i (µ, β) are chosen to satisfy Equation ( 7), but with p(x) replaced by q(x|µ, β), and β is chosen to satisfy Equation ( 23), but with, again, p(x) replaced by q(x|µ, β).Note that we have slightly abused notation; whereas, in the previous sections, λ i and Z depended only on µ, now they depend on both µ and β.However, the previous variables are closely related to the new ones: when β = 0, the constraint associated with b disappears, and we recover q(x|µ); that is, q(x|µ, 0) = q(x|µ).Consequently, λ i (µ, 0) = λ i (µ), and Z(µ, 0) = Z(µ).The procedure for determining the relationship between ∆S and the normalized bias, b, involves two steps: first, for a particular bias, b, choose the λ i and β in Equation (25) to satisfy the constraints given in Equation ( 2) and the condition B(x) q(x|µ,β) = b; second, compute ∆S from Equation (22).Repeating those steps for a large number of biases will produce curves like the ones shown in Figure 5a,b.
Since the true entropy, S p , is maximized (subject to constraints) when β = 0, it follows that ∆S is zero when β = 0 and nonzero, otherwise.In fact, in the Methods, we show that ∆S has a single global minimum at β = 0; we also show that the normalized bias, b, is a monotonic increasing function of β.Consequently, there are two normalized biases that have the same ∆S, one larger than m and the other smaller.This is shown in Figure 5a,b, where we plot b/m versus ∆S/S p for the homogeneous Dichotomized Gaussian model.It turns out that this model has near maximum normalized bias, as shown in Figure 5c.Consistent with that, the Dichotomized Gaussian model has about the same distribution of spike counts as the maximally biased models, but a very different distribution from the maximum entropy model (Figure 5d).
The fact that the Dichotomized Gaussian model has near maximum normalized bias is important, because it tells us that the bias we found in Figures 3 and 4 is about as large as one could expect.In those figures, we found that b/m had a relatively small range-from about one to four.Although too large to be used for bias correction, this range is small enough that one could use it to get a conservative estimate of the minimum number of trials [Equation (17)] or minimum run time [Equation ( 18)] it would take to reduce bias to an acceptable level.

Using a Plug-in Estimator to Reduce Bias
Given that we have an expression for the asymptotic normalized bias, b [Equation (11)], it is, in principle, possible to correct for it (assuming that K is large enough for the asymptotic bias to be valid).If the effect of model-misspecification is negligible (which is typically the case if the neurons are sufficiently weakly correlated), then the normalized bias is just m, the number of constraints, and bias correction is easy: simply subtract m/2K from our estimate of the entropy.If, however, there is substantial model-misspecification, we need to estimate covariance matrices under the true and maximum entropy models.Of course, these estimates are subject to their own bias, but we can ignore that and use a "plug-in" estimator; an estimator computed from the covariance matrices in Equation ( 11), C p and C q , which, in turn, are computed from data.Specifically, we estimate C p and C q using: Such an estimator is plotted in Figure 6a for a homogeneous Dichotomized Gaussian with n = 10 and νδt = 0.1.Although the plug-in estimator converges to the correct value for sample sizes above about 500, it underestimates b by a large amount, even when K ∼ 100.To reduce this effect, we considered a thresholded estimator, denoted b thresh , which is given by: where b plugin comes from Equation (11).This estimator is motivated by the fact that we found, empirically, that the additional bias due to model misspecification was almost always greater than m.Correlations are color coded as in Figure 5. Gray lines indicate the true normalized bias as a function of sample size, computed numerically as for Figure 2. (b) Relative error without bias correction, (S q ( μ) − S q (µ))/S q (µ), with the plug-in correction, (S q ( μ) + 2Kb plugin − S q (µ))/S q (µ), and with the thresholded estimator, (S q ( μ) + 2Kb thresh − S q (µ))/S q (µ).As shown in Figure 6a, b thresh is closer to the true normalized bias than b plugin .In Figure 6b, we plot the relative error of the uncorrected estimate of the maximum entropy, S q ( μ) − S q (µ) /S q (µ), and the same quantity, but with two corrections: the plug-in correction, S q ( μ) + 2Kb plugin − S q (µ) /S q (µ), and the thresholded correction, S q ( μ) + 2Kb thresh − S q (µ) /S q (µ).Using the plug-in estimator, accurate estimates of the maximum entropy can be achieved with about 100 samples; using the threshold estimators, as few as 30 samples are needed.This suggests that our formalism can be used to perform bias correction for maximum entropy models, even in the presence of model-misspecification.

Discussion
In recent years, there has been a resurgence of interest in maximum entropy models, both in neuroscience [13,15,16,[18][19][20][21][22] and in related fields [23][24][25].In neuroscience, these models have been used to estimate entropy.Although maximum entropy based estimators do much better than brute-force estimators, especially when multiple neurons are involved, they are still subject to bias.In this paper, we studied, both analytically and through simulations, just how big the bias is.We focused on the commonly used "naive" estimator, i.e., an estimator of the entropy, which is calculated directly from the empirical estimates of the probabilities of the model.
Our main result is that we found a simple expression for the bias in terms of covariance matrices under the true and maximum entropy distributions.Based on this, we showed that if the true model is in the model class, the (downward) bias in the estimate of the maximum entropy is proportional to the ratio of the number of parameters to the number of observations, a relationship that is identical to that of naive histogram estimators [6,7].This bias grows quadratically with population size for second-order binary maximum entropy models (also known as Ising models).
What happens when the model is misspecified; that is, when the true data does not come from a maximum entropy distribution?We investigated this question for, again, second-order binary maximum entropy models.We found that model misspecification generally increases bias, but the increase is modest-for a population of 100 neurons and strong correlations (ρ = 0.1), model misspecification increased bias by a factor of at most four (see Figure 3b).Experimentally, correlation coefficients are usually substantially below 0.1 [15,16,18,20], so a conservative estimate of the minimum number of samples one needs in an experiment can be obtained from Equation ( 17) with b/m = 4.However, this estimate assumes that our results for homogeneous populations apply to heterogeneous populations, something we showed only for relatively small populations (five neurons).
Has bias been a problem in experiments so far?The answer is largely no.Many of the experimental studies focused on the total entropy, using either no more than 10 neurons [15,16,18,20] or using populations as large as 100, but with a heavily regularized model [17].In both cases, recordings were sufficiently long that bias was not an issue.There have been several studies that computed mutual information, which generally takes more data than entropy [19,[37][38][39].Two were very well sampled: Ohiorhenuan et al. [19] used about 30,000 trials per stimulus, and Granot-Atedgi et al. [39] used effectively 720,000 trials per stimulus (primarily because they used the whole data set to compute the pairwise interactions).Two other studies, which used a data set with an average of about 700 samples per trial [40], were slightly less well sampled.The first used a homogeneous model (the firing rates and all higher order correlations were assumed to be neuron-independent) [37].This reduced the number of constraints, m, to at most five.Since there were about 700 trials per stimulus, the approximate downward bias, m/2K with K = 700, was 0.004 bits.This was a factor of 125 smaller than the information in the population, which was on the order of 0.5 bits, so the maximum error due to bias would have been less than 1%.The other study, based on the same data, considered a third order maximum entropy model and up to eight units [38].For such a model, the number of constraints, m, is n(n 2 + 5)/6 (=n+n(n−1)/2+n(n−1)(n−2)/6).Consequently, the approximate downward bias was n(n 2 +5)/12K.With n = 8 and K = 700, the bias is 0.07 bits.This is more than 10% of information, which was about 0.6 bits.However, the authors applied a sophisticated bias correction to the mutual information [38] and checked to see that splitting the data in half did not change the information, so their estimates are likely to be correct.Nevertheless, this illustrates that bias in maximum entropy models can be important, even with current data sets.Furthermore, given the exponential rate at which recorded population sizes are growing [41], bias is likely to become an increasingly major concern.
Because we could estimate bias, via Equations ( 10) and (11), we could use that estimate to correct for it.Of course, any correction comes with further estimation problems: in the case of model misspecification, the bias has to be estimated from data, so it too may be biased; even if not, it could introduce additional variance.This is, potentially, especially problematic for our bias correction, as it depends on two covariance matrices, one of which has to be inverted.Nevertheless, in the interests of simplicity, we used a plug-in estimator of the bias (with a minor correction associated with matrix singularities).In spite of the potential drawbacks of our simple estimator (including the fact that it assumes Equation ( 11) is valid even for very small sample sizes, K), it worked well: for the models we studied, it reduced the required number of samples by a factor of about three-from 100 to about 30.While this is a modest reduction, it could be important in electrophysiological experiments, where it is often difficult to achieve long recording times.Furthermore, it is likely that it could be improved: more sophisticated estimation techniques, such as modified entropy estimates [5], techniques for bias-reduction for mutual information [8,42] or Bayesian priors [30,31,43], could provide additional or more effective bias reduction.

Numerical Methods
For all of our analysis, we use the Dichotomized Gaussian distribution [26,27,36], a distribution over binary variables, x = (x 1 , x 2 , ..., x n ), where x i can take on the values, 0 or 1.The distribution can be defined by a sampling procedure: first, sample a vector, z, from a Gaussian distribution with mean γ and covariance Λ: then, set x i to +1 if z i > 0 and 0, otherwise.Alternatively, we may write: where p(z) is given by Equation (28).Given this distribution, our first step is to sample from it and, then, using those samples, fit a pairwise maximum entropy model.More concretely, for any mean and covariance, γ and Λ, we generate samples (x (k) , k = 1, ..., K); from those, we compute the μi [Equation (1)]; and from the μi , we compute the parameters of the maximum entropy model, λ i [Equation (4)].Once we have those parameters, we can compute the normalized bias via Equation (11).
While this is straightforward in principle, it quickly becomes unwieldy as the number of neurons increases (scaling is exponential in n).Therefore, to access the large n regime, we use the homogeneous Dichotomized Gaussian distribution, for which all the means, variances and covariances are the same: The symmetries of this model make it possible to compute all quantities of interest (entropy of the Dichotomized Gaussian distribution model, entropy of the maximum entropy model, normalized bias via Equation (11) and minimum and maximum normalized bias given distance from the model class) without the need for numerical approximations [27].

Parameters of the Heterogeneous Dichotomized Gaussian Distribution
For the heterogeneous Dichotomized Gaussian distribution, we need to specify γ and Λ.Both were sampled from random distributions.The mean, γ i , came from a zero mean, unit variance Gaussian distribution, but truncated at zero, so that only negative γ i s were generated.We used negative γ i s, because for neural data, the probability of not observing a spike (i.e.x i = 0) is generally larger than that of observing a spike (i.e.x i = 1).Generation of the covariance matrix, Λ, was more complicated and proceeded in several steps.The first was to construct a covariance matrix corresponding to a homogeneous population model; that is, a covariance matrix with 1 along the diagonal and ρ an all the off-diagonal elements.The next step was to take the Cholesky decomposition of the homogeneous matrix; this resulted in a matrix, G, with zeros in the upper triangular entries.We then added Gaussian noise to the lower-triangular entries of G (the non-zero entries).Finally, the last step was to set Λ to (G + N) • (G + N) , where N is the noise added to the lower-triangular entries and denotes the transpose.Because we wanted covariance matrices with a range of median correlation coefficients, the value of ρ was sampled uniformly from the range, [0.0, 0.8].Similarly, to get models that ranged from weakly to strongly heterogeneous, the standard deviation of the noise we added to the Cholesky-decomposition was sampled uniformly from {0.1, 0.25, 0.5, 1, 2}.

Fitting Maximum Entropy Models
To find the parameters of the maximum entropy models (the λ i ), we numerically maximized the log-likelihood using a quasi-Newton implementation [44].The log likelihood, denoted L, is given by: with the second equality following from Equation (4).Optimization was stopped when successive iterations increased the log-likelihood by less than 10 −20 .This generally led to a model for which all moments were within 10 −8 of the desired empirical moments.

Bias Correction
To compute the normalized bias, b [Equation ( 11)], we first estimate the covariance matrices under the true and maximum entropy distributions via Equation (26); these are denoted Ĉp and Ĉq , respectively.We then have to invert Ĉq .However, given finite data, there is no guarantee that Ĉq will be full rank.For example, if two neurons have low firing rates, synchronous spikes occur very infrequently, and the empirical estimate of x i x j can be zero for some i and j; say i 0 and j 0 .In this case, the corresponding λ will be infinity [see Equation ( 4)]; and so, Ĉq i j ,i 0 j 0 will be zero for all (i , j ) pairs, and Ĉq i ,i 0 j 0 will be zero for all i .In other words, Ĉq will have a row and column of zeros (and if the estimate of x i x j is zero for several (i, j) pairs, there will be several rows and columns of zeros).The corresponding rows and columns of Ĉp will also be zero, making b plugin (= tr[C q−1 • C p ]) well behaved in principle.In practice, however, this quantity is not well behaved, and b plugin will, in most cases, be close to m − m 0 , where m 0 is the number of all-zero rows and columns.In some cases, however, b plugin took on very large values (or was not defined).We attribute these cases to numerical problems arising from the inversion of the covariance matrix.We therefore rejected any data sets for which b plugin > 10m.It is likely that this problem could be fixed with a more sophisticated (e.g., Bayesian) estimator of empirical averages, or by simply dropping the rows and columns that are zero, computing b plugin , and then adding back in the number of dropped rows.

Sample Size
When estimating entropy for a particular model (a particular value of γ and Λ), for each K we averaged over 10, 000 data sets.Error bars denote the standard error of the mean across data sets.

The Normalized Bias Depends Only on the Covariance Matrices
The normalized bias, b, given in Equation (10), depends on two quantities: ∂ 2 S q (µ)/∂µ i ∂µ j and δµ i δµ j .Here, we derive explicit expressions for both.The second is the easier of the two: noting that δµ i is the mean of K uncorrelated, zero mean random variables [see Equation ( 9)], we see that: where the last equality follows from the definition given in Equation (12a).
For the first, we have: where we used Equation ( 6) for the entropy.From the definition of Z(µ), Equation ( 5), it is straightforward to show that Inserting Equation (33) into Equation (32), the first and third terms cancel, and we are left with: This quantity is hard to compute directly, so instead, we compute its inverse, ∂µ i /∂λ j .Using the definition of µ i : differentiating both sides with respect to λ j and applying Equation (33), we find that: ∂µ i ∂λ j = g j (x)g i (x) q(x|µ) − g j (x) q(x|µ) g i (x) q(x|µ) = C q ji (36) The right-hand side is the covariance matrix within the model class.

Figure 1 .
Figure 1.Sampling bias in maximum entropy models.The equilateral triangle represents a D-dimensional probability space (for the binary model considered here, D = 2 n − 1, where n is the dimensionality of x).The cyan lines are contour plots of entropy; the red lines represent the m linear constraints and, thus, lie in a D − m dimensional linear manifold.(a) Maximum entropy occurs at the tangential intersection of the constraints with the entropy contours.(b) The light red region indicates the range of constraints arising from multiple experiments in which a finite number of samples is drawn in each.Maximum entropy estimates from multiple experiments would lie along the green line.(c) As the entropy is concave, averaging the maximum entropy over experiments leads to an estimate that is lower than the true maximum entropy-estimating maximum entropy is subject to downward bias.

Figure 2 .
Figure 2. Normalized bias, b, versus number of samples, K. Grey lines: b, computed from Equation(11).Colored curves: −2K times the bias, computed numerically using the expression on the left-hand side of Equation(8).We used a homogeneous Dichotomized Gaussian distribution with n = 10 and a mean of 0.1.Different curves correspond to different correlation coefficients [see Equation(19) below], as indicated in the legend.

Figure 3 .
Figure 3. Scaling of the bias with population size for a homogeneous Dichotomized Gaussian model.(a) Bias, b, for νδt = 0.1 and a range of correlation coefficients, ρ.The bias is biggest for strong correlations and large population sizes; (b) νδt = 0.02 and a range of (smaller) correlation coefficients.In both panels, the left axis is b/m, and the right axis is (b/m)/ log(e/(νδt)).The latter quantity is important for determining the minimum number of trials [Equation (17)] or the minimum runtime [Equation(18)] needed to reduce bias to an acceptable level.

Figure 4 . 3 .
Figure 4. Effect of heterogeneity on the normalized bias in a small population.(a) Normalized bias relative to the within model class case, b/m, of a heterogeneous Dichotomized Gaussian model with n = 5 as a function of the median mean, νdt, and correlation coefficient, ρ.As with the homogeneous model, bias is largest for small means and strong correlations.(b) The same plot, but for a homogeneous Dichotomized Gaussian.The difference in bias between the heterogeneous and homogeneous models is largest for small means and small correlations, but overall, the two plots are very similar.

Figure 5 .
Figure 5. Relationship between ∆S and bias.(a) Maximum and minimum normalized bias relative to m versus ∆S/S p (recall that S p is the entropy of p(x)) in a homogeneous population with size n = 5, νδt = 0.1, and correlation coefficients indicated by color.The crosses correspond to a set of homogeneous Dichotomized Gaussian models with νδt = 0.1.(b)Same as a, but for n = 100.For ρ = 0.5, the bias of the Dichotomized Gaussian model is off the right-hand side of the plot, at (0.17, 3.4); for comparison, the maximum bias at ρ = 0.5 and ∆S/S p = 0.17 is 3.8.(c) Comparison between the normalized bias of the Dichotomized Gaussian model and the maximum normalized bias.As in panels a and b, we used νδt = 0.1.Because the ratio of the biases is trivially near one when b is near m, we plot (b DG − m)/(b max − m), where b DG and b max are the normalized bias of the Dichotomized Gaussian and the maximum bias, respectively; this is the ratio of the "additional" bias.(d) Distribution of total spike count (= i x i ) for the Dichotomized Gaussian, maximum entropy (MaxEnt) and maximally biased (MaxBias) models with n = 100, νδt = 0.1 and ρ = 0.05.The similarity between the distributions of the Dichotomized Gaussian and maximally biased models is consistent with the similarity in normalized biases shown in panel c.

4. 3 .
Dependence of ∆S and the Normalized Bias, b, on β