Parameters estimation for spatio-temporal maximum entropy distributions: application to neural spike trains

We propose a numerical method to learn Maximum Entropy (MaxEnt) distributions with spatio-temporal constraints from experimental spike trains. This is an extension of two papers [10] and [4] who proposed the estimation of parameters where only spatial constraints were taken into account. The extension we propose allows to properly handle memory effects in spike statistics, for large sized neural networks.


Introduction
With the evolution of Multi-Electrode Arrays (MEA) acquisition techniques, it is currently possible to simultaneously record the activity of a few hundred of neurons up to a few thousand [13]. Stevenson et al [42] reported that the number of recorded neurons doubles approximately every 8 years. However, beyond the mere recording of an increasing number of neurons, there is a need to extract relevant information from data in order to understand the underlying dynamics of the studied network, how it responds to stimuli and how spike train response encodes these stimuli. In the realm of spike trains analysis this means having efficient spike sorting techniques [29,19,28,36], but also efficient methods to analyze spike statistics. The second aspect requires using canonical statistical models whose parameters have to be tuned ("learned") from data.
The Maximum Entropy method (MaxEnt) offers a way to selecting canonical statistical models from first principles. Having its root in statistical physics, MaxEnt consists of fixing a set of constraints, determined as the empirical average of features measured from the spiking activity. Maximizing the statistical entropy given those constraints provides a unique probability, called a Gibbs distribution, which approaches at best data statistics in the following sense: among all probability distributions which match the constraints this is the one which has the smallest Kullback-Leibler divergence with the data ( [10]). Equivalently, it satisfies the constraints without adding additional assumption on statistics [22].
Most studies have focused on describing properly the statistics of spatially synchronized patterns of neuronal activity without considering time-dependent patterns and memory effects. In this setting pairwise models [41,35] or extensions with triplets and quadruplets interactions [15], [14], [46] were claimed to correctly fit ≈ 90 to 99% of the information. However, considering now the capacity of these models to correctly reproduce spatio-temporal spike patterns, the performances drop-off dramatically, especially in the cortex [44,30] or in the retina [48].
Taking into account spatio-temporal patterns requires to introduce memory in statistics, described as a Markov process. MaxEnt extends easily to this case (see section 2.2 and references therein for a short description) producing Gibbs distributions in the spatio-temporal domain. Moreover, rigorous mathematical methods are available to fit the parameters of the Gibbs distribution [48]. However, the main drawback of these methods is the huge computer memory they require, preventing their applications to large scale neural networks. Considering a model with memory depth D (namely, the probability of a spike pattern at time t depends on the spike activity in the interval [t − D, t − 1]), there are 2 N (D+1) possible patterns. The method developed in [48] requires to handle a matrix of size 2 N (D+1) × 2 N (D+1) . So, it becomes intractable for N (D + 1) > 20.
In this paper, we propose an alternative method to fit the parameters of a spatio-temporal Gibbs distribution with larger values of the product N (D + 1). We have been able to go up to N (D + 1) (∼ 120) on a small cluster (64 processors AMD Opteron(tm) 2300 MHz). The method is based on [11] and [4] who proposed the estimation of parameters in spatial Gibbs distributions. The extension in the spatio-temporal domain is not straightforward, as we show, but it carries over to the price of some modifications. Combined with parallel Montecarlo computing developed in [33] this provides a numerical method allowing to handle Markovian spike statistics with spatio-temporal constraints.
The paper is organized as follow. In section 2, we recall the theoretical background for spike train with Gibbs distribution. We discuss both spatial and spatio-temporal case. In the next section, 3, we explain the method to fit the parameters of MaxEnt distributions. As we mathematically show, the convex criterion used by [11] still applies for spatio-temporal constraints. However, the method used by [4] to avoid recomputing the Gibbs distribution at each parameters change cannot be directly used and has to be adapted using a Linear Response scheme. In the last section, 4, we show benchmarks evaluating the performance of this method and discuss the computational obstacles that we encountered. We made tests with both synthetic and real data. Synthetic data were generated from known probability distributions using a Montecarlo method. Real data corresponds to spike trains obtained from retinal ganglion cells activity (courtesy of M.J. Berry and O. Marre). The method shows a satisfying performance in the case of synthetic data. Real data analysis is not systematic but instead used as an illustration and comparison with the paper of Schneidman et al. 2006 ([41]). As we could see in the example, the performance on real data, although satisfying, is affected by the large number of parameters in the distribution, consequence of the choice to work with canonical models (Ising, pairwise with memory). This effect is presumably not related to our method but to a standard problem in statistics.
Some of our notations might be not usual to some readers. Therefore, we added a list of symbols at the end of the paper. We consider the joint activity of N neurons, characterized by the emission of action potentials ("spikes"). We assume that there is a minimal time scale, δ, set to 1 without loss of generality such that a neuron can at most fire a spike within a time window of size δ. This provides a time discretization labeled with an integer time n. Each neuron activity is then characterized by a binary variable 1 ω k (n) = 1 if neuron k fires at time n and ω k (n) = 0 otherwise. The state of the entire network in time bin n is thus described by a vector , called a spiking pattern. A spike block is a consecutive sequence of spike patterns ω n2 n1 , representing the activity of the whole network between two instants n 1 and n 2 .
The time-range (or "range") of a block ω n2 n1 is n 2 − n 1 + 1, the number of time steps from n 1 to n 2 . Here is an example of a spike block with N = 4 neurons and range R = 3: A spike train or raster is a spike block ω T 0 from some initial time 0 to some final time T . To alleviate notations we simply write ω for a spike train. We note Ω the set of spike trains.

Observables
An observable is a function O which associates a real number O(ω) to a spike train. In the realm of statistical physics common examples of observables are the energy or the number of particles (where ω would correspond e.g. to a spin configuration). In the context of neural networks examples are the number of neuron firing at a given time n, N k=1 ω k (n), or the function ω k1 (n 1 )ω k2 (n 2 ) which is 1 if neuron k 1 fires at time n 1 and neuron k 2 fires at time n 2 and is 0 otherwise.
Typically, an observable does not depend on the full raster, but only on a sub-block of it. The time-range (or "range") of an observable is the minimal integer R > 0 such that, for any raster ω, From now on, we restrict to observables of range R, fixed and finite. We set D = R − 1.
An observable is time-translation invariant if, for any time n > 0 we have The two examples above are time-translation invariant. The observable λ(n 1 )ω k1 (n 1 )ω k2 (n 2 ), where λ is a real function of time, is not time-translation invariant. Basically, time-translation invariance means that O does not depend explicitly on time. We focus on such observables from now on.

Monomials
Prominent examples of time-translation invariant observables with range R are products of the form: where p u , u = 1 . . . r are pairs of spike-time events (k u , n u ), k u = 1 . . . N being the neuron index, and n u = 0 . . . D being the time index. Such an observable, called monomial, takes therefore values in { 0, 1 } and is 1 if and only if ω ku (n u ) = 1, u = 1 . . . r (neuron k 1 fires at time n 1 , . . . , neuron k r fires at time n r ). A monomial is therefore a binary observable that represents the logic-AND operator applied to a prescribed set of neuron spikes events. We allow the extension of the definition (1) to the case where the set of pairs p 1 , . . . , p r is empty and we set m ∅ = 1. For a number N of neurons and a time range R there are thus 2 N R such possible products. Any observable of range R can be represented as a linear combination of products (1). Monomials constitute therefore a canonical basis for observable representation. To alleviate notations, instead of labeling monomials by a list of pairs, as in (1), we shall label them by an integer index l.

Potential
Another prominent example of observable is the function called "energy" or potential in the realm of the MaxEnt. Any potential of range R can be written as a linear combination of the 2 N R possible monomials (1): where some coefficients λ l in the expansion may be zero. Therefore, by analogy with spin systems, monomials somewhat constitute spatio-temporal interactions between neurons: the monomial r u=1 ω ku (n u ) contributes to the total energy H λ (ω) of the raster ω if and only if neuron k 1 fires at time n 1 , . . . , neuron k r fires at time n r in the raster ω. The number of pairs in a monomial (1) defines the degree of an interaction: degree 1 corresponds to "self-interactions", degree 2 to pairwise, and so on. Typical examples of such potentials are the Ising model [41,35,40]: where considered events are individual spikes and pairs of simultaneous spikes. Another example is the Ganmor-Schneidman-Segev (GSS) model [14], [15] where additionally to 3, simultaneous triplets of spikes are considered (We restrict the form (4) to triplet although Ganmor et al were also considering quadruplets). In these two examples the potential is a function of the spike pattern at a given time. Here, we choose this time equal to 0, without loss of generality, since we are considering time-translation invariant potentials. More generally, the form (2) affords the consideration of spatio-temporal neurons interactions: this allows us to introduce delays, memory and causality in spike statistics estimation. A simple example is a pairwise model with delays such as: where 'PR' stands for 'Pairwise with range R', takes into account the events where neuron i fires s time steps after a neuron j with s = 0 . . . D.

The Maximum Entropy Principle
Assigning equal probabilities (uniform probability distribution) to possible outcomes goes back to Laplace and Bernoulli ( [16]) ("principle of insufficient reason"). Maximizing the statistical entropy without constraints is equivalent to this principle. In general, however, one has some knowledge about data, typically characterized by empirical average of prescribed observables (e.g. for spike trains, firing rates, probability that a fixed group of neurons fire at the same time, probability that K neurons fire at the same time [45]): this constitutes a set of constraints. The Maximum Entropy Principle (MaxEnt) is a method to obtain, from the observation of a statistical sample, a probability distribution that approaches at best the statistics of the sample, taking into account these constraints without additional assumptions [22]. Maximizing the statistical entropy given those constraints provides a distribution as far as possible from the uniform and as close as possible to the empirical distribution. For instance, considering the empirical mean and variance of the sample of a random variable as constraints results in a Gaussian distribution. Although some attempts have been made to extend MaxEnt to non stationary data [20,21,23,34] it is mostly applied in the context of stationary statistics: the average of an observable does not depend explicitly on time. We shall work with this hypothesis. In its simplest form, the MaxEnt also assumes that the sample has no memory: the probability of an outcome at time t does not depend on the past. We first discuss the MaxEnt in this context in the next section, before considering the case of processes with memory in the section 2.2.2.

Spatial constraints
In our case, the natural constraints are represented by the empirical probability of occurrence of characteristic spike events in the spike train, or, equivalently, by the average of specific monomials. Classical examples of constraints are the probability that a neuron fires at a given time (firing rate) or the probability that two neurons fire at the same time. For a raster ω of length T we note π n=0 ω i (n), the empirical probability that two neurons i, j fire at the same time is π n=0 ω i (n)ω j (n) and so on. Given a set of L monomials m l , their empirical average, π (T ) ω [ m l ], measured in the raster ω, constitute a set of constraints shaping the sought probability distribution. We consider here monomials corresponding to events occurring at the same time, i.e. m l (ω) ≡ m l ( ω(0) ) postponing to section 2.2.2 the general case of events occurring at distinct times.
In this context, the MaxEnt problems is stated as follows. Find a probability distribution µ that maximizes the entropy: (where the sum holds on the 2 N possible spike patterns ω(0)), given the constraints: The average of monomials, predicted by the statistical model µ (noted here µ [ m l ]), must be equal to the average π (T ) ω [ m l ] measured in the sample. There is, additionally, the probability normalization constraint: This provides a variational problem where M is the set of (stationary) probabilities on spike trains. One searches, among all stationary probabilities ν ∈ M, the one which maximizes the rhs of (9). There is a unique such probability, µ = µ λ , provided N is finite and λ l > −∞. This probability depends on the parameters λ.
Stated in this form the MaxEnt is a Lagrange multipliers problem. The sought probability distribution is the classical Gibbs distribution: where ]. Note that the time index (here 0) does not play a role since we have assumed µ λ to be stationary (time-translation invariant).
The value of λ l s is fixed by the relation: Additionally, note that the matrix ∂ 2 log Z λ ∂λ l ∂λ l is positive. This ensures the convexity of the problem and the uniqueness of the solution of the variational problem.
Note that we do not expect in general µ λ to be equal to the (hidden) probability shaping the observed sample. It is only the closest one satisfying the constraints (7) [10]. The notion of closeness is related to the Kullback-Leibler divergence, defined in the next section.
It is easy to check that the Gibbs distribution (10) obeys: for any spike block ω n2 n1 . Indeed, the potential of the spike block ω n2 . Equation (12) expresses that spiking pattern occurring at different times are independent under the Gibbs distribution (10). This is expected: since the constraints shaping µ λ take only into account spiking events occurring at the same time, we have no information on causality between spikes generation or on memory effects. The Gibbs distributions obtained when constructing constraints only with spatial events leads to statistical models where spike patterns are renewed at each time step, without reference to the past activity.

Spatio-temporal constraints
On the opposite, one expects that spike trains generation involves causal interactions between neurons and memory effects. We would therefore like to construct Gibbs distributions taking into account information on spatio-temporal interactions between neurons and leading to a statistical model not assuming anymore that successive spikes patterns are independent. Although the notion of Gibbs distribution extends to processes with infinite memory [12] we shall concentrate here to Gibbs distributions associated with Markov processes with finite memory depth D. That is, the probability to have a spike pattern ω(n) at time n, given the past history of spikes reads P ω(n) ω n−1 n−D . Note that those transition probabilities are assumed not to depend explicitly on time (stationarity assumption).
Such a family of transition probabilities P ω(n) ω n−1 n−D define an homogeneous Markov chain. Provided 2 P ω(n) ω n−1 n−D > 0 for all ω n n−D , there is a unique probability µ, called the invariant probability of the Markov chain such that: In a Markov process the probability of a block ω n2 n1 , for n 2 − n 1 + 1 > D, is: the Chapman-Kolmogorov relation [18]. To determine the probability of ω n2 n1 , one has to know the transition probabilities and the probability µ ω n1+D−1 n1 . When attempting to construct a Gibbs distribution obeying (14) from a set of spatio-temporal constraints one has therefore to determine simultaneously the family of transition probabilities and the invariant probability. Remark that setting: we may write (14) in the form: The probability of observing the spike pattern ω n2 n1 given the past ω n1+D−1 n1 of depth D has an exponential form, similar to (10). Actually, the invariant probability of a Markov chain is a Gibbs distribution in the following sense.
In view of (14), probabilities must be defined whatever even if n 2 − n 1 is arbitrary large. In this setting, the right objects are probabilities on infinite rasters [18]. Then, the entropy rate (or Kolmogorov-Sinai entropy) of µ is: where the sum holds over all possible blocks ω n 0 . This reduces to (6) when µ obeys (12).
The MaxEnt takes now the following form. We consider a set of L spatiotemporal spike events (monomials) whose empirical average value π (T ) ω [ m l ] has been computed. We only restrict to monomials with a range at most equal to R = D + 1, for some D > 0. This provide us a set of constraints of the form (7). To maximize the entropy rate (17) under the constraints (7) we construct a range-R potential H λ = L l=1 λ l m l . The generalized form of the MaxEnt states that there is a unique probability measure µ λ ∈ M such that [6]: This is the extension of the variational principle (9) to Markov chains. It selects, among all possible probability ν, a unique probability µ λ which realizes the supremum. µ λ is called the Gibbs distribution with potential H λ . The quantity P [ λ ] is called topological pressure or free energy density. For a potential of the form (2) [38,25]: This is the analog of (11) which allows to tune the parameters λ l . Thus, P [ λ ] plays the role of log Z λ in (10). Actually, it is equal to log Z λ when restricting to the memory less case 3 . P [ λ ] is strictly convex 4 which guarantees the uniqueness of µ λ . Note that µ λ has not the form (10) for D > 0. Indeed a probability distribution e.g. of the form µ λ (ω n−1 the potential of the block ω n−1 0 , and: the "n-time steps" partition function does not obey the Chapman-Kolmogorov relation (14). However, the following holds [39,3,17,6].
1. There exist A, B > 0 such that, for any block ω n−1 2. We have: In the spatial case, (22). Although (23) is defined by a limit, it is possible to compute P [ λ ] as the log of the largest eigenvalue of a transition matrix constructed from H λ (Perron-Frobenius matrix) [49]. Unfortunately, this method does not apply numerically as soon as N R > 20.
These relations are crucial for the developments made in the next section.
To recap, a Gibbs distribution in the sense of [18] is the invariant probability distribution of a Markov chain. The link between the potential H λ and the transition probabilities P ω(D) ω D−1 0 (respectively the potential [15]) is given by: To finish this section let us introduce the Kullback-Leibler divergence d KL (ν, µ) which provides a notion of similarity between two probabilities ν, µ. We have d KL (ν, µ) ≥ 0 with equality if and only if µ = ν. The Kullback-Leibler divergence between an invariant probability ν ∈ M and the Gibbs distribution µ λ with potential H λ is given by [6]. When ν = π (T ) ω , we obtain the divergence between the "model (µ λ )" and the "empirical probability (π 3 Inferring the coefficients of a potential from data Equations (11) or (19) provide an analytical way to compute the coefficients of the Gibbs distribution from data. However, they require the computation of the partition function or of the topological pressure which becomes rapidly intractable as the number of neurons increases. Thus, researchers have attempted to find alternative methods to compute reliably and efficiently the λ l s. An efficient method has been introduced in [11] and applied to spike trains in [4]. Although these papers are restricted to Gibbs distributions of the form (10) (models without memory) we show in this section how their method can be extended to general Gibbs distributions.

The spatial case
The method developed in [11] by Dudik et al is based on the so-called convex duality principle, used in mathematical optimization theory. Due the difficulty in maximizing the entropy (which is a concave function), one looks for a convex function easier to investigate. Dudik et al showed that, for spatially constrained Maxent distributions, finding the Gibbs distribution amounts to finding the minimum of the negative log likelihood 5 : Indeed, in the spatial case, the Kullback-Leibler divergence between the empirical measure π (T ) ω and the Gibbs distribution at µ λ is: so that, from (24): where we used S π Its unique minimum is given by (11).
Moreover, we have: (10): and since P [λ] = log Z[λ] in the spatial case: 5 We have adapted [11] to our notations. Moreover, in our case π (T ) ω corresponds to the empirical average on a raster ω whereas π in [11] corresponds to an average over independent samples. Therefore: The idea proposed by Dudik et al is then to bound this difference by an easier-to-compute convex quantity, with the same minimum as L π (T ) ω (λ), and to reach this minimum by iterations on λ. They proposed a sequential and a parallel method. Let us summarize first the sequential method. The goal here is not to rewrite their paper [11] but to explain some crucial elements that are not directly appliable to the spatio-temporal case.
In the sequential case one updates λ as λ = λ + δe l , for some l, where e l is the canonical basis vector in direction l, so that ∆H λ = δm l , and Using the following property: for x ∈ [0, 1] and since m l ∈ {0, 1}, we have: This bound, proposed by Dudik et al, is remarkably clever. Indeed, it replaces the computation of the average µ λ e δm l , which is computationally hard, by the computation of µ λ [ m l ], which is computationally easy. Finally, In the parallel case, the computation and results differ. One now updates λ as λ = λ + L l=1 δ l e l . Moreover, one has to renormalize the m l s in m l = m l L in order that eq. (34) below holds. We have therefore ∆H λ = L l=1 δ l m l . Thus, Using the following property [9]: for δ l ∈ R and m l ≥ 0, L l=1 m l ≤ 1, we have: To be complete, let us mention that Dudik et al consider the case where some error l is allowed in the estimation of the coefficient λ l . This relaxation on the parameters alleviates the overfitting.
In this case, the bound on the right hand side in (33) (sequential case) becomes: whereas the right hand side in (35) becomes L l=1 G l (λ, δ) with: The minimum of these functions is easy to find and one obtains, for a given λ the variation δ required to lower bound the log-likelihood variation. The authors have shown that both sequential and parallel method produce a sequence λ (k) which converges to the minimum of L π (T ) ω as k → +∞. Note however that one strong condition in their convergence theorem is l > 0. This requires a sharp estimate of the error l , which cannot be solely based on the central limit theorem or on Hoeffding inequality in our case, because when the empirical average π (T ) ω (m l ) is too small, the minima of F , computed in [4] may not be defined.

Extension to the spatio-temporal case
We now show how to extend these computations to the spatio-temporal case, provided one replaces the log-likelihood L π (T ) ω by the Kullback-Leibler divergence (24). The main obstacle is that the Gibbs distribution does not have the form e H Z . We obtain thus a convex criterion to minimize Kullback-Leibler divergence variation, hence reaching it minimum, π (T ) ω . Replacing ν in eq. (24) by π (T ) ω , the empirical measure, one has: because the entropy S π (T ) ω cancels. This is the analog of (27). The main problem now is to compute P From (22) we have: Since H λ (ω n−1 (23): Therefore: This is the extension of (29) to the spatio temporal case. In the spatial case it reduces to (29) from (12). This equation is obviously numerically intractable, but it has two advantages: on one hand it allows to extend the bounds (33) (sequential case) and (35) So, compared to the spatial we have to replace m l by m l n−D in ∆H λ (ω n−1 0 ). We have therefore: From the time translation invariance of µ λ we have: so that: At first glance this bound is not really useful. Indeed, from (40) we obtain: Since this holds for any δ this implies P [ λ ] = P [ λ ]. The reason for this is evident. Renormalizing m l as we did to match the condition imposed by bound (31) is equivalent to renormalizing δ by δ n−D . As n → +∞ this perturbation tends to 0 and λ = λ. Therefore, the clever bound (31) would here be of no interest if we were seeking exact results. However, the goal here is to propose a numerical scheme, where, obvioulsy n is finite. We replace therefore the limit n → +∞ by a fixed n in the computation of P [ λ ] − P [ λ ]. Keeping in mind that m l must also be renormalized in π (T ) ω [ ∆H λ ] and using 1 n < 1 n−D the Kullback-Leibler divergence (38) obeys: the analog of (33).
In the parallel case, similar remarks holds. In order to apply the bound (34) we have to renormalize the m l s in m l = 1 L(n−D) . As for the spatial case we also need to check that L l=1 e δ l − 1 µ λ [m l ] > −1. (This constraint is not guarantee and has to be checked during iterations). One obtains finally: the analog of (35).
Compared with the spatial case, we see therefore that n mustn't be too large to have a reasonable Kullback-Leibler divergence variation. It mustn't be too small, however, to get a good approximation of the empirical averages.

Updating the target distribution when the parameters change
When updating the parameters λ, one has to compute again the average values µ λ [ m l ] since the probability µ λ has changed. This has a huge computational cost. The exact computation (e.g. from (11,19)) is not tractable for large N so approximate methods have to be used, like Montecarlo [33]. Again, this is also CPU time consuming especially if one recomputes it again at each iteration, but at least it is tractable. In this spirit, Broderick et al [4] propose to generate a Montecarlo raster distributed according to µ λ and to use it to compute µ λ when λ − λ is sufficiently small. We explain their method, limited to the spatial case, in the next section, and we explain why it is not applicable in the spatio-temporal case. We then propose an alternative method.

The spatial case
The average of m l is obtained by the derivative of the topological pressure P [ λ ]. In the spatial case, where P(λ) = log Z λ , we have: Using (28), one finally obtains: which is eq. (18) in [4]. Using this formula one is able to compute the average of m l with respect to the new probability µ λ only using the old one, µ λ .

Extension to the spatio-temporal case
We now explain why the Broderick et al method does not extend to the spatiotemporal case. The main problem is that if one tries to obtain the analog of the equality (45) one obtains in fact an inequality: where A, B are the constants in (22). They are not known in general (they depend on the potential) and they are different. However, in the spatial case because the potential has range 1. Then, one recovers (45). Let us now explain how we obtain (46).

Taylor expansion of the pressure
The idea is here to use a Taylor expansion of the topological pressure. This approach is very much in the spirit of [24], but extended here to the spatiotemporal case. Since λ = λ + δ, we have: The second derivative of the pressure is given by [39,3,17,6]: where: is the correlation function between m l , m k at time n, computed with respect to µ λ . (51) is a version of the fluctuation-dissipation theorem in the spatiotemporal case. σ n is the time shift applied n times. The third derivatives can be computed as well by taking the derivative (51) and using (47). This generates terms with third order correlations and so on [31]. Up to second order we have: Since the observable are monomials they only take the values 0 or 1 and the computation of χ jl is straightforward, reducing to counting the occurrence of time pairs t, t + n such that m j (t) = 1 and m l (t + n) = 1.
On practical grounds we introduce a parameter ∆ = λ −λ which measures the variation in the parameters after update. If ∆ is small enough (smaller than some ∆ c ), the terms of order 3 in the Tayor expansion are negligible, then we can use (53). Otherwise, if ∆ is big, we compute a new Montecarlo estimation of µ λ (as described in [33]). We explain in section 4.2 how ∆ c was chosen in our data. Then, we use the following trick. If δ > ∆ c we compute the new value µ λ [ m j ]. If ∆ c > δ > ∆c 10 , we use the linear response approximation (53) of µ λ . Finally, if δ < ∆c 10 we use µ λ [ m l ] instead of µ λ [ m l ] in the next iteration of the method . Thus, in the case, δ < ∆ c , we use the Gibbs distribution computed at some time step, say n, to infer the values at the next iteration. If we do that several successive time steps the distance to the original value λ n of the parameters increases. So we compute the norm λ n − λ n+k at each time step k, and we do not compute a new raster until this norm is larger than ∆ c .

The algorithms
We have two algorithms, sequential and parallel, which are very similar to Dudik el al. Especially, the convergence of their algorithms, proved in their paper, extends to our case since it only depends on the shape of the cost functions (36,37). We describe here the algorithms coming out from the presented mathematical framework, in a sequential and parallel version. We iterate the algorithms until the distance η = d µ λ , π (T ) ω is smaller than some η c . We use the Hellinger distance: Compute a new Gibbs sample using Montecarlo method [33] else Compute the new features probabilities using Taylor expansion (Equation 53) end end . Algorithm 1: Sequential algorithm. δ is the learning rate by which we change the value of a parameter λ l . η is the convergence criterion (54)). ∆ is the parameter allowing us to decide whether we update the parameters change by computing a new Gibbs sample or by the Taylor expansion. F l is given by eq. (36)

Parallel algorithm
Input: The features empirical probabilities π (T ) ω [ m l ] Output: parameters λ l initialization: λ l = 0 for every l, Compute a new Gibbs sample using Montecarlo method [33] else Compute the new features probabilities using Taylor expansion (Equation 53) end end Algorithm 2: The parallel algorithm. G l is given by (37). The implementation of those algorithms consists on an important part in a software developed at INRIA and called EnaS (Event Neural Assembly Simulation). The executable is freely available at http://enas.gforge.inria.fr/ v3/download.html.

Results
In this section we perform several tests on our method. We first consider synthetic data generated with a known Gibbs potential and recover its parameters. This step also allows us to tune the parameter ∆ c in the algorithms. Then, we consider real data analysis where the Gibbs potential form is unknown. This last step is not a systematic study that would be out of the scope of this paper, but simply provided as an illustration and comparison with the paper of Schneidman et al. 2006 [41].

Synthetic data
Synthetic data are obtained by generating a raster distributed according to a Gibbs distribution whose potential (2) is known. We consider two families of Gibbs potentials. For each family there are L > N monomials whose range belongs to { 1, . . . , R }. Among them, there are N "rate monomials" ω i (D), i = 1 . . . N , whose average gives the firing rate of neuron i, denoted r i ; the L − N other monomials, with degree k > 1, are chosen at random with a probability law ∼ e −k which favors therefore pairwise interactions. The difference between the two families comes from the distribution of coefficients λ l .
1. "Dense" rasters family. The coefficients are drawn with a Gaussian distribution with mean 0 and variance 1 L to ensure a correct scaling of the coefficients dispersion as L increases (Figure 1(a)). This produces typically a dense raster (Figure 1(b)) with strong multiple correlations.  where r i ∈ [0 : 0.01] with a uniform probability distribution. Other coefficients are drawn with a Gaussian distribution with mean 0.8 and variance 1 (Figure 2(a)). This produces a sparse raster (Figure 2(b)) with strong multiple correlations.

Tuning ∆ c
For small N, R (N R ≤ 20) it is possible to exactly compute the topological pressure using the transfer matrix technique [48]. We have therefore a way to compare the Taylor expansion (51) and the exact value.
If we perturb λ by an amount δ in the direction l, this induces a variation on µ λ [ m l ], l = 1 . . . L, given by the Taylor expansion (53). To the lowest order (1) , so that: is a measure of the relative error when considering the lowest order expansion.
In the same way, to the second order: so that: is a measure of the relative error when considering the next order expansion. In Figure 3 we show the relative errors (1) , (2) (in %), as a function of δ. For each point we generate 25 potentials, with N = 5, R = 3, L = 12. For each of these potentials we randomly perturb the λ j s, with a random sign, so that the norm of the perturbation δ is fixed. The linear response χ is computed from a raster of length T = 100000.  (1) and second order to (2) (see text). The curves correspond to N = 5, R = 3, L = 12. Left: Dense case; Right: Sparse case.
These curves show a big difference between the dense and sparse case. In the dense case, the second order error is about 5% for ∆ c = 1 whereas we need a ∆ c ∼ 0.03 to get the same 5% in the sparse case. We choose to align on the sparse case and in typical experiments we take ∆ c = 0.1 corresponding to about 10% of error on the second order.

Computation of the Kullback-Leibler divergence
To compute the Kullback-Leibler divergence between the empirical distribution π (T ) ω and the fitted predicted distribution µ λ , we need to know the value of the pressure P [ λ ], the empirical probability of the potential π . For small networks, we can compute the pressure using the Perron-Frobenius theorem ( [48]). However, for large scales, since we cannot compute the pressure, computing the Kullback-Leibler divergence is not direct and exact. We compute an approximation using the following technique. From Eq. (18) and (24), we can write: From the parameters λ, we compute a spike train distributed as µ λ using the Montecarlo method ( [33]). From this spike train, we compute the monomials averages µ λ [m l ] and the entropy S [ µ λ ] using the method of Strong et al. ( [43]). π

Performances on synthetic data
Here, we test the method on synthetic data where the shape of the sought potential is known: only the λ l s have to be estimated. Experiments were designed according to the following steps: • We start from a potential H λ * = l∈L λ * l m l . The goal is to estimate the coefficient values λ * l knowing the set L of monomials spanning the potential.
• We generate a synthetic spike train (ω s ) distributed according to the Gibbs distribution of H λ * .
• We take a potential H λ = l∈L λ l m l with random initial coefficients λ l . Then we fit the parameters λ l to the synthetic spike train ω (T ) s .
• We evaluate the goodness of fit.
For the last step (goodness of fit) we have used three criteria. The first one simply consists of computing the L 1 error is the final estimated value. d 1 is then averaged on 10 random potentials. Fig. 4 shows the committed error in the case of sparse and dense potentials. The method showed a good performance, both in dense and sparse case, for large N × R ∼ 60.
The main advantage of this criterion is to provide an exact estimation of the error made on coefficients estimation. Its drawback is that we have to know the shape of the potential which generated the raster: this is not the case anymore for real neural networks data. We therefore used a second criterion: confidence plots. For each spike block ω D 0 appearing in the raster ω s we draw a point in a two dimensional diagram with, on abscissa, the observed empirical probability π The probability that π (T ) ωs is therefore of about 99, 6%. This interval is represented by confidence lines spreading

The performance on real data
Here we show the inferring of MaxEnt distribution on real spike trains. We analyzed a set of 20 and 40 neurons 6 (courtesy of M. J. Berry and O. Marre) with spatial and spatio temporal constraints. Data are binned at 20 ms. We show the confidence plots and an example of convergence curves using the Hellinger Distance. The goal here is to check the goodness of fit not only for spatial patterns (as done in [41,35,15,14]), but also for spatio-temporal patterns. Figure 7 show the evolution of the Hellinger distance during parameters update both in parallel and sequential update process.
After estimating the parameters of an Ising and pairwise model of range R = 2 on a set of 20 neurons, we evaluate the confidence plots. Figures 8  and 9 show respectively the confidence plots for patterns of range 1,2 and 3 after fitting with an Ising model and Pairwise model of range R = 2. Our results on 20 neurons confirm the observations made in [48] for N = 5, R = 2 : a pairwise model with memory performs quite better than an Ising model to explain spatio-temporal patterns.
We then made the same analysis for 40 neuron. Figures 10 and 11 show respectively the confidence plots for patterns of range 1,2 and 3 after fitting with an Ising model and Pairwise model of range R = 2. In this case, we were not able to obtain a good convergence for N = 40, R = 2. This is presumably due to the insufficient length of the data set which does not allow us to estimate accurately the probability of some monomials. This aspect is discussed in the next section.

Discussion and conclusion
The method shows better performances for synthetic data than for real data although we did not make extensive studies for real data. The main reason, we believe, is that in the second case we don't know the form of the potential. As a consequence, we stick at existing canonical forms of potentials e.g. Ising and pairwise. The main problem with this approach is that the number of parameters to estimate dramatically growths with N R. The increase is moderate for the Ising model (N rates + N (N −1) 2 symmetric pairwise couplings) but it becomes prohibitively large even for pairwise range R models. On the opposite, our analysis of synthetic data used a relatively small number of parameters to fit.
The large number of parameters has 2 drawbacks: the increasing of computation time and errors in the estimation. Let us comment on the second problem. It is not intrinsic to our method; it is neither intrinsic to MaxEnt; this is a well known problem which arises already when doing linear regression analysis. Increasing the number of parameters may eventually lead to catastrophic estimations where the addition of degree of freedom can seriously hinder the resolution.
Since χ is symmetric we have E [ .˜ ] = χ −1 .E β.β .χ −1 = 1 T χ −1 . We arrive therefore at the conclusion that the fluctuations on the estimated coefficients λ are highly constrained by the convexity of the pressure, as expected. Mathematically, everything goes nicely since P is convex. However, it may happen that P is quite flat in some directions/monomials. Then small errors will be largely amplified. Therefore, when considering potentials of the form (2) it is expected that some terms (monomials) not only are irrelevant, but also dramatically deteriorate the estimation problem, introducing almost zero eigenvalues in χ. This is presumably what happened in Figure 11 where we were not able to obtain a good convergence for monomials averages.
At this stage, the main question is therefore: Can we have an idea of the potential shape from data before fitting the parameters? This question is not only related to the goodness of fit but, it is also a question of concept. Is is useful to represent a pairwise distribution for 40 neurons with nearly 2000 parameters? The idea would then be to filter irrelevant monomials. For that a feature selection method is useful and should complement this work. There are many directions we can take in the favor of the features selection. For instance, selecting the features on threshold ( [37,26]), using a χ 2 method ( [7]) as well as incremental feature selection algorithm ( [2], [50]). Other methods based on periodic orbit sampling ( [5]) and information geometry ( [32,1]) are under current investigation.
We have presented a method to fit the parameters of MaxEnt distribution with spatio-temporal constraints. In the process of exploring the dynamics of neural data, we hypothesize the model, fit it and finally judge the quality of the suggested model. Hence, this work is positioned as an important intermediate step in the neural coding using the MaxEnt framework, opening the door for analyzing the dynamics of large networks being not limited to spatial and/or traditional MaxEnt models.
Finally, we would like to highlight two points that should be investigated in further studies: • The effect of binning. In many experimental studies data is binned. Basically, binning was used in order to account for time spiking sensitivity, which is not the same for all the biological neural networks. For instance, [41] used 20 ms of binning for retinal spike trains. In the present paper, we have used the same as these authors but we have not considered the effect of binning on our statistical estimations. This is certainly a matter of further investigations, especially because, to our best knowledge no systematic study on binning effects on statistics has been done. In particular, three distinct dimensions should be considered: -The statistical dimension: How does binning biases statistics ? Could binning introduce spurious effects such as e.g. creating fallacious long range correlations?
-The computational dimension: how does the performance of the algorithm change with the bin size?
-The biological dimension: cross-correlograms are not the same in all brain areas. So optimal bin size is expected to depend on the investigated area.
• Maximum Entropy: There are several methods now in use to model the spatio-temporal correlations in ensembles of neurons. The generalized linear model (GLM) approach uses maximum likelihood and point-process to assess connectivity (e.g., [35]). Reverse correlation methods can also work well (e.g., [8]). Finally, there are causality metrics like Granger causality or transfer entropy ( [27]). Some of these methods have been compared in [47], but further investigations should be helpful, starting from synthetic data where statistics is under good control. Especially, how does Maximum entropy perform compared to these others methods?
Our method allow to investigate these two questions on numerical grounds although such an investigation should be completed by mathematical insights, using the properties of spatio-temporal Gibbs distributions. Parameters vector H Gibbs potential Z λ Partition function S Entropy P Topological pressure π (T ) ω Empirical probability measured on the spike train ω of length T µ λ Gibbs density with parameters λ M Set of invariant probabilities δ l = λ l − λ l Learning rate or the value by which we update the parameters λ l δ Vector of learning rates d KL Kullback-Leibler divergence C jk Correlation between two monomials j and k χ Hessian matrix (second derivative of the pressure) ∆ Root sum square of the learning rates β Fluctuations on the monomials averages Fluctuations on the parameters (relaxation)