Large Deviations Properties of Maximum Entropy Markov Chains from Spike Trains

We consider the maximum entropy Markov chain inference approach to characterize the collective statistics of neuronal spike trains, focusing on the statistical properties of the inferred model. To find the maximum entropy Markov chain, we use the thermodynamic formalism, which provides insightful connections with statistical physics and thermodynamics from which large deviations properties arise naturally. We provide an accessible introduction to the maximum entropy Markov chain inference problem and large deviations theory to the community of computational neuroscience, avoiding some technicalities while preserving the core ideas and intuitions. We review large deviations techniques useful in spike train statistics to describe properties of accuracy and convergence in terms of sampling size. We use these results to study the statistical fluctuation of correlations, distinguishability, and irreversibility of maximum entropy Markov chains. We illustrate these applications using simple examples where the large deviation rate function is explicitly obtained for maximum entropy models of relevance in this field.


Introduction
Spiking neuronal networks are perhaps the most sophisticated information processing devices that are available for scientific inquiry. There exists already an understanding of their basic mechanisms and functionality: they are composed by interconnected neurons which fire action potentials (also known as "spikes") collectively in order to accomplish specific tasks, e.g., sensory information processing or motor control [1]. However, the interdependencies in the spiking activity of populations of neurons can be extremely complex. In effect, these interdependencies can involve neighboring or also distant cells, being established either via structural connections, i.e., physical mediums such as synapses, or by functional connections reflected through spike correlations [2].
Understanding the way in which neuronal networks process information requires disentangling structural and functional connections while clarifying their interplay, which is a challenging but critical issue [3,4]. For this aim, networks of spiking neurons are usually measured using in vitro or in vivo multi-electrode-arrays, which connect neurons to electronic sensors specially designed for spike detection. Recent progress in acquisition techniques allows the simultaneous measurement of the spiking activity from increasingly large populations of neurons, enabling the collection of large amounts of experimental data [5]. Prominent examples of spike train recordings have been obtained from vertebrate retina (salamander, rabbit, and degus) [6][7][8] and cat cortex [9].
However, despite the progress in multi-electrode and neuroimaging recording techniques, modeling the collective spike train statistics is still one of the key open challenges in computational neuroscience. Analysis over recorded data has shown that, although the neuronal activity is highly variable (even when presented repeatedly the same stimulus), the statistics of the response is highly structured [10,11]. Therefore, it seems that much of the inner dynamics of neuronal networks is encoded in the statistical structure of the spikes. Unfortunately, traditional methods of estimation, inference, and model selection are not well-suited for this scenario since the number of possible binary patterns that a neuronal network can adopt grows exponentially with the size of the population. In fact, even long experimental recordings usually contain only a small subset of the possible spiking patterns, which makes the empirical frequencies poor estimators for the underlying probability distribution. For practical purposes, this induces dramatic limitations, as standard inference tools become unreliable as soon as the number of considered neurons grows beyond 10 [6].
Given the binary nature of the spiking data, it is natural to relate neuronal networks and digital communication system via Shannon's information theory. A possibly more subtle way of establishing this link is provided by the physics literature that studies stochastic spins systems. In fact, a succession of research efforts has helped develop a framework to study the spike train statistics based on tools of statistical physics, namely the maximum entropy principle (MEP), which provides an intuitive and tractable procedure to build a statistical model for the whole neuronal network. In 2006, Schneidman et al. [6] and Pillow et al. [12] used the MEP to characterize the spike train statistics of the vertebrate retina responding to natural stimuli, constraining only range one features, namely firing rates and instantaneous pairwise interactions. Since then, the MEP approach has become a standard tool to build probability measures in the field of spike train statistics [6,8,12,13]. This approach has triggered fruitful analyses of the neural code, including works about criticality [14], redundancy and error correction [7] among other intriguing and promising topics.
Although relatively successful, this approach for linking neuronal populations and statistical mechanics is based on assumptions that go against fundamental biological knowledge. Firstly, these works assume that the spike patterns are statistically independent of past and future activities of the network. In fact, and not surprisingly, there exists strong evidence supporting the fact that memory effects play a major role in spike train statistics [8,9,15]. Secondly, most studies that apply statistical mechanics to analyze neuronal data use tools that assume that the underlying system is in thermodynamic equilibrium. However, it has been recognized that being out-of-equilibrium is one of the distinctive properties of living systems [16][17][18]. Consequently, any statistical description that is consistent with the out-of-equilibrium condition of living neuronal networks should reflect some degree of time asymmetry (i.e., time irreversibility), which can be characterized using Markov chains [19][20][21][22][23][24].
As a way of addressing the above observations, some recent publications study maximum entropy Markov chains (MEMC) based on a variational principle from the thermodynamic formalism of dynamical systems (see for instance [8,24,25]). This framework is an extension of the classic approach based on the MEP that considers correlation of spikes among neurons simultaneously and with different time delays as constraints, being able in this way to account for various memory effects.
Most of the literature in spike train statistics via the MEP pays little attention to the fact that model estimation is done based on finite data (errors due to statistical fluctuations are likely to occur in this context). As the MEP can be seen as a statistical inference procedure, it is natural to inquire about the uncertainty (i.e., fluctuations and convergence properties) related to the inferred MEMC, or, in other words, ask for the robustness of the inference as a function of the sampling size of the underlying data set. Quantifying this error is particularly relevant in the light of recent results that suggest that the parameters inferred by the MEP approach in the context of experimental biological recordings are sharply poised at criticality [7,26]. On the other hand, once the MEMC has been inferred, it is also important to quantify how well a sample of the MEMC reproduce the average values of features of interest and how likely is that a sample of the MEMC produce a "rare" or unlikely event.
To provide some first steps in addressing the above issues, this paper studies the MEMC framework using tools from large deviation theory (LDT) [27,28]. We exploit the fact that the average values of features obtained from samples of the MEMC satisfy a large deviation property, and use LDT techniques to estimate their fluctuations in terms of the sampling size. We also show how to compute the rate functions using the tilted transition matrix technique and the Gärtner-Ellis theorem. It is to be noted that there is a large body of theoretical work linking the maximum entropy principle and large deviations [27,29]. However, these techniques have been scarcely used in spike train analysis (only to study the i.i.d case [30][31][32][33]), most likely because of the lack of a suitable introduction of these concepts within the neuroscientific literature. It is our hope that these applications might trigger the interest of the computational neuroscience community into the large deviation literature. Consequently, another goal of this paper is to provide an accessible introduction of the MEMC and LDT formalisms to the community of computational neuroscience, avoiding some technicalities while preserving the core ideas and intuitions. To the best of our knowledge, this manuscript presents the first attempt to bring these two topics together in the context of spike train statistics. This article is part of a more ambitious program that attempts to build a more unified theoretical structure and a complete toolbox helpful to approach spike train statistics using the thermodynamic formalism [24,25].
The rest of this paper is organized as follows. Section 2 presents the basic definitions and tools needed to apply large deviations techniques further in the paper. In particular, we present the maximum entropy principle framed in the thermodynamic formalism as a variational principle. In Section 3, we introduce some basic aspects of the theory of large deviations. In Section 4, we focus on the empirical averages of features. We present some examples of relevance in spike train statistics, where we are able to compute explicitly the rate function for each feature in the maximum entropy potential. In Section 5, we present further applications of the theory of large deviations in this field with a list of illustrative examples and finally we present our conclusions in Section 6.

Preliminaries
This section introduces the general definitions, notations and conventions that are used throughout the paper, providing in turn the necessary background for the unfamiliar reader.

Data Binarization and Spike Trains
Let us consider a set of measurements from a network of N interacting neurons. The "raw data" consist of N continuous signals containing the extra-cellular potential (electrical potential measured outside of the cell) of each of the neurons recorded over the length of the experiment. These data are processed by spike sorting algorithms [34,35], which are signal processing techniques designed to extract the spiking activity of each neuron.
Neurons have a minimal characteristic time in which no two spikes can occur, called "refractory period" [36], which provides a natural time-scale that can be used for "binning" (i.e., for discretizing) the time index of the measurements, denoted by ∆t b (When binning, sometimes it can be useful to go beyond the refractory period. In those cases, two spikes may occur within the same time bin. The convention is to consider this event equivalent to just one spike.). Denoting the time index by the integer variable t, one can say that x k t = 1 whenever the k-th neuron emits a spike during the t-th time bin, while x k t = 0 otherwise. This standard procedure transforms experimental data into sequences of binary patterns (see Figure 1).
A spike pattern is the spike-state of all the measured neurons at time bin t, which is denoted by . A spike block is a consecutive sequence of spike patterns, denoted by x t,r := x s r s=t . While the length of the spike block x t,r is r − t + 1, is also useful to consider spike blocks of infinite length starting from time t = 0, which are denoted by x. Finally, is this paper, we consider that a spike train is either a spike block of finite length or an infinite sequence of spiking patterns, which will be useful later when discussing asymptotic properties. The set of all possible spike blocks of length R corresponding to a network of N neurons is denoted by A N R . The set of all spike blocks of infinite length is denoted by Ω ≡ A N N , which is a useful mathematical object as clarified below. Let us define proj R : Ω → A N R the natural projection given by proj R (x) = x 0,R−1 . After binning ∆t b , the spiking activity is transformed into binary patterns in discrete time. We illustrate the notation used throughout this paper.

Features
Following the machine-learning nomenclature, a feature is a function that extracts a property of interest from the data. Formally, is defined as a function f : Ω → R that associates a real number to each x ∈ Ω. The feature f is said to have a temporal range or simply a range R if for every x, y ∈ Ω such that x = y, one has that f (x) = f (y) if and only if x 0,R−1 = y 0,R−1 , that is, if f only depends on the first R spike patterns of the spike-train. A special class of features, over which this work is focused on, are binary functions consisting of finite products of spike states, i.e., Above, l is a shorthand notation for the set {t k , i k } q k=1 , where [t k ] q k=1 and [i k ] q k=1 are collections of time and neuron indexes, respectively, and q is the number of spikes considered by the feature. Correspondingly, for a given index l, one has f l (x) = 1 if and only if the i k -th neuron spikes at time t k , for all k ∈ {1, . . . , q} in the spike-train x, while f l (x) = 0 otherwise. Note that, when considering features of range R ≥ 1, the firing times t k are constrained within the interval { 0, . . . , R − 1 }. We define the reduced featuref :

Statistical Structure
For a given spiking neuronal network involved in a particular experimental protocol, the measured activity usually contains a significant amount of stochasticity that is characteristic of measurements at this spatiotemporal scale. This randomness is caused mostly by: (i) the random variation in the ionic flux of charges crossing the cellular membrane per unit time at the post synaptic button due to the binding of neurotransmitter; (ii) the fluctuations in the current resulting from the large number of opening and closing of ion channels [37,38]; (iii) noise coming from electrical synapses [39].
To capture this stochasticity within our modeling, it is natural to endow Ω with a probabilistic structure. For this, we assume there exists a probability distribution p{·} over Ω that quantifies the intrinsic randomness that is associated to the spiking phenomena. From this point of view, all A ⊂ Ω are events that might take place with probability p{A}. Following a standard practice in computational neuroscience, we assume that the stochastic process generating the spikes is stationary, i.e., that their statistics do not change in time. As we discuss below, this assumption is crucial for the maximum entropy inference. Although an extension of our approach to a non-stationary scenario is possible, we focus here on the stationary case as it greatly simplifies the presentation. Using the stationary assumption, given the probability distribution of the whole process p{·} one can define a unique corresponding probability distribution over A N R following the natural projection, given by: As a consequence of assuming a stochastic process guiding the neuronal activity, a feature f : Ω → R becomes a random variable. Consequently, the statistics of f are defined by In particular, considering the feature f (x) = x k t , one can note that individual spike-states (as well as spike patterns and spike blocks) become discrete random variables. As a convention, we denote X k t a random spike-state that follows an implicit underlying probability distribution p{·}, while lower-case expressions (e.g., x k t ) are used for denoting concrete realization of these random variables. The mean value of a feature f with respect to the probability p{·} is given by: For the case of features of range R, the mean value can be expressed alternatively as: which is a finite sum. Above,f is the reduced feature, as defined in Section 2.2.

Empirical Averages
Let us consider spiking data of the form x 0,T−1 , where T is the sample length. Although in general the underlying probability measure p{·} that governs the spiking activity is unknown, it is useful to use the available data to estimate the mean values of specific features. If f is a feature of range R, the empirical average value of f from the sample x 0,T−1 is In particular, for features of range one, the previous expression becomes

An interesting questions is under which conditions
This, and other convergence issues, are explored in Section 4.

Inference of the Statistical Model with the MEP
Following Section 2.3, the probability measure p{·} represents the inherent stochasticity of the spiking neural population under consideration. As p{·} is unknown, one would like to infer it from data. In the sequel, we first introduce the general MEP as a method for inferring p{·}. Then, we show this principle for the case where only synchronous constraints are considered. Finally, we present the case of where non-synchronous correlations are included to constrain the maximization problem.

Fundamentals of the MEP
The MEP was first proposed by E. T. Jaynes as a way for estimating probability distributions when the information for performing the inference is scarce [40]. Rooted in principles of statistical physics, this approach selects a probability measure that satisfies the evidence supported by the available information while leaving all other aspects as random as possible. For quantifying the corresponding randomness, Jaynes shows that the most natural metric is the Shannon entropy [41]. The probability measure found by this procedure is known as the maximum entropy distribution.
Formally, the MEP is a concave constrained maximization problem, where the constraints that define the optimization space correspond to the available information that guide the inference process. Accordingly, if additional constraints are introduced then the optimization space is reduced; this corresponds to the informative power of new information, which restricts the space of models that are consistent with it.
The inference procedure based on the MEP follows the following steps: where M is the set of probability measures and M(c 1 , . . . , c K ) is the family of probability measures that are consistent with the empirical mean values c 1 , . . . , c K obtained in Step II. IV. Defining the entropy rate of the stochastic process as find the maximum entropy process, characterized by the probability measurê Some small remarks are to be said about this procedure. One can think of this as a data-driven algorithm, whose input is the data x 0,T−1 and output is the maximum entropy measurep. The first two steps of the process are known in the machine learning literature as "feature selection" and "feature extraction", respectively (see, e.g., [42,43]). The goal of these steps is to reduce the dimensionality of the input for the subsequent stages to prevent the selected model to overfitting the data (i.e., to include in the model effects of noise and biases due to the finiteness of the data). Hence, what drives the model selection stages is not the whole data but the quantities c 1 , . . . , c K , which define the space to be explored in Step IV.
Steps III and IV are known as "model selection". According to the the machine learning jargon, these steps deliver a generative model, in the sense the obtained model can later be used to generate new data. In this sense, it is interesting that, although the data are finite, the entropy rate calculated in Step IV is computed over all spike blocks of all lengths t, which is possible due to the generative nature of the candidate models. The inputs for the model selection stages are not the entire data x 0,T−1 but only the values c 1 , . . . , c K , which represent the knowledge obtained from the data that guides the search in the space of candidate models. Moreover, as these quantities represent all the available knowledge one has about the underlying stochastic process generating the spikes, one would like to select a model that reflect that information while making no further assumptions. By recalling the work made by Claude Shannon on the analysis of information sources (cf. [44] and references therein), one can interpret the entropy rate as a measure of how hard is to predict the realization of a stochastic process. This implies, in turn, that the maximum entropy measure within M{c 1 , . . . , c K } is the most random, i.e., unstructured, among those which satisfies the constraints Although the framework presented above is general enough to encompass the cases considering synchronous and non-synchronous constraints [21], when considering features of range R > 2 the problem go beyond the realm of equilibrium statistical mechanics. We present an alternative way general enough to deal with systems assuming non-synchronous constraints. In Section 3.2, we present the method for finding the maximum entropy measure when only synchronous features are selected, leaving for Section 3.3 the more general situation including non-synchronous constraints.

Time-Independent Constraints
Assuming only synchronous interactions is the equivalent to only considering features of range one (i.e., features that consider neurons at the same time index, cf. Section 2.2), which leads to restricting the candidate models to those in where the present state is statistically independent of past and future states. Moreover, by the assumption of stationarity, this leads to modeling the collective spiking activity as a sequence of i.i.d. random variables. The challenge, in this case, is to estimate the corresponding distribution as reliably as possible. A large portion of the literature of maximum entropy spike train statistics focuses on synchronous interactions between neurons, implicitly neglecting interactions across time. Although this assumption induces a strong simplification, the resulting models have proven to be rich in structure and can provide interesting results and insights about the neural code [6,12]. In the following, we recall how this problem can be addressed using the MEP.
As a consequence of the assumptions of temporal independence and stationarity, it can be shown that Equation (4) is reduced tô where M 1 (c 1 , . . . , c k ) corresponds to the set of distributions q 1 over A N 1 (cf. range one projections in Equation (1)) such that the constraints E q 1 { f k } = c k are satisfied for each k = 1, . . . , K. Note that the above sum is over the 2 N possible spike patterns, being a simpler condition than Equation (4). In fact, following a simple argument based on Lagrange multipliers and the concavity of the entropy, it can be shown that the distributionp 1 that solves Equation (5) is unique. Moreover, it is a Boltzmann-Gibbs distribution [41]:p where H β is referred as the energy or potential function where β ∈ R K is the vector of Lagrange multipliers. Following the statistical physics literature, Z(β) is called the partition function, whose logarithm is referred as free energy.
Conversely, from the uniqueness property of the maximum entropy distribution, one can conclude that there is only one Boltzmann-Gibbs distributionp 1 that belongs to M(c 1 , . . . , c K ), being the only solution of Equation (5). Interestingly, this alternative approach is much easier to solve the original optimization problem (In particular, M 1 {c 1 , . . . , c k } is not easy to parameterize and hence the application of standard techniques of convex optimization (e.g., gradient decent) is not straightforward.). In effect, one only needs to find the values of the parameter vector β k that reproduces the empirical average values c 1 , . . . , c K . Moreover, it is known that, for any Boltzmann-Gibbs distribution p 1 , the following holds: Therefore, using Equation (8), one could find the appropriate values of β for which Ep 1 {f k } = c k are satisfied (However, for practical purposes, this problem cannot be solved for systems with N > 20 [8], so alternative procedures are needed. For the interested reader, we refer to [7,14,45,46]).

Non-Synchronous Constraints
A generalization of the previous approach is to include average values of features corresponding to interactions in the spiking activity across time as constraints. This is a natural assumption in biological spiking networks as it is expected that the spike of one neuron influence other subsequent spikes.
Statistical models with time-dependencies can be generated with the MEP by introducing features that involve different time indexes. In effect, selecting features of range R induces interdependencies and a corresponding "memory" in the model of length R − 1, and thus it is natural to look for the best suited Markov chain over the state space A R N . A Markov chain model would allow expressing the probability of a spike train x 0,T for T > R as where P is a transition probability matrix (note that P{·, ·} has a consistency requirement: for w, y ∈ A R N , P {w|y} > 0 only when y 1,R−1 = w 0,R−2 ) and π is a corresponding invariant probability distribution (which is unique due to the ergodicity assumption, cf. Section 3.3.1) associated to P. Note that, due to the stationarity condition, the transition probabilities P{·|·} are constant over the realization of the whole process (see Appendix A for more details.).
Following the MEP, as described in Section 3.1, we look for a procedure for finding a Markov transition matrix P that maximizes its entropy rate while satisfying some constrains given the empirical averages of observables f 1 , . . . , f K . For ergodic Markov chains, a well-known calculation (that can be found, e.g., in [44]) shows that the entropy rate, as given by Equation (3), is equivalent to the following simple expression: where π i = π{x 0, R−1 = i} and P ij = P{j|i} for i, j ∈ A N R . It is important to notice that Equation (9) corresponds to the Kolmogorov-Sinai entropy (KSE) in the dynamical systems literature [47]. In general, Equation (9) is larger when, for a fixed i, the conditional probabilities P ij are closer to an uniform distribution, i.e., when knowing the transition statistics gives little certainty about the next step.
It can be shown that, if the considered features do not involve correlations across time (i.e., they are features of range one, cf. Section 2.2), then the resulting transition probabilities are such that the corresponding stochastic process is i.i.d (i.e., when P ij = π j ). Interestingly, in this scenario, Equation (9) reduces to the Shannon entropy of π i . This clarifies that this approach based on Markov chains extends the classical MEP and the results presented in Section 3.2.
Finding the MEMC raises, however, some extra technicalities with respect to the time-independent case. Recall that the goal is no longer to estimate a probability distribution, but to reconstruct from data a transition matrix P and a corresponding invariant measure π. On the one hand, the challenge is that as P and π are not independent parameters of the process (π has to be the eigenvector associated with the unitary eigenvalue of P [48]), and, on the other hand, although Lagrange multipliers method can still be applied for constraints of range two or more, the extension for non-synchronous constraints is not straightforward. For this reason, in the sequel, we explore an alternative route to build the MEMC based on the transfer matrix technique. This technique also provides further insightful connections with statistical physics and thermodynamics from which its large deviation properties arise naturally.

Transfer Matrix Method
To find the MEMC associated with non-synchronous constraints, we follow the same ideas presented above in the time-independent case, but using different tools. We present them here.
Let us consider the set of features chosen to constrain the maximization of entropy rate (Step I in Section 3.1). We assume that the features chosen have a finite maximum range R. From these features, one can build the energy function H β (Equation (7)) of finite range R as a linear combination of these features. Using this energy function, we build a matrix denoted by L H β , so that for every y, w ∈ A N R its entries are given as follows: By yw R−1 we mean the word obtained by concatenation of y 1 and w 1,R−1 . In the particular case of energy functions associated to range one features, the above matrix is defined as L H β (y, w) = e H β (y) . Assuming H β > −∞, the elements of the matrix L H β are non-negative, which in turn implies ergodicity. Moreover, the matrix is primitive by construction, thus it satisfies the Perron-Frobenius theorem [49]. Hereafter, L H will be referred as the Ruelle-Perron-Frobenius matrix (RPF). Let us denote be ρ the largest eigenvalue of L H , which because it satisfies the Perron-Frobenius theorem is an eigenvalue of multiplicity one and strictly larger in modulus than the rest of the eigenvalues [49]. We denote by U and V left and right eigenvectors of L H β corresponding to the eigenvalue ρ. Notice that U i > 0 and V i > 0, for all i ∈ A N R . The RPF matrix is not the Markov matrix we are looking for, moreover, is not a stochastic matrix, but can be converted into a stochastic matrix. We recall that for an irreducible matrix M with spectral radius λ, and positive right eigenvector v associated to λ, then the stochasticization of M is the following stochastic matrix: where D is the diagonal matrix with diagonal entries D(i, i) = v i . Thus, in our context, the MEMC transition matrix is given as follows: which is the main output of the MEMC approach (see Figure 2). The unique stationary probability measure π associated to P is explicitly given by where U, V is the standard inner product in R NR (we refer the reader to [49] for details and proofs).

Thermodynamic Formalism
In the previous section, we have shown how to obtain the transition matrix and the invariant measure of a Markov chain. However, we have not yet included the constraints (we have just used the features to build the energy function); in other words, we have not yet fit the parameters of the MEMC. To fit the maximum entropy parameters, let us introduce the following quantity, where M st is the set of all stationary probability measures in A R N and E q H β = ∑ K k=1 β k E q { f k } is the average value of H β with respect to q. Solving the optimization problem in Equation (14), one gets the Markov measure we are looking for. Indeed, one knows from the thermodynamical formalism (see [50]) that for our energy function H β of range R ≥ 2, there exists an unique translation invariant (stationary) Markov measure p associated to H β for which one has the constant M > 1 such that, that attains the supremum in Equation (14), that is S{p} + E p H β . The quantity P [H β ] is called topological pressure, which plays the role of the free energy in the statistical mechanics. The measure p, as defined by Equation (15), is known as the Gibbs measure in the sense of Bowen. Note that one can show that MEMCs are particular cases of these measures, associated to finite-range energy functions. Moreover, Equation (6) is a particular case of Equation (15), when M = 1 and H β is an energy function of range one.
The average values of the features, their correlations, as well as their higher cumulants can be obtained by taking the successive derivatives of the topological pressure with respect to their conjugate parameters β. This explains the important role played by the topological pressure in this framework. In general, where κ n is the cumulant of order n (see Appendix B).
In particular, taking the first derivative: where E p { f } is the the average of f k with respect to p (maximum entropy measure), which is equal (by assumption) to the average value of f k with respect to the empirical measure from the data c k , that constraint of the maximization problem. These equations suggest a relationship with the logarithm of the free energy or log partition function of the Boltzmann-Gibbs distribution. Indeed, for range one potentials (time-independent Maximum entropy distributions), ρ(β) = Z(β) and P [H β ] = ln Z(β) which relates Equation (8) with Equation (17) (For a detailed example see Section 5.2). This problem of estimating the MEMC parameters become computationally expensive for big matrices. However, there exist efficient algorithms to estimate the parameters for the Markov maximum entropy problem in the literature [45].

Preliminary Considerations
This subsection reviews two elementary tools for studying the convergence of random variables while providing corresponding references. In the sequel, first the central limit theorem is introduced in Section 4.1.1, and then large deviation theory is discussed in Section 4.1.2.

Central Limit Theorem
Let us first assume that one can have access to arbitrarily large data sequences. Consider t ∈ N and let x 0,t−1 be the spike-block of length t (which is allowed to increase), and let f : Ω → R be an arbitrary feature (not necessarily belonging to the set of features chosen to fit the MEMC). In this section, we establish asymptotic properties of A t ( f ) sampled with respect to the MEMC characterized by p{·}.
Throughout this work, we assume that p{·} is an ergodic Markov probability measure, meaning that every spiking block in A N R is attainable from every other block in the Markov chain within R time steps as discussed in Section 3. Thanks to the ergodic assumption, it is guaranteed that the empirical averages become statistically more accurate as the sampling size grows [51], i.e., However, the above result does not clarify the rate at which the estimate accuracy improves. To answer this question, one can use the central limit theorem (CLT) for ergodic Markov chains (see [52]). This theorem states that there exists a constant σ > 0 such that for large values of t, the random variable √ t σ A t ( f ) − E{ f } distributes as a standard Gaussian random variable (technically, the central limit theorem says that where the convergence is in distribution), with σ being the square-root of the auto-covariance function of f [52]. This, in turn, implies that "typical" fluctuations of A t ( f ) around its mean value E{ f } are of the order of σ/ √ t.

Large Deviations
Although the central limit theorem for ergodic Markov chains is accurate in describing typical events (which are fluctuations around the mean value), it does not say anything about the likelihood of larger fluctuations. Even though it is clear that the probability of such large fluctuations goes to zero as the sample size increases, it is valuable to describe the corresponding decrease rate. In particular, one says that an empirical average A t ( f ) satisfies a large deviation principle (LDP) with rate function I f , defined as if the above limit exists. Intuitively, the above condition for large t implies that p{A t ( f ) > s} ≈ e −tI f (s) . In particular, if s > E p { f } the Law of Large Numbers (LLN) guarantees that p{A t ( f ) > s} tends to zero as t grows; the rate function quantifies the speed at which this happens.
Calculating I f directly, i.e., by using the definition (Equation (18)), can be a formidable task. However, the Gärtner-Ellis theorem provides a smart shortcut for avoiding this problem [27]. To this end, let us introduce the scaled cumulant generating function (SCGF) (the name comes from the fact that the n-th cumulant of f can be obtained by t successive differentiation operations over of λ f (k) with respect to k, and then evaluating the result at k = 0) associated to the random variable f , by when the limit exists (further general details about cumulant generating functions are found in Appendix B). Note that, while A t ( f ) is an empirical average taken over a sample, the expectation in Equation (19) is taken over the probability distribution given by the corresponding model p{·}. If λ f is differentiable, then the Gärtner-Ellis theorem ensures that the average A t ( f ) satisfies a LDP with rate function given by the Legendre transform of λ f , that is Therefore, in summary, one can study the large deviations of empirical averages A t ( f ) by first computing their SCGF from the selected model and then finding their Legendre transform.
One of the most useful applications of the LDP is to estimate the likelihood that A t ( f ) adopts a value far from its expected value. To illustrate this, let us assume that I f (s) is a positive differentiable convex function (A classical result in LDP states that I f (s) is a convex function if λ f (k) is differentiable [28]. For a discussion about the differentiability of λ f (k) see [53].). Then, because of the properties of convex functions I f (s) has a unique global minimum. Denoting this minimum by s * , it follows from the differentiability of I f (s) that I f (s * ) = 0, and using properties of the Legendre transform s * = λ f (0) = lim t→∞ E p ( f ). This is the LLN, i.e., A t ( f ) gets concentrated around s * . Consider a value s = s * and assume that I f (s) admits a Taylor series around s * given by Since s * must correspond to a zero and a minimum of I(s) , the first two terms in this series vanish, and as I(s) is convex function I (s) > 0. For large values of k, we obtain from Equation (18) p{A so the small deviations of A t ( f ) around s * are Gaussian-distributed as for i.i.d. sums 1/I f (s * ) = λ f (0) = σ 2 . In this sense, large deviation theory can be seen as an extension of the CLT because it gives information not only about the small deviations around s * but also about large deviations (not Gaussian) of A t ( f ).

Large Deviations for Features of MEMC
In this section, we focus on the statistical properties of features sampled from the inferred MEMC. For example, one may be interested in measuring the probability of obtaining "rare" average values of features such as firing rates, pairwise correlations, triplets or spatiotemporal events. This characterization is relevant as these features are likely to play an important role in neuronal information processing, and rare values may hinder the whole enterprise of conveying information. We show in this section how to obtain the large deviations rate functions of arbitrary features through the Gärtner-Ellis theorem via the SCGF. In particular, we show that the SCGF can be directly obtained from the inferred Markov transition matrix P.
Consider MEMC taking values on the state space A N R with transition matrix P. Let f be a feature of range R which consider only the block and k ∈ R, we introduce P ( f ) (k), the tilted transition matrix by f of P, parameterized by k, whose elements are given by: For transition matrices P inferred from the MEP, the tilted transition matrix can be built directly from the spectral properties of the transfer matrix Equation (10) as follows, Recall that V is the right eigenvector of the transfer matrix L. Here, we also have used the shortcut notation H β (i, j) to indicate that the energy function takes the contributions from the blocks i and j.
Remarkably, the feature f does not need to belong to the set of chosen features to fit the MEMC. Now, we can take advantage of the structure of the given process in order to obtain more explicit expressions for the SCGF λ f (k), for instance, if one considers i.i.d. random variables X then, from the very definition, one can obtain that which is the case of range one features. Thus, using Equation (22), we get that the maximum eigenvalue of the tilted matrix, denoted by ρ( P f (k)) is, Since P f is a positive matrix, the Perron-Frobenius theorem ensures the uniqueness of ρ . Next, for the case of additive features, one deals with positive Markov chains, and under the assumption of ergodicity, an straightforward calculation (see, for instance, [54]) leads us to obtain that It also can be proven that λ f (k), in this case, is differentiable [54], setting up the scene to apply the Gärtner-Ellis theorem, which bypasses the direct calculation of p{A T ( f ) > s} in Equation (18), i.e., having λ f (k), its Legendre transform leads to the rate function of f as shown in Figure 3.

Large Deviations for the Entropy Production
A stochastic process is said to be in equilibrium if one cannot notice the effect of time on it. It is worth noticing that non-equilibrium stochastic processes are natural candidates to model spike train statistics as time plays a non-symmetrical role [25]. One of the consequences of including features of range R > 1 as constraints in the maximum entropy problem is that it opens the possibility to break the time-reversal symmetry present in the time-independent models. This captures the irreversible character of the underlying biological process and, thus, allows fitting more realistic statistical models from the biological point of view.  [55]. For a homogeneous discrete time ergodic Markov chain characterized by the Markov measure p(π, P) taking values in A N R , to be in equilibrium is equivalent to satisfy the detailed balance conditions, which is given by the following set of equalities: Conversely, when these conditions are not satisfied, the statistical model of the spiking activity is said to be a non-equilibrium system. Since non-equilibrium is expected to occur generically in neuronal network models, one would like to quantify how far from equilibrium is the inferred MEMC. For this purpose, one can define the information entropy production (IEP) for p, which is given by when the limit exists. For the maximum entropy Markov measure p(π, P), the IEP is explicitly given by: (see [56] for the calculation). We remark that it is still possible to obtain the information entropy production rate also in the non-stationary case. Clearly, for features of range one, IEP = 0 always, meaning that the process is time-reversible, therefore the probabilities of every path and its corresponding time-reversal path are equal. For features of range R > 1, IEP > 0 generically (we refer the interested reader to [25] for details and examples).
However, since in practice one only have access to limited amount of data, a natural question is to ask for the entropy production of the system considered up to a finite amount of time. It turns out that this characterization can be obtained through a LDP. With this in mind one may define the following feature: .
Since we have assumed that samples are produced by a stationary ergodic Markov chain characterized by p(π, P), the ergodic theorem assures that for p-almost every sample, the quantity W t when t goes to infinity converges, and it is by definition the IEP, lim t→∞ W t (x 0,t−1 )) = IEP(π, P).
Once we have the convergence for W t , we may ask for its large deviation properties. Under the same idea above, and following [57], we introduce the following matrix: this matrix help us to build the SCGF associated to W t , through the logarithm of the maximum eigenvalue ρ F (k). Using the Gärtner-Ellis theorem one gets the rate function I W (s) for the IEP.

Large Deviations and MEMC Distinguishability
It is clear that there exist a relationship between accuracy of the estimation and sample size. The larger the sample size the more information is available and the uncertainty diminish. In the context of maximum entropy models, this idea has been well conceptualized using tools from information geometry [30,58]. The main idea of this approach is that the maximum entropy models form a manifold of probability measures whose coordinates are the parameters β. Consider a spike train dataset x 0,T−1 consisting of T spike patterns obtained from a spiking neuronal network. Given a set of features { f k } K k=1 and their empirical averages, one may infer the parameters β = (β 1 , . . . , β K ) characterizing the MEMC p(π, P). We may use the inferred MEMC to generate a sample x 0,T−1 of the same size as the original dataset. Considering the same set of features one may apply again the MEP to infer a new set of parameters β from x 0,T−1 , which is, in general, expected to be different from β. These maximum entropy models will belong to a certain volume in the manifold which will decrease as the sample size increase [30]. On the other hand, increasing the sample size of x 0,T−1 , one expects that the Markov chain p (π , P ) specified by β gets "closer" to the one characterized by β. The idea of relating a distance in the parameter space with a distance in the space of probability measures can be rigorously formulated using large deviations techniques. Let us start by defining the relative entropy between these two MEMC (Gibbs measures in the sense of Bowen in Equation (15)), which provides a notion of "distance" (the relative entropy is not a metric as is not symmetric and do not satisfy the triangle inequality). To do that in the context of MEMCs, consider a Gibbs measure p associated to the energy function H β , and let q be another Gibbs measure. The Ruelle-Föllmer theorem gives us an expression for the relative entropy density between two Gibbs measures in terms of the pressure, the entropy rate and the expected value of the energy function with respect to q (see [29]), as follows: Observe that if d(q | p) = 0, we obtain the variational characterization of Gibbs measures in Equation (14).
Consider the potential H β = ∑ K k=1 β k f k associated with a MEMC p(π, P). Given a set of empirical averages A t ( f k ) generated by a sample of p(π, P) we obtain new maximum entropy parameters β . The probability that the maximum entropy parameters β associated with an ergodic Markov Chain p (π , P ) get close to β follows the following large deviation principle [28]: where ∆δ = [−δ, δ] K . Calling and the vector δβ = β − β and choosing ∆δ close to 0, we informally rewrite the above corollary in the form: Thus, for large T, we get: which implies that close-by parameters are associated to close-by probability measures [30]. Consider now two MEMCs p(π, P) and p (π , P ) specified by H β and H β , respectively with the same family of features. We say that the MEMCs are -indistinguishable if: As both MEMCs satisfy the variational principle (Equation (14)), the relative entropy between p and p (Equation (26)) reads: Taking the Taylor expansion of d(p | p ) around β = β we get: Since d(p | p ) is minimized at β = β, we obtain, Taking the second derivative of d(p | p ) from (Equation (30)), one also has that, The second partial derivatives of the topological pressure with respect to the parameters β k and β j can be conveniently arranged in a matrix L with components L kj . Given two MEMCs specified by H β and H β , in the limit of large t they are -indistinguishable if: where T denotes transpose. The matrix L can be obtained from data without need to fit the parameters. Equation (32) characterize a region in the space of MEMC of indistinguishable models, whose volume can be calculated in the large t limit using spectral properties of the matrix L [30]. This result generalizes a previous result for maximum entropy distributions for range one energy functions in [31].

Illustrative Examples
In this section, we illustrate the presented methods in some simple scenarios.
In these examples, we follow a set of steps: 1. Choose the features and build the energy function (Equation (7)). 2. Build the transfer matrix (Equation (10)). 3. Compute the free energy and find the maximum entropy parameters using (Equation (17)). 4. Build the Markov transition matrix using (Equation (12)). 5. Choose the observable to examine and build the tilted transition matrix using Equation (22). 6. Compute the Legendre transform of the log maximum eigenvalue of the tilted transition matrix to obtain the rate function (Equation (24)).
For the sake of clarity, in this section, we focus on small neuronal networks. It is clear, however, that the extension of these techniques to larger neural populations is straightforward.

First Example: Maximum Entropy Model of a Range Two Feature
Consider spiking data from two interacting neurons. We measure only the average value of a of a range two feature from the spiking data to fit a MEMC. The feature denoted by f 1 is given bỹ f 1 (x 0,1 ) = x 2 0 · x 1 1 , which detects when a spike of the second neuron is followed by a spike in the first one. The system can be described with the help of an energy function H(x 0,1 ) = β 1f1 (x 0,1 ).
For a given dataset of T spike blocks of range two, the empirical average reads, This means that in the data one finds that this event appears c% of the time. The transfer matrix L H (cf. Equation (10)) is primitive by construction (cf. Equation (10)) and satisfies the hypothesis of the Perron-Frobenius theorem. In fact, its unique maximum eigenvalue is ρ(β 1 ) = e β 1 + 3. Given the restriction in Equation (33), using Equation (17) we obtain the following relationship between the parameter β 1 and the value of the restriction c: This equation can be solved numerically. Using the obtained value of β 1 in Equation (12), one can find the corresponding Markov transition matrix. Note that, among all the Markov chains that match exactly the restriction, the selected one maximizes the KSE. Moreover, it is direct to check that the variational principle in Equation (14) is satisfied. Examples of values of β 1 for different values of c and IEP (25) for each value of β 1 are given in the following table: Having the MEMC, we are now interested in analyzing the statistical fluctuations of the feature f 1 .
Using Equation (22), we obtain the tilted transition matrix P ij (k) for each of the values in the Table 1. In Figure 4, we compute for each value of β 1 we compute the SCGF λ f 1 (k) (24) and the Legendre transform (rate function) associated to the feature I f 1 (s).  In Figure 5, we compute for each value of IEP in the table the rate function and illustrate for this example the symmetry relationship (Equation (A9)).  Table 1. Left SCGF λ W (k). Right rate function of the IEP feature W, I W (s).
The right eigenvector associated to this eigenvalue has all the components equal to 1. We obtain the topological pressure P [ H ] = log ρ(β). To find the MEMC parameters, we solve this set of equations: From Equation (34), provided some constraints on the average value of the features, we can solve the maximum entropy problem (see Table 2). Table 2. Set of values used in Figure 6.
From Equation (12), one can find the Markov transition matrix. To compute the rate function of each feature in this model, we take the logarithm of the maximum eigenvalue of the tilted matrix, and obtain the tilted cumulant generating function λ f (k). In Figure 6, we illustrate the rate functions for each feature in the model.  Table 2. (B) Rate functions for the pairwise interactions computed from the parameters in Table 2.

Third Example: Past Independent and Markov Maximum Entropy Measures
To illustrate the difference between synchronous and non-synchronous maximum entropy models, we study a simple model composed of two interacting neurons: We build a Markov chain by fixing the parameters of H at β 1 = −3, β 2 = 3, β 3 = 0.5 in the state space A 2 1 , given by whose corresponding transition matrix is given by We focus on the synchronous feature f = x 1 0 · x 2 0 , whose average value with respect to the Markov measure p fixed by the parameters β 1 , β 2 , β 3 is E p {x 1 0 · x 2 0 } = 0.292611. Using this particular Markov chain, we generate a sample of size T = 20, 000. Then, we consider these data as a spike train of two neurons from which we have no other information. Starting from this data, we find the maximum entropy distribution that only considers the empirical average of the synchronous feature A T (x 1 0 · x 2 0 ) = 0.2926 as constraint. Therefore, we build a second model that uses the following energy function: Using the constraint, we obtain from Equation (8),β = 0.215874 fixing the maximum entropy distributionp. Note that by construction Ep{x 1 0 · x 2 0 } = A T (x 1 0 · x 2 0 ) = 0.2926. For both energy functions (Equations (35) and (36)) with the parameters mentioned before, we compute the rate functions of the synchronous feature. Additionally, from the sample of the Markov chain we compute the empirical averages of the synchronous feature using sliding windows of 50 samples. As expected, these empirical averages fluctuate around the overall average, as shown in Figure 7. To test the relevance of including memory into the model (and assess the performance of memoryless features), we compared the statistics of the fluctuations seen in the empirical average with the prediction by the rate functions of the two models. Figure 8 shows the histogram of empirical fluctuations, and plots the theoretical estimations of the fluctuations given by K exp{−W I f (s)}, where W = 50 is the window size, s is the fluctuation size, I f (s) is the rate function, and K is a constant that is adapted for visualization purposes. Results show that the rate function of the model with memory fits accurately the relative frequencies of the empirical large fluctuations. In contrast, the model with no memory overestimates the fluctuation frequencies, having a much larger variance than the data, and underestimates near the expected value.

Conclusions
In the past few years, new experimental techniques combined with clever ideas from statistical mechanics have made it possible to infer maximum entropy models of spike trains directly from experimental recordings. However, a very important issue, namely quantifying the accuracy of the estimation obtained from a finite empirical sample, is usually ignored in this field. This is probably because the maximum entropy approach has a dual nature; one side is a convex optimization problem, which provides a unique solution independent of the sampling size, and on the other side is a Bayesian inference procedure, from which it is more natural to ask this question. As we have discussed in the Section 1 this characterization is relevant in the field of computational neuroscience as, in practice, experimental recordings are performed during a finite amount of time which causes fluctuations over the estimated quantities.
A fundamental goal of spike train analysis over networks of sensory neurons involves building accurate statistical models that predict the response of the network to a stimulus of interest. In particular, the aim of statistical inference of spiking neurons using the MEP, is that the fitted parameters shed light on some aspects of the neuronal code, therefore it is extremely important to quantify the accuracy of the statistical procedure. Additionally, one may be interested in measuring some properties of the inferred statistical model characterizing the spiking neuronal network, for example, the convergence rate of a sample or to quantify the probability of rare events of features such as firing rates, pairwise correlations, triplets or spatiotemporal events, mainly because these features are likely to play an important role in neuronal information processing. It is possible that rare and unlikely events have been generated by internal states of the neuronal tissue and not driven by the external stimulus. The events that are unlikely to occur deserve a better understanding as may carry important information about the network internal structure and may play a role in organizing a coherent dynamic to convey sensory information to the cerebral cortex.
The present contribution addressed this issue using tools from large deviations theory in the context of the MEMC. In particular, we showed that the transfer matrix technique used to build the MEMC is well adapted to compute large deviation rate functions using the Gärtner-Ellis theorem.
We also provide tools to investigate how sharply determined are the parameters of a MEMC with respect to the amount of empirical data using the concept of distinguishability. Additionally, we present a non-trivial relation between the distance in the parameter space and the distance in the manifold of maximum entropy probability measures using a LDP.
We have illustrated our method using simple examples. However, these examples might give a false impression that large deviations rate functions can always be calculated explicitly. In fact, exact and explicit expressions can be found only in small simple cases; fortunately, there exist numerical methods to evaluate rate functions [53].
Here, we have focused our attention on large deviations properties on maximum entropy models arising from spike train statistics, however, these results can be used in other fields of applications of maximum entropy models.
Author Contributions: All authors conceived the algorithm, and wrote and revised the manuscript. All authors have read and approved the final manuscript. We thank J.-P.E., A.P. and R.H. for helpful discussions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Discrete-Time Markov Chains and Spike Train Statistics
Consider the random process {X n : n ≥ 0} taking values on A N R . One can assume that the spiking activity of the neuronal network can be modeled by some discrete-time Markov process whose transition probabilities are obtained by means of the maximum entropy method described in Section 3. In this setting, A N R is the state space of the Markov chain, and, thus, if X n = x n,n+R−1 , we say that the process is in the state x n,n+R−1 at time n. The transition probabilities are given as follows, P X n = x (n) | X n−1 = x (n−1) , . . . , X 0 = x (0) = P X n = x (n) | X n−1 = x (n−1) , where we used the short hand notation x (n) := x n,n+R−1 . We emphasize that in this paper the states are spike blocks of finite length R, x n,n+R−1 . All along this paper he have only considered homogeneous Markov chains, that is, Equation (A1) is independent of n.
Since transitions are considered between blocks of the form x n−R,n−1 → x n−R+1,n , the block x n−R+1,n−1 must be common for the transition to be possible. Consider two spike blocks i, j ∈ A N R of range R ≥ 2. We say that the transition from state i to state j is allowed if i and j have the common sub-block x 1,R−1 =x 0,R−2 , wherex 0,R−2 are the first R − 1 spike patterns of j. Now, we define the transition matrix P R : A N R × A N R → R, whose entries are given by the transition probabilities, as follows, Note that P has 2 NR × 2 NR entries, but it is a sparse matrix since each line has, at most, 2 N non-zero entries. A stochastic matrix P is defined from transition probabilities in Equation (A2) satisfying: for all states i, j ∈ A N R . Moreover, by construction, for any pair of states, there exists a path of maximum length R in the graph of transition probabilities going from one to the other, which means that the Markov chain is primitive.

Appendix B. Cumulant Generating Function
In general, to obtain the scale cumulant generating function (as considered in Section 4.1.2), one has to deal with the moment of order r of a real-valued random variable f , which is given by, m r = E( f r ), for r ∈ N. Provided that it has a Taylor expansion about the origin, the moment generating function (or Fourier-Laplace transform) M(k) = E(e k f ) = E(1 + k f + · · · + k r f r /r! + · · · ) = ∞ ∑ r=0 m r k r /r! The cumulants κ r are the coefficients in the Taylor expansion of the cumulant generating function, defined as the logarithm of the moment generating function, that is, log M(k) = ∑ r κ r k r /r! The relationship between the first moments and cumulants can be obtained by extracting coefficients from the expansion, as follows: and so on. In particular, κ 1 is the mean of f , κ 2 is the variance, κ 3 the skewness and κ 4 the kurtosis.
for v, w ∈ A N R . For MEMC fitted through range one energy functions, { f (x n ); n ≥ 0} is an i.i.d. process and the variance of f is simply C f (0). These terms are the linear response coefficients. For MEMC associated to energy functions formed by K features, the matrix L can be conveniently arranged in a K × K symmetric matrix (known as the Onsager reciprocity relations [59]).