An Introduction to the Non-Equilibrium Steady States of Maximum Entropy Spike Trains

Cofré, Rodrigo; Videla, Leonardo; Rosas, Fernando

doi:10.3390/e21090884

Open AccessTutorial

An Introduction to the Non-Equilibrium Steady States of Maximum Entropy Spike Trains

by

Rodrigo Cofré

^1,*

,

Leonardo Videla

¹ and

Fernando Rosas

^2,3,4

¹

Centro de Investigación y Modelamiento de Fenómenos Aleatorios CIMFAV-Ingemat, Facultad de Ingeniería, Universidad de Valparaíso, Valparaíso 2340000, Chile

²

Centre for Psychedelic Research, Department of Medicine, Imperial College London, London SW7 2DD, UK

³

Centre for Complexity Science and Department of Mathematics, Imperial College London, London SW7 2AZ, UK

⁴

Data Science Institute, Imperial College London, London SW7 2AZ, UK

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(9), 884; https://doi.org/10.3390/e21090884

Submission received: 12 July 2019 / Revised: 23 August 2019 / Accepted: 7 September 2019 / Published: 11 September 2019

(This article belongs to the Special Issue Entropy Production and Its Applications: From Cosmology to Biology)

Download

Browse Figures

Versions Notes

Abstract

Although most biological processes are characterized by a strong temporal asymmetry, several popular mathematical models neglect this issue. Maximum entropy methods provide a principled way of addressing time irreversibility, which leverages powerful results and ideas from the literature of non-equilibrium statistical mechanics. This tutorial provides a comprehensive overview of these issues, with a focus in the case of spike train statistics. We provide a detailed account of the mathematical foundations and work out examples to illustrate the key concepts and results from non-equilibrium statistical mechanics.

Keywords:

non-equilibrium steady states; maximum entropy principle; spike train statistics; entropy production

1. Introduction

Being the brain one of the most complex systems within the observable universe, it is not surprising that there is still a large number of unanswered questions related to its structure and functions. With the aim of developing new ways of addressing such questions, there is an increasing consensus among neuroscientists in that interdisciplinary approaches are promising. As a prominent example of this, computational neuroscience has been greatly enriched during the last decades by tools, ideas and methods coming from statistical physics [1,2]. Moreover, these methods are recently being revisited with renewed interest due to the arrival of experimental techniques that generate huge volumes of data. In particular, neuroscientists have become progressively aware of the powerful computational techniques used by statistical physicists to analyze experimental data and large scale simulations.

When studying the firing patterns of collections of neurons, one of the most popular principles from statistical mechanics is the maximum entropy principle (MEP), which builds the least structured model that is consistent with average values measured from experimental data. These average values are usually restricted to firing rates and synchronous pairwise correlations, which gives rise to models composed by time independent and identically distributed (i.i.d) random variables, i.e., stochastic processes without temporal structure [3,4,5]. Needless to say, there exists strong evidence in favour of memory effects playing a major role in spike train statistics, and biological process in general [6,7,8,9]. Following this evidence, over the last years the study of complex biological systems has started to consider time-dependent processes where the past has an influence on future behavior [10,11,12]. The corresponding asymmetry between past and future is called the “arrow of time”, which is the unique direction associated with the irreversible flow of time that is noticeable in most biological systems.

Interestingly, the statistical physics literature has a fertile toolkit for studying time asymmetric processes [13]. First, one introduces the distinction between steady states that imply thermal equilibrium, and steady states that still carry fluxes—being called non-equilibrium steady states (NESS). Additionally, the extent to which a steady-state is not in equilibrium (i.e., the strength of its associated currents) can be quantified by the entropy-production rate [14], which is associated with the degree of time-irreversibility in the corresponding process [14]. Several studies have pointed out that being out-of-equilibrium is an important characteristic of biological systems [15,16,17]. Therefore, statistical characterizations consistent with the out-of-equilibrium condition should reproduce some degree of time irreversibility. One popular method that is suitable for studying these issues is Markov chain modeling [11,18,19,20,21,22].

Despite the potential of interdisciplinary pollination related to these fascinating issues, many scientists find it hard to explore these topics because of the major entry barriers, including differences in jargon, conventions, and notations across the various fields. To bridge this gap, this tutorial intends to provide an accessible introduction to the non-equilibrium properties of maximum entropy Markov chains, with an emphasis in spike train statistics. While not introducing novel material, the main added value of this tutorial is to present results of the field of non-equilibrium statistical mechanics in a pedagogical manner based on examples. These results have direct application to maximum entropy Markov chains, and may shed new light on the study of spike train statistics. This tutorial is suitable for researchers in the fields of physics or mathematics who are curious about the interesting questions and possibilities that computational neurosciences offers. The focus on this community is motivated by the growing community of mathematical physicists interested in computational neuroscience.

The rest of this tutorial is structured as follows. First, Section 2 introduces basic concepts of neural spike trains and Markov processes. Then, Section 3 introduces the notion of observable, and explores their fundamental properties. Section 4 introduces the core ideas of MEP, proposing the formal question and exploring methods for solving it. Section 5 studies various properties of interest of MEP models, including fluctuation-dissipation relationships, and their entropy production. Finally, Section 6 summarizes our conclusions.

2. Preliminary Considerations

This section introduces definitions, notations, and conventions that are used throughout the tutorial in order to give the necessary toolkit of ideas and notions to the unfamiliar reader.

2.1. Binning and Spike Trains

Consider a network of N spiking neurons, where time has been binned (i.e., discretized) in such a way that each neuron can exhibit no more than one action potential within one time bin

Δ t_{b}

. Action potentials, or “spikes”, are “all-or-none” events, and hence, spike data can be encoded using sequences of zeros and ones. A spiking state is denoted by

x_{t}^{k} = 1

, and corresponds to the event in which the k-th neuron spikes during the t-th time bin, while

x_{t}^{k} = 0

implies that it remains silent.

A spike pattern is defined as the spike-state of all neurons at time bin t, and is denoted by

x_{t} : = {[x_{t}^{k}]}_{k = 1}^{N}

. A spike block is a consecutive sequence of spike patterns, denoted by

x_{t, r} : = {[x_{s}]}_{s = t}^{r}

(see Figure 1). While the length of the spike block

x_{t, r}

is

r - t + 1

, it is useful to consider spike blocks of infinite length starting from time

t = 0

, which are denoted by

x

. Finally, in this tutorial we consider that a spike train is an infinite sequence of spiking patterns. This assumption turns out to be useful because it allows us to put our analysis in the framework of stochastic processes, and because it also allows us to characterize asymptotic statistical properties.

The set of all possible spike patters (or state space) in a network of N neurons is denoted by

S

, and the set of all spike blocks of length R in a network of N neurons is denoted by

S^{R}

.

Even at a single neuron level, for repetitions of the same stimulus, neurons respond randomly, but with a certain statistical structure. This is the main reason to look for statistical characterizations of spike trains. When trying to find a statistical representation considering a whole population of neurons responding simultaneously to a given stimulus, the problem is the following. Consider an experimental spike train from a network of N neurons where sequences of spike patterns are considered time-independent. The spike patterns can take

2^{N}

values (state space). For

N > 10

is not possible to observe all possible states in real experimental data nor computer simulations (2 h of recordings binned at 20 ms produce less than

2^{19}

spike patterns). For

N = 100

the state space is

2^{100}

, therefore the frequentist approach is useless to estimate the invariant measure. Can we learn something about the statistics of spike patterns from data having access only to a very small fraction of the state space? The maximum entropy principle provides an answer to this question. This principle has been used in the context of spike train statistics mainly considering firing rates and synchronous pairwise correlations, which gives rise to trivial stochastic processes composed by (i.i.d) random variables [3,4,5]. However, as mentioned in the introduction, there exists strong evidence in favour of past events playing a role in spike train statistics, and the biological process in general [6,7,8,9,11]. This principle can be generalized considering non-synchronous correlations, affording to build Markov chains from data. This approach opens the way to a richer modeling framework that can afford to model time irreversibility (highly expected in biological systems) and to a remarkable mathematical machinery based on non-equilibrium statistical mechanics which can be used to characterize collective behavior and to explore the capabilities of the system. We focus our tutorial on non-equilibrium steady states in the context of maximum entropy spike trains. In the next section of this tutorial we present the elementary properties of Markov chains (our main object of analysis) which will be used in the next chapters to extract relevant information about the underlying neuronal network generating the data.

2.2. Elementary Properties of Markov Chains

A stochastic process is a collection of random variables

X_{t} \in S

indexed by

t \in T

that often refers to time. The set

S

represents the phase-space of the process; in the case of stochastic processes representing spike trains, one usually takes

S = {0, 1}^{N}

. Moreover, considering the temporal binning discussed in Section 2.1, usually

T = N

(the set of natural numbers) corresponds to the so-called discrete-parameter stochastic processes.

While spike trains can be characterized by stochastic processes dependent on an infinite past [23,24], Markov chains are particularly well-suited for modeling data sequences with finite temporal dependencies. In the next paragraphs we give the precise definition of a Markov process.

A stochastic process

(X_{t} : t \in N)

defined on a measure space

Ω

is said to be a

P

–Markov chain if it satisfies the Markov property (with respect to the probability measure

P

): if, for every

t \in N

and for each sequence of states

x_{0}, x_{1}, \dots, x_{t + 1} \in S

, the following relationship holds:

\begin{matrix} P (X_{t + 1} = x_{t + 1} | X_{0} = x_{0}, X_{1} = x_{1}, \dots, X_{t - 1} = x_{t - 1}, X_{t} = x_{t}) = P (X_{t + 1} = x_{t + 1} | X_{t} = x_{t}) . \end{matrix}

(1)

This property is usually paraphrased as: the conditional distribution of the future given the current state and all past events depends exclusively on the current state of the process. It is direct to show that the Markov property is equivalent to the following condition: for every increasing sequence of indices

(i_{1} < i_{2} < \dots < i_{n})

in

N

, and for arbitrary states

x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{n}}

in

S

, we have:

\begin{matrix} P (X_{i_{n}} = x_{i_{n}} | X_{i_{n - 1}} = x_{i_{n - 1}}, \dots, X_{i_{1}} = x_{i_{1}}) = P (X_{i_{n}} = x_{i_{n}} | X_{i_{n - 1}} = x_{i_{n - 1}}) . \end{matrix}

To characterize the transition probabilities, define a

S

-indexed stochastic matrix to be a doubly indexed array of non-negative real numbers

P = (p (i, j) : i, j \in S)

such that

\sum_{j \in S} p (i, j) = 1

for every

i \in S

. It can be shown that a Markov chain is well-defined if the following is provided:

(i): An initial probability distribution, encoded by a vector $μ : = (μ_{i} : i \in S)$ .
(ii): A collection of $S$ -indexed stochastic matrices ${P_{t} : = {(p_{t} (i, j))}_{i, j \in S} : t \in N}$ .

Using these two elements, one can build probability measures

P^{n}

on

S^{n}

as follows,

\begin{matrix} P^{n} (i_{0}, i_{1}, \dots, i_{n - 1}) = μ (i_{0}) \prod_{j = 0}^{n - 2} P_{j} (i_{j}, i_{j + 1}) . \end{matrix}

Furthermore, the Kolmogorov extension theorem [25] guarantees the existence of a unique probability measure

P_{μ}

on

S^{N}

such that the coordinate process satisfies:

\begin{matrix} P_{μ} (X_{0} = i_{0}, X_{1} = i_{1}, \dots, X_{n} = i_{n}) = P^{n} (i_{0}, i_{1}, \dots, i_{n}), \end{matrix}

and with respect to which

(X_{t} : t \in T)

is a Markov chain. In this case

P_{μ}

is said to be the probability law of the Markov chain

(X_{t} : t \in N)

. This notation also remarks that

P_{μ}

is the law with initial distribution

μ

.

2.3. Homogeneity, Ergodicity and Stationarity

A Markov chain is said to be homogeneous if the transition matrices do not depend on the time parameter t, i.e., if there exists a

S

-indexed stochastic matrix P such that

P_{t} = P

for every

t \in T

. Note that if

(X_{t} : t \in T)

is a

P

–homogeneous Markov chain, then for every

t \in T

:

\begin{matrix} P (X_{t + 1} = j | X_{t} = i) = p (i, j) : = p_{i j} . \end{matrix}

(2)

In the rest of this paper we focus exclusively on homogeneous Markov chains, since this is the model assumed in the maximum entropy framework.

Consider now a homogeneous Markov chain

(X_{t} : t \in T)

with initial distribution

μ

and transition matrix P. Moreover, consider

p_{i j}^{(m)}

to be the

(i, j)

-th entry of the product matrix

P^{m} = P \cdot P \cdot \dots \cdot P

. These quantities correspond to the m–steps transition probabilities. Equation (2) can be generalized to

\begin{matrix} P (X_{t + m} = j | X_{t} = i) = p_{i j}^{(m)} . \end{matrix}

A stochastic matrix P is said to be ergodic if there exists

k \in N

such that all the k–step transition probabilities are positive—i.e., there is a non-zero probability to go between any two states in k steps. A homogeneous Markov chain is ergodic if it can be defined by an initial distribution

μ

and an ergodic matrix.

Finally, a probability distribution

π

on

S

is called a stationary distribution for the Markov chain specified by P if

\begin{matrix} π P = π . \end{matrix}

(3)

Equivalently,

π

is stationary for P if

π

is a left eigenvector of the transition matrix corresponding to the eigenvalue

λ = 1

, and is a probability distribution on

S

. While it is true that 1 is always an eigenvalue of P, it may be the case that no eigenvector associated to it can be normalized to a probability distribution. Further conditions for existence and uniqueness will be given in the next paragraph. Finally, if a

S

–indexed stochastic matrix P admits a stationary probability distribution

π

and

(X_{t} : t \in N)

is a Markov chain with initial distribution

π

and transition matrix P, then for every

t \in N

and

i \in S

:

\begin{matrix} P_{π} (X_{t} = i) = π_{i} . \end{matrix}

In this case

(X_{t} : t \in N)

is said to be a stationary Markov chain, or that the Markov chain is started from stationarity.

The notion of homogeneous ergodic Markov chains is relevant in the context of spike train statistics because of the Ergodic Theorem for finite-state Markov Chains, which state that for all finite-state, homogeneous, ergodic Markov chains

(X_{t} : t \geq 0)

with transition matrix P the following hold:

(a): There exists a unique stationary distribution $π$ for P that satisfies that $π_{i} > 0$ for every $i \in S$ .
(b): For every $j \in S$ ,

$\begin{matrix} lim_{m \to + \infty} p_{i j}^{(m)} = π_{j} . \end{matrix}$

Equivalently, for every distribution $ν$ , ${lim}_{t \to \infty} P_{ν} (X_{t} = j) = π_{j} .$ This property guarantees the uniqueness of the maximum entropy Markov chain.

2.4. The Reversed Markov Chain

Given a discrete ergodic Markov chain, it is mathematically possible to define its associated time reversed Markov chain. Some Markov chains in the steady-state yield the same Markov chain (in distribution) if the time course is inverted and others do not. It has been argued multiple times that those Markov chains that are different from their time inverted version are better suited to represent biological stochastic processes [6,7,9,11,12].

Let

\vec{P}

be a stochastic matrix, and assume that it admits a stationary probability measure

π

. Assume too that

π_{i} > 0

for every

i \in S

(according to (a) in the Ergodic Theorem from the previous section, this is the case when

\vec{P}

is ergodic). Define the

S

–indexed matrix

\overset{\leftarrow}{P}

with entries:

\begin{matrix} {\overset{\leftarrow}{P}}_{i j} = \frac{π_{j}}{π_{i}} {\vec{P}}_{j i} . \end{matrix}

A direct calculation shows that

\overset{\leftarrow}{P}

is also a stochastic matrix. Moreover, if

π

is stationary for

\vec{P}

, then it is for

\overset{\leftarrow}{P}

as well.

Using the above facts, let

P_{π}^{\to}

and

P_{π}^{\leftarrow}

be the laws of two stationary Markov chains, denoted by

X_{t}

and

Y_{t}

, whose stationary distribution is

π

and transition probabilities are

\vec{P}

and

\overset{\leftarrow}{P}

, respectively. The following holds

\begin{matrix} P_{π}^{\leftarrow} (Y_{0} = i_{0}, Y_{1} = i_{1}, \dots, Y_{n} = i_{n}) & = π_{i_{0}} {\overset{\leftarrow}{P}}_{i_{0} i_{1}} {\overset{\leftarrow}{P}}_{i_{1} i_{2}} \dots {\overset{\leftarrow}{P}}_{i_{n - 1} i_{n}} \\ = π_{i_{0}} \frac{π_{i_{1}}}{π_{i_{0}}} {\vec{P}}_{i_{1} i_{0}} \frac{π_{i_{2}}}{π_{i_{1}}} {\vec{P}}_{i_{2} i_{2}} \dots \frac{π_{i_{n}}}{π_{i_{n - 1}}} {\vec{P}}_{i_{n} i_{n - 1}} \\ = π_{i_{n}} {\vec{P}}_{i_{n} i_{n - 1}} {\vec{P}}_{i_{n - 1} i_{n - 2}} \dots {\vec{P}}_{i_{1} i_{0}} \\ = P_{π}^{\to} (X_{0} = i_{n}, X_{1} = i_{n - 1} \dots, X_{n} = i_{0}) . \end{matrix}

By virtue of this result, it is natural to call the chain

(Y_{t} : t \geq 0)

the reversed chain associated to

(X_{t} : t \geq 0)

.

2.5. Reversibility and Detailed Balance

A transition matrix P is reversible with respect to

π

if the associated Markov chain started from

π

has the same law as the reversed chain started from the same distribution. The reversibility of P with respect to

π

is equivalent to the condition of detailed balance, given by

\begin{matrix} π_{i} P_{i j} = π_{j} P_{j i} \forall i, j \in S . \end{matrix}

(4)

Note that any probability measure

π

that satisfies detailed balance with respect to P is necessarily stationary, since

\begin{matrix} \sum_{i \in S} π_{i} P_{i j} = \sum_{i \in S} π_{j} P_{j i} = π_{j} \sum_{i \in S} P_{j i} = π_{j} for every j \in S . \end{matrix}

The converse is, however, not true in general: a stationary distribution may not satisfy Equation (4).

Intuitively, Equation (4) states that, in the stationary state, the fluxes between each pair of states balance each other. In contrast, detailed balance is broken when there is a cycle of three or more states in the state space supporting a net probability current—even in the steady state. Detailed balance is also interpreted as “time reversibility”, as one could not distinguish the steady state dynamics of the system when going forward or backward in time. Certainly, this property is not expected in stochastic processes generated by biological systems. Several disciplines use the term “equilibrium” to refer to long-term behaviour, i.e., what is not transient. In this tutorial we use the term equilibrium state exclusively to refer to probability vectors that satisfy the detailed balance conditions—given in Equation (4). Markov chains that satisfy the detailed balance condition are referred as equilibrium steady states, and conversely, steady states that do not satisfy the detailed balance conditions are called Non-Equilibrium Steady States (NESS).

How to characterize (finite state, homogeneous) reversible Markov chains? Following [26], consider any finite graph

(S, {(c_{i j})}_{i, j \in S})

, with vertex set

S

and with the edge between vertices i and j labelled by the non-negative edge

c_{i j} = c_{j i}

. The graph can be visualized as a system of points labelled by

S

, and with a line segment between points whenever the corresponding conductance is positive. Define

c_{i} = \sum_{j \in S} c_{i j}

and the

S

–indexed stochastic matrix given by

\begin{matrix} p_{i j} = \frac{c_{i j}}{c_{i}}, \end{matrix}

Now define

C = \sum_{i \in S} c_{i}

. It is straightforward to prove that P is reversible with respect to the probability measure given by

\begin{matrix} π_{i} = \frac{c_{i}}{C}, \end{matrix}

and thus it is stationary for P. The unique Markov chain started from

π

and transition matrix P is called the stationary random walk on the network

(S, {(c_{i j})}_{i, j \in S})

. Conversely, any reversible

S

–valued Markov chain can be identified with the random walk on the graph with vertex set

S

and edges given by

c_{i j} = c_{j i} = π_{i} p_{i j}

.

2.6. Law of Large Numbers for Ergodic Markov Chains

The Law of Large Numbers (LLN) that applies to independent and identically distributed random variables (i.i.d.) can be extended to the realm of ergodic Markov chains. In effect, for a given ergodic Markov chain

(X_{t} : t \geq 0)

with stationary distribution

π

and transition matrix P, define the random variables

N_{i}^{(T)}

equal to the number of occurrences of the state i up to time

T - 1

, i.e.,

\begin{matrix} N_{i}^{(T)} = \sum_{t = 0}^{T - 1} 1_{{X_{t} = i}}, \end{matrix}

where

1_{{\cdot}}

is an indicator function. Similarly, define the random variables

N_{i j}^{(T)}

as the number of occurrences of the consecutive pair of states

(i, j) \in S^{2}

—in that order—up to time

T - 1

, i.e.,

\begin{matrix} N_{i j}^{(T)} = \sum_{t = 1}^{T - 1} 1_{{X_{t - 1} = i, X_{t} = j}} . \end{matrix}

With this, the Strong Law of Large Numbers for Markov chains can be stated as follows: if

(X_{t} : t \geq 0)

is ergodic and

π

is its unique stationary distribution, then

\begin{matrix} P_{μ} (lim_{T \to + \infty} \frac{N_{i}^{(T)}}{T} = π_{i}) = 1 and P_{μ} (lim_{T \to + \infty} \frac{N_{i j}^{(T)}}{T} = π_{i} p_{i j}) = 1, \end{matrix}

holds for any initial distribution

μ

. The result, in turn, implies the Weak Law of Large Numbers for Markov chains, which state that, following the above notation, for every

ε > 0

and for every starting distribution

μ

:

\begin{matrix} lim_{T \to + \infty} P_{μ} (|\frac{N_{i}^{(T)}}{T} - π_{i}| > ε) = 0 and lim_{T \to + \infty} P_{μ} (|\frac{N_{i j}^{(T)}}{T} - π_{i} p_{i j}| > ε) = 0 . \end{matrix}

Let’s denote by

C (S)

the space of real-valued functions on

S

. Clearly, any function of

C (S)

can be written as

f (x) = \sum_{i \in S} a_{i} 1_{i} (x)

for certain constants

a_{i}

,

i \in S

. Then, the above result generalizes as: for every

f \in C (S)

, ergodic chain

X_{t}

, and probability distribution

μ

, the following holds:

P_{μ} (lim_{T \to + \infty} \frac{1}{T} \sum_{t = 0}^{T - 1} f (X_{t}) = E_{π} (f (X_{0}))) = 1 .

This corresponds to a particular form of the Birkhoff Ergodic Theorem, which is briefly outlined in the next section and is relevant to characterize spike trains of observables as from data it is possible to accurately measure average values of firing rates and correlations. For an ergodic stationary Markov chain with a state space relatively small with respect to the sample size, this theorem guarantees that from a large sample the transition probabilities and the invariant measure can be recovered. This is not the case in spike train statistics at the population level as only a very small fraction of the state space is sampled in experimental spike trains. However, some features of the spike trains can be sampled very accurately from experimental data. In the next section, we present the basic elements to build from these characteristics a statistical model of the entire population.

3. Observables of Markov Chains and Their Properties

The notion of observable plays a central role in the study of maximum entropy spike trains. This section discusses their nature and fundamental properties.

3.1. Observables and Their Empirical Averages

Suppose a spiking neuronal network of N neurons is provided. Suppose too that measurements of spike patterns for T time bins have been performed. The observables are real-valued functions over the possible spike blocks, denoted here by

B : = S^{T}

. Let

C (B)

be the space of such observables, i.e., the linear space of real-valued functions

f : B \mapsto R

. Recall the space

C (S)

of observables of range 1, discussed at the end of the above section. This space can be naturally embedded into

C (B)

; thus, it can be considered as a linear subspace of the latter. More generally, the space of observables of range R for

R \leq T

, denoted

C (S^{R})

, is just the space of real-valued functions on

S^{R}

, that we identify with its image through the natural embedding into

C (B)

.

We are interested in the average of observables with respect to several probability measures. If

μ

is a probability measure on

B

(i.e.,

μ (ω) \geq 0

and

\sum_{ω \in B} μ (ω) = 1

) and f an observable of range

R \leq T

i.e.,

f \in C (S^{R})

, we define its expectation with respect to

μ

as

\begin{matrix} μ (f) = E_{μ} {f} : = \sum_{ω \in B} f (ω) μ (ω) . \end{matrix}

Since the space of blocks of length T is finite, the above sum is always finite, and thus our definition makes sense for every probability measure on

B

.

In the context of spike-trains, an important class of observable is made up of

{0, 1}

-valued functions. It can be proved that any finite-range binary observable can be written as a finite sum of finite products of functions of the form

1_{{X_{i}^{(j)} = 1}}

that represents the event that the j-th neuron fires during the i-th bin. The average value if this observable is known as the firing rate of neuron j. This quantity has been proposed as one of the major neural coding strategies used by the brain [27].

Consider a spike block

x_{0, T - 1}

, where T is the sample length. Although in most cases the probability measure

μ

that characterizes the spiking activity is not known, it is meaningful to use the experimental data to estimate the mean values of specific observables. The range of the validity of this procedure is usually based on prior assumptions about the nature of the source that originates the sample. For example, it can be assumed that the sample is a short piece of an infinite path that comes running from the far past, and so it can be assumed that this piece exhibits a behavior that is close to the stationary distribution. In this case, one can consider for any number

R \leq T

the quantity:

\begin{matrix} Q (y_{0}, y_{1}, \dots, y_{R - 1}) = \sum_{j = 0}^{T - R} 1_{{x_{j, j + R - 1} = (y_{0}, y_{1}, \dots, y_{R - 1})}}, \end{matrix}

that counts the number of appearances of the sequence

(y_{0}, \dots, y_{R - 1})

as a consecutive sub-sequence of

x_{0, T - 1}

. Now, for any set

A \subseteq S^{R}

, define:

\begin{matrix} μ_{x_{0, T - 1}} (A) : = \frac{1}{T - R + 1} \sum_{y \in A} Q (y), \end{matrix}

μ_{x_{0, T - 1}} (f) = A_{T} (f) = \frac{1}{T - R + 1} \sum_{i = 0}^{T - R} f (x_{i, R - 1 + i}) .

When the empirical distribution is not explicitly stated, it is customary to write

〈 f 〉

to denote the average of the observable f with respect to this probability measure.

3.2. Moments and Cumulants

Observables are random variables whose average values can be determined from experimental data or from the explicit representation of the underlying measure characterizing the stochastic process generating the data. Important statistical properties of random variables are encoded in the cumulants. We will use the cumulants in Section 5 of this tutorial to characterize and infer properties of maximum entropy Markov chains. Let us now introduce them.

The moment of order r of a real-valued random variable X is given by

m_{r} = E (X^{r})

, for

r \in N

(here we freely use the notation

E

to denote the expectation with respect to a probability measure that should be inferred from the context). The moment generating function (or Laplace transform) of a random variable is defined by:

\begin{matrix} M (t) = E (e^{t X}), \end{matrix}

and provided it is a function of t with continuous derivatives of arbitrary order at 0, we have that:

\begin{matrix} m_{r} = {(\frac{d^{r}}{d t^{t}} M)}_{t = 0} . \end{matrix}

The cumulants

κ_{r}

are the coefficients in the Taylor expansion of the cumulant generating function. The cumulants are defined as the logarithm of the moment generating function, namely,

\begin{matrix} ln M (t) = \sum_{r} κ_{r} t^{r} / r! . \end{matrix}

The relation between the moments and cumulants is obtained extracting coefficients from the Taylor expansion, i.e.,

\begin{matrix} κ_{r} = {(\frac{d^{r}}{d t^{r}} ln (M (t)))}_{t = 0} \end{matrix}

(5)

which yields the first values:

\begin{matrix} κ_{1} & = & m_{1}, \\ κ_{2} & = & m_{2} - m_{1}^{2}, \\ κ_{3} & = & m_{3} - 3 m_{2} m_{1} + 2 m_{1}^{3}, \\ κ_{4} & = & m_{4} - 4 m_{3} m_{1} - 3 m_{2}^{2} + 12 m_{2} m_{1}^{2} - 6 m_{1}^{4}, \end{matrix}

and so on. In particular, the first four cumulants are the mean, the variance, the skewness and the kurtosis.

3.3. Observables and Ergodicity

Let

θ : Ω \mapsto Ω

be the shift operator that acts on a sequence

ω \in Ω

as:

\begin{matrix} {(θ (ω))}_{i} = ω_{i + 1}, \end{matrix}

i.e.,

θ

shifts the sequence one position to the left. Now, assume that the Markov chain

(X_{t} : t \geq 0)

is ergodic. Let

π

be its unique stationary probability distribution. The Birkhoff Ergodic Theorem states that under the above assumptions, for every

f \in C (B)

:

\begin{matrix} P_{μ} (lim_{N \to + \infty} \frac{1}{N} \sum_{n = 0}^{N - 1} f \circ θ^{n} = E_{π} (f)) = 1, \end{matrix}

for every initial measure

μ

. This equation means that under the ergodic hypothesis, the temporal averages converge to the spatial averages. The importance of this fundamental result should not be underestimated since this result supports the practice of regarding averages of (hopefully) large samples of experimental data as faithful approximations of the true values of the expectations of the observables.

3.4. Central Limit Theorem for Observables

Consider an arbitrarily large sequence of spike patterns of N neurons. Consider

t \in N

and let

x_{0, t - 1}

be the spike-block of length t. Also, let f be an arbitrary observable of fixed range R. The asymptotic properties of

A_{t} (f)

are established in the following context: the finite sample is drawn from an ergodic Markov chain, i.e.,

x \sim P_{ν}

, where

P_{ν}

is the Markov probability measure of an ergodic chain

(X_{t} : t \geq 0)

started from an arbitrary initial distribution. Let

π

be the unique stationary measure for the Markov chain. Observe that by virtue of the ergodic assumption, the empirical averages of observables become more accurate as the sampling size grows, i.e.,

\begin{matrix} P_{ν} (A_{t} (f) \to E_{π} {f}) = 1 . \end{matrix}

for any starting condition

ν

. However, the above result does not clarify the rate at which the accuracy improves. The central limit theorem (CLT) for ergodic Markov chains provides a result to approach this issue (for datails see [28]).

Theorem 1 (Central limit theorem for ergodic Markov chains).

Under the above assumptions, and keeping notation, define:

\begin{matrix} σ = \sqrt{(E_{π} ((f (X_{0}, \dots, X_{R - 1}) - E_{π} {(f (X_{0}, \dots, X_{R - 1}))}^{2})} . \end{matrix}

Let

L_{t}

be the law of the random variable

\frac{\sqrt{t}}{σ} [A_{t} (f) - E_{π} {f}]

under the measure

P_{ν}

of an ergodic Markov chain started from an arbitrary distribution. Let L be the law of a standard normal random variable. Then

L_{t} \to L

in the sense of weak convergence of convergence in distribution. This is usually written as:

\begin{matrix} P_{ν} \{\frac{\sqrt{t}}{σ} [A_{t} (f) - E_{π} {f}] \leq x\} \to \frac{1}{\sqrt{2 π} σ} \int_{- \infty}^{x} e^{- \frac{s^{2}}{2 σ}} d s . \end{matrix}

This theorem implies that “typical” fluctuations of

A_{t} (f)

around its long term average

E_{π} {f}

are of the order of

σ / \sqrt{t}

. For spike trains, this theorem quantifies the expected Gaussian fluctuations of observables in terms of the sample size of the experimental data.

3.5. Large Deviations of Average Values of Observables

Although the CLT for ergodic Markov chains is precise in describing the typical fluctuations around the mean, it does not characterize the probabilities of large fluctuations. While it is clear that the probability of large fluctuations of average values vanish as the sample size increases, it is sometimes relevant to characterize the decrease rate of this probability. That is what the large deviation principle (LDP) does.

Let f be a function of finite range defined on the space of sequences. In many situations, f will be a

{0, 1}

–valued function. Let

P_{π}

be the probability measure on the space of sequences induced by an ergodic Markov chain with stationary probability

π

. The empirical average

A_{t} (f)

satisfies a large deviation principle (LDP) with rate function

I_{f}

, defined as

I_{f} (s) : = - lim_{t \to \infty} \frac{1}{t} log P_{π} {A_{t} (f) > s},

(6)

if the above limit exists. The above condition implies for large t that

P_{π} {A_{t} (f) > s} \approx e^{- t I_{f} (s)}

. In particular, if

s > E_{π} {f}

the Law of Large Numbers (LLN) ensure that

P_{π} {A_{t} (f) > s}

goes to zero as t increases, but the rate function quantifies the speed at which this occurs.

Calculating

I_{f}

using the definition (Equation (6)), is usually impractical. However, the Gärtner-Ellis theorem provides a clever alternative to circumvent this problem [29]. Let us introduce the scaled cumulant generating function (SCGF) associated to the random variable f by

λ_{f} (k) = : lim_{t \to \infty} \frac{1}{t} ln E_{π} [e^{t k A_{t} (f)}], k \in R,

(7)

when the limit exists (details about cumulant generating functions are found in [30]). While the empirical average

A_{t} (f)

is taken over a sample (empirical measure), the expectation in (7) is computed over the probability distribution given by

P_{π} {\cdot}

.

Theorem 2 (Gärtner-Ellis theorem).

If

λ_{f}

is differentiable, then the average

A_{t} (f)

satisfies a LDP with rate function given by the Legendre transform of

λ_{f}

, that is

I_{f} (s) = max_{k \in R} {k s - λ_{f} (k)} .

(8)

Therefore, the large deviations of empirical averages

A_{t} (f)

can be characterized by first computing their SCGF and then finding their Legendre transform.

A useful application of the LDP is to estimate the likelihood that the empirical average

A_{t} (f)

takes a value far from its expected value. Let us assume that

I_{f} (s)

is a positive differentiable convex function. Then,

λ_{f} (k)

is also differentiable [31] (for a comprehensive discussion about the differentiability of

λ_{f} (k)

see [30].) Then, as

I_{f} (s)

is convex it has a unique global minimum. Denoting this minimum by

s^{*}

, from the differentiability of

I_{f} (s)

it follows that

I_{f} (s^{*}) = 0

. Additionally, it follows from properties of the Legendre transform that

s^{*} = λ_{f}^{'} (0) = E_{p} {f}

, which is the LLN that says that

A_{t} (f)

concentrates around

s^{*}

. Consider

s \neq s^{*}

and that

I_{f} (s)

admits a Taylor expansion around

s^{*}

I_{f} (s) = I_{f} (s^{*}) + I_{f}^{'} (s^{*}) (s - s^{*}) + \frac{I_{f}^{″} (s^{*}) {(s - s^{*})}^{2}}{2} + O {(s - s^{*})}^{3} .

As

s^{*}

is zero and a minimum of

I (s)

, the first two terms of this expansion are zero, and as

I (s)

is convex

I^{″} (s) > 0

. For large t, it follows from (6) that

\begin{matrix} p {A_{t} (f) > s} & \approx e^{- t I_{f} (s)} \\ \approx e^{- t (\frac{I_{f}^{″} (s^{*}) {(s - s^{*})}^{2}}{2})}, \end{matrix}

(9)

so the “small deviations” (we are using Taylor expansion) of

A_{t} (f)

around

s^{*}

are Gaussian (in Equation (9)

1 / I_{f}^{″} (s^{*}) = λ_{f}^{″} (0) = σ^{2}

). In this sense, the LDP can be considered as an extension of the CLT as it goes beyond the small deviations around

s^{*}

(Gaussian), but additionally the large deviations (not Gaussian) of

A_{t} (f)

.

4. Building Maximum Entropy Temporal Models

This section presents the main concepts behind the construction of maximum entropy models for temporal data. The next Section 4.1, introduces the concept of entropy, and then Section 4.2 formulates the problem of maximizing the entropy rate. Methods for solving this problem are discussed in Section 4.3, which are then illustrated in an example presented in Section 4.4.

4.1. The Entropy Rate of a Temporal Model

4.1.1. Basic Definitions

In order to give mathematical meaning to the rather vague notion of uncertainty, a natural approach is to employ the well-established notion of Shannon entropy. For any probability measure p defined over the state space E (not necessarily

S

), the Shannon entropy of p is given by

S [p] : = - \sum_{x \in E} p (x) log p (x) .

Note that this definition can be used for measures on the spaces of infinite sequences

E^{N}

. However, as in most cases of interest, the value saturates in infinite. A better suited notion in this context is given by the entropy rate, which plays a crucial role in the rest of this tutorial.

Definition 1 (entropy rate).

Let μ be a probability measure on the space of sequences

S^{N}

. For

n \geq 1

let

μ_{n}

be the probability measure induced by μ on the initial n coordinates, i.e.,

μ_{n}

is the probability distribution on

E^{n}

given by:

\begin{matrix} μ_{n} (x_{0}, x_{1}, \dots, x_{n} - 1) = μ (ω \in S^{N} : X_{i} = x_{i} for i = 0, 1, \dots, n - 1) . \end{matrix}

The entropy rate of the measure μ is defined by:

S [μ] = lim_{n \to \infty} \frac{1}{n} S [μ_{n}] .

(10)

The above definition applies to any probability distribution on the space of sequences. Intuitively, the entropy rate correspond to the entropy per time unit, and represents how much “uncertainty” is created by the process as time moves forward.

4.1.2. The Entropy Rate of I.I.D. and Markov Models

Let us consider first a null model of spike activity, where there is complete statistical independence between two consecutive spike patters. For this, first recall that

S = {0, 1}^{N}

, where N is the fixed number of neurons. Without loss of generality, we can enumerate the elements of

S

as

s_{1}, s_{2}, \dots, s_{2^{N}}

. Let

ν = (ν_{1}, ν_{2}, \dots, ν_{2^{N}})

be a probability measure on

S

such that:

\begin{matrix} ν (s_{k}) = ν_{k} \end{matrix}

For a T–block

x = (x_{0}, x_{1}, \dots, x_{T - 1}) \in S^{T}

and for every

s \in S

, we set:

\begin{matrix} N_{s}^{T} (x) = \sum_{i = 0}^{T - 1} 1_{{x_{i} = s}} . \end{matrix}

On the space of infinite spike trains

S^{N}

we consider the probability

μ = ν^{\otimes N}

, i.e., the product measure on the space of spike trains. Observe that the induced measure is given by:

\begin{matrix} μ_{n} (x_{0}, x_{1}, \dots, x_{T - 1}) = \prod_{k = 1}^{2^{N}} ν_{k}^{N_{s}^{t} (x_{0}, \dots, x_{T - 1})} . \end{matrix}

With this, a straightforward calculation shows that

\begin{matrix} S [μ] = S [ν] = - \sum_{k = 1}^{2^{N}} ν_{k} ln (ν_{k}), \end{matrix}

and in this case we observe that the entropy rate is equal to the entropy of the probability distribution induced by each coordinate map.

A reasonable next step in the hierarchy of models is to weaken the independence hypothesis and assume instead that the spike activity keeps some bounded memory of the past. For this, following the considerations of Section 2, let us consider an ergodic discrete Markov chain with transition matrix P and invariant distribution

π

taking values in

S

. Let

μ = μ (P, π)

be the measure induced by this chain on the space

S^{N}

. Observe that, with the above notation:

\begin{matrix} μ_{n} (x_{0}, x_{1}, \dots, x_{n - 1}) = π_{x_{0}} \prod_{j = 1}^{n - 1} P_{x_{j - 1} x_{j}} . \end{matrix}

A direct computation shows that

\begin{matrix} S [μ_{1}] & = - \sum_{(x_{0}, x_{1}) \in S^{2}} π_{x_{0}} P_{x_{0} x_{1}} ln (π_{x_{0}} P_{x_{0} x_{1}}) \\ = - \sum_{x \in S} π_{x} ln (π_{x}) - \sum_{(x_{0}, x_{1}) \in S^{2}} π_{x_{0}} P_{x_{0} x_{1}} ln (P_{x_{0} x_{1}}), \end{matrix}

and induction shows that:

\begin{matrix} S [μ_{n}] = - \sum_{x \in S} π_{x} ln (π_{x}) - n \sum_{(x_{0}, x_{1}) \in S^{2}} π_{x_{0}} P_{x_{0} x_{1}} ln (P_{x_{0} x_{1}}) . \end{matrix}

Thus dividing by n and taking the limit in Equation (10), one finds that

\begin{matrix} S [μ] = - \sum_{(x_{0}, x_{1}) \in S^{2}} π_{x_{0}} P_{x_{0} x_{1}} ln (P_{x_{0} x_{1}}) . \end{matrix}

4.2. Entropy Rate Maximization under Constraints

Now we introduce the central problem of this tutorial. Assume we have empirical data from spiking activity. Consider the empirical averages of K observables,

〈 f_{k} 〉

, for

f_{k}, k = 1, \dots, K

. We need to characterize the Markov chains that are consistent with these average values. Except for trivial and uninteresting situations, there is no finite set of empirical averages that uniquely determines a distribution

μ

on

S^{N}

that fits the averages, in the sense that

\begin{matrix} μ (f_{k}) = 〈 f_{k} 〉 for k = 1, \dots, K . \end{matrix}

Consequently, we need to impose further restrictions in order to guarantee uniqueness. A useful and meaningful approach is the so-called Maximum Entropy Markov Chain model (MEMC), which fit the unique probability measure

μ

among all the stationary Markov measures

ν

on

S^{N}

that match the expected values of a given set of observables and that maximizes the entropy rate. Mathematically, it is written in the following form:

\begin{matrix} \max_{ν \in M_{i n v}} & S [ν] \\ subject to & ν (f_{k}) = {〈 f_{k} 〉}_{e} = C_{k}, \forall k \in {1, \dots, K}, \end{matrix}

where

M_{i n v}

is a shorthand for the sets of stationary Markov measures on

S^{N}

. Formally:

\begin{matrix} M_{i n v} : = {(π, P) : π is a probability on S, P is stochastic, π P = π} . \end{matrix}

It is to be noted that the maximum entropy principle can be derived in some scenarios from more general principles based on large deviation theory [30]. In this framework, entropy maximization corresponds to Kullback-Leiber divergence minimization. This approach can be useful for accounting additional information that is not in the form of functional constraints, but as a Bayesian prior. A major drawback of this approach to be applied to spike trains, is that it assumes stationarity in the data. While this condition is not to be naturally expected in biological systems, controlled experiments can be carried out in the context of spike train analysis in order to maintain these conditions [3,8,32,33]. The maximum entropy principle as presented here is useful only in the stationary case. However, some extensions have been proposed [34,35]. Note also that there are alternative variational principles which can be used to find distributions that extremize the value of quantities such as the maximum entropy production principle [36,37,38], or the Prigogine minimum entropy production principle [39,40]. To the best of our knowledge, these alternatives have not yet been explored in the context of spike train statistics.

4.3. Solving the Optimization Problem

We now discuss techniques for finding models that maximize the entropy rate.

4.3.1. Lagrange Multipliers and the Variational Principle

To solve the above optimization problem, let us introduce the set of Lagrange multipliers

h_{k} \in R

and an energy function

H = \sum_{k = 1}^{K} h_{k} f_{k}

, which is a linear combination of observables. Consider the following unconstrained optimization problem, which can be framed in the context of the variational principle of the thermodynamic formalism [41]:

F [H] = sup_{ν \in M_{i n v}} \{S [ν] + ν (H)\} = S [μ] + μ (H),

(11)

where

F [H]

is called the free energy and

ν (H) = \sum_{k = 1}^{K} h_{k} ν (f_{k})

is the average value of

H

with respect to the measure

ν

. The following holds:

\frac{\partial F [H_{h}]}{\partial h_{k}} = E_{p} {f_{k}} = C_{k}, \forall k \in {1, . . ., K},

where

E_{p} {f}

is the average of

f_{k}

with respect to p (maximum entropy measure), which is equal (by restriction) to the average value of

f_{k}

with respect to the empirical measure from the data.

The maximum-entropy (ME) principle [42] has been successfully applied to spike data from the cortex and the retina [3,8,9,11,12,43]. The approach starts by fixing the set of constraints determined by the empirical average of observables measured from spiking data. Maximizing the entropy (concave functional) under constraints, gives a unique distribution. The choice of observables to measure in the empirical data (constraints) determines the statistical model. The approach of Lagrange multipliers may not be practical when trying to fit a MEMC. In the next section we introduce an alternative optimization based on spectral properties.

4.3.2. Transfer Matrix Method

In order to illustrate the transfer matrix method, we start with a classical example that allow us to introduce a fundamental definition. Let A be a adjacency matrix i.e., a

{0, 1}

-valued square matrix with rows and columns indexed by the elements of

S

. If there exists an

n \geq 0

such that

\begin{matrix} A_{i j}^{n} > 0 \end{matrix}

for every

i, j \in S

, we say that A is primitive. The next well-known theorem of Linear Algebra is crucial [44] for the uniqueness of the MEMC.

Theorem 3 (Perron-Frobenius theorem).

Let A be a primitive matrix. Then,

There is a positive maximal eigenvalue ρ > 0 such that all other eigenvalues satisfy $∣ ρ^{'} ∣ < ρ$ . Moreover ρ is simple;
There are positive left- and right-eigenvectors $u = (u_{1}, \dots, u_{k}), v = (v_{1}, \dots, v_{k})$ s.t. $u A = ρ u, A v = ρ v .$

Apply the above theorem to a primitive matrix A, and define:

P_{i j} = \frac{A_{i j} v_{j}}{ρ v_{i}}; π_{i} = \frac{u_{i} v_{i}}{〈 u, v 〉},

where

〈 u, v 〉

is the standard inner product in

R^{2^{N}}

(we refer the reader to [44] for details). The matrix P built above is stochastic. Moreover,

π

is its unique stationary measure. Define the Parry measure to be the Markov measure:

\begin{matrix} μ (i_{0}, i_{1}, \dots, i_{n}) = π_{i_{0}} P_{i_{0} i_{1}}, \dots, P_{i_{n - 1} i_{n}} . \end{matrix}

It is well known that the Parry measure is the unique measure of maximal entropy consistent with the adjacency matrix A [45,46].

Inspired by this result, we consider now the general case. Consider constraints given by a set of empirical averages of observables, as explained in the previous section. The above example certainly fits this setting: just consider binary observables associated to each pair of states

(i, j)

that evaluates to 1 when a transition from state i to state j has been observed in the data. In our general setting, we assume that the chosen observables have a finite maximum range R. From these observables the energy function

H

of finite range R is built as a linear combination of these observables. Using this energy function we build a matrix denoted by

L_{H}

, so that for every

y, w \in S^{R}

its entries are given as follows:

L_{H} (y, w) = \{\begin{matrix} e^{H (y_{1} w_{1, R - 1})} & if y_{1, R - 1} = w_{0, R - 2} \\ 0, & otherwise . \end{matrix}

(12)

where

y_{1} w_{R - 1}

is the concatenated block built from

y_{1}

and

w_{1, R - 1}

. For observables of range one, the matrix above is defined as

L_{H} (y, w) = e^{H (y)}

. Assuming

H > - \infty

, the elements of the matrix

L_{H}

are non-negative. Furthermore, in every non trivial case, the matrix is primitive and satisfies the Perron-Frobenius theorem [44]. Denote by

ρ

the unique largest eigenvalue of

L_{H}

. Just as above, we denote by

u

and

v

the left and right eigenvectors of

L_{H}

associated to

ρ

. Notice that

u_{i} > 0

and

v_{i} > 0

, for all

i \in S

. The free energy associated to a transfer matrix is the logarithm of the unique maximum eigenvalue.

The matrix

L_{H}

can be turned into a Markov matrix of maximum entropy. For a primitive matrix M with spectral radius

ρ

, and positive right eigenvector

v

associated to

ρ

, the stochastic matrix built from M is computed as follows:

S (M) = \frac{1}{ρ} D^{- 1} M D,

where D is the diagonal matrix with entries

D_{i i} = v_{i}

. The MEMC transition matrix P and unique stationary probability measure

π

are explicitly given by

P = S (L_{H}); π_{i} : = \frac{u_{i} v_{i}}{〈 u, v 〉}, \forall i \in S .

(13)

Note that when

H = 0

, the MEMC is characterized by the Markov transition matrix with components [47]:

P_{i j} = \frac{A_{i j} v_{j}}{ρ v_{i}},

where A is the adjacency matrix.

4.3.3. Finite Range Gibbs Measures

For a fixed energy function

H

of range

R \geq 2

, there is a unique stationary Markov measure

μ

for which there exist a constant

γ \geq 1

such that [46],

γ^{- 1} \leq \frac{μ [x_{1, n}]}{exp (\sum_{k = 1}^{n - R + 1} H (x_{k, k + R - 1}) - (n + R - 1) F [H])} \leq γ,

(14)

that attains the supremum (11). The measure

μ

, as defined by (14), is known in the symbolic dynamics literature as Gibbs measure in the sense of Bowen [48]. All MEMCs belong to this class of measures. Moreover, the classical Gibbs measures in statistical mechanics are particular cases of (14), when

γ = 1

,

F [H] = log Z

and

H

is an energy function of range one, leading to an i.i.d stochastic process characterized by the product measure

μ

. In this case the following holds:

μ (x) = \frac{e^{H (x)}}{Z} \forall x \in S; Z = \sum_{x \in S} e^{H (x)} .

The free energy that is defined here has a deep relationship with the free energy in thermodynamics. Consider a thermodynamic system in equilibrium. The Helmholtz free energy derived from the partition function as follows:

F (β) = - β^{- 1} log Z

where

β = 1 / (k T)

and k is Boltzmann’s constant and T is the temperature.

This quantity is related to the cumulant generating function for the energy. In the context of the maximum entropy principle, the physical temperature and the Boltzmann’s constant play no role, so usually both are considered equal to 1. From the free energy, all of the thermodynamic properties of the system can be obtained via its derivatives, examples are the internal energy, specific heat, and entropy. It is to be noted that the definition used in this tutorial for the free energy (11) follows from the conventions used in the field of thermodynamic formalism [41,45,46] and changes its sign with the usual convention in the field of statistical mechanics.

4.4. Example

We present here the toy example that we will use to explore statistical properties of spike trains using the non-equilibrium statistical physics approach. We present the transfer matrix technique to compute the Markov transition matrix, its invariant measure and free energy from a potential

H

.

Consider a range-2 potential with two neurons (

N = 2

). We use the notation introduced in Section 2.1:

H (x^{0, 1}) = h_{1} x_{0}^{1} x_{1}^{2} + h_{2} x_{0}^{2} x_{1}^{1} .

The state space of this problem is given by:

(\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} 0 \\ 1 \end{matrix}), (\begin{matrix} 1 \\ 0 \end{matrix}), (\begin{matrix} 1 \\ 1 \end{matrix}) .

The transfer matrix (12) associated to

H

is, in this case, a

4 \times 4

matrix

L_{{xx}^{'}} = (\begin{matrix} 1 & 1 & 1 & 1 \\ 1 & 1 & e^{h_{2}} & e^{h_{2}} \\ 1 & e^{h_{1}} & 1 & e^{h_{1}} \\ 1 & e^{h_{1}} & e^{h_{2}} & e^{h_{1} + h_{2}} \end{matrix}) .

This matrix satisfies the hypothesis of the Perron-Frobenius theorem. The maximum eigenvalue is

ρ = \frac{1}{2} (3 + e^{(h_{1} + h_{2})} + \sqrt{5 + 4 e^{h_{1}} + 4 e^{h_{2}} + 2 e^{(h_{1} + h_{2})} + e^{(2 h_{1} + 2 h_{2})}}),

and the free energy

F [H] = log (ρ) .

(15)

5. Statistical Properties of Markov Maximum Entropy Measures

The procedure of finding a maximum entropy model gives us a full statistical model of the system of interest. In this section we discuss the added value that having such a model can provide.

5.1. Cumulants from Free Energy

All the statistical properties of the observables and their correlations can be obtained by taking the successive derivatives of the free energy with respect to the Lagrange Multipliers. This property explains the important role played by the free energy in the framework of MEMC. In general,

\frac{\partial^{n} F [H]}{\partial h_{k}^{n}} = κ_{n} \forall k \in {1, . . ., K},

where

κ_{n}

is the cumulant of order n (Equation (5)). In particular, taking the first derivative:

\frac{\partial F [H]}{\partial h_{k}} = E_{p} {f_{k}} \forall k \in {1, . . ., K},

(16)

where

E_{p} {f_{k}}

is the average with respect to the maximum entropy distribution p, which is equal to the average value of

f_{k}

with respect to the empirical measure. With Equation (16) the parameters of the MEMC can be fitted to be consistent with fixed average values of observables.

Suppose that we compute from data the average values of the following observables

〈 x_{0}^{1} x_{1}^{2} 〉 = 0.1

and

〈 x_{0}^{2} x_{1}^{1} 〉 = 0.3

. We solve (16) (two equations and two unknowns) and obtain

h_{1} = - 1.98306

and

h_{2} = 1.48406

. With these parameters, the following Markov transition matrix and invariant measure are obtained from (13):

P_{x x^{'}} = (\begin{matrix} 0.232971 & 0.469441 & 0.0987018 & 0.198886 \\ 0.115617 & 0.232971 & 0.216056 & 0.435357 \\ 0.549892 & 0.15252 & 0.232971 & 0.0646176 \\ 0.272896 & 0.0756914 & 0.509966 & 0.141446 \end{matrix}) π (x) = (\begin{matrix} 0.29102 \\ 0.248443 \\ 0.248443 \\ 0.212095 \end{matrix}) .

5.2. Fluctuation-Dissipation Relations

For a first-order stationary Markov chain, since each

X_{n}, n \geq 1

depends on its predecessor, this induces a non-zero time-correlation between

X_{n}

and

X_{n + r}

, even when the distance r is greater than 1. This correlation, and more generally, time correlations between observables can be directly derived from the free energy. This relationship is usually referred to as Fluctuation-dissipation, and is also related to the linear response function that is presented in Section 5.7.

Let P be an ergodic matrix and indexed by the states in some finite set E, and

π

be its unique stationary measure. In this general context, for two real-valued functions that depend on a fixed finite number of components, we define the n–step correlation as

\begin{matrix} C_{f, g} (n) = E_{π} (f (X_{0}) g (X_{n})) - E_{π} (f (X_{0})) E_{π} (g (X_{0})) . \end{matrix}

In the particular case of MEMC with potentials of range

R > 1

there is a positive time correlation between pairs of observables

f (x_{n})

and

g (x_{n + r})

. Suppose the correlations decay fast enough so that (at least)

\sum_{n = 0}^{\infty} | C_{f, g} (n) | < \infty .

Then the following sum (known as the Green-Kubo formula [49]) converges and is non-negative:

σ_{f_{k}, f_{j}}^{2} = C_{f_{k}, f_{j}} (0) + \sum_{r = 1}^{\infty} C_{f_{k}, f_{j}} (r) + \sum_{r = 1}^{\infty} C_{f_{j}, f_{k}} (r) .

(17)

Additionally, it can be shown that the energy function and the free energy depends smoothly upon maximum entropy parameters. Moreover, the correlations between observables can be obtained from the free energy through:

σ_{f_{k}, f_{j}}^{2} = \frac{\partial^{2} F [H]}{\partial h_{k} \partial h_{j}} = \frac{\partial μ (f_{j})}{\partial h_{k}} .

The relationship between a correlation and a derivative of the free energy is called the fluctuation-dissipation theorem [50]. For a MEMC characterized by

μ (P, π)

, the fluctuation-dissipation relationships can be obtained explicitly:

\begin{matrix} \frac{\partial^{2} F [H]}{\partial h_{k} \partial h_{j}} = & E_{μ} [f_{k} f_{j}] - E_{μ} [f_{k}] E_{μ} [f_{j}] + \sum_{r = 1}^{\infty} \sum_{x, x^{'} \in S} (f_{k} (x) f_{j} (x^{'}) π_{x} P_{x x^{'}}^{r} - E_{μ} [f_{k}] E_{μ} [f_{j}]) \\ + \sum_{r = 1}^{\infty} \sum_{x, x^{'} \in S} (f_{j} (x) f_{k} (x^{'}) π_{x} P_{x x^{'}}^{r} - E_{μ} [f_{k}] E_{μ} [f_{j}]) . \end{matrix}

(18)

For MEMC built from K observables, the correlations can be conveniently arranged in a

K \times K

symmetric matrix denoted by

χ

(the symmetry refers to the Onsager reciprocity relations [51]).

χ_{j k} = \frac{\partial^{2} F [H]}{\partial h_{k} \partial h_{j}} = \frac{\partial μ (f_{j})}{\partial h_{k}} = \frac{\partial μ (f_{k})}{\partial h_{j}} = χ_{k j} .

(19)

For the example Section 4.4, we obtain the matrix

χ

by taking the second derivatives of (15) and evaluate at the parameters found previously,

χ_{k j} = (\begin{matrix} 0.0971481 & 0.0606071 \\ 0.0606071 & 0.127964 \end{matrix}) .

In Figure 2, we plot the right hand side of Equation (18) for the MEMC built from the example Section 4.4 consistent with constraints considered in the example of Section 5.1, for the auto-correlation of the observable

x_{0}^{2} x_{1}^{1}

.

5.3. Resonances and Decay of Correlations

We now turn back to the general setting of an arbitrary ergodic matrix P with stationary measure

π

associated to a Markov chain taking values on a finite state space (not necessarily the space of spike-patterns). Without loss of generality, assume that P is indexed by the states in

E = {1, 2, \dots, M}

. It can be proved that in this case there exists

(l_{i} : i = 1, 2, \dots, M)

and

(r_{i} : i = 1, 2, \dots, M)

, sets of left and right eigenvectors respectively, associated to the eigenvalues

(ρ_{i} : i = 1, \dots, M)

. We can assume that the eigenvectors and left and right eigenvalues have been sorted and normalized in such a way that

ρ_{1} = 1

,

l_{1}

is the unique P–stationary probability vector

π

,

r_{1} = {(111 \dots 1)}^{T}

, and

\begin{matrix} 〈 l_{i} | r_{j} 〉 = δ_{i, j}, \end{matrix}

where

δ_{i, j}

is the Kronecker delta, and

〈 u v 〉 = 〈 u, v 〉

corresponds to the Dirac’s bra-ket,

| u 〉 〈 v | = u v^{T}

. With the same notation, the spectral decomposition of P is written:

\begin{matrix} P = \sum_{i = 1}^{M} ρ_{i} | r_{i} [l_{i}] . \end{matrix}

Hence:

\begin{matrix} P^{n} = \sum_{i = 1}^{M} ρ_{i}^{n} | r_{i} [l_{i}] . \end{matrix}

(20)

Given two functions

f : E \mapsto R

and

g : E \mapsto R

the following holds,

\begin{matrix} C_{f, g} (n) & : = E_{π} (f (X_{0}) g (X_{n})) - E_{π} (f (X_{0})) E_{π} (g (X_{0})) \\ = 〈 π f \circ P^{n} g 〉 - 〈 π f 〉 〈 π g 〉 . \end{matrix}

(21)

Recall the discussion in previous sections regarding the reverse chain Section 2.4. Writing

E_{π}^{\leftarrow}

for the expectation operator associated to the reverse Markov measure, i.e., to the measure

μ = μ (π, \overset{\leftarrow}{P})

, one can see that

\begin{matrix} E_{π} (f (X_{0}) g (X_{n})) = E_{π}^{\leftarrow} (f (X_{n}) g (X_{0})), \end{matrix}

and hence (21) becomes

\begin{matrix} 〈 π | g \circ {\overset{\leftarrow}{P}}^{n} f 〉 - 〈 π | f 〉 〈 π | g 〉 . \end{matrix}

From (20)

\begin{matrix} f \circ P^{n} g = \sum_{i = 1}^{M} 〈 l_{i} | g 〉 | f \circ r_{i} 〉 \end{matrix}

and thus (21) becomes

\begin{matrix} C_{f, g} (n) & = \sum_{i = 1}^{M} ρ_{i}^{n} 〈 l_{i} | g 〉 〈 π | f \circ r_{i} 〉 - 〈 π | f 〉 〈 π | g 〉 \\ = \sum_{i = 2}^{M} ρ_{i}^{n} 〈 l_{i} | g 〉 〈 π | f \circ r_{i} 〉 . \end{matrix}

(22)

In Figure 3, we show the auto-correlations of the same observable considered in Figure 1, for the same MEMC. We observe modulations in the decay of the auto-correlations due to the complex eigenvalues in Equation (22), which arise in the non-symmetric transition matrix induced by the irreversibility of the MEMC.

We have found in Equation (22) an explicit expression for the decay of correlation for observables from the set of eigenvalues and eigenvectors of the transition matrix P. This is relevant in the context of spike train statistics because as the matrix P characterizing the spike trains is not expected to be symmetric, its eigenvalues are not necessarily real and modulations in the decay of correlations are expected (resonances). When measuring correlations between observables from data, one may observe this oscillatory situation that resembles resonances. This may be a symptom of a non-equilibrium situation.

5.4. Large Deviations for Average Values of Observables in MEMC

Obtaining the probability of “rare” average values of firing rates, pairwise correlations, triplets or non-synchronous observables is relevant in spike train statistics as these observables are likely to play an important role in neuronal information processing, and rare values may convey crucial information or be a symptom that the system in not working properly.

Here, we build from a previous article [52] where it is shown that the SCGF (7) can be obtained directly from the inferred Markov transition matrix P through the Gärtner-Ellis theorem (8). Consider a MEMC with transition matrix P. Let f be an observable of finite range and

k \in R

. We introduce the tilted transition matrix by f of P, parametrized by k and denoted by

{\tilde{P}}^{(f)} (k)

[53] as follows:

{\tilde{P}}_{i j}^{(f)} (k) = P_{i j} e^{k f (i j)} i, j \in S .

(23)

The tilted transition matrix can be directly obtained from the spectral properties of the transfer matrix (12),

\begin{matrix} {\tilde{P}}_{i j}^{(f)} (k) & = & \frac{e^{H_{i j}} v_{j}}{v_{i} ρ} e^{k f (i j)} \\ = & \frac{e^{[H_{i j} + k f (i j)]} v_{j}}{v_{i} ρ} i, j \in S . \end{matrix}

Recall that

v

is the right eigenvector associated to its maximum eigenvalue

ρ

of the transfer matrix

L

. Here we use the notation

H_{i j}

to specify that the energy function is built from the elements of the state space i and j. Remarkably, this result is valid not only for the observables in the energy function, i.e., from here the LDP of more general observables can be computed.

To obtain an explicit expressions for the SCGF

λ_{f} (k)

, it is possible to take advantage of the structure of the underlying stochastic process. For instance, for i.i.d. random process

X_{t}

where

X_{i} \sim X

from Definition 7, one can obtain that

λ (k) = lim_{t \to \infty} \frac{1}{t} ln E {[e^{t k A_{t} (f)}]}^{t} = ln E [e^{k f (X)}],

which is the case of range one observables. Using the Equation (23), we obtain that the maximum eigenvalue of the tilted matrix

ρ ({\tilde{P}}_{f} (k))

is,

ρ ({\tilde{P}}_{f} (k)) = \sum_{j} π_{j} e^{k f (j)} j \in S .

As

{\tilde{P}}_{f}

is a primitive matrix, the uniqueness of

ρ ({\tilde{P}}_{f} (k))

is ensured from the Perron-Frobenius theorem.

For additive observables of ergodic Markov chains, a direct calculation (see [54]) leads us to

λ_{f} (k) = ln (ρ ({\tilde{P}}^{(f)})) .

It can also be proved that

λ_{f} (k)

, in this case, is differentiable [54], setting up the scene to use the Gärtner-Ellis theorem to obtain

I_{f} (s)

as shown in Figure 4.

5.5. Information Entropy Production

Given a Markov chain

(X_{t} : t \geq 0)

on a general finite state space E with transition matrix P started from the distribution

ν

, denoted

ν^{(n)}

the distribution of

X_{n}

, namely, for

i \in E

:

\begin{matrix} ν^{(n)} (i) = P_{ν} (X_{n} = i) . \end{matrix}

Obviously,

ν^{(0)} = ν

, and

\begin{matrix} ν_{j}^{(n + 1)} = \sum_{i \in E} ν_{i}^{(n)} P_{i j} . \end{matrix}

The information-theoretic entropy of the probability distribution

ν

at time n is given by

\begin{matrix} S_{n} (ν) : = - \sum_{i \in E} ν_{i}^{(n)} log ν_{i}^{(n)}, \end{matrix}

and the change of entropy over one time step is defined as

\begin{matrix} Δ S_{n} : = S_{n + 1} (ν) - S_{n} (ν) . \end{matrix}

A bit of algebra yields

Δ S_{n} = - \sum_{i, j \in E} ν_{j}^{(n)} P_{j i} log \frac{ν_{j}^{(n + 1)} P_{j i}}{ν_{i}^{(n)} P_{i j}} + \frac{1}{2} \sum_{i, j \in E} [ν_{j}^{(n)} P_{j i} - ν_{i}^{(n)} P_{i j}] log \frac{ν_{j}^{(n)} P_{j i}}{ν_{i}^{(n)} P_{i j}} .

The first term on the right hand side above is called information entropy flow and the second term information entropy production [12].

In the stationary case, i.e., when P admits a stationary measure

π

and the chain is started from that distribution, one has that

ν^{(n)} = π

for every

n \geq 0

; thus, in this case, the change of entropy rate is zero, i.e., for stationary chains, the information entropy flow equals (minus) information entropy production. This case is the focus of this work. The chain is associated to spike train activity for transitions between L–blocks. Starting from stationarity the entropy production rate is explicitly given by

\begin{matrix} I E P (P, π) : = \frac{1}{2} \sum_{x, x^{'} \in S^{L}} [π (x^{'}) P_{x^{'} x} - π (x) P_{x x^{'}}] log \frac{π (x^{'}) P_{x^{'} x}}{π (x) P_{x x^{'}}} \geq 0 . \end{matrix}

(24)

The non-negativity implies that information entropy is positive as long as the process violates the detailed balance conditions (4). This is analogous to the second law of thermodynamics [55]. From this equation it is easy to realize that if the Markov chain satisfies the detailed balance condition, the information entropy production is zero.

In Figure 5, we compute the information entropy production from Equation (24), for the MEMC of the example (Section 4.4) for different values of the parameters

h_{1}, h_{2}

.

It may seem contradictory that in stationary state the entropy is constant, while there is a positive “production” of entropy. The information entropy production in stationary state always compensate the information entropy flow, which leaves the information entropy rate constant. In this case we refer to non-equilibrium steady states (NESS).

5.6. Gallavotti-Cohen Fluctuation Theorem

To characterize the fluctuations of the IEP, consider the MEMC

μ (P, π)

and the following observable:

W_{n} (x_{0, n}) = \frac{1}{n} ln (\frac{μ (x_{0, n})}{μ (x_{n, 0})}),

It can be shown that

lim \frac{1}{n} W_{n} \to I E P (π, P)

. The Gallavotti-Cohen fluctuation theorem is as a statement about properties of the SCGF and rate function of the IEP [14].

λ_{W} (k) = λ_{W} (- k - 1), I_{W} (s) = I_{W} (- s) - s .

(25)

This symmetry holds for a general class of stochastic processes including NESS from Markov chains [56], and is a universal property of the IEP, i.e., it is independent of the parameters of the MEMC. To compute

λ_{W} (k)

and

I_{W} (s)

, define

A {(k)}_{i j} = P_{i j} {[\frac{π_{i} P_{i j}}{π_{j} P_{j i}}]}^{k}

. If

ρ (k)

is the largest eigenvalue of

A (k)

, then

{lim}_{n \to \infty} ln E (e^{n λ W_{n}}) = ln ρ (k)

.

In Figure 6, we illustrate the Gallavotti-Cohen symmetry property of the large deviation functions associated to the IEP (Equation (25)).

These properties are relevant to the large deviations of the averaged entropy production denoted

\frac{W_{t}}{t}

over a trajectory

x_{0, t - 1}

of the Markov chain

p (π, P)

. The following relationship holds,

\frac{p \{\frac{W_{t}}{t} \approx s\}}{p \{\frac{W_{t}}{t} \approx - s\}} ≍ e^{t s} .

This means that the positive fluctuations of

\frac{W_{t}}{t}

are exponentially more probable than negative fluctuations of equal magnitude.

5.7. Linear Response

The linear response serves to quantify how a small perturbation

δ h

of a set of the maximum entropy parameters affects the average values of observables in terms of the unperturbed measure. This is relevant in the context of spike trains statistics to identify stiff and sloppy directions in the space of parameters. A small change in a sloppy parameter produces very little impact in the statistical model. In contrast, a small change in a stiff parameter produces a significant change. For a MEMC characterized by

μ = (P, π)

corresponding to an energy function with fixed parameters

h

denoted by

H_{h}

, one can obtain the average value of a given observable

f_{k}

from (16).

Now, consider a perturbed energy denoted by

\tilde{H} = H_{h + δ h}

. Using a Taylor expansion, the average value of an arbitrary observable

f_{k}

with respect to the MEMC can be obtained

\tilde{μ} = (\tilde{P}, \tilde{π})

associated to the perturbed energy. Considering the Taylor expansion of

F [H_{h + δ h}]

about

H_{h}

\begin{matrix} \frac{\partial F [H_{h + δ h}]}{\partial h_{k}} & = & \frac{\partial F [H_{h}]}{h_{k}} + \sum_{j} \frac{\partial^{2} F [H_{h}]}{\partial h_{k} h_{j}} δ h_{j} + O {(δ h_{j})}^{2}, \end{matrix}

(26)

\begin{matrix} E_{\tilde{μ}} [f_{k}] & = & E_{μ} [f_{k}] + \sum_{j} \frac{\partial^{2} F [H_{h}]}{\partial h_{k} h_{j}} δ h_{j} + O {(δ h_{j})}^{2}, \end{matrix}

(27)

\begin{matrix} Δ E [f] & \approx & χ \cdot δ h . \end{matrix}

(28)

We use (16) to go from (26) to (27). Observe from (27) that a small perturbation of a parameter

h_{j}

influences the average value of all other observables in the energy function (as

f_{k}

is arbitrary). The perturbation is modulated by the second derivatives of the free energy corresponding to the unperturbed regime

F [H_{h}]

(see Figure 7).

6. Discussion and Future Work

This tutorial explores how one can use maximum entropy methods to capture asymmetric temporal aspects of spike trains from experimental data. In particular, we showed how spatio-temporal constraints can produce homogeneous irreducible Markov chains whose unique steady state is, in general, non-equilibrium (NESS)—thus, detailed balance condition is not satisfied causing strictly positive entropy production. This fact highlights that only non-synchronous maximum entropy models induce time irreversible processes, which is one of the key hallmarks of biological systems.

We have presented a survey of diverse techniques from mathematics and statistical mechanics to study these NESS, which correspond to a rich toolkit that can be employed to study unexplored aspects of spike train statistics. We emphasise that many of these concepts, including entropy production and fluctuation-dissipation relationships, have not been explored much in the context of spike train analysis. However, the fact that time irreversibility is such an important feature of living systems suggest that these notions might play an important role in neural dynamics.

Possible extensions include measuring the entropy production for different choices of spatio-temporal constraints using the maximum entropy method on biological spike train recordings. A more ambitious extension is to explore the relationship between entropy production computed from experimental data obtained from different physiological processes and relate them to features such as adaptation or learning. Concerning time-dependent neuronal network models, future studies might lead to a better understanding of the impact of particular synaptic topologies of neuronal network models on the corresponding entropy production, decay of correlations, resonances and other sophisticated statistical properties.

Other possible extensions are related to the drawbacks of current approaches. This can include limitations of the maximum entropy method related to the requirement of stationarity in the data, which is not a natural condition for some biological scenarios. However, several of the techniques presented in this tutorial naturally extend to the non-stationary case, including the information entropy production, which can still be defined along non-stationary trajectories [14]. Related to this issue is that the approach presented in this tutorial does not make any reference to the stimulus. While this issue has been addressed in the synchronous framework [34], there is still an open field to explore the Markovian extension of these ideas. Another interesting topic to explore in future studies is the inclusion of the non-stationary approach such as the state space analysis proposed in [35]. Also, another open problem is related to the efficient implementation of the transfer matrix technique, which currently requires an important computational effort in the case of large neural networks. Recently, some improvements of this approach have been proposed based in Monte Carlo methods [57].

In summary, we believe that these topics are fertile ground for multi-disciplinary exploration by teams composed of mathematicians, physicists, and neuroscientists. It is our hope that this work may foster future collaborative research among disciplines, which might bring new breakthroughs to advance our fundamental understanding of how the brain works.

Author Contributions

The three authors conceived the main ideas and concepts, wrote and revised the manuscript. All authors have read and approved the final manuscript.

Funding

L.V. was supported by CONICYT-Beca de Doctorado No. 21170406 Convocatoria 2017. F.R. was supported by the Ad Astra Chandaria Foundation. R.C. was supported by CONICYT-PAI Inserción 79160120, Proyecto REDES ETAPA INICIAL, Convocatoria 2017 REDI170457, Fondecyt Iniciación 2018 Proyecto 11181072.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MEP	Maximum entropy principle
MEMC	Maximum entropy Markov chain
SCGF	Scaled cumulant generating function
CLT	Central limit theorem
LLN	Law of large numbers
LDP	Large deviation principle
IEP	Information entropy production
KSE	Kolmogorov-Sinai entropy
NESS	Non-equilibrium steady states
Symbol list
$S$	${0, 1}^{N}$ the state space of spike patterns of N neuron
$Ω$	The set of infinite sequences of spike patterns
$x_{n}^{k}$	Spiking state of neuron k at time n
$x_{n}$	Spike pattern at time n
$x_{t_{1}, t_{2}}$	Spike block from time $t_{1}$ to $t_{2}$
$ν (f)$	Expectation of the observable f w.r.t. the probability measure $ν$
$A_{T} (f)$	Empirical Average value of the observable f considering T spike patterns
$S^{R}$	Space of spike blocks of N neurons and length R
$S [μ]$	Entropy of the probability measure $μ$
$H$	Energy function
$F [H]$	Free energy

References

Rieke, F.; Warland, D.; de Ruyter van Steveninck, R.; Bialek, W. Spikes, Exploring the Neural Code; M.I.T. Press: Cambridge, MA, USA, 1996. [Google Scholar]
Bialek, W. Biophysics: Searching for Principles; Princeton University Press: Princeton, NJ, USA, 2012. [Google Scholar]
Schneidman, E.; Berry, M.J.; Segev, R.; Bialek, W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 2006, 440, 1007–1012. [Google Scholar] [CrossRef] [PubMed]
Ganmor, E.; Segev, R.; Schneidman, E. Sparse low-order interaction network underlies a highly correlated and learnable neural population code. Proc Natl. Acad. Sci. USA 2011, 108, 9679–9684. [Google Scholar] [CrossRef] [PubMed]
Tkačik, G.; Marre, O.; Amodei, D.; Schneidman, E.; Bialek, W.; Berry, M.J. Searching for collective behavior in a large network of sensory neurons. PLoS Comput. Biol. 2014, 10, e1003408. [Google Scholar] [CrossRef] [PubMed]
Palsso, B. Systems Biology: Properties of Reconstructed Networks; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Tang, A.; Jackson, D.; Hobbs, J.; Chen, W.; Smith, J.; Patel, H.; Prieto, A.; Petrusca, D.; Grivich, M.; Sher, A.; et al. A maximum entropy model applied to spatial and temporal correlations from cortical networks in vitro. J. Neurosci. 2008, 28, 505–518. [Google Scholar] [CrossRef] [PubMed]
Marre, O.; El Boustani, S.; Frégnac, Y.; Destexhe, A. Prediction of spatiotemporal patterns of neural activity from pairwise correlations. Phys. Rev. Lett. 2009, 102, 138101. [Google Scholar] [CrossRef]
Vasquez, J.; Palacios, A.; Marre, O.; Berry, M., II; Cessac, B. Gibbs distribution analysis of temporal correlation structure on multicell spike trains from retina ganglion cells. J. Physiol. Paris 2012, 106, 120–127. [Google Scholar] [CrossRef] [PubMed]
Mora, T.; Deny, S.; Marre, O. Dynamical criticality in the collective activity of a population of retinal neurons. Phys. Rev. Lett. 2015, 114, 078105. [Google Scholar] [CrossRef]
Cofré, R.; Cessac, B. Exact computation of the maximum entropy potential of spiking neural networks models. Phys. Rev. E 2014, 89, 052117. [Google Scholar] [CrossRef]
Cofré, R.; Maldonado, C. Information entropy production of maximum entropy Markov chains from spike trains. Entropy 2018, 20, 34. [Google Scholar] [CrossRef]
Schulman, L.S. Time’s Arrows and Quantum Measurement; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
Jiang, D.Q.; Qian, M.; Qian, M.P. Mathematical Theory of Non-Equilibrium Steady States; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Schrödinger, E. What Is Life? The Physical Aspect of the Living Cell; Cambridge University Press: Cambridge, UK, 1944. [Google Scholar]
Prigogine, I. Nonequilibrium Statistical Mechanics; Monographs in Statistical Physics; Interscience publishers, John Wiley & Sons: Hoboken, NJ, USA, 1962. [Google Scholar]
Deem, M. Mathematical adventures in biology. Phys. Today 2007, 60, 42–47. [Google Scholar] [CrossRef]
Filyukov, A.; Karpov, V. Description of steady transport processes by the method of the most probable path of evolution. Inzhenerno-Fizicheskii Zhurnal 1967, 13, 624–630. [Google Scholar] [CrossRef]
Filyukov, A.; Karpov, V. Method of the most probable path of evolution in the theory of stationary irreversible processes. Inzhenerno-Fizicheskii Zhurnal 1967, 13, 798–804. [Google Scholar] [CrossRef]
Favretti, M. The maximum entropy rate description of a thermodynamic system in a stationary non-equilibrium state. Entropy 2009, 4, 675–687. [Google Scholar] [CrossRef]
Monthus, C. Non-equilibrium steady states: maximization of the Shannon entropy associated with the distribution of dynamical trajectories in the presence of constraints. J. Stat. Mech. Theor. Exp. 2011, 3, P03008. [Google Scholar] [CrossRef]
Shi, P.; Qian, H. Frontiers in Computational and Systems Biology; Feng, J., Fu, W., Sun, F., Eds.; Springer: London, UK, 2010; chapter Irreversible Stochastic Processes, Coupled Diffusions and Systems Biochemistry; pp. 175–201. [Google Scholar]
Galves, A.; Löcherbach, E. Infinite systems of interacting chains with memory of variable length-A stochastic model for biological neural nets. J. Stat. Phys. 2013, 151, 896–921. [Google Scholar] [CrossRef]
Cofré, R.; Cessac, B. Dynamics and spike trains statistics in conductance-based Integrate-and-Fire neural networks with chemical and electric synapses. Chaos Solitons Fractals 2013, 50, 13–31. [Google Scholar] [CrossRef]
Halmos, P.R. Measure Theory; Graduate Texts in Mathematics; Springer: New York, NY, USA, 1974. [Google Scholar]
Levin, D.; Peres, Y. Markov Chains and Mixing Times, 2nd ed.; American Mathematical Society: Providence, RI, USA, 2017. [Google Scholar]
Gerstner, W.; Kistler, W. Spiking Neuron Models; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Jones, G.L. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320. [Google Scholar] [CrossRef]
Ellis, R. Entropy, Large Deviations and Statistical Mechanics; Springer: Berlin, Germany, 1985. [Google Scholar]
Touchette, H. The large deviation approach to statistical mechanics. Phys. Rep. 2009, 478, 1–69. [Google Scholar] [CrossRef]
Dembo, A.; Zeitouni, O. Large deviations techniques and applications. In Stochastic Modelling and Applied Probability; Springer: Berlin, Germany, 2010; Volume 38. [Google Scholar]
Marre, O.; Amodei, D.; Deshmukh, N.; Sadeghi, K.; Soo, F.; Holy, T.; Berry, M., II. Mapping a complete neural population in the Retina. J. Neurosci. 2012, 43, 14859–14873. [Google Scholar] [CrossRef]
Tkačik, G.; Mora, T.; Marre, O.; Amodei, D.; Berry, M., II; Bialek, W. Thermodynamics for a network of neurons: Signatures of criticality. Proc Natl. Acad. Sci. USA 2015, 112, 11508–11513. [Google Scholar] [CrossRef]
Granot-Atedgi, E.; Tkačik, G.; Segev, R.; Schneidman, E. Stimulus-dependent maximum entropy models of neural population codes. PLoS Comput. Biol. 2013, 9, e1002922. [Google Scholar] [CrossRef]
Shimazaki, H.; Amari, S.; Brown, E.N.; Grün, S. State-space analysis of time-varying higher-order spike correlation for multiple neural spike train data. PLoS Comput. Biol. 2012, 8, e1002385. [Google Scholar] [CrossRef]
Dewar, R. Information theory explanation of the fluctuation theorem, maximum entropy production and self-organized criticality in non-equilibrium stationary states. J. Phys. A Math. Gen. 2003, 36, 631. [Google Scholar] [CrossRef]
Dewar, R. Maximum entropy production and the fluctuation theorem. J. Phys. A Math. Gen. 2005, 38, L371. [Google Scholar] [CrossRef]
Martyushev, L.; Seleznev, V. Maximum entropy production principle in physics, chemistry and biology. Phys. Rep. 2006, 426, 1–45. [Google Scholar] [CrossRef]
Jaynes, E. The minimum entropy production principle. Ann. Rev. Phys. Chem. 1980, 31, 579–601. [Google Scholar] [CrossRef]
Pressé, S.; Ghosh, K.; Lee, J.; Dill, K.A. Principles of maximum entropy and maximum caliber in statistical physics. Rev. Mod. Phys. 2013, 85, 1115. [Google Scholar] [CrossRef]
Ruelle, D. Thermodynamic Formalism; Addison-Wesley: Reading, MA, USA, 1978. [Google Scholar]
Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620. [Google Scholar] [CrossRef]
Tkačik, G.; Marre, O.; Mora, T.; Amodei, D.; Berry, M., II; Bialek, W. The simplest maximum entropy model for collective behavior in a neural network. J. Stat. Mech. 2013, 2013, P03011. [Google Scholar] [CrossRef]
Seneta, E. Non-Negative Matrices and Markov Chains; Springer: New York, NY, USA, 2006. [Google Scholar]
Walters, P. Ruelle’s operator theorem and g-measures. Trans. Am. Math. Soc. 1975, 214, 375–387. [Google Scholar]
Bowen, R. Equilibrium States and the Ergodic Theory of Anosov Diffeomorphisms. In Lecture Notes in Mathematics, revised ed.; Springer: Berlin, Germany, 2008; Volume 470. [Google Scholar]
Parry, W.; Pollicott, M. Zeta functions and the periodic orbit structure of hyperbolic dynamics. Astérisque, Société mathématique de France 1990, 187–188. Available online: http://www.numdam.org/issue/AST_1990__187-188__1_0.pdf (accessed on 11 September 2019).
Chazottes, J. Fluctuations of observables in dynamical systems: From limit theorems to concentration inequalities. In Nonlinear Dynamics New Directions; González-Aguilar, H., Ugalde, E., Eds.; Springer: Cham, Switzerland, 2015; pp. 47–85. [Google Scholar]
Gaspard, P. Chaos, Scattering and Statistical Mechanics; Non-Linear Science series; Cambridge University Press: Cambridge, UK, 1998; Volume 9. [Google Scholar]
Bettolo, U.M.; Puglisi, A.; Rondoni, L.; Vulpiani, A. Fluctuation–dissipation: Response theory in statistical physics. Phys. Rep. 2008, 461, 111–195. [Google Scholar]
Gaspard, P. Random paths and current fluctuations in nonequilibrium statistical mechanics. J. Math. Phys. 2014, 55, 075208. [Google Scholar] [CrossRef]
Cofré, R.; Maldonado, C.; Rosas, F. Large deviations properties of maximum entropy Markov shains from spike trains. Entropy 2018, 20, 573. [Google Scholar] [CrossRef]
Touchette, H. A Basic Introduction to Large Deviations: Theory, Applications, Simulations. arXiv 2012, arXiv:1106.4146v3. [Google Scholar]
Ellis, R.S. The theory of large deviations and applications to statistical mechanics. In Long-Range Interacting Systems; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
Nicolis, G.; Nicolis, C. Foundations of Complex Systems: Emergence, Information and Prediction; World Scientific: Singapore, 2012. [Google Scholar]
Maes, C. The fluctuation theorem as a Gibbs property. J. Stat. Phys. 1999, 95, 367–392. [Google Scholar] [CrossRef]
Nasser, H.; Cessac, B. Parameter estimation for spatio-temporal maximum entropy distributions: Application to neural spike trains. Entropy 2014, 16, 2244–2277. [Google Scholar] [CrossRef]

Figure 1. Illustration of a spike train, a spiking state and spike pattern. The time bin size

Δ t_{b}

determine the binary patterns.

Figure 1. Illustration of a spike train, a spiking state and spike pattern. The time bin size

Δ t_{b}

determine the binary patterns.

Figure 2. Plot of the auto-correlation of the observable

x_{0}^{2} x_{1}^{1}

with respect to the MEMC consistent with constraints

〈 x_{0}^{1} x_{1}^{2} 〉 = 0.1

and

〈 x_{0}^{2} x_{1}^{1} 〉 = 0.3

. The plot show the sum of Equation (18) from

r = 1

up to the number in the abscissa. Note the fast convergence towards

χ_{22}

.

Figure 2. Plot of the auto-correlation of the observable

x_{0}^{2} x_{1}^{1}

with respect to the MEMC consistent with constraints

〈 x_{0}^{1} x_{1}^{2} 〉 = 0.1

and

〈 x_{0}^{2} x_{1}^{1} 〉 = 0.3

. The plot show the sum of Equation (18) from

r = 1

up to the number in the abscissa. Note the fast convergence towards

χ_{22}

.

Figure 3. Auto-correlations of the observable

x_{0}^{2} x_{1}^{1}

for the MEMC with the same parameters as in Figure 1. Modulations in the decay of correlations are due to the complex eigenvalues of the MEMC.

Figure 3. Auto-correlations of the observable

x_{0}^{2} x_{1}^{1}

for the MEMC with the same parameters as in Figure 1. Modulations in the decay of correlations are due to the complex eigenvalues of the MEMC.

Figure 4. Rate functions of observables

x_{0}^{1} x_{1}^{2}

in red, and

x_{0}^{2} x_{1}^{1}

in blue for the MEMC consistent with constraints

〈 x_{0}^{1} x_{1}^{2} 〉 = 0.1

and

〈 x_{0}^{2} x_{1}^{1} 〉 = 0.3

. The minimum value of both functions coincide with their expected values with respect to the MEMC. Around the minimum Gaussian fluctuations are expected (9). Far from the expected values are the large deviations.

Figure 4. Rate functions of observables

x_{0}^{1} x_{1}^{2}

in red, and

x_{0}^{2} x_{1}^{1}

in blue for the MEMC consistent with constraints

〈 x_{0}^{1} x_{1}^{2} 〉 = 0.1

and

〈 x_{0}^{2} x_{1}^{1} 〉 = 0.3

. The minimum value of both functions coincide with their expected values with respect to the MEMC. Around the minimum Gaussian fluctuations are expected (9). Far from the expected values are the large deviations.

Figure 5. IEP for the MEMC of the example (Section 4.4) for different values of parameters

h_{1}, h_{2}

. Observe that

I E P (P, π) = 0

when

h_{1} = h_{2}

and that increases as the parameters become more different (more asymmetry in P).

Figure 5. IEP for the MEMC of the example (Section 4.4) for different values of parameters

h_{1}, h_{2}

. Observe that

I E P (P, π) = 0

when

h_{1} = h_{2}

and that increases as the parameters become more different (more asymmetry in P).

Figure 6. Gallavotti-Cohen symmetry property for the SCGF and rate function of the IEP (Equation (25)). Left: SCGF of the IEP of the MEMC with the same parameters considered in the previous examples. Right: Rate function of the observable W, the minimum is attained at the expected value of IEP.

Figure 7. Linear response for the MEMC of the example (Section 4.4) for different values of perturbations

δ h_{1}

and

δ h_{2}

. The colors represent

∥ E_{\tilde{μ}} [f_{k}] - E_{μ} [f_{k}] ∥

computed using two methods. The “forward” method consists in computing

E_{\tilde{μ}} [f_{k}]

from

\tilde{μ}

and

E_{μ} [f_{k}]

from

μ

. The figure in the middle is obtained by computing

∥ E_{\tilde{μ}} [f_{k}] - E_{μ} [f_{k}] ∥

from

χ

using Equation (28). (Right) The difference between both methods illustrated in a scatter plot in logarithmic scale.

Figure 7. Linear response for the MEMC of the example (Section 4.4) for different values of perturbations

δ h_{1}

and

δ h_{2}

. The colors represent

∥ E_{\tilde{μ}} [f_{k}] - E_{μ} [f_{k}] ∥

computed using two methods. The “forward” method consists in computing

E_{\tilde{μ}} [f_{k}]

from

\tilde{μ}

and

E_{μ} [f_{k}]

from

μ

. The figure in the middle is obtained by computing

∥ E_{\tilde{μ}} [f_{k}] - E_{μ} [f_{k}] ∥

from

χ

using Equation (28). (Right) The difference between both methods illustrated in a scatter plot in logarithmic scale.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cofré, R.; Videla, L.; Rosas, F. An Introduction to the Non-Equilibrium Steady States of Maximum Entropy Spike Trains. Entropy 2019, 21, 884. https://doi.org/10.3390/e21090884

AMA Style

Cofré R, Videla L, Rosas F. An Introduction to the Non-Equilibrium Steady States of Maximum Entropy Spike Trains. Entropy. 2019; 21(9):884. https://doi.org/10.3390/e21090884

Chicago/Turabian Style

Cofré, Rodrigo, Leonardo Videla, and Fernando Rosas. 2019. "An Introduction to the Non-Equilibrium Steady States of Maximum Entropy Spike Trains" Entropy 21, no. 9: 884. https://doi.org/10.3390/e21090884

APA Style

Cofré, R., Videla, L., & Rosas, F. (2019). An Introduction to the Non-Equilibrium Steady States of Maximum Entropy Spike Trains. Entropy, 21(9), 884. https://doi.org/10.3390/e21090884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Introduction to the Non-Equilibrium Steady States of Maximum Entropy Spike Trains

Abstract

1. Introduction

2. Preliminary Considerations

2.1. Binning and Spike Trains

2.2. Elementary Properties of Markov Chains

2.3. Homogeneity, Ergodicity and Stationarity

2.4. The Reversed Markov Chain

2.5. Reversibility and Detailed Balance

2.6. Law of Large Numbers for Ergodic Markov Chains

3. Observables of Markov Chains and Their Properties

3.1. Observables and Their Empirical Averages

3.2. Moments and Cumulants

3.3. Observables and Ergodicity

3.4. Central Limit Theorem for Observables

3.5. Large Deviations of Average Values of Observables

4. Building Maximum Entropy Temporal Models

4.1. The Entropy Rate of a Temporal Model

4.1.1. Basic Definitions

4.1.2. The Entropy Rate of I.I.D. and Markov Models

4.2. Entropy Rate Maximization under Constraints

4.3. Solving the Optimization Problem

4.3.1. Lagrange Multipliers and the Variational Principle

4.3.2. Transfer Matrix Method

4.3.3. Finite Range Gibbs Measures

4.4. Example

5. Statistical Properties of Markov Maximum Entropy Measures

5.1. Cumulants from Free Energy

5.2. Fluctuation-Dissipation Relations

5.3. Resonances and Decay of Correlations

5.4. Large Deviations for Average Values of Observables in MEMC

5.5. Information Entropy Production

5.6. Gallavotti-Cohen Fluctuation Theorem

5.7. Linear Response

6. Discussion and Future Work

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI