Information Bottleneck Approach to Predictive Inference

This paper synthesizes a recent line of work on automated predictive model making inspired by Rate-Distortion theory, in particular by the Information Bottleneck method. Predictive inference is interpreted as a strategy for efficient communication. The relationship to thermodynamic efficiency is discussed. The overall aim of this paper is to explain how this information theoretic approach provides an intuitive, overarching framework for predictive inference.


Introduction
"The fundamental problem of scientific progress, and a fundamental one of every day life, is that of learning from experience.Knowledge obtained in this way is partly merely description of what we have already observed, but partly consists of making inferences from past experience to predict future experience.This part may be called generalization.It is the most important part; events that are merely described and have no apparent relation to others may as well be forgotten, and in fact usually are."Sir Harold Jeffreys [1].
Predictive inference lies at the heart of science, as one of our fundamental cognitive tools for discovery-making.The general idea is that a good model should have predictive power, while not being overly complicated.This notion is deeply rooted in our culture, and has been made specific in a variety of different technical approaches in modern times, e.g., [1][2][3][4][5].Complexity measures abound [6], as do utility functions that prescribe which aspects of the data are considered relevant.In fact, generalization is a core issue in statistics and machine learning [3].There, one often analyzes a "bag of data".Time stamps are either not important or simply not available, as data may stem from experiments designed to reveal a relationship between independent and dependent variables.However, one of the most fundamental assumptions we typically make about physical reality is that of the progression of time, and a causal structure of spacetime (see e.g.[7,8]).Natural, and in particular living systems, usually are thought of as dynamical systems embedded in an environment in which they interact with other dynamical systems.Observations of dynamical systems typically yield time series data, containing the information necessary to understand the underlying physics of the observed system [9,10].This paper thus focusses on the analysis of time series data.
Setting aside intricate philosophical arguments, let us take a simple information theoretic view that makes minimal assumptions about the process underlying the generation of the data.Motivated by the fact that any model constructed from the data will produce, in some form, a summary of past observations, let us try to make that summary efficient in the sense that it shall contain, to the greatest extent possible, information that can be used to predict future data without keeping irrelevant detail that would lead to unnecessary model complexity.
Formulated this way, it is clear that the Information Bottleneck (IB) [11] method provides an excellent framework for predictive inference, because it is a lossy compression scheme that finds that summary of the data which contains maximal relevant information at a fixed level of allowed detail.The data to be compressed, or summarized, are past experiences, and the summary should be useful for predicting future experiences.We can thus identify relevant information as information about future data [12][13][14][15][16][17][18][19][20].The use of a model that is not overly complex is related to the goal of compression in communication.How much detail the model contains is measured by the information that is kept about the past, i.e., the memory contained in the summary.Irrelevant, nonpredictive information has to be discarded.
The IB method then allows one to find, for any given amount of retained memory, that representation that has the largest predictive power by means of containing the largest amount of information about future experiences or, conversely, the most compact model with a desired level of predictive power.
Direct application of the IB method (Section 2) compresses past trajectories of a given length to predict future trajectories of a given length.While limited, this approach already has interesting applications and is related to other data analysis methods, such as to [16] the causal state partition [21] and, under restrictive assumptions [18], to slow feature analysis [22].
Going beyond the original IB framework, a dynamical learning paradigm was established in [14] (Section 4).Using this approach, feedback from the learner to the environment can be treated naturally, and connections to reinforcement learning can be made [17] (Section 4.1).The scope of the present paper is, however, limited to passive predictive inference, without feedback (Section 4.2).The -machine [21], known as an optimal predictor [23], is a limiting case of this more general approach (Section 4.3).
The dynamical approach taken in [14] allows for a thermodynamic treatment of computation and predictive inference away from thermodynamic equilibrium [24] (Section 3).Results may point towards a physical reason for the philosophy underlying predictive inference.

Information Theoretic Treatment of Prediction
The dynamics of complex systems can be viewed as transforming, deleting, and creating information, i.e. they are performing computations.A body of work has argued this, and has made use of predictive information as a general measure for temporal correlations (e.g.[20,21,[25][26][27][28][29][30][31][32], and refs.therein), since it gives a fuller description than the correlation function [33].
Let the value of a signal obtained at time, t, be denoted by x t .In general, this could be a concatenation of several measured quantities at time, t.Then the (Shannon) mutual information [34] measures how much an observation at one instant can tell us about an observation at another instant in time [25].(Throughout the paper, information is measured in nats, for convenience, and log denotes the natural logarithm.Brackets, • p(•) , denote averages, with subscripts denoting the probability distribution, p, over which the average is taken [35].)Now, let t = t + ∆t, where ∆t is a finite interval, discretizing time.For simplicity of the exposition, let us rescale, such that ∆t = 1.The instantaneous predictive information, I[x t ; x t+1 ], suffices to describe the complete causal connections in the observed time series, only in cases where x t contains a complete measurement of a dynamical system, such that the dynamics are first-order (either Hamiltonian or governed by a first-order Markov process) [26].In the general case, if x t does not fully determine the underlying state of the observed dynamical system, then information between the extended past,

→
x t = (x t+∆t , . . ., x τ F ), trajectories should be considered (τ P and τ F parameterize the length of past and future trajectories, respectively, and here we set ∆t = 1): In the limit, τ F → ∞, the growth of predictive information as a function of τ P reveals the complexity of the underlying physical process [29].It has been argued that predictive information reveals the causal structure of the physical process that generated the observed signal [21,26,28,36].This approach has been useful in dynamical systems theory and chaos theory, as well as neuroscience and machine learning ( [14,16,17,20,21,23,28,29,36,37] and refs.therein).Recently, predictive information was proposed as "a universal order parameter to study phase transitions in physical systems" [38].
The information theoretic approach to predictive inference taken in the present paper is based on a simple extension of predictive information (Eq. 1) to include a (learning) system that receives the signal, x, as an input.At time, t, the state of the system, s t , contains memory about the past trajectory of the environment,

←
x t , quantified by the mutual information: If the external signal has a causal structure and temporal correlations, that is, if the predictive information, , is nonzero, then part of this memory may be information about the future: If the probabilistic map that determines the state s t from the input data can be chosen freely, then there is the potential for predictive filtering.The model then contains predictions encoded in the distributions, p( → x t |s t ), and the predictive bits retained (Equation 3) quantify the predictive power of the model.This paper discusses different formalizations of learning systems.Direct application of the IB method (Section 2) results in the construction of a map from past trajectories to system states, p(s t | ← x t ).
Alternatively, the learning system can be a viewed as a dynamical system itself (Section 4), coupled to the environment.Then, a recursive information bottleneck algorithm (RIB) can be used to find optimally predictive dynamics (Section 4.2, 4.3).When the physical reality of the system is taken into account, energetic considerations factor in (Section 3).Finally, there can be feedback from the learning system to the environment, endowing the system with some level of control (Section 4.1).This approach describes interactive learning.It goes beyond passive predictive inference (and therefore beyond the scope of the present paper) and extends the IB framework to a feedback situation.
The material in this paper is a synthesis of results that have been presented at conferences, published elsewhere [14][15][16][17]24], and some new material.Section 2 makes use of material from [15,16], Section 3 uses material from [24], and Section 4 uses material form [14,17].

Direct Use of Information Bottleneck for Predictive Inference
Compression of past trajectories results in a model that contains a map from ← x t to the summary variable, s t , which we can think of as denoting the state of a learning machine.Another function maps from s t to future trajectories.Both of these functions can, in general, be probabilistic maps.A model therefore contains the probability distributions p(s t | ← x t ) and p( → x t |s t ).In theory, past and future trajectories could be infinite.
Let us measure the model's complexity by the amount of information that the state contains about the data it summarizes, that is, the mutual information [34] The predictive power of the model is given by the information that the model captures about future experiences, → x t .One then looks for an assignment of pasts to states that results in a model with maximal predictive power at fixed memory.Formally, this is done by solving the constrained optimization problem [11,16]: where λ is a Lagrange multiplier controlling the trade-off between model complexity and predictive power [11,15,16].Importantly, when the past is known, then the probability of the future does not depend on knowledge of the state, i.e. p( . This optimization problem is thus identical to the information bottleneck problem: past trajectories, ← x t , are compressed, such that information about the future, → x t , is kept, i.e., predictive information is relevant information.The optimization is, of course, equivalent to Shannon's rate-distortion theory [34], with the the relative entropy, or Kullback-Leibler divergence [39], D KL [p( , used as a distortion function [15] measuring the predictability loss one encounters in replacing the trajectory, ← x t , by the summary, s t .For each value of λ, this optimization results in an optimal probabilistic assignment of past trajectories to model states, i.e., IB finds a family of optimal models [11], parameterized by λ.Each obeys the self consistent equations [11,16]: p( where Z( ← x t ) ensures normalization: The exponential distribution in Equation ( 5) can be compared to a Gibbs-Boltzmann distribution.With this analogy, the Lagrange parameter, λ, has been identified as a "temperature"-like parameter (not to be confused with physical temperature, cmp.Sec.3.4).The analogy leads to the intuition that in the limit of large λ, fluctuations prevent any structure from being resolved.As λ is lowered, more and more structure is recovered as solutions pass through a series of phase transitions [11].
Note that the model's prediction is represented by the distribution, p( In practice, this is usually not known, but has to be estimated from the given data, e.g.via normalized frequencies [16], or model assumptions have to be made (e.g.[18]).However, once the model, Equations (5-8), has been constructed, then any trajectory ← x t can be mapped onto a state s t via the probabilistic map p(s t | ← x t ).From knowledge of the state, an estimate of the future can be generated using the distribution p( The probability of the state sequence (. . ., s t−1 , s t , s t+1 , . . . ) given the input data (past trajectories) is proportional to (Equations 5-8 . The first term in the sum results in the most likely state sequence having minimal predictability loss.This term becomes increasingly relevant as λ decreases.The second term, log[p(s t )], corresponds to an extra penalty, favoring simpler models over more complex ones.This explains why, out of all possible, equally predictive representations [21,23,28,36], the minimal one is chosen [16].
To study the causal compressibility [15] of a signal, x(t), one can then plot the predictive power of the best possible model vs. its memory, that is, both quantities are evaluated at the solution to the optimization at different values of the Lagrange multiplier to obtain an information curve [11] in analogy to a rate-distortion curve.Processes of qualitatively different causal structure can be thus identified [15].Numerically, the curve can be traced out using the information bottleneck algorithm [11], which is closely related to the Blahut-Arimoto algorithm [40,41].

Asymptotic Behavior
The full power of this method is revealed by studying the solution in the regime in which the trade-off parameter, λ → 0, so that the emphasis is on predictive power.When infinitely long pasts are used to predict infinite futures, then the states computed in this limit are the causal states constituting minimal sufficient statistics [21,23,28,36].This result holds independent of the exact form of the distribution that generated the input data.A detailed proof can be found in [16].Here, we give only an intuitive derivation.Note that similar intuition has previously been pointed out in [42].
Using results from [43], it is easy to see that in the limit λ → 0, p(s t | ← x t ) will tend towards zero for all states, except those that minimize D KL [p( Assuming that there is no restriction on the state space of the learning machine, one can always ensure that there exists one state, s t , such that D KL [p( , equals one if s t = s t and zero otherwise.This constitutes a deterministic assignment of pasts to states.A method known as deterministic annealing [44,45] can be used to numerically find the solution for the λ → 0 limit.The distributions specified by Equation ( 5) assign past trajectories to model states by means of distributional clustering [11,46].In the context of time series data, this means that two pasts, ← x t and ← x t , that have similar conditional future distributions, p( x t ), will likely end up in the same cluster, denoted by s t .This results in a partitioning of the space of all past trajectories.In general, this partition is what is often called soft, or fuzzy, because the assignments are probabilistic.The resulting partition can only be hard when the assignments become deterministic.
The hard partition discovered in the λ → 0 limit can alternatively be described by an equivalence relation, yielding the very definition of the causal state partition [21]: two past trajectories, ← x t and ← x t , are equivalent for purposes of prediction, if and only if p( They are then mapped onto the same causal state.This equivalence relation has many desirable properties, most notably, the causal states are unique and minimal sufficient statistics [21,23,28,36,37].The causal state partition is a probabilistic bisimulation [47], and is also fundamentally related to observable operator models [48], see e.g.[49]. This result shows that IB has the capacity for predictive inference when used on time series data, because it discovers minimal sufficient statistics.This fact has motivated the use of the name optimal causal inference (OCI) [16].For finite lengths, τ P and τ F , an equivalence relation can be defined in analogy to the causal state partition, and OCI recovers the corresponding partition, in the limit λ → 0. Other algorithms exist for constructing the causal state partition [31,50].One advantage of the IB approach is that it allows for a principled relaxation of the complexity constraint by adjustment of λ.
Here, the words causality and causal inference do not refer to what has been coined causal inference in statistics (see e.g.[51] and references therein), an approach that involves the logic of counterfactuals.In statistics it often makes sense that abstracting away from given data to as-yet unseen data does not necessarily rely on the data being ordered in time.But the notion of causality cannot easily be separated from the concept of time in any physically meaningful way.Therefore, temporal causal structure has to be taken into account to understand the physics underlying natural computation, particularly if one assumes that nature computes by means of its dynamics (see Sections 1.1 and 3.4).

Advantages and Disadvantages
This approach has the great advantage that no assumptions have to be made about the distribution underlying the generation of the time series.OCI finds the best model in terms of predictive power at any desired level of complexity.
While conceptually elegant, in practice p( x t ) has to be estimated from the data.Finite sample effects set a natural lower bound on λ, due to an upward bias in the estimated information content (see [43] and references therein).Therefore, the number of causal states that can be used to describe the data without over-fitting is limited.The bias correction method of [43] has been used in the predictive inference context [16].In general, it works well only when the number of data is significantly larger than the number of bins that are used to estimate probabilities (by normalized counts).Since ← x t and → x t are potentially infinitely long sequences, finite sample effects could easily be overwhelming.The first simple step to address this problem is to make τ P and τ F finite and not too large.This restriction, however, comes at the disadvantage of restricting the predictive power of the model and its ability to discover structure.It also introduces a new problem: the discovered model might depend on how we chose to distribute the total length, τ P +τ F , to the parameters, τ P and τ F , respectively.The recursive IB we discuss in Section 4.2 addresses this problem.

Linear and Gaussian Models
An alternative is to use a model for the process that generated the data.For linear models with Gaussian noise, the explicit form of the IB solution can be calculated analytically [52].This is known as Gaussian information bottleneck (GIB).Applied to time series, for τ P = τ F = 1, one assumes that past, x t , and future, x t+1 , are jointly multivariate Gaussian variables [18].GIB furthermore assumes that there is a linear transformation, the matrix, M , mapping the input data to the model variable, here, s t = M x t + ξ, in a noisy fashion, where ξ N (0, Σ) is normally distributed.In this approach, GIB is used to find a reduced description of the underlying model.The optimization is now over the linear transformation, M , and the bottleneck problem becomes [18]: The matrix that solves this optimization problem contains, as λ decreases, an increasing number of (scaled) eigenvectors of I − (Σ xt;x t+1 Σ −1 xt ) 2 [18,52], where Σ xt denotes the covariance matrix of the inputs, Σ xt;x t+1 characterizes temporal correlations, and I is the identity matrix.The method is related [18] to slow feature analysis [22], and is able to deconvolve and filter composite signals [18].
When analyzing nonlinear complex systems in which the Gaussian assumption is violated, the practitioner has to consider carefully whether the method can still be applied.

Thermodynamic Foundations
Predictive inference is an efficient way of processing information, implemented by living organisms in various ways.Prediction can be useful for different stages of information processing, ranging from genetic networks, to vision, to motor behavior, to higher cognitive function [20].All information processing has to happen on physical devices, natural or synthetic, and many authors have argued that, in general, all information is physical (e.g., [53]).Ultimately, one would like to know if there are physical reasons for the emergence of predictive inference.
Whenever many small (computing) units are tightly packed together, as is the case in living systems, heat generation due to dissipation generally poses a problem.Thermodynamic efficiency is thus a relevant consideration, and is also becoming increasingly relevant for the design of modern artificial systems, as the size of components shrinks and their packing density increases.This section asks about the relationship between thermodynamic efficiency and information processing efficiency (in the sense implemented by predictive inference).Could energetic efficiency be an underlying motivation for predictive inference?
Memory and predictive power are the relevant diagnostics for how well a system is implementing predictive inference: on the one hand, a model should have large predictive power, on the other hand, it should not be overly complex, i.e., we do not wish to retain memory beyond what is useful for prediction.The nonpredictive part of the memory thus measures the inefficiency of the model, in terms of prediction, by quantifying how much irrelevant information is retained.
It turns out that there is a simple relationship between dissipation and instantaneous nonpredictive information [24].Before this is explained in Section 3.4, a brief overview of some relevant context is given in Sections 3.1 to 3.3.
Let us imagine that predictive inference is implemented by means of a physical device, and, as before, let us denote the state of this computing system, at time, t, by s t .The input time series, X = x 0 , x 1 , . . ., x τ , is fed into the system by means of a change in some external parameter(s).For simplicity, we shall denote these environmental variables also by x t (assuming that a one-to-one mapping can be constructed between the input time series and the external parameters driving the computing system).Changes in the external signal then cause changes in the system's state.
In the previous section, the dependency of states on data was given abstractly by the probabilistic map, p(s t | ← x t ), from all past experiences up to time, t, onto states s t .However, specifying this map in an explicit way in practice would require a buffer for the entire history, in order to determine the state s t directly from the currently observed trajectory.
An alternative strategy is to use the system's state space as memory by making the state-update depend not only on the input data, but also on the system's previous state.The dynamics of the device can be characterized by conditionally Markovian transition probabilities, p(s t |s t−1 , x t ).These dynamics result in an implicit model of the input time series, and also determine the physics of the computing device, as the incoming data drive the computing machine via changes in control parameters.During this process, work is done.The thermodynamic inefficiency of this process can be characterized by the amount of work that is lost to dissipation.Since this process may drive the computing system arbitrarily far from thermodynamic equilibrium, equilibrium thermodynamics is no longer an adequate description.

Driven Systems Far from Thermodynamic Equilibrium
During the last two decades, significant progress has been made in understanding driven systems far from equilibrium [54,55], most notably, Jarzynski's work relation [56], associating the work, W , done on a thermodynamic system to the resulting change in free energy of the system, ∆F .The brackets, • , denote the statistical average.It is assumed that the system is started in thermodynamic equilibrium, then is driven arbitrarily far from equilibrium, due to changes in external parameters that follow a known protocol (the experiment).After execution of this protocol, the system is finally allowed to relax back to thermodynamic equilibrium.Throughout, the system is connected to a heat bath at inverse temperature β = 1/k B T , where k B is Boltzmann's constant, and T is the temperature.The work relation holds for systems driven arbitrarily far from thermodynamic equilibrium, thus going beyond the near-equilibrium predictions of linear response theory [57].
The fact that the work done on the system, on average, has to be larger than the work that can be derived from the system, W ≥ ∆F , follows from Equation (10) [56] via Jensen's inequality.The r.h.s. of Equation ( 10) can be expanded into a sum of cumulants of W. Assuming a Gaussian distribution, only the first two survive and a fluctuation-dissipation relation follows [56]: While the original derivation was done using Hamiltonian dynamics, the work relation can also be derived for stochastic systems governed by conditionally Markovian dynamics [58].State-to-state transitions are given by p(s t |s t−1 , x t ), taking the system through a sequence of states S = s 0 , . . ., s τ in response to the input, or protocol, X = x 0 , . . ., x τ which is assumed to be given.The system is in thermodynamic equilibrium at t = 0, so that the Boltzmann distribution, p eq (s|x) := e −β(E(s,x)−F [x]) describes the initial distribution, p(s 0 |x 0 ) = p eq (s 0 |x 0 ) (subscript "eq" denotes thermodynamic equilibrium).The equilibrium free energy is where k B H[p eq (s|x)] is the thermodynamic entropy, given by [59] the Shannon entropy of the equilibrium distribution, H[p eq (s|x)] := − log[p eq (s|x)] peq(s|x) , times the Boltzmann constant k B .
The total work done, W , can be split into incremental changes in energy [58,60], as can the total heat, Q, exchanged with the bath (heat flowing into the system is positive by convention): Energy changes during these work-and relaxation-steps sum up to the total change in energy, W + Q = ∆E = E(s τ , x τ ) − E(s 0 , x 0 ) (first law of thermodynamics).Now, assume that after completion of the protocol at time, τ , the system can relax back to thermodynamic equilibrium.Then the amount of work dissipated during this process is that part of the total work, W , that did not contribute to increasing the equilibrium free energy of the system, i.e., Consider the protocol run in reverse time, X = x τ . . ., x 0 , and ask for the probability, p R ( S| X), of finding the exact reverse-time path through state space, S = s τ . . ., s 0 .The ratio between forward time probability, p F (S|X), and reverse time probability, p R ( S| X), depends exponentially on the work done in excess of the equilibrium free energy change [58,60,61]: The work relation, Equation (10), then follows immediately from normalization of probability [60]: = e ∆F e −βW p F (S|X) , whereby one obtains e −∆F = e −βW p F (S|X) .

Nonequilibrium Free Energy and Dissipation
If the system does not instantaneously relax back to thermodynamic equilibrium after being driven away from equilibrium, then detailed knowledge of the system's state would allow for the extraction of additional free energy, beyond the free energy of the corresponding equilibrium system.This additional free energy (at time t) is proportional [26] to the relative entropy, or Kullback-Leibler divergence, between the out-of-equilibrium distribution, p t , and the corresponding Boltzmann distribution, p xt eq := p eq (s The total nonequilibrium free energy [62][63][64] is then given by the equilibrium free energy F [x t ] plus this additional free energy: with H[p t ] denoting the Shannon entropy associated with the distribution p t .The second equality follows directly from inserting the Boltzmann distribution into Equation (14).
The additional free energy may be difficult to harness, as doing so requires knowledge about the system that may not be available.However, it could theoretically be used by a clever device.Dissipation, i.e., the work that is irretrievably lost, is therefore work done on the system in excess of the system's nonequilibrium free energy change ∆F neq , i.e., Brackets denote the statistical average.Since we are interested in how much information a system can carry about an arbitrary environment, we have to allow the external driving signal (= protocol) to be stochastic and, as anything else would be too limiting a restriction.We are therefore interested in quantities, averaged not only over P (S|X), but also over P(X), and therefore we have to take the average over the joint distribution P (S, X).Note, however, that an argument analogous to the above can be made when the protocol is assumed to be fixed (see e.g., [65] and references therein).Average dissipation (Equation 16) is related to the average work done in excess of the equilibrium free energy change, W ex := W − ∆F , by the change in additional free energy due to being out of equilibrium ∆F add , i.e., the change in relative entropy (see Equation ( 14)): Using the fact that dissipation is non-negative, we see that W ex ≥ ∆F add .Extra work, in the amount of ∆F add can be extracted from a system that was started in thermodynamic equilibrium (F add p 0 , x 0 = 0) and driven out of equilibrium by the protocol X, as for such a system ∆F add = F add [p τ , x τ ] ≥ 0 due to the non-negativity of relative entropy.This observation has been used to motivate the claim that additional work could be extracted using a "feedback" protocol, i.e., an experimental protocol that is adjusted in response to knowledge of the system, available via a measurement (see e.g.[65,66] and references therein).The notion of "feedback" in this context (e.g.[66]) is more restrictive than the feedback referred to in Sec.4.1, and in most of the robotics and signal processing literature, where typically both, the system and the environment (here: the protocol) evolve in time and influence each other.

Landauer's Principle
Imagine building a computing machine that is composed of many small devices (microscale, or even nanoscale).If every part of the machine dissipates heat, it may become a challenge to keep the machine from overheating.Synthetic devices face this problem, as do biological computing systems.It is thus relevant to know what the physical limits on heat generation are, and how they can be achieved.
Landauer argued [67] that the heat generated when one bit of information is erased from a device has to be at least k B T ln(2).Take, for example, a simple model for a bistable system: a particle in a double well potential.Assume that one could measure which well the particle is in, but that one would not attempt any other measurements on the device.Then, from the point of view of the potential observer, i.e., the user of the device, this device can store one bit of information.The particle can be either in the left or in the right well.
Now assume that at the beginning of an "erasure" protocol, the observer does not know, but could measure, where the particle is.Proceed with the erasure of this one bit of information by deforming the potential, such that the particle is forced into, say, the left well.At the end of the protocol, the user knows where the particle is.Hence, no further information can be obtained from the device.Therefore, one bit of information has been deleted.This is Szilard's engine [68] run in reverse [69].Landauer assumed that both at the beginning and at the end of the protocol the device would be in thermodynamic equilibrium.
He argued that the information erased, I L e , is then directly related to the difference in system entropy, ∆H := H[p eq (s τ |x τ )] − H[p eq (s 0 |x 0 )] (given the protocol and assuming that the protocol starts and ends in equilibrium): , where the factor log(2) converts from nats to bits.The change in equilibrium free energy, ∆F = F [x τ ] − F [x 0 ], can be written as an average change in energy and an entropic change: ∆F = ∆E − k B T ∆H.Combining this with the first law of thermodynamics, e .This quantity has to be non-negative, according to the second law of thermodynamics, and therefore we have So, if one bit of information is erased, I L e = 1, then the generated heat is at least k B T log(2).
Landauer's argument is a restatement of the second law of thermodynamics, based on the fact that dissipation is the total change in entropy, which splits up into the change in the environment, given by the heat generated, plus the change in the system's entropy, which, in turn, he argued, is the negative information erased during a protocol.Landauer's principle can be generalized to the case where one does not assume that the system ends in thermodynamic equilibrium [24,65], and to stochastic driving [24].These generalizations equally are restatements of the ("generalized" [65]) second law.

Thermodynamics of Prediction
While the treatment of systems driven far from thermodynamic equilibrium usually assumes that a driving protocol is given by some prescribed experimental protocol (e.g.[56,58,65,70]), we obviously cannot make this restricting assumption in the context of adaptation and learning.Instead, the focus has to be on systems embedded in stochastic environments.
Let us therefore assume that the protocol is not known, but rather reflects one instantiation of a stochastic environment, which can be described by some distribution, P (X), underlying the generation of the data, x 0 , . . ., x τ .The system's dynamics could be adapted to the environment.This may have occurred via some natural process, for example, evolution, or by means of engineering a synthetic device in some optimal fashion.A system that is adapted to a certain type of stochastic environment would not be optimized with respect to one particular realization of the environment, but rather with respect to the average.Therefore, the following treatment focusses on average quantities, averaged not only over P (S|X), but also over all possible realizations of the environment (or protocol), P (X).
When the environment is changed from x t to x t+1 , work is being done (see Eq (11)).How much of this work can be extracted from the system?We assume that at each point in time only the current state of system and environment can be measured, and that there is no additional, extraneous lookup table (or any additional memory of any kind).The best estimate of the current environmental signal given the system's state is then p(x t |s t ), and the best estimate of the next environmental signal is p(x t+1 |s t ).Conversely, the best estimate of the system's state given the environment is p(s t |x t ) before the change in the environment, and p(s t |x t+1 ) after.The equilibrium free energy change associated with this change is simply [71].The average instantaneous work lost is then given by . This dissipated work is proportional [24] to instantaneous nonpredictive information, Line (19) follows directly from averaging Equation (15) over the external signal, whereby the conditional entropy of the system conditioned on the external control parameter can be expressed in terms of that part of the average energy that is not available as nonequilibrium free energy: The last equality (20) follows immediately from the fact that mutual information measures a reduction in entropy [39]: Importantly, the instantaneous nonpredictive information allows us to judge the model that is implicit in the system's dynamics by the criteria governing predictive inference, namely that a good model should have large predictive power while not being overly complicated.Some part of the instantaneous memory, I[s t , x t ], retained in the system's state at time t, is predictive, I[s t , x t+1 ].But the rest, I[s t , x t ] − I[s t , x t+1 ], represents a measure for how much nonpredictive "clutter" [20], or "nostalgia" [24], is kept at any instant in time.In that way it characterizes the ineffectiveness, or inefficiency, of the predictive inference that the system implements implicitly via its dynamics.It reaches zero when none of the instantaneous memory gets wasted on nonpredictive information, i.e.I[s t , x t ] = I[s t , x t+1 ].This could be achieved trivially by keeping no memory at all, I[s t , x t ] = 0, and having no predictive power.However, a system that fulfills a function and operates at a finite rate usually has to be correlated to some degree with the environmental driving, and thus will have nonzero instantaneous memory.In many cases I[s t , x t+1 ] ≤ I[s t , x t ], but in some cases, it is possible to have more instantaneous predictive power than instantaneous memory, i.e.I[s t , x t+1 ] > I[s t , x t ], because information from the more distant past that may increase predictive power can be carried by the system dynamics.
The proportionality between instantaneous nonpredictive information and dissipation, Equation (20), carries over to the quantum regime [72], where this framework has proven useful to give a new interpretation to quantum discord, identified as "the thermodynamic inefficiency of the most energetically efficient classical approximation of a quantum memory" [72].
Total instantaneous nonpredictive information (summed over the entire protocol) ), provides a lower bound on work lost on average, assuming that the system is in thermodynamic equilibrium at the beginning of the protocol [24]: To see this, let us compute the contributions to dissipation during relaxation steps, and add them to the dissipation during work steps (given by Equation 20).Note, that during a relaxation step, no work is done.Furthermore, the environmental variable does not change, and neither does the free energy of the corresponding equilibrium distribution.Therefore, dissipation during relaxation steps is solely given by the negative change in additional free energy, which is proportional to the difference in Kullback-Leibler divergence from equilibrium, Since the relative entropy between the actual distribution and the corresponding equilibrium distribution is a Lyaponov function [73], we obtain, using Equation ( 20), the following bound on the quantity ≥ k B T I nonpred .
The average dissipation on the l.h.s., in turn, sets a lower bound to the average work done in excess of equilibrium free energy change, W ex : remember that, if the system is in thermodynamic equilibrium at time t = 0, we have F add [p(s 0 |x 0 )] = 0, and the l.h.s of ( 24) is then W −∆F −F add [p(s τ |x τ )] ≤ W ex , due to the non-negativity of relative entropy.Altogether, we arrive at (22).
Inequality (22) leads to a refinement of Landauer's argument.We use, as before, the fact that βW ex = −β Q − I e , (where I e = log(2) I L e is Landauer's erasure in nats, for convenience), to arrive at the conclusion that the heat leaving the system is lower bound by Landauer's bound is augmented by the total instantaneous nonpredictive information.These insights are applicable to systems that can be driven arbitrarily far from thermodynamic equilibrium, spanning a wide range, including artificial computing devices, as well as biomolecular machines.The direct connection between nonpredictive information and dissipation hints towards a possible underlying physical reason for the emergence of predictive inference: naturally occurring computation may be implementing predictive inference because this information processing strategy may also allow for the efficient use of energy.This may be relevant on the small scales on which the machinery of life operates, where k B T is not negligible.Some bio-machines approach 100% thermodynamic efficiency, such as the F 1 -ATPase [74], a molecule crucial for the energy metabolism of cells.When driven in a natural fashion, the stall torque is near the maximal values possible, given the free energy liberated by ATP hydrolysis and the size of the rotation.Similar efficiencies have been observed also in other bio-molecular motors, e.g., [75].Such optimal behavior may imply that even these micro-machines might implement predictive inference implicitly via their dynamics.

Recursive Schemes
The physical view of information processing laid out above motivates that one could compress a time series by constructing a dynamical rule that determines the conditional state-updates, such that the resulting model has maximal predictive power at fixed memory [14].

Interactive Learning
In the most general case, learning machines are able to interact with, and change, their environment, as animals do when they learn.The learning system then no longer learns passively.Instead, its actions feed back to the environment so that the time evolution of the environment depends on the actions taken.Thereby the future that the learning system encounters depends, to some degree, on its own actions.This type of learning is interactive learning.
Dynamical rules have to be found not only for the internal state of the learning machine, but also for the actions that the machine can take.If one postulates that the predictive power of the resulting behavior should be maximized at fixed coding cost, then an information theoretic approach reveals that optimal action policies must balance control and exploration [14].The approach provides a generalization of previously existing concepts, such as the causal state partition, to interactive learning.It furthermore allows for an intuitive treatment of curiosity-driven reinforcement learning [17], to which many alternative and some related approaches exist (see discussion and references in [17]).

Recursive Information Bottleneck Method (RIB)
For the present discussion of predictive inference, the treatment shall remain restricted to representations extracted from the data that do not influence future experiences directly (i.e.passive learning).This case is contained in the more general treatment developed in [14] and can be retrieved directly by deleting the action variables throughout [14].By doing so, one obtains a recursive version of the information bottleneck method.
The learning machine updates its state from s t−1 to s t , after it receives a new input, x t .The update dynamics are given by p(s t |s t−1 , x t ).They characterize the model, together with the predictions encoded in p( → x t |s t ), and the state distribution, p(s t ).The coding rate associated with the dynamics, I[s t ; {s t−1 , x t }], measures the complexity of the model.Dynamics are then constructed, such that predictive power is maximized, under a constraint on the coding cost, or memory [14]: This optimization problem is solved by dynamics that obey the following equations (Equations ( 14)-( 17) in [14], with actions taken out, and history = {x t , s t−1 }): These dynamics group new incoming data, x t , together with the current state, s t−1 , according to their similarity in terms of conditional future distributions, thereby creating, in each step, an incrementally more predictive model.Since the state, s t , is computed from the variables, s t−1 and x t , knowing it does not add information about the future, due to the data processing inequality [39].This is equivalent to saying that knowledge of the new state does not change the distribution over futures, when the old state and the current input are given, because the new state is obtained as a function of only these two variables and hence, no external information enters: p( → x t |s t , s t−1 , x t ) = p( → x t |s t−1 , x t ).A similar assumption is made in the original Information Bottleneck method, where the compression of the input data does not change the distribution over the relevant quantity, assuming the data are given.This assumption is crucial and warrants the name "recursive information bottleneck (RIB)" for this new, recursive compression scheme.Note that in the general case, however, when the learner's actions can change its future input, this assumption is no longer valid, and the rate-distortion framework has to be extended beyond its original scope, leaving the recursive information bottleneck method as a special case of this more general scenario [14].Learning with feedback as in [14] can easily be extended from the "dynamical" learning discussed in this Section to the "static" learning treated by the original IB method (see Sec. 2).
The RIB method's algorithmic procedure [14] starts by initializing the machine states to a uniform distribution that is uncorrelated with the observations, and as a consequence, the first iteration of RIB simply runs the IB algorithm (as explained in Sec. 2) with pasts of length one, and futures of length τ F , where the initial input distribution, p( → x t |x t ), is acquired from the input data.Then, a sequence of L states is produced, using the optimal assignments which have been obtained by the initial compression: p 0 (s t |x t ).The duration, L, controls how long the learning machine tests its model of the environment, before re-evaluating it.After L steps, which are used to produce new states and to acquire the associated new input statistics, p j ( → x t |x t , s t−1 ), RIB then solves Equations ( 28)- (31) and produces new optimal state assignments, p j (s t |x t , s t−1 ), which are, in turn, used to produce the next L states.This procedure is repeated iteratively.(The index j labels iteration number.) Each step, j, has a trade-off parameter, λ j , associated with it.Sampling errors can lead to over-fitting when this parameter is too small [43].This becomes more pronounced, the smaller L is.L can be as short as L = 1, its minimum, when the algorithm operates in an "online"-fashion.One can then decrease λ j as a function of j, since sampling errors decrease as more data are accumulated over time.If a finite time series is given, then the maximum value of L is limited by the length of the time series.In that case, when L is set to its maximum, the algorithm makes several passes over the entire data batch.For theoretical guidance regarding λ j , results from [43] can be used.
A possible advantage of this scheme is that the state space of the learning machine will increase only as much as necessary for prediction.Instead of having to estimate p( ← x t , → x t ) (as for OCI), one iteratively re-estimates p( → x t |x t , s t−1 ), potentially easing the estimation problem significantly.
However, depending on the complexity of the input data, the model quality may rely on the use of long futures.If infinitely long futures are required to learn the best possible model, then any practical procedure has to be suboptimal, as it has to deal with finite data.But it is interesting to study this asymptotic regime (τ F → ∞) theoretically, in order to understand the general capacity of the method.

Asymptotic Behavior of Recursive Information Bottleneck
In the limit τ F → ∞, and taking the limit λ j → 0 ∀j, RIB finds the causal state partition (and thus minimal sufficient statistics), together with deterministic state transitions, yielding as a special case in this limit the " -machine", which is the unique maximally predictive and deterministic Hidden Markov Model of a given time series [21,23,28,36].
To see how this works, consider pasts of length τ P = t, whereby ← x = (x 0 , . . ., x t ), and define the function, f ( ← x t ), which creates a partition of the space of all past trajectories, such that all ← x t which give the same the conditional future distribution, p( ) .This is one (intuitive) way of defining the causal state partition.We have to show that the states of RIB recovers f ( ← x t ), ∀t.This can be done by mathematical induction (we present here an intuitive argument and leave tedious technical details for later).
Since the RIB procedure is initialized by OCI run on pasts of length one, we have after the initial optimization, in the limit λ 0 → 0, the assignments p 0 (s t | ← x 0 ) = 1 if s t = s 0 0 otherwise , where the optimal solution is s 0 = f ( ← x 0 ) [16].This is the basis step for the proof by induction.The inductive hypothesis is that s t = f ( ← x t ), and we have to show that if this is true, then s t+1 t = f ( ← x t+1 ).
Since we let λ j → 0, ∀j, we know [16,43] that the optimal solution is the one that minimizes the relative entropy in the exponent of Equation 28.So, we have to evaluate D KL [p( ))] ≥ 0, which is non-negative by definition of the optimal s t+1 .Now consider the probability that the system goes from the optimal state s t to optimal state s t+1 which tends to one as λ → 0, assuming that p(s t+1 ) is non-zero.This result holds in general, regardless of the underlying distribution that generated the data.It is important, because it shows that RIB has the capacity to discover the minimal unique and sufficient predictive model, a representation that can be regarded as the best possible predictive model that can be constructed from observations of a stochastic process alone [23,36,42].Conveniently, we do not have to evaluate or compare infinite past trajectories when using this recursive method.Furthermore, while the −machine is defined for infinite future trajectories, it is obvious from the above treatment that when the notion of causal state partition is extended to finite futures, the argument above still applies for finite values for τ F .

Modeling Linear Dynamical Systems
Assume that the transfer function that characterizes a dynamical system generating the observed data is linear with additive Gaussian noise.Then, using the IB to find a reduced representation of the dynamical system becomes an Eigenvalue problem.The reduced system dynamics can be computed analytically, given the assumed underlying linear system dynamics.This approach was coined past-future IB (PFIB) [19].The Eigenvectors that determine the reduced system are the same as those computed by canonical correlation analysis (CCA) [19,76].

Conclusions
Predictive inference can be interpreted as a strategy for effective and efficient communication: past experiences are compressed into a representation that is maximally informative about future experiences.The information bottleneck (IB) framework can thus be applied, either in a direct way, or in its recursive form (RIB).Both methods find, asymptotically, the causal state partition, i.e., minimal sufficient statistics.RIB additionally recovers, asymptotically, the -machine, which is a maximally predictive and minimally complex deterministic HMM, believed to be the best predictive description of a stochastic process that can be extracted from the data alone.
While the main appeal of the IB framework is its generality, and that no assumptions have to be made about the distribution that generated the data, linear and Gaussian model assumptions do result in signal processing methods that are related to known methods, such as canonical correlation analysis.
Beyond philosophical motivations, the information theoretic approach to predictive inference laid out here can also be motivated from thermodynamic considerations.Prediction and energetic efficiency are tightly coupled, because instantaneous nonpredictive information is fundamentally related to dissipation.Implemented on a physical system, predictive inference may thus constitute a strategy for using energy efficiently by minimizing dissipation.
In summary, predictive inference may have advantages, not only in the abstract world of thoughts, where it enables efficient communication, but also in a concrete thermodynamic sense.The information bottleneck framework offers an intuitive approach to an overarching theory.

t ] [35]; or in other words, the model's memory. For deterministic maps, we have H[s t | ← x t ] = 0, and the memory thus reduces to the entropy, H[s t ], which reflects, for predictive systems, the statistical complexity [28]. However, in the general case of probabilistic maps, I[s t , ← x t ] is a more adequate measure of model complexity
, I[s t ; than H[s t ].To see this, consider a model with a large number of states, n, but where each data point is mapped with equal probability to each of the n states [15]: p(s t | ←x ← ( |x t , s t−1 )p(s t |x t , s t−1 )p(x t , s t−1 ), → x t → x t |x t , s t−1 ) p( → x t |s t )] .