Review The Liang-Kleeman Information Flow: Theory and Applications

Information flow, or information transfer as it may be referred to, is a fundamental notion in general physics which has wide applications in scientific disciplines. Recently, a rigorous formalism has been established with respect to both deterministic and stochastic systems, with flow measures explicitly obtained. These measures possess some important properties, among which is flow or transfer asymmetry. The formalism has been validated and put to application with a variety of benchmark systems, such as the baker transformation, Hnon map, truncated Burgers-Hopf system, Langevin equation, etc. In the chaotic Burgers-Hopf system, all the transfers, save for one, are essentially zero, indicating that the processes underlying a dynamical phenomenon, albeit complex, could be simple. (Truth is simple.) In the Langevin equation case, it is found that there could be no information flowing from one certain time series to another series, though the two are highly correlated. Information flow/transfer provides a potential measure of the cause-effect relation between dynamical events, a relation usually hidden behind the correlation in a traditional sense.


Introduction
Information flow, or information transfer as it sometimes appears in the literature, refers to the transference of information between two entities in a dynamical system through some processes, with one entity being the source, and another the receiver. Its importance lies beyond its literal meaning in that it actually carries an implication of causation, uncertainty propagation, predictability transfer, etc., and, therefore, has applications in a wide variety of disciplines. In the following, we first give a brief demonstration of how it may be applied in different disciplines; the reader may skip this part and go directly to the last two paragraphs of this section.
According to how the source and receiver are chosen, information flow may appear in two types of form. The first is what one would envision in the usual sense, i.e., the transference between two parallel parties (for example, two chaotic circuits [1]), which are linked through some mechanism within a system. This is found in neuroscience (e.g., [2][3][4]), network dynamics (e.g., [5][6][7]), atmosphere-ocean science (e.g., [8][9][10][11]), financial economics (e.g., [12,13]), to name but a few. For instance, neuroscientists focus their studies on the brain and its impact on behavior and cognitive functions, which are associated with flows of information within the nervous system (e.g., [3]). This includes how information flows from one neuron to another neuron across the synapse, how dendrites bring information to the cell body, how axons take information away from the cell body, and so forth. Similar issues arise in computer and social networks, where the node-node interconnection, causal dependencies, and directedness of information flow, among others, are of concern [6,14,15]. In atmosphere-ocean science, the application is vast, albeit newly begun. An example is provided by the extensively studied El Niño phenomenon in the Pacific Ocean, which is well known through its linkage to global natural disasters, such as the floods in Ecuador and the droughts in Southeast Asia, southern Africa and northern Australia, to the death of birds and dolphins in Peru, to the increased number of storms over the Pacific, and to the famine and epidemic diseases in far-flung parts of the world [16][17][18]. A major focus in El Niño research is the predictability of the onset of the irregularly occurring event, in order to issue in-advance warning of potential hazardous impacts [19][20][21]. It has now become known that the variabilities in the Indian Ocean could affect the El Niño predictability (e.g., [22]). That is to say, at least a part of the uncertainty source for El Niño predictions is from the Indian Ocean. Therefore, to some extent, the El Niño predictability may also be posed as an information flow problem, i.e., a problem on how information flows from the Indian Ocean to the Pacific Ocean to make the El Niño more predictable or more uncertain.
Financial economics provides another field of application of information flow of the first type; this field has received enormous public attention since the recent global financial crisis triggered by the subprime mortgage meltdown. A conspicuous example is the cause-effect relation between the equity and options markets, which reflects the preference of traders in deciding where to place their trades. Usually, information is believed to flow unidirectionally from equity to options markets because informed traders prefer to trade in the options markets (e.g., [23]), but recent studies show that the flow may also exist in the opposite way: informed traders actually trade both stocks and "out-of-the-money" options, and hence the causal relation from stocks to options may reverse [12]. More (and perhaps the most important) applications are seen through predictability studies. For instance, the predictability of asset return characteristics is a continuing problem in financial economics, which is largely due to the information flow in markets. Understanding the information flow helps to assess the relative impact from the markets and the diffusive innovation on financial management. Particularly, it helps the prediction of jump timing, a fundamental question in financial decision making, through determining information covariates that affect jump occurrence up to the intraday levels, hence providing empirical evidence in the equity markets, and pointing us to an efficient financial management [13].
The second type of information flow appears in a more abstract way. In this case, we have one dynamical event; the transference occurs between different levels, or sometimes scales, within the same event. Examples for this type are found in disciplines such as evolutionary biology [24][25][26], statistical physics [27,28], turbulence, etc., and are also seen in network dynamics. Consider the transitions in biological complexity. A reductionist, for example, views that the emergence of new, higher level entities can be traced back to lower level entities, and hence there is a "bottom-up" causation, i.e., an information flow from the lower levels to higher levels. Bottom-up causation lays the theoretical foundation for statistical mechanics, which explains macroscopic thermodynamic states from a point of view of molecular motions.
On the other hand, "top-down" causation is also important [29,30]. In evolution (e.g., [31]), it has been shown that higher level processes may constrain and influence what happens at lower levels; particularly, in transiting complexity, there is a transition of information flow, from the bottom-up to top-down, leading to a radical change in the structure of causation (see, for example [32]). Similar to evolutionary biology, in network dynamics, some simple computer networks may experience a transition from a low traffic state to a high congestion state, beneath which is a flow of information from a bunch of almost independent entities to a collective pattern representing a higher level of organization (e.g., [33]). In the study of turbulence, the notoriously challenging problem in classical physics, it is of much interest to know how information flows over the spectrum to form patterns on different scales. This may help to better explain the cause of the observed higher moments of the statistics, such as excess kurtosis and skewness, of velocity components and velocity derivatives [34]. Generally, the flows/transfers are two-way, i.e., both from small scales to large scales, and from large scales to small scales, but the flow or transfer rates may be quite different.
Apart from the diverse real-world applications, information flow/transfer is important in that it offers a methodology for scientific research.
In particular, it offers a new way of time series analysis [35][36][37]. Traditionally, correlation analysis is widely used for identifying the relation between two events represented by time series of measurements; an alternative approach is through mutual information analysis, which may be viewed as a type of nonlinear correlation analysis. But both correlation analysis and mutual information analysis put the two events on an equal stance. As a result, there is no way to pick out the cause and the effect. In econometrics, Granger causality [38] is usually employed to characterize the causal relation between time series, but the characterization is just in a qualitative sense; when two events are mutually causal, it is difficult to differentiate their relative strengths. The concept of information flow/transfer is expected to remedy this deficiency, with the mutual causal relation quantitatively expressed.
Causality implies directionality. Perhaps the most conspicuous observation on information flow/transfer is its asymmetry between the involved parties. A typical example is seen in our daily life when a baker is kneading a dough. As the baker stretches, cuts, and folds, he guides a unilateral flow of information from the horizontal to the vertical. That is to say, information goes only from the stretching direction to the folding direction, not vice versa. The one-way information flow (in a conventional point of view) between the equity and options markets offers another good example. In other cases, such as in the aforementioned El Niño event, though the Indian and Pacific Oceans may interact with each other, i.e., the flow route could be a two-way street, the flow rate generally differs from one direction to another direction. For all that account, transfer asymmetry makes a basic property of information flow; it is this property that distinguishes information flow from the traditional concepts such as mutual information.
As an aside, one should not confuse dynamics with causality, the important property reflected in the asymmetry of information flow. It is temptating to think that, for a system, when the dynamics are known, the causal relations are determined. While this might be the case for linear deterministic systems, in general, however, this need not be true. Nonlinearity may lead a deterministic system to chaos; the future may not be predictable after a certain period of time, even though the dynamics is explicitly given. The concept of emergence in complex systems offers another example. It has long been found that irregular motions according to some simple rules may result in the emergence of regular patterns (such as the inverse cascade in the planar turbulence in natural world [39,40]). Obviously, how this instantaneous flow of information from the low-level entities to high-level entities, i.e., the patterns, cannot be simply explained by the rudimentary rules set a priori. In the language of complexity, emergence does not result from rules only (e.g., [41][42][43]); rather, as said by Corning (2002) [44], "Rules, or laws, have no causal efficacy; they do not in fact 'generate' anything... the underlying causal agencies must be separately specified." Historically, quantification of information flow has been an enduring problem. The challenge lies in that this is a real physical notion, while the physical foundation is not as clear as those well-known physical laws. During the past decades, formalisms have been established empirically or half-empirically based on observations in the aforementioned diverse disciplines, among which are Vastano and Swinney's time-delayed mutual information [45], and Schreiber's transfer entropy [46,47]. Particularly, transfer entropy is established with an emphasis of the above transfer asymmetry between the source and receiver, so as to have the causal relation represented; it has been successfully applied in many real problem studies. These formalisms, when carefully analyzed, can be approximately understood as dealing with the change of marginal entropy in the Shannon sense, and how this change may be altered in the presence of information flow (see [48], section 4 for a detailed analysis). This motivates us to think about the possibility of a rigorous formalism when the dynamics of the system is known. As such, the underlying evolution of the joint probability density function (pdf) will also be given, for deterministic systems, by the Liouville equation or, for stochastic systems, by the Fokker-Planck equation (cf. §4 and §5 below). From the joint pdf, it is easy to obtain the marginal density, and hence the marginal entropy. One thus expects that the concept of information flow/transfer may be built on a rigorous footing when the dynamics are known, as is the case with many real world problems like those in atmosphere-ocean science. And, indeed, Liang and Kleeman (2005) [49] find that, for two-dimensional (2D) systems, there is a concise law on entropy evolution that makes the hypothesis come true. Since then, the formalism has been extended to systems in different forms and of arbitrary dimensionality, and has been applied with success in benchmark dynamical systems and more realistic problems. In the following sections, we will give a systematic introduction of the theories and a brief review of some of the important applications.
In the rest of this review, we first set up a theoretical framework, then illustrate through a simple case how a rigorous formalism can be achieved. Specifically, our goal is to compute within the framework, for a continuous-time system, the transference rate of information, and, for a discrete-time system or mapping, the amount of the transference upon each application of the mapping. To unify the terminology, we may simply use "information flow/transfer" to indicate either the "rate of information flow/transfer" or the "amount of information flow/transfer" wherever no ambiguity exists in the context. The next three sections are devoted to the derivations of the transference formulas for three different systems. Sections 3 and 4 are for deterministic systems, with randomness limited within initial conditions, where the former deals with discrete mappings and the latter with continuous flows. Section 5 discusses the case when stochasticity is taken in account. In the section that follows, four major applications are briefly reviewed. While these applications are important per se, some of them also provide validations for the formalism. Besides, they are also typical in terms of computation; different approaches (both analytical and computational) have been employed in computing the flow or transfer rates for these systems. We summarize in Section 7 the major results regarding the formulas and their corresponding properties, and give a brief discussion on the future research along this line. As a convention in the history of development, the terms "information flow" and "information transfer" will be used synonymously. Throughout this review, by entropy we always mean Shannon or absolute entropy, unless otherwise specified. Whenever a theorem is stated, generally only the result is given and interpreted; for detailed proofs, the reader is referred to the original papers.

Theoretical Framework
Consider a system with n state variables, x 1 , x 2 , ..., x n , which we put together as a column vector x = (x 1 , ..., x n ) T . Throughout this paper, x may be either deterministic or random, depending on the context where it appears. This is a notational convention adopted in the physics literature, where random and deterministic states for the same variable are not distinguished. (In probability theory, they are usually distinguished with lower and upper cases like x and X.) Consider a sample space of x, Ω ⊂ R n . Defined on Ω is a joint probability density function (pdf) ρ = ρ(x). For convenience, assume that ρ and its derivatives (up to an order as high as enough) are compactly supported. This makes sense, as in the real physical world, the probability of extreme events vanishes. Thus, without loss of generality, we may extend Ω to R n and consider the problem on R n , giving a joint density in L 1 (R n ) and n marginal densities ρ i ∈ L 1 (R): Correspondingly, we have an entropy functional of ρ (joint entropy) in the Shannon sense and n marginal entropies Consider an n-dimensional dynamical system, autonomous or nonautonomous, where F = (F 1 , F 2 , ..., F n ) T is the vector field. With random inputs at the initial stage, the system generates a continuous stochastic process {x(t), t ≥ 0}, which is what we are concerned with. In many cases, the process may not be continuous in time (such as that generated by the baker transformation, as mentioned in the introduction). We thence also need to consider a system in the discrete mapping form: with τ being positiver integers. Here Φ is an n-dimensional transformation the counterpart of the vector field F. Again, the system is assumed to be perfect, with randomness limited within the initial conditions. Cases with stochasticity due to model inaccuracies are deferred to Section 5. The stochastic process thus formed is in a discrete time form {x(τ ), τ }, with τ > 0 signifying the time steps. Our formalism will be established henceforth within these frameworks.

Toward a Rigorous Formalism-A Heuristic Argument
First, let us look at the two-dimensional (2D) case originally studied by Liang and Kleeman [49] This is a system of minimal dimensionality that admits information flow. Without loss of generality, examine only the flow/transfer from x 2 to x 1 . Under the vector field F = (F 1 , F 2 ) T x evolves with time; correspondingly its joint pdf ρ(x) evolves, observing a Liouville equation [50]: As argued in the introduction, what matters here is the evolution of H 1 namely the marginal entropy of x 1 . For this purpose, integrate (8) with respect to x 2 over R to get: Other terms vanish, thanks to the compact support assumption for ρ. Multiplication of (9) by −(1 + log ρ 1 ) followed by an integration over R gives the tendency of H 1 : where E stands for mathematical expectation with respect to ρ. In the derivation, integration by parts has been used, as well as the compact support assumption. Now what is the rate of information flow from x 2 to x 1 ? In [49], Liang and Kleeman argue that, as the system steers a state forward, the marginal entropy of x 1 is replenished from two different sources: one is from x 1 itself, another from x 2 . The latter is through the very mechanism namely information flow/transfer. If we write the former as dH * 1 /dt, and denote by T 2→1 the rate of information flow/transfer from x 2 to x 1 (T stands for "transfer"), this gives a decomposition of the marginal entropy increase according to the underlying mechanisms: Here dH 1 /dt is known from Equation (10). To find T 2→1 , one may look for dH * 1 /dt instead. In [49], Liang and Kleeman find that this is indeed possible, based on a heuristic argument. To see this, multiply the Liouville Equation (8) by −(1 + log ρ), then integrate over R 2 . This yields an equation governing the evolution of the joint entropy H which, after a series of manipulation, is reduced to where ∇ is the divergence operator. With the assumption of compact support, the first term on the right hand side goes to zero. Using E to indicate the operator of mathematical expectation, this becomes That is to say, the time rate of change of H is precisely equal to the mathematical expectation of the divergence of the vector field. This remarkably concise result tells that, as a system moves on, the change of its joint entropy is totally controlled by the contraction or expansion of the phase space of the system. Later on, Liang and Kleeman show that this is actually a property holding for deterministic systems of arbitrary dimensionality, even without invoking the compact assumption [51]. Moreover, it has also been shown that, the local marginal entropy production observes a law in the similar form, if no remote effect is taken in account [52]. With Equation (12), Liang and Kleeman argue that, apart from the complicated relations, the rate of change of the marginal entropy H 1 due to x 1 only (i.e., dH * 1 /dt as symbolized above), should be This heuristic reasoning makes the separation (11) possible. Hence the information flows from x 2 to x 1 at a rate of where ρ 2|1 is the conditional pdf of x 2 , given x 1 . The rate of information flow from x 1 to x 2 , written T 1→2 , can be derived in the same way. This tight formalism (called "LK2005 formalism" henceforth), albeit based on heuristic reasoning, turns out to be very successful. The same strategy has been applied again in a similar study by Majda and Harlim [53]. We will have a chance to see these in Sections 4 and 6.

Mathematical Formalism
The success of the LK2005 formalism is remarkable. However, its utility is limited to systems of dimensionality 2. For an n-dimensional system with n > 2, the so-obtained Equation (14) is not the transfer from x 2 to x 1 , but the cumulant transfer to x 1 from all other components x 2 , x 3 ,..., x n . Unless one can screen out from Equation (14) the part contributed from x 2 , it seems that the formalism does not yield the desiderata for high-dimensional systems.
To overcome the difficulty, Liang and Kleeman [48,51] observe that, the key part in Equation (14) namely dH * 1 /dt actually can be alternatively interpreted, for a 2D system, as the evolution of H 1 with the effect of x 2 excluded. In other words, it is the tendency of H 1 with x 2 frozen instantaneously at time t. To avoid confusing with dH * 1 /dt, denote it as dH 1\ 2 /dt, with the subscript \ 2 signifying that the effect of x 2 is removed. In this way dH 1 /dt is decomposed into two disjoint parts: T 2→1 namely the rate of information flow and dH 1\ 2 /dt. The flow is then the difference between dH 1 /dt and dH 1\ 2 /dt: For 2D systems, this is just a restatement of Equation (14) in another set of symbols; but for systems with dimensionality higher than 2, they are quite different. Since the above partitioning does not have any restraints on n, Equation (15) is applicable to systems of arbitrary dimensionality.
In the same spirit, we can formulate the information transfer for discrete systems in the form of Equation (4). As x is mapped forth under the transformation Φ from time step τ to τ +1, correspondingly its density ρ is steered forward by an operator termed after Georg Frobenius and Oskar Perron, which we will introduce later. Accordingly the entropies H, H 1 , and H 2 also change with time. On the interval [τ, τ +1], let H 1 be incremented by ∆H 1 from τ to τ +1. By the foregoing argument, the evolution of H 1 can be decomposed into two exclusive parts according to their driving mechanisms, i.e., the information flow from x 2 , T 2→1 , and the evolution with the effect of x 2 excluded, written as ∆H 1\ 2 . We therefore obtain the discrete counterpart of Equation (15): Equations (15) and (16) give the rates of information flow/transfer from component x 2 to component x 1 for systems (3) and (4), respectively. One may switch the corresponding indices to obtain the flow between any component pair x i and x j , i ̸ = j. In the following two sections we will be exploring how these equations are evaluated.

Frobenius-Perron Operator
For discrete systems in the form of Equation (4), as x is carried forth under the transformation Φ, there is another transformation, called Frobenius-Perron operator P (F-P operator hereafter), steering ρ(x), i.e., the pdf of x, to Pρ (see a schematic in Figure 1). The F-P operator governs the evolution of the density of x.
A rigorous definition requires some ingredients of measure theory which is beyond the scope this review, and the reader may consult with the reference [50]. Loosely speaking, given a transformation Φ : Ω → Ω (in this review, for any ω ⊂ Ω. If Φ is nonsingular and invertible, it actually can be explicitly evaluated. Making transformation y = Φ(x), the right hand side is, in this case, ∫ where J is the Jacobian of Φ: and J −1 its inverse. Since ω is arbitrarily chosen, we have If no nonsingularity is assumed for the transformation Φ, but the sample space Ω is in a Cartesian product form, as is for this review, the F-P operator can also be evaluated, though not in an explicit form.
where a = (a 1 , ..., a n ) is some constant point (usually can be set to be the origin). Let the counterimage of ω be Φ −1 (ω), then it has been proved (c.f. [50]) that In this review, we consider a sample space R n , so essentially all the F-P operators can be calculated this way.

Information Flow
The F-P operator P allows for an evaluation of the change of entropy as the system evolves forth. By the formalism (16) , we need to examine how the marginal entropy changes on a time interval [τ, τ + 1]. Without loss of generality, consider only the flow from x 2 to x 1 . First look at increase of H 1 . Let ρ be the joint density at step τ , then the joint density at step τ + 1 is Pρ, and hence Here (Pρ) 1 means the marginal density of x 1 at τ + 1; it is equal to Pρ with all components of x but x 1 being integrated out. The independent variables with respect to which the integrations are taken are dummy; but for the sake of clarity, we use different notations, i.e., x and y, for them at time step τ and τ + 1, respectively.
The key to the formalism (16) is the finding of namely the increment of the marginal entropy of x 1 on [τ, τ + 1] with the contribution from x 2 excluded.
Here the system in question is no longer Equation (4), but a system with a mapping modified from Φ: with x 2 frozen instantaneously at τ as a parameter. Again, we use .., n, to indicate the state variables at steps τ and τ + 1, respectively, to avoid any possible confusion. In the mean time, the dependence on τ and τ + 1 are suppressed for notational economy. Corresponding to the modified transformation Φ \ 2 is a modified F-P operator, written P \ 2 . To find H 1\ 2 (τ + 1), examine the quantity h = − log(P \ 2 ρ) 1 (y 1 ), where the subscript 1 indicates that this is a marginal density of the first component, and the dependence on y 1 tells that this is evaluated at step τ + 1. Recall how Shannon entropy is defined: H 1\ 2 (τ + 1) is essentially the mathematical expectation, or "average" in loose language, of h. More specifically, it is h multiplied with some pdf followed by an integration over R n , i.e., the corresponding sample space. The pdf is composed of several different factors. The first is, of course, (P \ 2 ρ) 1 (y 1 ). But h, as well as (P \ 2 ρ) 1 , also has dependence on x 2 , which is embedded within the subscript \2. Recall how x 2 is treated during [τ, τ + 1]: It is frozen at step τ and kept on as a parameter, given all other components at τ . Therefore, the second part of the density is ρ(x 2 |x 1 , x 3 , ..., x n ), i.e., the conditional density of x 2 on x 1 , x 3 , ..., x n . (Note again that x i means variables at time step τ .) This factor introduces extra dependencies: x 3 , x 4 , ..., x n (that of x 1 is embedded in y 1 ), which must also be averaged out, so the third factor of the density is ρ 3...n (x 3 , ..., x n ) namely the joint density of (x 3 , x 4 , ..., x n ). Put all these together, Subtraction of H 1\ 2 (τ + 1) − H 1 (τ ) from Equation (19) gives, eventually, the rate of information flow/transfer from x 2 to x 1 : Notice that the conditional density of x 2 is on x 1 , not on y 1 . (x 1 and y 1 are the same state variable evaluated at different time steps, and are connected via Likewise, it is easy to obtain the information flow between any pair of components. If, for example, we are concerned with the flow from x j to x i (i, j = 1, 2, ..., n, i ̸ = j), replacement of the indices 1 and 2 in Equation (23) respectively with i and j gives Here the subscript \ j of P means the F-P operator with the effect of the j th component excluded through freezing it instantaneously as a parameter. We have also abused the notation a little bit for the density function to indicate the marginalization of that component. That is to say, and ρ \ i\ j is the density after being marginalized twice, with respect to x i and x j . To avoid this potential notation complexity, alternatively, one may reorganize the order of the components of the vector x = (x 1 , ..., x n ) T such that the pair appears in the first two slots, and modify the mapping Φ accordingly. In this case, the flow/transfer is precisely the same in form as Equation (23). Equations (23) and (24) can be evaluated explicitly for systems that are definitely specified. In the following sections we will see several concrete examples.

Properties
The information flow obtained in Equations (23) or (24) has some nice properties. The first is a concretization of the transfer asymmetry emphasized by Schreiber [47] (as mentioned in the introduction), and the second a special property for 2D systems.
The proof is rather technically involved; the reader is referred to [48] for details. This theorem states that, if the evolution of x i has nothing to do with x j , then there will be no information flowing from x j to x i . This is in agreement with observations, and with what one would argue on physical grounds. On the other hand, the vanishing T j→i yields no clue on T i→j , i.e., the flow from x i to x j need not be zero in the mean time, unless Φ j does not rely on x i . This is indicative of a very important physical fact: information flow between a component pair is not symmetric, in contrast to the notion of mutual information ever existing in information theory. As emphasized by Schreiber [47], a faithful formalism must be able to recover this asymmetry. The theorem shows that our formalism yields precisely what is expected. Since transfer asymmetry is a reflection of causality, the above theorem is also referred to as property of causality by Liang and Kleeman [48].

Theorem 3.2 For the system Equation (4), if n = 2 and Φ 1 is invertible, then
A brief proof will help to gain better understanding of the theorem. If n = 2, the modified system has a mapping Φ \ 2 which is simply Φ 1 with x 2 as a parameter. Equation (22) is thus reduced to where y 1 = Φ 1 (x 1 , x 2 ), and (P \ 2 ρ) 1 the marginal density of x 1 evolving from ρ \ 2 = ρ 1 upon one transformation of Φ \ 2 = Φ 1 . By assumption Φ 1 is invertible, that is to say, J 1 = ∂Φ 1 ∂x 1 ̸ = 0. The F-P operator hence can be explicitly written out: So The conclusion follows subsequently from Equation (16).
The above theorem actually states another interesting fact that parallels what we introduced previously in §2.2 via heuristic reasoning. To see this, reconsider the mapping Φ : R n → R n , x → x. Let Φ be nonsingular and invertible. By Equation (18), the F-P operator of the joint pdf ρ can be explicitly evaluated. Accordingly, the entropy increases, as time moves from step τ to step τ + 1, by After some manipulation (see [48] for details), this is reduced to This is the discrete counterpart of Equation (12), yet another remarkably concise formula. Now, if the system in question is 2-dimensional, then, as argued in §2.2, the information flow from x 2 to x 1 should be ∆H 1 − ∆H * 1 , with ∆H * 1 being the marginal entropy increase due to x 1 itself. Furthermore, if Φ 1 is nonsingular and invertible, then Equation (28)

Continuous Systems
For continuous systems in the form of Equation (3), we may take advantage of what we already have from the previous section to obtain the information flow. Without loss of generality, consider only the flow/transfer from x 2 to x 1 , T 2→1 . We adopt the following strategy to fulfill the task: • Discretize the continuous system in time on [t, t + ∆t], and construct a mapping Φ to take x(t) to x(t + ∆t); • Freeze x 2 in Φ throughout [t, t + ∆t] to obtain a modified mapping Φ \ 2 ; • Compute the marginal entropy change ∆H 1 as Φ steers the system from t to t + ∆t; • Derive the marginal entropy change ∆H 1\ 2 as Φ \ 2 steers the modified system from t to t + ∆t; • Take the limit T 2→1 = lim ∆t→0 ∆H 1 − ∆H 1\ 2 ∆t to arrive at the desiderata.
Clearly, this mapping is always invertible so long as ∆t is small enough. In fact, we have to the first order of ∆t. Furthermore, its Jacobian J is Likewise, it is easy to get This makes it possible to evaluate the F-P operator associated with Φ. By Equation (18), Here ∇ · F = ∑ i ∂F i ∂x 1 ; we have suppressed its dependence on x to simplify the notation. As an aside, the explicit evaluation (31), and subsequently (32) and (33), actually can be utilized to arrive at the important entropy evolution law (12) without invoking any assumptions. To see this, recall that ∆H = E log |J| by Equation (28). Let ∆t go to zero to get which is the very result E(∇ · F), just as one may expect.

Information Flow
To compute the information flow T 2→1 , we need to know dH 1 /dt and dH 1\ 2 /dt. The former is easy to find from the Liouville equation associated with Equation (3), i.e., ∂ρ ∂t following the same derivation as that in §2.2: The challenge lies in the evaluation of dH 1\ 2 /dt. We summarize the result in the following proposition: and ρ \ 2 = ∫ R ρ dx 2 , ρ \ 1\ 2 = ∫ R 2 ρ dx 1 dx 2 are the densities after marginalized with x 2 and (x 1 , x 2 ), respectively.
The proof is rather technically involved; for details, see [51], section 5.
With the above result, subtract dH 1\ 2 /dt from dH 1 /dt and one obtains the flow rate from x 2 to x 1 . Likewise, the information flow between any component pair (x i , x j ), i, j = 1, 2, ..., n; i ̸ = j, can be obtained henceforth.

Theorem 4.1 For the dynamical system (3), the rate of information flow from x j to x i is
where In this formula, Θ j|i reminds one of the conditional density x j on x i , and, if n = 2, it is indeed so. We may therefore call it the "generalized conditional density" of x j on x i .

Properties
Recall that, as we argue in §2.2 based on the entropy evolution law (12), the time rate of change of the marginal entropy of a component, say x 1 , due to its own reason, is dH * 1 /dt = E(∂F 1 /∂x 1 ). Since for a 2D system, dH * 1 /dt is precisely dH 1\ 2 /dt, we expect that the above formalism (36) or (39) verifies this result.

Theorem 4.2 If the system (3) has a dimensionality 2, then
and hence the rate of information flow from x 2 to x 1 is What makes a 2D system so special is that, when n = 2, ρ \ 2 = ρ 1 , and Θ 2|1 is just the conditional distribution of x 2 given x 1 , ρ/ρ 1 = ρ(x 2 |x 1 ). Equation (36) can thereby be greatly simplified: Subtract this from what has been obtained above for dH 1 /dt, and we get an information flow just as that in Equation (14) via heuristic argument. As in the discrete case, one important property that T j→i must possess is transfer asymmetry, which has been emphasized previously, particularly by Schreiber [47]. The following is a concretization of the argument. Look at the right-hand side of the formula (39). Given that (1 + log ρ i ) and ρ \ j , as well as F i (by assumption), are independent of x j , the integration with respect to x j can be taken within the multiple integrals. Consider the second integral first. All the variables except θ j|i have dependence on x j . But ∫ θ j|i dx j = 1, so the whole term is equal to ∫ ..dx n which vanishes by the assumption of compact support. For the first integral, move the integration with respect to x j into the parentheses, as the factor outside has nothing to do with x j . This integration yields ∫ For all that account, both the two integrals on the right-hand side of Equation (39) vanish, leaving a zero flow of information from x j to x i . Notice that this vanishing T j→i gives no hint on the flow in the opposite direction. In other words, this kind of flow or transfer is not symmetric, reflecting the causal relation between the component pair. As Theorem 3.1 is for discrete systems, Theorem 4.3 is the property of causality for continuous systems.

Stochastic Systems
So far, all the systems considered are deterministic. In this section we turn to systems with stochasticity included. Consider the stochastic counterpart of Equation ( where w is a vector of standard Wiener processes, and B = (b ij ) the matrix of perturbation amplitudes. In this section, we limit our discussion to 2D systems, and hence have only two flows/transfers to discuss. Without loss of generality, consider only T 2→1 , i.e., the rate of flow/transfer from x 2 to x 1 . As before, we first need to find the time rate of change of H 1 , the marginal entropy of x 1 . This can be easily derived from the density evolution equation corresponding to Equation (47), i.e., the Fokker-Planck equation: where g ij = g ji = ∑ 2 k=1 b ik b jk , i, j = 1, 2. This integrated over R with respect to x 2 gives the evolution of ρ 1 : Multiply (49) by −(1 + log ρ 1 ), and integrate with respect to x 1 over R. After some manipulation, one obtains, using the compact support assumption, where E is the mathematical expectation with respect to ρ. Again, the key to the formalism is the finding of dH 1\ 2 /dt. For stochastic systems, this could be a challenging task. The major challenge is that we cannot obtain an F-P operator as nice as that in the previous section for the map resulting from discretization. In early days, Majda and Harlim [53] have tried our heuristic argument in §2.2 to consider a special system modeling the atmosphere-ocean interaction, which is in the form Their purpose is to find T 2→1 namely the information transfer from x 2 to x 1 . In this case, since the governing equation for x 1 is deterministic, the result is precisely the same as that of LK05, which is shown in in §2.2. The problem here is that the approach cannot be extended even to finding T 1→2 , since the nice law on which the argument is based, i.e., Equation (12), does not hold for stochastic processes. Liang (2008) [54] adopted a different approach to give this problem a solution. As in the previous section, the general strategy is also to discretize the system in time, modify the discretized system with x 2 frozen as a parameter on an interval [t, t + ∆t], and then let ∆t go to zero and take the limit. But this time no operator analogous to the F-P operator is sought; instead, we discretize the Fokker-Planck equation and expand x 1\ 2(t+∆t) , namely the first component at t + ∆t with x 2 frozen at t, using the Euler-Bernstein approximation. The complete derivation is beyond the scope of this review; the reader is referred to [54] for details. In the following, the final result is supplied in the form of a proposition.
Proposition 5.1 For the 2D stochastic system (47), the time change of the marginal entropy of x 1 with the contribution from x 2 excluded is In the equation, the second and the third terms on the right hand side are from the stochastic perturbation. The first term, as one may recall, is precisely the result of Theorem 4.2. The heuristic argument for 2D systems in Equation (13) is successfully recovered here. With this the rate of information flow can be easily obtained by subtracting dH 1\ 2 /dt from dH 1 /dt.
Theorem 5.1 For the 2D stochastic system (47), the rate of information flow from x 2 to x 1 is where E is the expectation with respect to ρ(x 1 , x 2 ).
It has been a routine to check for the obtained flow the property of causality or asymmetry. Here in Equation (52), the first term on the right hand side is from the deterministic part of the system, which has been checked before. For the second term, if b 11 , b 12 , and hence g 11 = ∑ k b 1k b 1k have no dependence on x 2 , then the integration with respect to x 2 can be taken inside with ρ/ρ 1 or ρ(x 2 |x 1 ), and results in 1. The remaining part is in a divergence form, which, by the assumption of compact support, gives a zero contribution from the stochastic perturbation. We therefore have the following theorem: Theorem 5.2 If, in the stochastic system (47), the evolution of x 1 is independent of x 2 , then T 2→1 = 0.
The above argument actually has more implications. Suppose B = (b ij ) are independent of x, i.e., the noises are uncorrelated with the state variables. This model is indeed of interest, as in the real world, a large portion of noises are additive; in other words, b ij , and hence g ij , are constant more often than not. In this case, no matter what the vector field F is, by the above argument the resulting information flows within the system will involve no contribution from the stochastic perturbation. That is to say,

Theorem 5.3 Within a stochastic system, if the noise is additive, then the information flows are the same in form as that of the corresponding deterministic system.
This theorem shows that, if only information flows are considered, a stochastic system with additive noise functions just like deterministic. Of course, the resemblance is limited to the form of formula; the marginal density ρ 1 in Equation (52) already takes into account the effect of stochasticity, as can be seen from the integrated Fokker-Planck Equation (49). A more appropriate statement might be that, for this case, stochasticity is disguised within the formula of information flow.

Applications
Since its establishment, the formalism of information flow has been applied with a variety of dynamical system problems. In the following we give a brief description of these applications.

Baker Transformation
The baker transformation as a prototype of an area-conserving chaotic map is one of the most studied discrete dynamical systems. Topologically it is conjugate to another well-studied system, the horseshoe map, and has been be used to model the diffusion process in real physical world.
The baker transformation mimicks the kneading of dough: first the dough is compressed, then cut in half; the two halves are stacked on one another, compressed, and so forth. Formally, it is defined as a mapping on the unit square Ω = [0, 1] × [0, 1], Φ : Ω → Ω, with a Jacobian J = det . This is the area-conserving property, which, by Equation (28) yields ∆H = E log |J| = 0; that is to say, the entropy is also conserved. The nonvanishing Jacobian implies that it is invertible; in fact, it has an inverse Thus the F-P operator P can be easily found First compute T 2→1 , the information flow from x 2 to x 1 . Let ρ 1 be the marginal density of x 1 at time step τ . Taking integration of Equation (55) with respect to x 2 , one obtains the marginal density of x 1 at τ + 1 One may also compute the marginal entropy H 1 (τ + 1), which is an entropy functional of (Pρ) 1 . However, here it is not necessary, as will soon become clear. If, on the other hand, x 2 is frozen as a parameter, the transformation (53) then reduces to a dyadic mapping in the stretching direction, Two observations: (1) This result is exactly the same as Equation (56), i.e., (P \ 2 ρ) 1 is equal to (Pρ) 1 .
(2) The resulting (P \ 2 ρ) 1 has no dependence on the parameter x 2 . The latter helps to simplify the computation of H 1\ 2 (τ + 1) in Equation (22): Now the integration with respect to x 2 can be taken inside, giving ∫ ρ(x 2 |x 1 )dx 2 = 1. So H 1\ 2 (τ + 1) is precisely the entropy functional of (P \ 2 ρ) 1 . But (P \ 2 ρ) 1 = (Pρ) 1 by observation (1). Thus H 1 (τ + 1) = H 1\ 2 (τ + 1), leading to a flow/transfer The information flow in the opposite direction is different. As above, first compute the marginal density The marginal entropy increase of x 2 is then which is reduced to, after some algebraic manipulation, where I = To compute H 2\ 1 , freeze x 1 . The transformation is invertible and the Jacobian J 2 is equal to a constant 1 2 . By Theorem 3.2, So, In the expressions for I and II, since both ρ and the terms within the brackets are nonnegative, I +II ≥ 0. Furthermore, the two brackets cannot vanish simultaneously, hence I + II > 0. By Equation (64) T 1→2 is strictly positive; in other words, there is always information flowing from x 1 to x 2 . To summarize, the baker transformation transfers information asymmetrically between the two directions x 1 and x 2 . As the baker stretches the dough, and folds back on top the other, information flows continuously from the stretching direction x 1 to the folding direction x 2 (T 1→2 > 0), while no transfer occurs in the opposite direction (T 2→1 = 0). These results are schematically illustrated in Figure 2; they are in agreement with what one would observe in daily life, as described in the beginning of this review.

Hénon Map
The Hénon map is another most studied discrete dynamical systems that exhibit chaotic behavior. Introduced by Michel Hénon as a simplified Poincaré section of the Lorenz system, it is a mapping with a > 0, b > 0. When a = 1.4, b = 0.3, the map is termed "canonical," for which initially a point will either diverge to infinity, or approach an invariant set known as the Hénon strange attractor. Shown in Figure 3 is the attractor.
Like the baker transformation, the Hénon map is invertible, with an inverse The F-P operator thus can be easily found from Equation (18): In the following, we compute the flows/transfers between x 1 and x 2 .  (23), we need to find the marginal density of x 1 at step τ + 1 with and without the effect of x 2 , i.e., (Pρ) 1 and (Pρ) 1\ 2 . With the F-P operator obtained above, (Pρ) 1 is If a = 0, this integral would be equal to ρ 2 (x 1 −1). Note it is the marginal density of x 2 , but the argument is x 1 − 1. But here a > 0, the integration is taken along a parabolic curve rather than a straight line. Still the final result will be related to the marginal density of x 2 ; we may as well write itρ 2 (x 1 ), that is Again, notice that the argument is x 1 .
To compute (P \ 2 ρ) 1 , let following our convention to distinguish variables at different steps. Modify the system so that x 2 is now a parameter. As before, we need to find the counterimage of (−∞, y 1 ] under the transformation with x 2 frozen: Therefore, Denote the average of ρ 1 (−x 1 ) and ρ 1 (x 1 ) asρ 1 (x 1 ) to make an even function of x 1 . Then (P \ 2 ρ) 1 is simply Note that the parameter x 2 does not appear in the arguments. Furthermore, J 1 = det

Substitute all the above into Equation (23) to get
The taking of the integration with respect to x 2 inside the integral is legal since all the terms except the conditional density are independent of x 2 . With the fact ∫ R ρ(x 2 |x 1 )dx 2 = 1, and the introduction of notationsH andH for the entropy functionals ofρ andρ, respectively, we have Next, consider T 1→2 , the flow from the quadratic component to the linear component. As a common practice, one may start off by computing (Pρ) 2 and (P \ 1 ρ) 2 . However, in this case, things can be much simplified. Observe that, for the modified system with x 1 frozen as a parameter, the Jacobian of the transformation J 2 = det = 0. So, by Equation (24), with Equation (67), the marginal density allowing us to arrive at an information flow from x 1 to x 2 in the amount of: That is to say, the flow from x 1 to x 2 has nothing to do with x 2 ; it is equal to the marginal entropy of x 1 , plus a correction term due to the factor b.
The simple result of Equation (71) is remarkable; particularly, if b = 1, the information flow from x 1 to x 2 is just the entropy of x 1 . This is precisely what what one would expect of the mapping component Φ 2 (x 1 , x 2 ) = bx 1 in Equation (65). While the information flow is interesting per se, it also serves as an excellent example for the verification of our formalism.

Truncated Burgers-Hopf System
In this section, we examine a more complicated system, the Truncated Burgers-Hopf system (TBS hereafter). Originally introduced by Majda and Timofeyev [55] as a prototype of climate modeling, the TBS results from a Galerkin truncation of the Fourier expansion of the inviscid Burgers' equation, i.e., ∂u ∂t to the n th order. Liang and Kleeman [51] examined such a system with two Fourier modes retained, which is governed by 4 ordinary differential equations: Despite its simplicity, the system is intrinsically chaotic, with a strange attractor lying within Shown in Figure 4 are its projections onto the x 1 -x 2 -x 4 and x 1 -x 3 -x 4 subspaces, respectively. Finding the information flows within the TBS system turns out to be a challenge in computation, since the Liouville equation corresponding to Equations (73)-(76) is a four-dimensional partial differential equation. In [51], Liang and Kleeman adopt a strategy of ensemble prediction to reduce the computation to an acceptable level. This is summarized in the following steps: 1. Initialize the joint density of (x 1 , x 2 , x 3 , x 4 ) with some distribution ρ 0 ; make random draws according to ρ 0 to form an ensemble. The ensemble should be large enough to resolve adequately the sample space. 2. Discretize the sample space into "bins." 3. Do ensemble prediction for the system (73)-(74). 4. At each step, estimate the probability density function ρ by counting the bins. 5. Plug the estimated ρ back to Equation (39) to compute the rates of information flow at that step. Notice that the invariant attractor in Figure 4 N (µ, Σ), where they generate an ensemble of 2,560,000 members, each steered independently under the system (73)-(76). The details about the sample space discretization, probability estimation, etc., are referred to [51]. Shown in the following are only the major results.
Between the four components of the TBS system, pairwise there are 12 information flows, namely, To compute these flows, Liang and Kleeman [51] have tried different parameters µ and σ 2 k (k = 1, 2, 3, 4), but found the final results are the same after t = 2 when the trajectories are attracted into the invariant set. It therefore suffices to show the result of just one experiment: µ k = 9 and σ 2 k = 9, k = 1, 2, 3, 4. Plotted in Figure 5 are the 12 flow rates. First observe that T 3→4 = T 4→3 = 0. This is easy to understand, as both F 3 and F 4 in Equations (75) and (76) have no dependence on x 3 nor on x 4 , implying a zero flow in either direction between the pair (x 3 , x 4 ) by the property of causality. What makes the result remarkable is, besides T 3→4 and T 4→3 , essentially all the flows, except T 3→2 , are negligible, although obvious oscillations are found for T 2→1 , T 3→1 , T 1→2 , T 4→1 , T 2→3 , and T 2→4 . The only significant flow, i.e., T 3→2 , means that, within the TBS system, it is the fine component that causes an increase in uncertainty in a coarse component but not conversely. Originally the TBS was introduced by Majda and Timofeyev [55] to test their stochastic closure scheme that models the unresolved high Fourier modes. Since additive noises are independent of the state variables, information can only be transferred from the former to the latter. The transfer asymmetry observed here is thus reflected in the scheme.

Langevin Equation
Most of the applications of information flow/transfer are expected with stochastic systems. Here we illustrate this with a simple 2D system, which has been studied in reference [54] for the validation of Equation (52): where A = (a ij ) and B = (b ij ) are 2 × 2 constant matrices. This is the linear version of Equation (47). Linear systems are particular in that, if initialized with a normally distributed ensemble, then the distribution of the variables will be a Gaussian subsequently (e.g., [56]). This greatly simplifies the computation which, as we have seen in the previous subsection, is often a formidable task.
is the mean vector, and Σ = (BB T is the matrix (g ij ) we have seen in Section 5), which determine the joint density of x: By Theorem 5.1, the rates of information flow thus can be accurately computed. Several sets of parameters have been chosen in [54] to study the model behavior. Here we just look at one such choice: B = , and a sample path of x starting from (1,2). The computed rates of information flow, T 2→1 and T 1→2 , are plotted in Figure 7a and b. As time moves on, T 2→1 increases monotonically and eventually approaches a constant; on the other hand, T 1→2 vanishes throughout. While this is within one's expectations, since dx 2 = −0.5x 2 dt + dw 1 + dw2 has no dependence on x 1 and hence there should be no transfer of information from x 1 to x 2 , it is interesting to observe that, in contrast, the typical paths of x 1 and x 2 could be highly correlated, as shown in Figure 6c. In other words, for two highly correlated time series, say x 1 (t) and x 2 (t), one series may have nothing to do with the other. This is a good example illustrating how information flow extends the classical notion of correlation analysis, and how it may be potentially utilized to identify the causal relation between complex dynamical events.

Summary
The past decades have seen a surge of interest in information flow (or information transfer, as it is sometimes called) in different fields of scientific research, mostly in the appearance of some empirical/half-empirical form. We have shown that, given a dynamical system, deterministic or stochastic, this important notion can actually be formulated on a rigorous footing, with flow measures explicitly derived. The general results are summarized in the theorems in Sections 3, 4 and 5. For two-dimensional systems, the result is fairly tight. In fact, if writing such a system as where (w 1 , w 2 ) are standard Wiener processes, we have a rate of information flowing from x 2 to x 1 , This is an alternative expression of that in Theorem 5.1; T 1→2 can be obtained by switching the subscripts 1 and 2. In the formula, g 11 = ∑ k b 2 1k , ρ 1 is the marginal density of x 1 , and E stands for mathematical expectation with respect to ρ, i.e., the joint probability density. On the right-hand side, the third term is contributed by the Brownian notion; if the system is deterministic, this term vanishes. In the remaining two terms, the first is the tendency of H 1 , namely the marginal entropy of x 1 ; the second can be interpreted as the rate of H 1 increase on x 1 its own, thanks to the law of entropy production (12) [49], which we restate here: For an n-dimensional system dx dt = F(x, t), its joint entropy H evolves as dH This interpretation lies at the core of all the theories along this line. It illustrates that the marginal entropy increase of a component, say, x 1 , is due to two different mechanisms: the information transferred from some component, say, x 2 , and the marginal entropy increase associated with a system without taking x 2 into account. On this ground, the formalism is henceforth established, with respect to discrete mappings, continuous flows, and stochastic systems, respectively. Correspondingly, the resulting measures are summarized in Equations (24), (39) and (52). The above-obtained measures possess several interesting properties, some of which one may expect based on daily life experiences. The first one is a property of flow/transfer asymmetry, which has been set as the basic requirement for the identification of causal relations between dynamical events. The information flowing from one event to another event, denoted respectively as x 2 and x 1 , may yield no clue about its counterpart in the opposite direction, i.e., the flow/transfer from x 1 to x 2 . The second says that, if the evolution of x 1 is independent of x 2 , then the flow from x 2 to x 1 is zero. The third one is about the role of stochasticity, which asserts that, if the stochastic perturbation to the receiving component does not rely on the given component, the flow measure then has a form same as that for the corresponding deterministic system. As a direct corollary, when the noise is additive, then in terms of information flow, the stochastic system functions in a deterministic manner.
The formalism has been put to application with benchmark dynamical systems. In the context of the baker transformation, it is found that there is always information flowing from the stretching direction to the folding direction, while no flow exists conversely. This is in agreement with what one would observe in kneading dough. Application to the Hénon map also yields a result just as expected on physical grounds. In a more complex case, the formalism has been applied to the study of the scale-scale interaction and information flow between the first two modes of the chaotic truncated Burgers equation. Surprisingly, all the twelve flows are essentially zero, save for one strong flow from the high-frequency mode to the low-frequency mode. This demonstrates that the route of information flow within a dynamical system, albeit seemingly complex, could be simple. In another application, we test how one may control the information flow by tuning the coefficients in a two-dimensional Langevin system. A remarkable observation is that, for two highly correlated time series, there could be no transfer from one certain series, say x 2 , to the other (x 1 ). That is to say, the evolution of x 1 may have nothing to do with x 2 , even though x 1 and x 2 are highly correlated. Information flow/transfer analysis thus extends the traditional notion of correlation analysis and/or mutual information analysis by providing a quantitative measure of causality between dynamical events, and this quantification is based firmly on a rigorous mathematical and physical footing.
The above applications are mostly with idealized systems; this is, to a large extent, intended for the validation of the obtained flow measures. Next, we would extend the results to more complex systems, and develop important applications to realistic problems in different disciplines, as envisioned in the beginning of this paper. The scale-scale information flow within the Burgers-Hopf system in § 6.3, for example, may be extended to the flow between scale windows. By a scale window we mean, loosely, a subspace with a range of scales included (cf. [57]). In atmosphere-ocean science, important phenomena are usually defined on scale windows, rather than on individual scales (e.g., [58]). As discussed in [53], the dynamical core of the atmosphere and ocean general circulation models is essentially a quadratically nonlinear system, with the linear and nonlinear operators possessing certain symmetry resulting from some conservation properties (such as energy conservation). Majda and Harlim [53] argue that the state space may be decomposed into a direct sum of scale windows which inherit evolution properties from the quadratic system, and then information flow/transfer may be investigated between these windows. Intriguing as this conceptual model might be, there still exist some theoretical difficulties. For example, the governing equation for a window may be problem-specific; there may not be such governing equations as simply written as those like Equation (3) for individual components. Hence one may need to seek new ways to the derivation of the information flow formula. Nonetheless, central at the problem is still the aforementioned classification of mechanisms that govern the marginal entropy evolution; we are expecting new breakthroughs along this line of development.
The formalism we have presented thus far is with respect to Shannon entropy, or absolute entropy as one may choose to refer to it. In many cases, such as in the El Niño case where predictability is concerned, this may need to be modified, since the predictability of a dynamical system is measured by relative entropy. Relative entropy is also called Kullback-Leibler divergence; it is defined as i.e., the expectation of the logarithmic difference between a probability ρ and another reference probability q, where the expectation is with respect to ρ. Roughly it may be interpreted as the "distance" between ρ and q, though it does not satisfy all the axioms for a distance functional.
Therefore, for a system, if letting the reference density be the initial distribution, its relative entropy at a time t informs how much additional information is added (rather than how much information it has). This provides a natural choice for the measure of the utility of a prediction, as pointed out by Kleeman (2002) [59]. Kleeman also argues in favor of relative entropy because of its appealing properties, such as nonnegativity and invariance under nonlinear transformations [60]. Besides, in the context of a Markov chain, it has been proved that it always decreases monotonically with time, a property usually referred to as the generalized second law of thermodynamics (e.g., [60,61]). The concept of relative entropy is now a well-accepted measure of predictability (e.g., [59,62]). When predictability problems (such as those problems in atmosphere-ocean science and financial economics as mentioned in the introduction) are dealt with, it is necessary to extend the current formalism to one with respect to the relative entropy functional. For all the dynamical system settings in this review, the extension should be straightforward.