Probabilistic Inference for Dynamical Systems

A general framework for inference in dynamical systems is described, based on the language of Bayesian probability theory and making use of the maximum entropy principle. Taking the concept of a path as fundamental, the continuity equation and Cauchy’s equation for fluid dynamics arise naturally, while the specific information about the system can be included using the maximum caliber (or maximum path entropy) principle.


Introduction
Dynamical system models are widely used to describe complex physical systems (e.g., the weather), as well as social and economic systems (e.g., the stock market). These systems are usually subject to high levels of uncertainty, either in their initial conditions and/or in their interactions with their environment. From the point of view of constructing predictive models, the optimal description of the time-dependent state of such a system given external constraints is a challenge with promising applications in both fundamental and applied science. This is of course an inference problem in which we must choose the most likely solution out of the (possibly infinite) alternatives compatible with the given information we have about the system. Given all this, it seems that a unified framework for performing inference on dynamical systems may open new possibilities in several areas, including non-equilibrium statistical mechanics and thermodynamics, hydrodynamics (including magnetohydrodynamics), and classical mechanics under stochastic forces, among other possible fields of application. Of course, this vision of inference applied to dynamical systems is not new-the clearest exposition of the ideas that we aim to extend here was given by E. T. Jaynes [1,2], followed by several others [3][4][5].
In this work, we present some elements for a general framework of inference in dynamical systems, written in the language of Bayesian probability. The first is the master equation, which is shown as a direct consequence of the laws of probability. Next, we develop the treatment of inference over paths from which we obtain the continuity equation and Cauchy's equation for fluid dynamics, and discuss their range of applicability. Finally, we close with some concluding remarks.

Why Bayesian Inference?
Unlike the standard ("frequentist") interpretation of probability theory, in which probabilities are frequencies of occurrence of repeatable events, Bayesian probability can be understood as the natural extension of classical logic in the case of uncertainty [6,7]. Bayesian probability deals with unknown quantities rather than identical repetitions/copies of an event or system, and is able to include prior information when needed.
The conceptual framework of Bayesian probability provides an elegant language to describe dynamical systems under uncertainty. A straightforward advantage of the Bayesian framework is that one does not need to assume an ensemble of "many identical copies of a system". A single system with uncertain initial conditions and/or forces is sufficient to construct a theory. The probability of finding this particular system in a given state at a given time would not be a frequency, but rather a degree of plausibility conditioned on the known information. In fact, we can lift even the common assumption of "many degrees of freedom". The motion of a single particle could be used to construct an internally consistent theory, where the equations of motion for time-dependent probability densities are similar to the ones of hydrodynamics. We will describe in detail both features of the Bayesian formulation in the following sections.
A brief overview of Bayesian notation used in this work follows. We will take P(Q|I ) as the probability of a particular proposition Q being true given knowledge I. On the other hand, G I will denote the expected value of an arbitrary quantity G given knowledge I, and will be given by where u represents one of the possible states of the system.

Dynamical Evolution of Probabilities
Consider a discrete-time system that can transit between n possible states {x 1 , . . . , x n } at different times. If we denote by X(t) the state of the system at time t, we have that the joint probability of being in state a at time t and in state b at time t + ∆t is given by P(X(t) = a, X(t + ∆t) = b|I ) = P(X(t + ∆t) = b|X(t) = a, I) · P(X(t) = a|I ). ( By summing over a in Equation (2) and taking b = x, we have P(X(t + ∆t) = x|I ) = ∑ a P(X(t + ∆t) = x|X(t) = a, I) · P(X(t) = a|I ), while by summing over b and taking a = x, we have P(X(t) = x|I ) = P(X(t) = x|I ) ∑ b P(X(t + ∆t) = b|X(t) = x, I).
From these two identities (Equations (3) and (4)), we can construct the discrete-time difference, This equation is a discrete-time form of the celebrated "master equation" [8][9][10], involving time-dependent transition probabilities. The case where P(X(t + ∆t) = x|X(t) = x , I) is independent of t (i.e., a function only of the initial state x and final state x) is more commonly known as the master equation in the literature. In the continuous-time limit when ∆t → 0, we can write this equation as: where the instantaneous density ρ is given by ρ(a; t) := P(X(t) = a|I ) and we have defined the (continuous-time) transition rate W t as In this sense, the master equation as written in Equation (5) is a direct consequence of the laws of probability, and its validity is universal whenever we have transitions between states. Please note that there is no requirement for the system to be Markovian: regardless of the form of the joint probability (8) which yields the transition probability in Equation (7) as: It is for this probability that Equation (5) holds. In general, the transition rate W t will most probably be time-dependent, due to the fact that it captures the dependence of the previous history of the system up to t. It follows from this that all probabilities of time-dependent quantities must evolve in time according to Equation (5) (or (6) in the case of continuous time) for some (possibly time-dependent) transition probability (rate). This continuous-time master equation (Equation (6)) is more general than the continuity equation, as it includes the case where some quantities can be created or destroyed during a process. However, time evolution under global and local conservation laws is a fundamental case that can also be readily obtained from the Bayesian formalism, as we will see in the following sections. As is well-known, the continuous-time master equation can be approximated in the limit of infinitesimally small transitions to obtain the Fokker-Planck equation [9,11], but in the next section we start from the existence of continuous paths as a postulate.

Fluid Theories in a Bayesian Formulation
We will now consider a dynamical system that follows a path X(t) ∈ X in time, where X denotes the space of all paths consistent with given boundary conditions. The path X(t) is not completely known, and we only have access to partial information denoted by I.
In this setting, Bayesian theory defines a functional P[X|I ] that is the probability density of the path X(t) being the "true path" under the known information. For any arbitrary functional F [X] of the path, we can then write its expected value as a path integral: On the other hand, the expected value of any instantaneous quantity A(x; t) is given by where ρ(x; t) := P(X(t) = x|I ) is the instantaneous probability density at time t. By using a quantity A(x ; t) = δ(x − x), we see that the probability density itself has a path integral representation: By differentiating Equation (12) with respect to time, we obtain the continuity equation for the instantaneous probability density [12]: where v(x; t) is the velocity field that describes the flow of probability, given by v(x; t) = Ẋ (t) This equation describes the global and local conservation of the probability of finding the system in a given state x at a time t, and is guaranteed to hold for any system moving continuously in time through paths X(t) ∈ X. In the same way, it is possible to derive a dynamical equation for the velocity field v(x; t) itself, by differentiating it with respect to time. We have: where in the last line we have defined the acceleration field as a(x; t) := Ẍ (t) and the velocity covariance matrix By using the continuity equation (Equation (13)) to rewrite the left-hand side as we obtain, dividing both sides by ρ and using Equation (15), that The term v ν v µ ∂ ν ln ρ cancels, and we have: from which we now cancel the term (∂ ν v ν )v µ , arriving at Rearranging the derivatives of v µ in the left-hand side, we have which is the Cauchy momentum equation with D/Dt the convective derivative and ← → σ = −ρ · ← → C the stress tensor. Equations (13) and (23) form a closed coupled system of equations for ρ(x; t) and v(x; t), needing as their only external input the velocity covariance matrix C µν . These equations are then built-in features of inference over paths. In a Bayesian approach, they are valid for any system that moves continuously in time. The Cauchy momentum equation includes most notably the Navier-Stokes equation as a particular case [13].

Including Particular Knowledge into Our Models
At this point, we have developed a generic framework where no particular details about a system have been included. Clearly all those details have to be contained in P[X|I ], or rather, in the covariance matrix C µν (x; t) which can be derived from it. The question remains about how to incorporate these details in the most unbiased manner. In principle, we could start from the null assumption of equiprobable paths, P[X|I 0 ] = constant, and add new information R later on, by updating our probability functional P[X|I 0 ] to a new P[X|I ], where I = (I 0 , R). There are essentially two equivalent methods to achieve this, and depending on the actual form of R, one of them may be more directly applicable than the other.
(1) Bayes' theorem: the posterior distribution P(u|I 0 , R) is given in terms of the prior P(u|I 0 ) by This method is most useful when R is comprised of statements about the states u (e.g., boundary conditions). (2) Principle of maximum entropy: the posterior distribution p(u) is the one that maximizes where p 0 (u) is the prior distribution. This method is most useful when R consists of constraints on the final model p(u), usually expressed as fixed expected values.
In Reference [2], Jaynes assumes the continuity equation from the start and derives the flux J(x; t) = ρ · v(x; t) from symmetry considerations, the central limit theorem, and Bayes theorem.
In our classification, this corresponds to method (1). In Reference [5], Gull recovers Brownian motion by essentially performing discrete-time maximum caliber inference under constraints over location and particle speed, hence corresponding to an application of method (2).

The Maximum Caliber Principle
The function p(u) that is closest to our prior probability p 0 (u) and is consistent with the constraints R is the one that maximizes the relative entropy [14,15] among the set of functions p that are compatible with R. The negative of this relative entropy, known as the Kullback-Leibler divergence, is commonly used to measure the "informational distance" from p 0 to p. It is important to note that this is a rule of inference and not a physical principle, and therefore it is not bounded by the meaning assigned to the states x, as long as we can write (Bayesian) probabilities over them.
For the general case of m constraints of the form the maximum entropy solution starting from P(u|I 0 ) is obtained through the use of m Lagrange multipliers (one for each constraint), where Z(λ) is the partition function. This is compatible with Bayesian updating, as this posterior distribution is proportional to the prior. The Lagrange multipliers are solutions of the constraint equations in terms of Z: In exactly the same way, the path (relative) entropy (sometimes known as the caliber) is defined as the path integral [1,[16][17][18][19][20][21][22]: where p 0 [X] := P[X|I 0 ] is the prior path probability. The use of this generalization is justified based on the fact that we can write any path X(t) in terms of a complete orthonormal basis {B i }, and then there is a one-to-one correspondence between every path X(t) and its coordinates (γ 0 , γ 1 , . . . , γ N−1 ). Inference over paths X then becomes completely equivalent to inference over the coefficients γ, which form a system with N degrees of freedom. In summary, for the general maximum caliber inference problem we have m constraints, written as from which the probability functional obtained is Any such maximum caliber solution can be cast in the "canonical" form, as where A is a functional, analogous to the Hamilton action of a classical system, and α > 0 is a constant with the same physical units as A. By simple inspection of this canonical form, it is straightforward to see that the most probable path is the one with minimum action:

An Illustration: Newtonian Mechanics of Charged Particles
As an example of the application of this formalism, consider a "particle" with known square speed ν 2 , known instantaneous probability density ρ, and known velocity field v(x; t) for all times t ∈ [0, τ]. The corresponding constraints are then The resulting maximum caliber solution is of the form with the Hamilton action and a Lagrangian defined as where λ 1 , λ 2 , and λ 3 are Lagrange multipliers. This Lagrangian can be cast into a more familiar form, by simply renaming the Lagrange multipliers and integrating the delta function [18]. Interestingly, this is none other than the Lagrangian for a particle with time-dependent "mass" m(t) in an external "electromagnetic potential" (Φ, A). The most probable path under these constraints is determined by the solution of the Euler-Lagrange equation, which reduces to Newton's second law under a "Lorenz force", as shown in the Appendix. In particular, it is important to note that it is the constraint on the squared speed ν 2 (t) that adds the mass m(t) to the model, as m = 2λ 1 , the constraint on the probability density ρ(x; t) adds the scalar potential Φ(x; t) to the model, as Φ = −λ 2 , and finally the constraint on the local velocity field v(x; t) adds the vector potential A(x; t) to the model, because A = λ 3 . Nowhere in the derivation of this Lagrangian have we assumed the existence of charges, electromagnetic fields, or the Lorenz force. The structure that is revealed is the most unbiased under the constraints given in Equations (31) to (33), that is, with approximate knowledge of its location (given by ρ) and velocity "field lines" (given by v). This model could be used for people in a busy street crossing, or vehicles in a city.

Concluding Remarks
We have shown that it is possible to construct a fluid theory from Bayesian inference of an abstract system with N degrees of freedom moving along paths X(t), and that this theory automatically includes the continuity equation and the Cauchy momentum equation as built-in features. Moreover, through the use of the Maximum Caliber principle, it is possible to formulate the dynamics of such an abstract system in terms of an action that is minimal for the most probable path, resembling the well-known structures of Lagrangian and Hamiltonian mechanics.
By entering the square speed, instantaneous probability density, and velocity field into our model, a Lagrangian of a "particle" under external fields emerges naturally. This "particle" moves on average according to Newton's law of motion F = m · a under the Lorenz force, with scalar and vector potentials determined by the known information about the location and velocity lines. In this formulation, the only ingredients that we could call physical were the existence of an N-dimensional "particle" moving continuously along (unknown) paths. In this application, we see that position and velocity are the only intrinsic (real or ontological) properties of the particle at a given time t. On the other hand, the time-dependent mass m(t) and the fields Φ(x; t) and A(x; t) are emergent parameters (in fact, Lagrange multipliers) needed to impose the constraints on the known information used to construct the model. we have which can be written as Using the fact that the i-th component of v × (∇ × A) is given by we finally obtain d dt which is Equation (38).