Simultaneous State and Parameter Estimation Using Maximum Relative Entropy with Nonhomogenous Differential Equation Constraints

In this paper, we continue our efforts to show how maximum relative entropy (MrE) can be used as a universal updating algorithm. Here, our purpose is to tackle a joint state and parameter estimation problem where our system is nonlinear and in a non-equilibrium state, i.e., perturbed by varying external forces. Traditional parameter estimation can be performed by using filters, such as the extended Kalman filter (EKF). However, as shown with a toy example of a system with first order non-homogeneous ordinary differential equations, assumptions made by the EKF algorithm (such as the Markov assumption) may not be valid. The problem can be solved with exponential smoothing, e.g., exponentially weighted moving average (EWMA). Although this has been shown to produce acceptable filtering results in real exponential systems, it still cannot simultaneously estimate both the state and its parameters and has its own assumptions that are not always valid, for example when jump discontinuities exist. We show that by applying MrE as a filter, we can not only develop the closed form solutions, but we can also infer the parameters of the differential equation simultaneously with the means. This is useful in real, physical systems, where we want to not only filter the noise from our measurements, but we also want to simultaneously infer the parameters of the dynamics of a nonlinear and non-equilibrium system. Although there were many assumptions made throughout the paper to illustrate that EKF and exponential smoothing are special cases of MrE, we are not “constrained”, by these assumptions. In other words, MrE is completely general and can be used in broader ways. Entropy 2014, 16 4975


Introduction
In this paper, we continue our efforts to show how maximum relative entropy (MrE) [1] links to existing data filtering and parameter estimation approaches in data analysis.In our previous paper [2], we showed the direct connection between MrE and the Kalman filter (KF) [3], which resulted in KF being shown as a special case of MrE.Here, our purpose is to proceed into a joint state and parameter estimation problem where our system is nonlinear and in a non-equilibrium state, i.e., perturbed by known varying external forces.Traditional parameter estimation can be performed by maximizing a likelihood with respect to the unknown parameter [4].The algorithm is called the expectation-maximization algorithm (EM algorithm).However, this approach is sequential, i.e., it has two steps: first, the expectation step and then the maximization step.Both steps are executed in separate, subsequent processing operations [5].Again, our purpose though is to perform joint state and parameter estimation in a single, simultaneous updating step to avoid splitting the operations.
The extended Kalman filter (EKF) [6,7] is a method that is commonly used to infer simultaneously both nonlinear state and parameter values by using an augmented vector.However, EKF does this inference on every occurrence of the discretization interval and as Jazwinski states, "... the filter operates sequentially, requiring no data storage."Memory size has increased over the years, and embedded processing units are already processing at GHz levels; therefore, the storage requirement becomes less critical compared to the actual implementation of the filter.Practically, however, it is very important that the filter be as stable as possible in dynamical situations when the system is perturbed by external forces, and the system parameter is constantly changing.
To illustrate this point, a toy example of a system with a first order non-homogeneous ordinary differential equation that could describe a resistance-inductance (RL) circuit is shown.When the circuit is initially disconnected from a power source and suddenly it is reconnected (switched on), the initial growth rate of the current will be proportional to the power applied and the inductance, L, and will be more "visible" (more information regarding the parameter) at the beginning of the time series.The resistance, R, can be calculated from the steady-state conditions (when a constant voltage source is present), but the end of the time series will result in more information about this parameter.Thus, in real applications, the beginning of the time series might contain more information about one parameter of the system, while the end of that same time series might contain more information about another parameter.Due to the iterative nature of the EKF algorithm and its Markov assumption, results from the EKF algorithm may not be acceptable when the parameter to be inferred is more "visible" at one end of the time series versus the other end of the same time series.Exponential smoothing attempts to solve this problem, e.g., exponentially weighted moving average (EWMA) [8], by assuming that the time series follows an exponential curve.EWMA has been shown to produce acceptable filtering results in real exponential systems [9], but it still cannot estimate both the state and its parameters simultaneously.
Variational data assimilation [10] used in numeric weather prediction applications is another method designed to find the best fit for a mathematical model given a set of observations.As Courtier and Talagrand state "one first defines a scalar function which, for any solution of the model over (t 0 , t 1 ), measures the "distance" between that solution and the observations" [10].In this paper, the Euclidean distance is derived, which is an optimization criterion, from the principle of MrE.The variational data assimilation method uses the principles of optimal control and constrains the functional with the model equations by incorporating them in the form of adjoint equations [11].In our paper, optimal control variables as defined in [10] are considered as estimates of the means of probabilistic variables, and the origin of these constraints is explicitly shown.The optimal control signal, which perturbs the actual physical system, is a priori given and known and defined in this paper's filtering application.We extend the data assimilation approach by not only performing the data assimilation (calculating estimates), but also inferring the parameters of our mathematical model by incorporating additional information about the real control signal.
In many systems, such as weather and climate, imposing a control signal is not possible.One possibility to deal with this issue would be to treat it as a regular state variable and directly or indirectly measure or infer it.Such cases are out of the scope of this paper.
The main purpose of this paper is to show how MrE can be used to simultaneously infer both the parameters and state values, while taking into account the transient characteristic of the exponential process.This demonstrates that an exponential smoothing filter is a special case of using MrE as a filter with first order nonhomogenous ordinary differential constraints applied to the posterior solution.

Maximum Relative Entropy Filter
The maximum relative entropy method can be seen as the criterion to determine the 'updated' posterior state of a system given certain constraints on the posterior, including data constraints.For a general, simple example, see [1,12], where both sequential and simultaneous updating are shown.A posterior might not have an analytical closed form solution for determining the estimate for a parameter if the constraints are complex or non-linear.A numerical approach has to be sought to find the estimate in that case.However, the method can be used to update a posterior that is partially known or is partially assumed, because it is of a tractable form, such as used in mean field approaches, [13].MrE will be used in a similar manner in this paper to infer both state and parameter values simultaneously.
In online applications, where the reaction of the system is required after a few milliseconds, performance is crucial and irreducible.Closed form solutions are needed that are scalable in not only sophisticated processing systems, but in embedded environments, as well.In the following example information about the form of the posterior distribution is included.This helps to achieve a closed form analytical solution for the estimates needed, which, in turn, helps to keep iteration steps at a minimum.For example, posterior distribution can be assumed to be a Gaussian distribution or that it is known to be of this form, but the scale is not known.By applying MrE, the distribution's mean and variance can be inferred, which represent the scaling parameters of the distribution.

Maximum Relative Entropy as a Universal Approach for Filtering Applications
Let the following be either an approximate or exact mathematical model about the behavior of the system, f (x (t) , δ,u (t)) = 0, where f is a mathematical model's function, x(t) is the system state vector, u (t) is a control signal, which perturbs the system into a non-equilibrium state (this is assumed to be a known solution, but it does not take away from the generality of the approach), and δ is an unknown system parameter vector (which can be initially guessed or assumed).When the function, f , is differentiable, the same model is valid at any time moment by selecting that time moment's (discretization interval's) probabilistic variable where k is time index of any discretization interval.When the function arguments are not differentiable at all points of their domain, they might have discontinuities at their boundary conditions.However, often, smoothness in one state space variables causes discontinuities in the other state variables, which implicitly enforces a boundary constraint.These connections are often only seen at the macroscopic (course-grained) level.For example, the inductor's current just before the commutation (−0) is equal to the inductor's current right after commutation (+0) in the RL circuit.However, the inductor's voltage has an abrupt change at the commutation moment, and its (−0) and (+0) values can be found implicitly from the boundary condition of the current signal using Kirchhoff's circuit laws.This information becomes useful when the functional with partly differentiable functions need to be constrained.When discontinuity points exist in the function, MrE optimization on the domain that is differentiable is used.This results in unknowns at the boundary conditions or the discontinuity points.Some states, control variables or parameters will be equal at time moments (−0) and (+0).This information might be seen at the macro (course-grained) level of the system.Then, we solve every boundary condition for the unknowns and, finally, infer the parameters of the differential equations simultaneously.
A toy numeric example with a control signal perturbing a first order nonhomogenous ordinary differential equation system by a relay control (on/off or so-called bang-bang control) is shown below.This type of control signal causes a discontinuity point in the domain of the differential equations system.Simultaneous inferring of the differential equation parameters is illustrated.
If the measured time series is in a vector form, cx k , i.e., at time moments t 0 , t 1 , ..., t n observations (data) are collected regarding the one dimensional variable x(t) as cx 0 , cx 1 , ..., cx n , then our entropy will have the form, where the posterior distribution, P = P ( x k , x k ), would update the prior distribution, P likelihood (cx k , x k ), so that our set of estimated values for x k maximizes the entropy; Equation (3).This enforces the constraints in Equation ( 2) on the posterior and produces a Lagrangian, The integral Equation ( 4) can be multidimensional.Furthermore, the exact form of the constraints may be unknown or so-called non-parametric, i.e., their behavior is known, but the constraints' forms or mathematical models are undefined.Integrals should be summed over time windows to establish a Lagrangian for the joint action in such cases.One should consider using the computationally feasible methods of the Monte Carlo family or similar brute force sampling approaches for these cases.In our toy example, an example with a computable solution is provided, and other complicated cases are out of the scope of the paper.
In some cases, Equation ( 4) does not have a closed form solution, and one should seek a numeric iterative approach.For example, when the mathematical models are parabolic partial differential equations, applying the variational data assimilation method as in [11] is suggested.In our toy example, Equation ( 4) has a closed form solution with some unknown parameters from our mathematical model.These parameters are inferred by using boundary conditions, which are satisfied for the time series.If it is assumed that the control signal that perturbs the system is known and it is further assumed that there are two subsequent sets of time series with different perturbations, then one approach to explicitly estimate the value of the unknown parameter δ is to find it, such that it satisfies the boundary condition between the two time series, i.e., the last estimate of the first time series must be equal to the very first estimate of the next time series.This is illustrated in the following section, where this is accomplished by incorporating the differential equation's boundary condition.

Example with Nonhomogenous Differential Equation Constraints
We examine a toy system below that uses a one-dimensional time series of measurements.At time moments t 0 , t 1 , ..., t n observations (data) are collected regarding the one-dimensional variable x(t) as cx 0 , cx 1 , ..., cx n .We assume the observations are represented by independent probabilistic variables x 0 , x 1 , ..., x n , and their uncertainties are equal to σ 2 cx .Then, the probability density function of the joint prior or the likelihood [2] for the time series variables is, (5) Note: It is reasonable to see the likelihood as containing "prior" information, since it represents the relationship that has already been established between observation values and probabilistic variables.
The goal here is to find a posterior distribution for the variables that will satisfy the constraints (dynamics).Frequently, the unbiased model is unknown for the variable as a function of time, because it contains unobservable nonlinearities.Further, the analytical closed form model can be more complicated than what is known a priori about the system based on experience.Yet, often, the main behavior of a system can be approximated by an exponential law.Thus, inferring the posterior distribution may be done by not incorporating the dynamic constraints, which exactly represent the physical system behavior, but by the constraints representing an exponential approximation.This type of assumption that uses an exponential approximation allows for the closed form solutions to be found for practical filtering applications and will be shown below, e.g., simultaneously approximating a time series with an exponential curve, while inferring parameters of the differential equation.
The main problem with the estimation of an exponential mathematical model is that the start and the end processes, and thus, the data from them, are equally important.Markov assumptions cannot then be explicitly assumed.In the resistance-inductance (RL) circuit, where the control signal is the voltage signal for the circuit, the parameters (R and L) of the circuit can be calculated from the transient conditions when a step control signal function is applied and the system is in a steady state.At the beginning, the rate of current increase will give information about the inertia of the system, which is represented by L. At the end of the transient process, the steady state of the system will give information about the resistance of the system R. Thus, the time series is "smoothed" with an approximate exponential law, and the estimates must be determined with a wide enough processing window that not only covers the start of the transient process, but also so that the pattern of the saturation effect is partially "seen" in the time series.This way, the estimates, which behave according to the exponential law, as well as the parameters of the differential equation are inferred.In other words, the more information that is included, the less uncertainty is left in the estimate expression after the entropy is maximized.
Again, for illustration, assume that the following approximation for the system behavior model is, where u (t) is the control signal that perturbs the system variable, x(t), according to Equation ( 6), and where δ and C are the parameters of the first order nonhomogenous, differential equation and x steady is the steady state value of the variable x(t) when there is no control signal u (t) applied, assuming that the system is in a steady state.Further, it can be assumed that the system control signal, u(t), is constant over the duration of all discretization intervals, ∆t, at any time moment k.It can also be assume that Equation ( 6) is the constraint over the probabilistic variable's mean.In this case, the constraint is, Here, the mean of some probabilistic variable is a function of time.The inference process should not only estimate the means of the probabilistic variable at time moments k, but also take into account the dynamics, which is given a priori.The solution of Equation ( 6) is represented by, where x 0 represents the initial condition value at some initial time.If it is assumed that the initial conditions of the variable x(t) are the previous sample's value and the current sample's value is the final value after the discretization interval passed, then Equation ( 8) can be rewritten as: The exponential term does not depend on the actual means, so an intermediate notation will used, , which simplifies the expression to, The common denominator is found, and all of the terms to the left side of Equation ( 10) are moved, which yields, The entropy is maximized to find the joint distribution that is closest to P obs , but that also satisfies the approximate model (knowledge about the system) in Equation ( 9), which is assumed is enough to mimic the behavior of the system at the a macroscopic level.Before doing so, the general shape of the posterior distribution will be assumed, but the scaling factors are unknown, so that, where it is assumed that the variance of the means σ 2 x is constant for each variable, respectively, for illustrative purposes.Then, the entropy to be maximized is, P posterior (x 0 , . . ., x n ) • log P posterior (x 0 ,...,xn) P likelihood (x 0 ,...,xn) dΩ. ( This is maximized to derive the Lagrangian, where dΩ = dx 0 • dx 1 • . . .• dx k • . . .• dx n and where λ k are the Lagrange multipliers, which enforce the constraints (dynamics) over the means of the variables under consideration.By inserting Equations ( 5) and (12) into Equation ( 14) and integrating over the variables, the Lagrangian is simplified to, where ε is an integration constant along with additive terms, which do not affect the Lagrangian in Equation ( 15).This constant can be omitted, which yields the final Lagrangian, It should be noted at this point that the "distance" that is mentioned in [10] is the quadratic term in the Lagrangian.

Maximization of the Expectation
After differentiating the entropy with respect to the variables means, x 0 , . . ., x n , a system of n+1 equations is formed, Every discretization interval constraint (which originates from Equation (11) forms the system of n equations, which together with Equation (17) form a system of 2n + 1 linear equations that have a closed form solution.
For illustration, the solutions for the cases n = 4, n = 6 and n = 8 are provided.After this is done, the pattern of the solution will be generalized.
In the first case (n = 4), the function f is differentiable.For illustration, we assume that there are jump discontinuities at discretization points x 0 , x 4 , x 8 , x 12 , etc.It is not a necessity to strictly enforce this assumption, but it is made so that the sequence of actions necessary for this approach are clear.At the jump discontinuities in this system (control signal in the whole period of the differentiable interval is constant), the boundary conditions claim that x k (−0) = x k (+0) , i.e., the boundary condition is of the Dirichlet type.Then, for every differentiable period of the function, we have two systems of equations, and: which results in to a system of nine equations and mine unknowns consisting of five state variables and four Lagrange multipliers, After some manipulations, the final solution for the two state variables x 0 and x 4 is: and: Values x 0 and x 4 belong to the very first differentiable period of the time series when a control signal u is a constant, i.e., u f irst = const.Thus, the closed form solutions for x 0 (+0) and x 4 (−0) values have been found.
After introducing the following coefficients for convenience, is of the same order as the discretization interval's size.For this example, it is a priori known that some external disturbances affect our system through changing the value of δ, i.e., it is actually a function of time, δ (t).However, in many physical systems, the parameters' rate of change is much smaller than the rates of change of the state and control variables.Specifically, in a current application of this, the rate of change of δ (t) in our system was ten-times smaller than the duration of differentiable period.This allowed us to assume that γ (δ) was constant, because any change in δ resulted in approximately a ten-to one hundred-times smaller change in γ (δ) compared to the change effecting Equation (37).Therefore,the algorithm can start from a certain value of δ, find the δ using Equation (38), then put it back into expressions Equations ( 24)-( 31), with the recalculated estimates of the means at the boundary condition defined by discontinuity point.Normally, after 3-5 iterations, the value of δ converges to the acceptable precision.
When δ converges to an acceptable value, the remaining estimates of , x 1 , x 2 and x 3 are calculated.Thus, some information from observations is precalculated and then applied to the boundary conditions.This yields the exact values for the differentiable points, which are in between the discontinuity points.
Let us analyze the example with n = 6.Similarly to previous example, we derive our boundary solution for the system of 13 equations as, (40) The convenience coefficients are then redefined as: which yields,  38) that has to be avoided when implementing the filtering algorithm.The coefficients A f irst and A second depend on the observation values, as shown in Equations ( 54) and (55).Thus, the size of n needs to be such that the change in observations in the processing window is noticeable in order to avoid this singularity in a real system, i.e., the system must be perturbed enough or its control signal is reduced so that the change of measurements of the input channel is noticeable.If this condition is satisfied, the MrE filter will be stable.Our physical system naturally satisfied this requirement.In this example, during the whole measurement and estimation process, there are no singularities.
We need to make an important theoretical note here regarding this approach as compared to exponential smoothing used in EWMA.For this, the numeric example with n = 4 is used.
After solving the systems of equations in Equation ( 21), in addition to the x 4 value in Equation ( 23), also the expression for x 3 is constructed, From exponential smoothing [9], the next discretization interval's estimate is calculated using, At this point, it is unclear what the parameters θ and x unknown represent in the MrE filter and how to resolve this by analyzing the equations Equations ( 58) and (23).It is clear that parameter θ is our function γ (δ), and Equation (59) can be rewritten in the form, Inserting Equations ( 58) and (23) to Equation (60), and solving for x unknown , yields, All Equation (61) values are constants for a single differentiable period, as was mentioned earlier.Thus, exponential smoothing is a special case of applying the MrE filter.Because of this, the physical meaning of the coefficients in the exponential smoothing can be understood.Furthermore, because MrE is more general, when the situation changes (for example, when the control force changes, or external disturbances change the system parameter δ), it can be adapted to our exponential filter, so that it returns unbiased estimates.

MrE Filtering Experimental Results
In this section, a numerical example using MrE is shown.It should be noted that the inference of the differential equation parameters and calculation of the estimates is performed simultaneously according to Equations ( 38), ( 51   During this experiment, the following values for the exponential equation were selected (the physical dimensions are relative, and all values are taken from the actual physical system) with an initial value of, δ = 0.8, C = 1.5, x steady = 294.15,∆t = 0.001s, u f irst = 100 and u second = 20.The processing was performed on a 2.0-GHz PC, which consumed less than 10% of the total CPU time.No double precision optimizations were implemented for the estimation routines, which means that there is a possibility to implement an even faster MrE filter variant for embedded applications.The longest processing window during the experiment was approximately one second.During the whole experimentation, MrE showed stable behavior, and the estimated δ coefficient varied in the range from 0.88 to 1.17.The duration of differentiable periods of function f varied from 50 up to 300 discretization intervals.The filter is immune to any problems that might occur because of jump discontinuities in the function f .

Summary and Final Remarks
In this paper, we developed the closed form solutions for the probabilistic variable means when first order nonhomogenous ordinary differential equations are used as constraints for an univariate system with jump discontinuities in the control signal using maximum relative entropy (MrE).Further, we also showed that we can infer the parameters of the differential equation simultaneously with the means.This is useful in real, physical systems, where we want to filter the noise from our measurements and simultaneously infer the parameters connected to the dynamics of a nonlinear and nonequilibrium system.
Algorithms, such as extended Kalman filter (EKF), may produce poor results, because its Markov assumptions may not be applicable.An example of this was shown using a RL circuit, where important information on one parameter is at the beginning of the time series, while other important information about the other parameter is at the end of the time series.Thus, the Markov assumption is not valid.This problem can be solved traditionally by using exponential smoothing, like the exponentially weighted moving average (EWMA) method.However, this method also has problems, since it cannot simultaneously estimate both the state and the parameters under consideration.The MrE derivation of the exponential smoothing filter gives insight into the physical meaning of the parameters found in the method of exponential smoothing filters and how this should be adopted when control forces change or external disturbances change our system parameters.The exponential smoothing filter does not involve information about abrupt changes (jump discontinuities), while MrE techniques resolve this by handling such discontinuities.Finally, since EWMA assumes that the data series is exponential, it does not incorporate information from other measurement channels in the same simultaneous step of inference, which limits its applicability.
Another approach for exponential smoothing, which is similar to the one presented in this paper, was the variation data assimilation method.It was shown in this paper that the "distance" between our model and observations can be derived from the principles of MrE.It was also shown that we can not only perform data assimilation, but also perform simultaneous estimation of unknown parameters of the system.
It should be made very clear that although there were many assumptions made throughout the paper to illustrate that EKF and exponential smoothing are special cases of MrE, we are not in general confined by these assumptions.In other words, MrE is completely general and can be used in much broader ways.
The coefficient estimation shown in this paper can cover the estimation of inductance, dissipation factors or the speed of an induction motor and would serve as a practical method for indirect measurement applications, which are of great need in high-speed monitoring applications in dynamical systems.Moreover, practically, singularities were avoided by using MrE filtering, which might lead to the instability of the filter, as mentioned in [14].Future work will include connecting MrE to the unscented Kalman filter (UKF), the ensemble Kalman filter and other advanced filtering methods.
e ) s t e a d y s t a t e v a l u e o f v a r i a b l e ; w h i l e ( t r u e ) / / l o o p c o n t i n u o u s l y { c a l c u l a t e a l p h a and b e t a c o e f f i c i e n t s ; c a l c u l a t e A f i r s t , B f i r s t , A sec , B s e c ; d e l t a = ( B sec − B f i r s t ) / ( A f i r s t −A s e c ) ; i f ( ( d e l t a − e a r l i e r d e l t a )ˆ2 < p r e c i s i o n ) break ; / / h a l t t h e l o o p e a r l i e r d e l t a = d e l t a ; } r e t u r n d e l t a ; } Notice that there is still a singular point in Equation ( )-(57).The intermediate curve values where calculated by Equation (8).

Figure 1
presents the real time curve of estimates and the spread of the observation values.The observations represent {cx 0 , . . ., cx n , . ..} values, and the estimated values are { x 0 , . . ., x n , . ..}.

Figure 1 .
Figure 1.Noisy observations and their estimates after inferring the δ coefficient by the maximization of relative entropy.