Dynamic Expectation Maximization Algorithm for Estimation of Linear Systems with Colored Noise

The free energy principle from neuroscience has recently gained traction as one of the most prominent brain theories that can emulate the brain’s perception and action in a bio-inspired manner. This renders the theory with the potential to hold the key for general artificial intelligence. Leveraging this potential, this paper aims to bridge the gap between neuroscience and robotics by reformulating an FEP-based inference scheme—Dynamic Expectation Maximization—into an algorithm that can perform simultaneous state, input, parameter, and noise hyperparameter estimation of any stable linear state space system subjected to colored noises. The resulting estimator was proved to be of the form of an augmented coupled linear estimator. Using this mathematical formulation, we proved that the estimation steps have theoretical guarantees of convergence. The algorithm was rigorously tested in simulation on a wide variety of linear systems with colored noises. The paper concludes by demonstrating the superior performance of DEM for parameter estimation under colored noise in simulation, when compared to the state-of-the-art estimators like Sub Space method, Prediction Error Minimization (PEM), and Expectation Maximization (EM) algorithm. These results contribute to the applicability of DEM as a robust learning algorithm for safe robotic applications.


Introduction
The free energy principle (FEP) has emerged from neuroscience as a unifying theory of the brain [1] and has begun to guide the search for a brain-inspired learning algorithm for robots. Many attempts have been made in this direction, including the state and input observer [2,3], the adaptive controller for robot manipulators [4][5][6], the body perception and action scheme for humanoid robots [7], the robot navigation of ground robots [8] etc. However, the design of a parameter estimation algorithm for linear systems with colored noise remains unexplored. Since the design of an accurate parameter estimator for dynamic systems sits at the core of control systems and robotics, the reformulation of FEP into a brain-inspired estimation algorithm has an influential impact on the industry and applied robotics.
A wide range of estimators have been proposed in the literature for linear timeinvariant (LTI) systems [9], including the blind system identification [10][11][12]. However, most of them assume the noises to be temporally uncorrelated (white noise), which is often violated in practice. This results in biased estimation for the least-square (LS)-based methods [13], and an inaccurate convergence for the iterative methods [14]. Although many attempts have been made to solve this problem, mainly through bias compensation methods [15,16], none of them perform state, input, parameter, and noise estimation for systems with colored noises [17]. The only method that does it is the Dynamic Expectation Maximization (DEM) [18] algorithm, which uses FEP to invert a highly nonlinear and hierarchical brain model from sensory data. However, the disconnect between neuroscience for LTI systems with colored noise. We reformulate DEM for LTI systems into a form that is widely used in the robotics domain and use this mathematical formulations to prove that the estimation steps of DEM have theoretical guarantees of convergence [41]. In our prior work, we have discussed the stability conditions of our DEM-based linear state and input observer design [2]. This convergence guarantees and stability criteria are essential for robot safety while in operation and is of high relevance to the robotics community. Through extensive simulations on a range of random systems, we show that DEM is a competitive estimator when compared to other benchmarks in the control systems domain. The core contributions of the paper are:

1.
Reformulating DEM into an estimation algorithm for LTI systems with colored noise (Section 12).

2.
Proving that the estimator has theoretical guarantees of convergence for the estimation steps (Section 14). 3.
Proving through rigorous simulation that DEM outperforms the state-of-the-art system identification methods for parameter estimation under colored noise (Section 16).

Problem Statement
Consider the linear plant dynamics given in Equation (1), where A, B and C are constant system matrices, x ∈ R n is the hidden state, v ∈ R r is the input and y ∈ R m is the output.ẋ = Ax + Bv + w, y = Cx + z.
Here, w ∈ R n and z ∈ R m represent the process and measurement noise, respectively. The notations of the plant are denoted in boldface, whereas its estimates are denoted in nonboldface letters. The noises in this paper are generated through the convolution of white noise with a Gaussian kernel. The use of colored noise is motivated by the fact that in robotics, the unmodelled dynamics and the non-linearity errors can enter the plant dynamics through the noise terms, thereby violating the white noise assumption in practice [20]. The goal of this paper is to reformulate DEM for an LTI system such that given the output of the system y, the estimator computes the associated states x, inputs v, parameter θ containing the matrices A, B and C, hyperparameters λ that model the noise precision (Π = e λ ), and the uncertainties of all its estimates (Σ x , Σ v , Σ θ , Σ λ ), with the help of the prior (learned) knowledge encoded in the robot brain, such that the estimate best predicts the data. The schematic of the proposed robot brain's inference process is given in Figure 1. A simple block diagram of the robot brain's inference process using DEM. It uses the measurement data y generated from the environment (also called generative process). DEM enables the direct fusion of the prior information into the inference process. The concept of generalized coordinates will be detailed in Section 3.1.

Preliminaries
To reformulate DEM into an estimation algorithm for LTI systems, this section introduces the key concepts and terminologies that are familiar to the FEP audience.

Generative Model
The generative model (plant model) is the robot brain's estimate of the generative process in the environment that generated data. The robot brain infers this model via model evidencing from the measurement data. The key idea behind DEM to deal with colored noise is to model the trajectory of the time-varying components (states, for example) using generalized coordinates. The use of generalized coordinates is new to the control systems literature and is different from the familiar definition in classical mechanics. In mechanics, it is the minimum number of independent coordinates that define the configuration of a system, whereas in DEM, it is the vector defining the motion of a point using its higher derivatives. In DEM, the emphasis is on tracking the trajectories of states, inputs and outputs instead of their point estimates. The states are expressed in generalized coordinates using its higher-order derivatives, i.e.,x = [x x x . . . ] T . The generalized state vector x with an order of generalized motion of p will have p + 1 terms, with the copy of the state vector as the first term, followed by its p derivatives. The variables in generalized coordinates are denoted by a tilde, and its components (higher derivatives) are denoted by primes. The evolution of states is written as: The colored noises are modeled such that the covariance of noise derivativesz = [z, z , z , . . . ] T andw = [w, w , w , . . . ] T are well defined (to be explained in Section 3.3). The generative model representing the system is compactly written as:x = D xx =Ãx +Bṽ +wỹ =Cx +z (3) where D x = 0 1 0 1 . . 0 1 0 (p+1)×(p+1) ⊗ I n×n , performs the block derivative operation, equivalent to shifting up all components in generalized coordinates by one block. A similar definition holds for D v (appears later) with size r(d + 1) × r(d + 1), where p and d are the order of generalized motion of states and inputs, respectively. Here,Ã = I p+1 ⊗ A, B = I p+1 ⊗ B, andC = I p+1 ⊗ C, where ⊗ is the Kronecker tensor product.

Parameters and Hyperparameters
To simplify the parameter estimation steps of DEM for LTI systems (Section 13.2) and to facilitate the convergence proof (Section 14), we introduce an alternative generative model which is the direct reformulation of Equation (3) given by: Throughout the paper, θ represents the set of parameters, whereas λ = λ z λ w represents the hyperparameters that define the precision matrices (inverse covariance matrices) of the observation noise and the process noise. For noise modelling, we parametrize the noise precisions using an exponential relation with the hyperparameters, given by [18]: where Ω w and Ω z represent constant matrices encoding how different noises are correlated.
Here Π w and Π z are the inverse covariances {(Σ w ) −1 , (Σ z ) −1 } or precisions of the noises. This parameterization ensures the selection of positive definite noise precision matrices through hyperparameter updates. We assume zero cross-correlation between w and z. We also assume that θ and λ are time-invariant.

Colored Noise
The next step towards handling the colored noise is to embed the noise correlation between different components in the generalized noisesw andz given in Equation (3). DEM uses generalized coordinates, which models a correlation between noise derivatives through the temporal precision matrix S (inverse of covariance matrix). The generalized noise correlation is assumed to be due to a Gaussian filter, where S can be calculated as [18]: where σ (from the Gaussian kernel) denotes the noise smoothness level. While σ 2 → 0 denotes white noise, nonzero σ 2 denotes colored noise. The covariance between noise derivatives increases exponentially with the order of noise derivatives. Simulations show that derivatives above 6 can be neglected [18]. The generalized noise precision matrices are given byΠ w = S(σ 2 ) ⊗ Π w andΠ z = S(σ 2 ) ⊗ Π z , where Π w and Π z are the inverse noise covariances.

Generalized Motion of the Outputs and Noises
The generalized motion of outputỹ is practically expensive and inaccessible to robotics systems. Moreover, most sensors such as encoders operate with discrete measurements. Therefore, as a pre-processing step for estimation (refer to Figure 1),ỹ should be computed from discrete measurements. The goal is to first express the discrete measurements as a function of output derivatives and then invert the function to compute the derivatives from the discrete measurements. Given the p temporal derivativesỹ at time t, the p output sequence surrounding y can be approximated using Taylor series as follows [18]: where i, j = 1, 2, . . . , p + 1, ceil(.) is the smallest integer function andŷ has the size m(p + 1) × 1. Therefore, generalized motion of outputỹ at time t is: Usingỹ embeds more temporal information about the plant into the data in the form of conditional trajectories, with the disadvantage of a time latency of p 2 dt in estimation. For robotic systems with high sampling rates, this latency in estimation is negligible [3,20].
The next section employs this generalized output along with the generative model for observer designs.

Notations and Conventions
Throughout the paper, the superscript notation will be used to represent the quantity being referred to, and the subscript notation will be used to represent the derivative operation. For example,Π w λ represents the derivative of the generalized precision of the process noise w with respect to the hyperparameters λ. The tilde operator (x) is used to represent a quantity in its generalized coordinates, whereas the bar operator (F) is used to represent the time integral of a quantity. All the probability distributions will be represented by the p(.) notation, whereas its expectation will be represented by the p(.) notation.

Free Energy Objectives
With the preliminaries in place, we can build up towards the complete DEM algorithm. Eventually, in Sections 8 and 12, we will see that it is an optimization algorithm that finds the best estimates for states, inputs, parameters, and hyperparameters for given measurement data. This result is achieved by optimizing two objective functions, which are the core objectives under the entire Free Energy Principle: the free energy and the free action [1].
Here we derive and simplify this objective.
The derivation starts from the fundamentals of Variational Inference (VI) [39]. In VI, the posterior distribution p(ϑ/y) of parameter ϑ, given the measurement y, is expressed as: However, the marginalization over ϑ to calculate p(y) is often intractable because the search space of ϑ is large. A widely used technique is to introduce a variational distribution q(ϑ) known as the recognition density, which acts as an approximate representation of the posterior distribution with q(ϑ) ≈ p(ϑ/y). A common method used among variational Bayes algorithms is to minimize the Kullback-Leibler (KL) divergence between both the distributions, defined as: where . q(ϑ) represents the expectation over q(ϑ). Substituting Equation (10) in Equation (11) and using q(ϑ)dϑ = 1 yields: KL(q(ϑ)||p(ϑ/y)) = ln q(ϑ) q(ϑ) − ln p(ϑ, y) q(ϑ) + ln p(y).
The rearrangement of terms yield: ln p(y) = ln p(ϑ, y) q(ϑ) − ln q(ϑ) q(ϑ) + KL(q(ϑ)||p(ϑ/y)), where ln p(y) is the log-evidence, U(ϑ, y) = ln p(ϑ, y) is the internal energy and H(ϑ) q(ϑ) = − ln q(ϑ) q(ϑ) is the entropy of the density. The free energy term in Equation (13) is defined as the sum of an energy term and its entropy. It acts as the lower bound on the log-evidence because the KL divergence term KL(q(ϑ)||p(ϑ/y)) is always positive. The maximization of free energy minimizes the divergence term in Equation (13) because the log-evidence is independent of q(ϑ), thereby rendering the variational density q(ϑ) as a close approximation of p(ϑ/y). Therefore, the difficult evaluation of an intractable integral term in Equation (10) is converted into a much simpler optimization problem of maximizing the free energy. This reduces the problem of inference to a direct optimization of its free energy objectives and is the fundamental idea behind variation inference. The free energy term in Equation (13) is equivalent to the Evidence Lower Bound (ELBO) [39]. It can be simplified as: However, when the parameter set to be estimated includes both the time-variant and the time-invariant parameters, the free action is used as the objective function to be maximized. The free action is defined as the time integral of the free energy and is given by: where V(ϑ) = U(ϑ, y) q(ϑ) is called the variational free energy (VFE). The next sections will deal with two main assumptions used in DEM to simplify the free energy objectives, namely the Laplace approximation and the mean-field assumption. The aim is to derive the simplified free energy objectives for an LTI system under these assumptions.

Laplace Approximation
The first common approach to simplify the free energy objective is to assume the variational density q(ϑ) to be Gaussian in nature with variational parameters ϑ and Σ θ as its mode and covariance, respectively, [38]. Here, the inverse of Σ ϑ (denoted by Π ϑ and known as the conditional precision), represents the confidence in estimation. The recognition density takes the following form: There are two main advantages with this approximation: 1.
It facilitates an easy computation of the conditional precision Π ϑ (derived in Section 7.2) as the negative curvature of the internal energy at it's mode µ ϑ .
Therefore, the main aim of this section is to simplify the expression for internal energy U(ϑ, y) using the Laplace approximation, for an LTI system with its states, inputs, and outputs expressed in generalized coordinates. The internal energy in Equation (13) can be expressed as the sum of log-likelihood and prior terms as: The parameter set ϑ includes two types of parameters: 1. the states and inputs, which are time-varying and therefore expressed in generalized coordinates, 2.
the parameters and hyperparameters, which are time-invariant and not expressed in generalized coordinates.
DEM combines the new sensory information y coming from the environment with the robot brain's priors (refer to Figure 1) in a Bayesian fashion, through the internal energy U(ϑ, y) expression given in Equation (18). The next sections will deal with simplifying U(ϑ, y) and its actionŪ by first modeling the probability distributions for the generative model and the priors.

Generative Model
The probability distribution p(y/x,ṽ, θ, λ), given in Equation (18), represents the generative model that predicts the output from the current parameter estimates. This probability can be assumed to be Gaussian-distributed, centered around the model's output predictionCx (from Equation (3)), with the same uncertainty as that of the measurement noiseΣ z . The distribution p(y/ϑ) becomes: Since the robot cannot directly measurex in Equation (3), we track the motion of the generalized states through the approximationẋ =x = D xx . The prediction for motion is Ax +Bṽ with an uncertainty ofΣ w . The Gaussian distribution becomes:

Prior Distributions
The remaining distributions p(ṽ), p(θ) and p(λ) are the priors of the robot brain that can be transferred from prior (learned) experiences to the inference process. Similar to the previous section, a Gaussian prior is placed over the inputs p(ṽ) = N (ṽ : ηṽ, Lṽ) as well, with mean ηṽ and prior covariance Lṽ = (Pṽ) −1 , as: The prior distribution of parameters θ ∈ R l is assumed to be a Gaussian-centred around the prior parameter value η θ , with the prior covariance L θ = (P θ ) −1 : Similarly, a Gaussian prior is placed over the hyperparameters λ ∈ R 2 : A higher value of Pṽ, P θ and P λ represents the robot's higher confidence in its prior estimates ηṽ, η θ , and η λ , respectively.

Simplification of the Internal Energy ActionŪ
This section aims at using the distributions from the previous sections to simplifyŪ. The logarithm of a Gaussian prior after dropping constants takes the general form: Therefore, substituting Equations (19)- (23) in Equation (18) and simplifying it using the prediction error terms x = D xx −Ãx −Bṽ, ỹ =ỹ −Cx, ṽ =ṽ − ηṽ θ = θ − η θ , and λ = λ − η λ , after dropping constants, yields: the block diagonal operator. Grouping the internal energy terms of the temporal and nontemporal parameters yields: where Summing up the internal energy of all the temporal parameters over time yields the action of internal energy as follows: It can be observed from Equation (28) that the robot's priors (ηṽ, Pṽ, η θ , P θ , η λ , P λ ) enter the free energy objective through the internal energy term. Intuitively, this can be seen as the direct influence of the robot's prior beliefs on the inference process through the mismatches in the robot's predictions. These weighed prediction errors drive the robot's desire to maintain an equilibrium between its internal model and the generative process in the environment. A large mismatch between the robot's predictions and the data results in a large prediction error, which gets precision-weighted and enters the free energy objective through its internal energy.

Mean-Field Approximation
The second widely used assumption for the simplification of free energy objectives is the factorization of parameters into independent subdensities for the recognition density [18], given by: This approximation assumes the conditional independence between subdensities (states and parameters, for example). The subdensities are assumed to interact with each other only through the mean-field quantities. The strong biological plausibility of this approximation in terms of biological brain's inference process is delineated in [27]. The main advantage of this approximation is the simplification ofV in Equation (15). However, the mathematical proof for this simplification is missing from the DEM literature [18]. Therefore, this section aims to fill this gap by deriving these proofs by delineating all the intermediate assumptions.

Simplification of the Entropy ActionH
The entropy of the density in Equation (14), given by H(ϑ) q(ϑ) = − ln q(ϑ) q(ϑ) , can be simplified for all parameters as: Substituting the mean-field assumption given in Equation (29) in Equation (30) and using the property of normalized recognition densities q(θ i )dθ i = 1 yields: where . We place the Laplace approximation over the marginals of the recognition densities of all parameters as: The recognition densities given in Equation (32) might look similar to the priors distributions given in Equations (21)-(23), mainly due to the common Gaussian distribution. The prior distributions are centered around the prior mean and prior covariances, whereas the recognition density is centered around the mean µ i and conditional covariance Σ i of the i-th parameter set. The action of entropy can be calculated by substituting Equation (32) in Equation (31) and summing up the entropy terms of time dependent parameters with respect to time. Upon dropping the constant terms, it simplifies to: Equation (33) shows how the uncertainty in the robot's inference directly enters the objec-tiveF, throughH.

Mean-Field Terms
Given the simplified expressions forŪ andH, the next step towards findingF in Equation (15) is to evaluateV given by: U(y, ϑ) can be simplified using the second-degree Taylor series expansion near the mean µ ϑ = {µx, µṽ, µ θ , µ λ } as: where we use the shorthand U(y, µ ϑ ) = U(y, ϑ)| ϑ=µ ϑ and U(y, µ ϑ ) ϑ = U(y, ϑ) ϑ | ϑ=µ ϑ . This approximation of the internal energy has nontrivial implications in terms of the biological brain's decision-making process [21]. The second order approximation is justified because the Laplace and mean-field approximations entail an internal energy that is quadratic inx, v, and θ, as given in Equation (25), thereby reducing all its higher-order derivatives to zero. Moreover, the higher derivatives of U(y, µ ϑ ) with respect to λ, reduce to zero because of the assumptions made in Section 9. By differentiating Equation (25) at the mean µ i , U(y, µ ϑ ) θ i can be found to be all zeroes, which upon substitution in Equation (35) simplifies to: Substituting it in Equation (34), upon simplification yields: where Since the parameters ϑ i are assumed to be independent of each other, the covariance between them drops to zero, resulting in: where is defined as the mean-field term. Equation (38) shows thatV is the sum of internal energy action and the mean field terms. The simplification ofV is one of the main advantages of using mean-field approximation. However, this approximation can be relaxed to build Generalized Filtering [37,42], which is mainly relevant to nonlinear identification. It involves the modeling of parameters and hyperparameters in generalized coordinates (together with states) for online system identification. However, in this work, we take a simpler approach.

Simplified Free Energy Objectives
This section aims at simplifying the free energy objectives using the results from Sections 5 and 6.

Simplification of the Parameter Precisions
One of the main advantages of Laplace approximation is the simple evaluation of the covariance associated with the parameter estimation. This is achieved by setting the gradient of the free action with respect to individual parameter covariance as zero. The free action gradients with respect to covariance of paramaters ϑ i is given by: The optimal parameter covariance is the covariance that maximizes the free action with zero gradients. Assuming that the parameter covariances are time-invariant, and equating the gradients to zero yields Π i = −Ū(y, µ ϑ ) ϑ i ϑ i , resulting in: From Equation (42), it is clear that the precision of parameters can be estimated just by evaluating the negative curvature of the internal energy at the conditional mode. These precision values denote the confidence of the estimator in the parameter estimate. Ideally, the parameter precision increases as the estimation process proceeds. Intuitively, the robot is more confident about its estimates (higher precision) when its predictions on the sensory data show the least variance.

Free Action at Optimal Precision
Substituting Equation (42) into Equation (39) at optimal precisions results in constant mean field terms. Therefore, the mean field terms in the free action given in Equation (40), reduces to a constant under the optimal precision given by Equation (42). Therefore, the free action at optimal precision for an LTI system reduces to: prediction error of (generalized) states + 1 2 n t ln |Σ X | state and input entropy hyperparameter entropy (43) Note that the free action is not a function of the latent variables (ϑ = {x,ṽ, θ, λ}) but the sufficient statistics (µ ϑ , Σ ϑ ) of the approximate posterior. For example, the weighted output . We regroup the time-dependent components into one variable X = µx µṽ , and use capital letters for the mean estimates of time-independent components (Θ = µ θ , Λ = µ λ ). Note that X, Θ, and Λ are part of the generative model and are the mean estimate of the components of plant dynamics.

Equivalence with the EM Algorithm
One of the most popular approaches to solve the Expectiation Maximization (EM) algorithm for state space models is to use the Maximum Likelihood Estimation (MLE), where the objective function is the log-likelihood of data, given by [43]: Comparing the objective functions of EM and DEM given by Equation (44) and Equation (43), respectively, DEM is equivalent to EM when: • the mean field terms are neglected, • the generalized motion is not considered, and • the robot's priors on ϑ are not considered.
Therefore, DEM can be considered as a generalized version of the EM algorithm with additional capabilities. The free action at optimal precision (Equation (43)) is the sum of prediction errors (PE) and entropy of generalized states, parameters, noise, and hyperparameters and is independent of mean-field terms. Although the mean-field terms turns into a constant at optimal precision, their gradients do not. This property is leveraged in Section 8 to account for the uncertainties in the parameter estimation during state estimation and vice versa, through the gradient of mean-field terms.

Update Rules for Estimation
The Free Energy objectives (Section 4), together with the two simplifying approximations (Sections 5 and 6), will be combined with an optimization procedure to form the ultimate DEM algorithm. The optimization procedure itself consists of the following two sets of update rules: 1.
a gradient ascent over its free actionF for the time invariant parameters θ and λ, 2.
a gradient ascent over its free energy F for the time varying parameters X, where F andF are related through F = ∂F ∂t . The core idea is that the time varying parameters x andṽ can be estimated online from the robot's instantaneous free energy, whereas the time-invariant parameters θ and λ can be estimated from its free action after observing a sequence of data. Accordingly, the update rules for both the gradient ascends are given by for the a th parameter update of the time invariant parameters and for the time=varying parameters, where k i is the learning rate. The presence of Dµ i term in Equation (46) differentiates the update rule from the general gradient ascent equation used in machine learning. This is to accommodate the boundary condition that when F is maximized, ∂F ∂µ i = 0 andμ i = Dµ i . In other words, when the free energy is maximized, the motion of the generalized states becomes their generalized motion [44]. However, the update equations for the time-invariant parameters, θ and λ, do not require the Dµ i term. Therefore, the update equations at t th time, a th parameter update step, and b th hyperparameter update step, after regrouping the (generalized) states and inputs as X = µx µṽ , is given by: where D = diag(D x , D v ). Note that the gradient update rules are written not on the latent variables (x,ṽ, θ and λ), but on their mean estimates (X, Θ and Λ). Since the update rules should be implemented in the discrete domain for robotics applications, Equation (47) is discretized under local linearization with the corresponding Jacobians as J The (generalized) state and input update at time t, parameter update at step a and hyperparameter update at step b are given by: Equation (48) shows that the update rules are dependent only on the gradients and curvatures of the free energy objectives.

The DEM Algorithm
The DEM algorithm is an iterative model inversion algorithm that uses Equations (42) and (48) to perform estimation on causal dynamic systems. It can be expressed using three steps: 1.
M step: noise hyperparameter estimation, a nomenclature that is similar to the EM algorithm. Figure 2 shows an intuitive block diagram that demonstrates the inference process of DEM as a coupled dynamics between D, E, and M steps. The data from the environment and the robot brain's prior distributions are used to infer the generative process. The pseudocode given in Algorithm 1 demonstrates how DEM performs estimation using only the gradient and curvatures of the free energy objectives. The next sections will focus on deriving the algebraic expressions for these quantities.

Updated Equations for Estimation
The free energy gradients in Equation (47) can be evaluated by differentiating Equation (40) with µ i . Substituting the resulting expression in Equation (47), upon simplification yields:Ẋ where is the gradient of the prediction error with respect to ∂µ j tr Σ i U(y, µ ϑ ) µ i µ i is the gradient of the mean field term of µ i with respect to µ j , and G Λ = 1 2 ∂ ∂Λ ln |Π| ϑ=µ ϑ is the gradient of the log-determinant of the precision matrix with respect to Λ. From Equation (49), the Jacobians that are required for the updates given in Equation (48) can be evaluated as: The Jacobians are employed in different loops corresponding to the D, E, and M steps in the Algorithm 1.

Update Equation for Precision of Estimates
The uncertainty in estimation is represented by the inverse of precision matrices Π ϑ i . Differentiating Equations (25) and (28) twice and substituting it into Equation (42) upon simplification yield: The only unknowns in the DEM update equations given by Equations (49)-(51) are the gradients and curvatures of E, W, and G. Sections 9-11 will deal with evaluating the simplified algebraic expressions for these gradients and curvatures. Using these simplifications, Section 12 will proceed towards expanding Algorithm 1.

Gradients of (Log Determinant of) Precision
This section aims at evaluating the gradients of the log determinant of noise precision (G Λ , G ΛΛ ) that are required for the hyperparameter update rules of the DEM algorithm. The precision matrix for hyperparameter estimation is modeled as: where S is the noise smoothness matrix given by Equation (7) and Ω z , Ω w are the constant matrices given in Equation (6). Here the precision matrix is parametrized using λ = λ z λ w , which is the only unknown in Equation (52). Therefore, the log determinant of precision and its gradients can be written as: where λ i is the i th element in λ. The gradients of precision in Equation (53) can be evaluated by differentiating Equation (52) as: Substituting Equations (52) and (54) in Equation (53) yields: where I nΠ z is the identity matrix of size nΠ z , which is the size of theΠ z matrix. Π w and Π z are modeled to have an exponential relation with λ, so that any updates on λ would result in positive semi-definite precision matrices. However, this relation entails an infinitely differentiable precision matrix with respect to λ, increasing the computational complexity of the algorithm. Therefore, an approximation is made by forcefully setting Π λ z λ z =Π λ w λ w = O, while maintaining the exponential relation betweenΠ and λ, thereby ensuring that the optimization process proceeds along the correct gradientsΠ λ . Together with Equation (54), this approximation results inΠ λλ = O. This assumption has two direct consequences: • It simplifies all the update rules given in Equations (49) and (50), • It simplifies the precision update rule for hyperparameters given in Equation (51).
A direct consequence of this approximation is in the simplification of G ΛΛ in Equation (53), expressed as: which upon substitution of Equations (52) and (54) yields: where nΠ z = m(p + 1) and nΠ w = n(p + 1) are the sizes ofΠ z andΠ w , respectively.
From Equations (55) and (57), G Λ and G ΛΛ are constants, and can be pre-computed in Algorithm 1.

Gradients of Prediction Error
As opposed to the result above, the gradients of the prediction errors are not constant, as is shown in this section.

Gradients of Prediction Error Along (Generalized) States
The error in prediction of (generalized) outputs, inputs and states is represented together by˜ , which makes up the precision weighted prediction error defined by E = 1 2˜ TΠ˜ ϑ=µ ϑ , whereΠ = diag(Π z , Pṽ,Π w ). The error and its gradient are: The gradient of prediction error with respect to X can be simplified as: where Since E X is linear in X, differentiating Equation (59) with respect to X yields a simple expression for curvature E XX as: (60)

Gradients of Prediction Error Along Parameters
To evaluate the gradients of prediction error along the parameters Θ, the reformulated definition of˜ is used: where M and N are given in Equation (5). This is to ensure that the variable Θ can be separated out of the expression for E Θ such that it is linear in Θ as follows: Since E Θ is linear in Θ, differentiating Equation (62) with respect to Θ yields a simple expression for E ΘΘ as:

Gradients of Prediction Error Along Hyperparameters
The gradients of prediction error along the hyperparameters Λ is simpler, and is given 2˜ TΠ λλ˜ ϑ=µ ϑ , which upon usingΠ λλ = 0, gives: In summary, Equations (59), (62) and (64) represent the gradients of the prediction error term, whereas Equations (60), (63) and (64) represent its curvatures. The next section will deal with evaluating the analytic expressions for all the gradients and curvatures of the mean field term.

Gradients of Mean Field Terms
This section aims to derive the analytic expressions for the mean field terms and their gradients:

Gradients of Mean Field Terms along Hyperparameters
In this section, we prove that all the gradients and curvatures of W Λ (namely W Λ µ i and W Λ µ i µ i ) are zeroes. The mean field term for hyperparameters Λ can be expressed as: To compute the gradients of W Λ , we need the curvature of internal energy with respect to λ. This can be evaluated by first differentiating Equation (25) with respect to λ and evaluating it at ϑ = µ ϑ , which yields: where G Λ is given by Equation (55). Upon further differentiation, we get: The assumption ofΠ λλ = 0 applied to Equation (67) yields: which contains only constants. Therefore, the assumption ofΠ λλ = 0 reduces all the gradients and curvatures of mean field terms of Λ to zeros: Since the internal energy given in Equation (25) is quadratic in ϑ i , and sinceΠ λλ = 0, all the gradients and curvatures of the mean field term of ϑ i with respect to itself are zeros:

Gradients of Mean Field Terms Along Generalized States
The mean-field term of the combined generalized states X can be expressed as: The curvature of internal energy with respect to X can be calculated by differentiating Equation (25) with respect to X twice, resulting in U(y, µ ϑ ) XX = − 1 2˜ T XΠ˜ X . Substituting it in Equation (71) upon differentiation with Θ yields: where the gradient of the mean field with respect to Θ is given by Similarly, the elements of the curvature matrix of the mean field term with respect to Θ is given by: The gradient and curvature of the mean field term of X with respect to Λ can be evaluated as: where Σ x is a component of Σ X = Σx Σxṽ Σxṽ Σṽ . Here the curvature W X ΛΛ vanishes due to the assumption thatΠ λλ = 0.

Gradients of Mean Field Terms Along Parameters
The mean-field term of the parameters Θ can be expressed as Differentiating Equation (75) with X and substituting Equation (61) in it yields the gradient as: and the elements of the curvature matrix as: Differentiating Equation (75) with respect to λ yields the gradient and curvature as: Here, W Θ ΛΛ vanishes due to the assumption thatΠ λλ = 0.

The Complete DEM Algorithm
By combining the gradients found from Sections 9, 10, and 11 with the Algorithm 1, we can finalize the full DEM algorithm so that it can iteratively compute the estimates and the associated precisions from data.

DEM Estimates
The main equations that are required to perform the update rules of DEM given in Equation (48) can be summarized as: Substituting Equation (57) to the expression for J Λ in Equation (79) yields: which is independent of Λ. This reduces the algorithm's computational complexity, as J Λ can now be pre-computed.

Precision of Estimates
This section simplifies the precision for DEM's estimates for an LTI system. The confidence in the estimate of (generalized) states and inputs can be simplified using Equations (51) and (60) as: From Equation (82), the precisions for state and input estimation are Πxx =C TΠzC + (D x − A) TΠw (D x −Ã) and Πṽṽ = Pṽ +B TΠwB , respectively. The cross-correlation between the (generalized) states and inputs are given by −B TΠw (D x −Ã). Since Πx ,ṽ is independent of X, it can be updated outside the D step.
Combining the results of Equations (51) and (63) yields the precision of parameter estimates Π θ , which is independent of Θ, as: From Equations (51), (57) and (64), the precision of hyperparameter estimation is: which is a constant and hence is never updated in the algorithm. In conclusion, the estimation using Equation (79), along with the precision of these estimates given by Equations (82)-(84) completely define the DEM algorithm for an LTI system with colored noises. The complete DEM algorithm is given in Algorithm 2.

Translation into Simplified Mathematical Form
Although the pseudocode derived in the previous sections is sufficient to replicate the DEM algorithm for an LTI system with colored noise, it is not sufficient to analyze DEM using the standard control systems tools for stability checks, convergence, etc. Therefore, in this section we translate the algorithm into a simplified mathematical form that control engineers can easily analyze. The following subsections aim at converting the DEM updates into a coupled linear system.

State and Input Estimation as a Linear Observer
This section deals with reformulating the D step of DEM for an LTI system as a (generalized) state and input observer. Substituting Equation (59) in Equation (79) with a learning rate of k X = 1 yields [2]: We now aim to mathematically prove that the (generalized) state and input observer of DEM can be reduced into an augmented LTI system, for which an exact discretization can be performed. We proceed by simplifying the mean field terms in Equation (39) as: . . . where, where Z M and Z N are matrices with elements 0 and 1. This leads to the mean field term being expressed as a linear transformation of X: Substituting Equation (89) into Equation (85) simplifies the observer as: The (generalized) state and input observer given by Equation (90) is of the form of an augmented LTI system. Therefore, an exact discretization can be used to solve it without using the second order gradient J X as given in Equation (48). This reduces the algorithm's computational complexity because E XX and W Θ XX for J X calculation are no longer necessary. Figure 3 shows the simplified control diagram of the observer. The stability condition of this observer (under known θ and λ) and its similarity with the Kalman Filter is discussed in our prior work [2]. To evaluate W Θ X , one could either use Equation (89) or Equation (76). Equation (90) is derived mainly for simplification and exact discretization.

Parameter Estimation-System Identification
This section aims to mathematically prove that the E step can be reduced to an augmented LTI system, for which an exact discretization can be performed. We proceed by first simplifying the parameter update equation given in Equation (79): Grouping all W X Θ i using Equation (72) yields: Here Z θ and Z I are constant matrices with elements 0 and 1. Substituting vec(˜ X ) from Equation (93) in Equation (92) yields: where A 2 and B 2 are given in Equation (62). Equation (95) is a linear differential equation in Θ for which an exact discretization can be computed. For each Θ update in Algorithm 2, A 2 and B 2 are also updated, consequently updating A 5 and B 5 . Therefore, Equation (95) is equivalent to a linear time-varying system. Figure 4 shows the simplified parameter estimation step of the robot brain. To evaluate W X Θ , one could use either Equation (72) or Equation (94). Equation (94) was derived mainly for the exact discretization and for the convergence proof in Section 14.

Hyperparameter Update
The update equation in Equation (80) can be simplified as: where a 1 , a 2 , and a 3 are constants that are independent of Λ. Since Equation (96) is nonlinear in Λ, an approximate discretization like the conventional Gauss-Newton update scheme given in Equation (48) should be used for the M step. In summary, the D and E steps follow an exact discretization, whereas the M step follows an approximate discretization.

Convergence Proof for Parameter and Hyperparameter Estimation
In robotics, it is important that learning algorithms provide a stable solution, especially when robot safety during operation is a concern. Therefore, a proof of convergence for DEM is important for its widespread use in robotics as a learning algorithm. However, the DEM literature lacks any such mathematical proof of convergence for the estimator. Therefore, this section aims at providing one for the parameter and hyperparameter estimation step on LTI systems.
Since the update equation given by Equation (95) is a linear differential equation, proving that A 5 ≺ O is sufficient to prove that Θ converges to a stable solution. Substituting the expression for A 2 from Equation (62) to the A 5 in Equation (95), yields: Since the prior precision matrix can be chosen to be positive definite, P θ O. It is straightforward to note from the expression for A 2 in Equation (62) (92) and (93), after some nontrivial linear algebra [41], yields: where It is straightforward from Equation (98) This completes the proof that the parameter estimation step of DEM converges for an LTI system. Similarly, from Equation (81), J Λ ≺ O proves the convergence of hyperparameter estimation step. For a detailed account of the linear algebra behind the proof of convergence, readers may refer to [41].

A Demonstrative Example
This section aims to provide the proof of concept for DEM through simulation for the estimation of an LTI system with colored noise. Since the algorithm can find an infinite number of solutions for a black box estimation ofx,ṽ, θ and λ from y, a black box estimation is not ideal as a demonstrative example. Therefore, we restrict this section to the joint estimation of x, A, B, Π w , and Π z from known y and C. A Gaussian bump input signal of v = e −0.25(t−12) 2 was centered around t = 12s and sampled at dt = 0.1s till T = 32s was used. The colored noise was generated with a smoothness value of σ = 0.5 for the Gaussian kernel. The noise precisions were Π w = e 8 I 2 and Π z = e 8 I 4 , making λ z = λ w = 8. The embedding order of the generalized motion of states and inputs were p = 6 and d = 2, respectively.

Priors for Estimation
As discussed in Section 5.2, three prior distributions are necessary for the algorithm. Since the inputs are known, the input prior η v is initialized with the known input v, and a tight prior precision of P v = e 32 I 1 is used to restrict any changes in v. Similarly, since the parameter C is known, the corresponding prior parameters in η θ are initialized with C, with tight priors of P θ i = e 32 . The prior parameters η θ i for the unknown A and B matrices are randomly sampled from the range of [−2,2], and a low prior precision of P θ i = e 6 is used to encourage exploratory behavior. In summary, η θ = vec(rand(2, 2) T ) T vec(rand(2, 1) T ) T vec(C T ) T T and P θ = diag(e 6 I 4 , e 6 I 2 , e 32 I 8 ).
Since the hyperparameters are unknown, their priors were set to zero λ = [0 0] T , with a prior precision of P λ = e 3 I 2 to encourage exploration.

Results of Estimation
The data y generated from the system in Section 15.1 was used to run the DEM algorithm given in Algorithm 2. Figure 5a demonstrates the successful state estimation of the algorithm. The results of parameter estimation (A and B) are shown in Figure 5b. The updates began from randomly selected priors η θ , marked by red circles, to finally converge. Table 1 shows that the DEM's estimate of A and B are close to the real values. This confirms that the parameter estimation can converge close to the real parameters, even when η θ is randomly selected from the range η θ ∈ [−2, 2] that is double the size of the real parameter range θ i ∈ [−1, 1]. Figure 5c shows the successful hyperparameter convergence close to λ z = λ w = 8.   DEM's confidence on its estimates increase with the E step iterations, as can be seen from Figure 6a, which demonstrates an increase in parameter precision Π θ . A similar trend can be observed for Π X . However, Π λ remains a constant during the entire algorithm, as proved in Section 12.2. The key idea behind DEM's inference is the maximization of free energy objectives. Read together, Figures 5 and 6 demonstrates that DEM successfully estimatesx, θ and λ, with increasing confidence on its estimates as the estimation proceeds by maximizingF from Equation (40). In summary, DEM can be used for the joint estimation of states, parameters and hyperparameters of an LTI system, subjected to colored noise.

Benchmarking
This section deals with benchmarking DEM against the state-of-the-art parameter estimation methods such as Expectation Maximization (EM), Subspace method (SS), and Prediction Error Minimization (PEM), for black-box estimation (fully unknown x, θ and λ).

Evaluation Metric for Parameter Estimation
For the black box identification, with completely unknown x, θ, and λ, there are infinite solutions with accurate input-output mapping. However, for LTI systems, there exists a unique transformation for identical systems. We use the companion canonical form to check the validity of parameter estimation by transforming both the real and the estimated parameters into their companion canonical form and then using the (square of) Euclidean distance between them as the sum of squared error (SSE) in parameter estimation. This evaluation metric will be used for parameter estimation in the next section.

Simulation Setup
A total of 500 (5 × 100) different randomly generated stable systems were used with five different noise smoothness values for parameter estimation. All systems were selected with same number of parameters n θ = 14 (n = 2, m = 4 and r = 1), with each θ i ∈ [−1, 1], while ensuring that A matrix is stable. All the noises were generated with the precision of e 6 (Π w = e 6 I 2×2 , Π z = e 6 I 4×4 ), with the embedding orders of states and inputs as p = 6 and d = 2. A Gaussian bump of v = e −0.25 * (t−12) 2 was used as the input signal with dt = 0.5s and T = 32s. The prior parameter was randomly initialized such that all η θ i ∈ [−2, 2] with a tight prior precision of P θ = e 4 I 14×14 . Both the hyperparameter priors η λ i were set to zero, with a prior precision of P λ = e −4 I 2×2 .
The System Identification toolbox from MATLAB was used for SS (n4sid()) and PEM methods. The solution of SS was used to initialize PEM. An implementation of EM algorithm for state space models was written in MATLAB based on [45]. n4sid() is inherently designed to handle colored noise, whereas the implemented EM algorithm is not. The code for the DEM algorithm will be openly available at: https://github.com/ ajitham123/DEM_LTI.

Results
The results shown in Figure 7 demonstrate the superior performance of DEM in comparison with EM, PEM, and SS, with minimum SSE during parameter estimation across different noise smoothness. Additionally, EM and PEM exploded occasionally (<5% times), resulting in outliers in SSE, which were removed for better visualization.
DEM demonstrated a consistent performance without generating any such outliers or exploding solutions, which could be explained by DEM's convergence guarantees for parameter estimation under colored noise [41], as proved in Section 14. In summary, DEM is a competitive parameter estimator for LTI systems with colored noise.

Discussion
The quest for a brain-inspired learning algorithm for robots has culminated in the free energy principle that postulates biological brain's perception as an optimization over its free energy objectives. FEP is of prime importance to robotics because of the use of generalized coordinates that enables it to gracefully handle colored noises. Colored noises appear in real robotics systems through the unmodeled dynamics and the non-linearity errors in the model, thereby providing an advantage for DEM during estimation when compared to other estimators. An example could be the unmodeled wind disturbances acting on an unmanned aerial vehicle while in flight, or the non linearity errors in the dynamic model of a robotic manipulator arm involved in a pick and place operation. The scope of this work spans across the blind system identification of such linear dynamic systems with colored noise.
The fundamental difference between this work and the prior work is in the reformulation of DEM for an LTI system. While DEM from computational neuroscience focuses on emulating the biological brain's perception through the hierarchical abstraction of a number of non-linear dynamic systems that interact with each other, our work focuses on reducing this method into an algorithm for the system identification of an LTI system with colored noise, which is a well-known problem in robotics. This reformulation enables the standard analysis for convergence, stability and unbiased estimation, which is an essential analysis in practical robotics. It also enables DEM to be compared with other existing estimation algorithms in a control systems domain. The widespread use of DEM in robotics necessitates these mathematical analyses, especially when concerning the stable and safe operation of robots in industry and during human-robot interaction.
An algorithm with proved convergence for estimation is preferred for safe robotic applications. Therefore, one of the main contribution of this work was the reduction of the estimation algorithm into a coupled augmented system to prove the convergence of parameter and hyperparameter estimation steps. This work also demonstrated the successful applicability of DEM for the estimation of a randomly selected LTI system. Furthermore, we showed through rigorous simulations on a wide range of randomly generated LTI systems that DEM is a competitive algorithm for system identification under colored noise, thereby widening the scope of DEM to a large number of LTI systems in robotics.
One of the main drawbacks of the algorithm is its higher computational complexity when compared to the estimation algorithms that do not keep track of the trajectory of states. Therefore, future work can focus on the online estimation using DEM with reduced computational load. Future work can also focus on extending this algorithm for linear time varying systems to deal with robots with changing system parameters while in operationa delivery drone dropping deliveries in mid-flight, for example. From a practical robotics point of view, DEM's parameter estimation module can be directly applied to a wide range of robots such as quadrotors, robotic arms, wheeled robots, etc. for black-box system identification, the input estimation module can be employed for fault-detection systems, and the hyperparameter estimation module can be used for online noise estimation for robust control. DEM can also be extended with a control loop for active inference to perform simultaneous perception and action on robots. This would result in the development of cognitive robots that can learn the generative model in the environment by interacting with it and actively seeking new information (active learning) for uncertainty resolution. This would influence multiple domains in robotics such as human-robot interaction for task learning, swarm robotics for collective learning and distributed control, informative path planning of aerial robots for environment monitoring, etc. The development of such braininspired autonomous agents sits at the core of cognitive robotics research. In summary, DEM has a huge potential to be the bioinspired learning algorithm for future robots.

Conclusions
The free energy principle from neuroscience has a great potential to be one of the most prominent frameworks for learning and control for the autonomous systems in future. Therefore, this paper converted the FEP-based inference scheme called DEM into a joint state, input, parameter, and hyperparameter estimation algorithm for LTI systems with colored noise. We derived the mathematical framework of DEM for LTI systems to prove that the resulting estimator is a combination of linear estimators that are coupled. We provided the proof of convergence for the estimation steps. Through rigorous simulations on randomly generated linear systems with colored noise at varying smoothness levels, we demonstrated that the DEM algorithm outperforms EM, PEM, and SS methods for parameter estimation with minimal estimation error. In light of the potential for DEM to solve the parameter estimation problem, the future research will aim at applying DEM to a quadcopter flying in wind. Data Availability Statement: The MATLAB code for the DEM algorithm will be openly available at: https://github.com/ajitham123/DEM_LTI.

Acknowledgments:
The authors would like to thank Karl Friston for the thought-provoking discussions on the use of generalized coordinates within the FEP framework.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: