Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models

Nisslbeck, Tim N.; Kouw, Wouter M.

doi:10.3390/e27070679

Open AccessArticle

Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models^†

by

Tim N. Nisslbeck

^*

and

Wouter M. Kouw

Department of Electrical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the proceedings of the IEEE European Control Conference, held at Thessaloniki, Greece, 24–27 June 2025.

Entropy 2025, 27(7), 679; https://doi.org/10.3390/e27070679

Submission received: 9 March 2025 / Revised: 5 June 2025 / Accepted: 9 June 2025 / Published: 26 June 2025

(This article belongs to the Special Issue Advances in Probabilistic Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

We present a Forney-style factor graph representation for the class of multivariate autoregressive models with exogenous inputs, and we propose an online Bayesian parameter-identification procedure based on message passing within this graph. We derive message-update rules for (1) a custom factor node that represents the multivariate autoregressive likelihood function and (2) the matrix normal Wishart distribution over the parameters. The flow of messages reveals how parameter uncertainty propagates into predictive uncertainty over the system outputs and how individual factor nodes and edges contribute to the overall model evidence. We evaluate the message-passing-based procedure on (i) a simulated autoregressive system, demonstrating convergence, and (ii) on a benchmark task, demonstrating strong predictive performance.

Keywords:

Bayesian inference; probabilistic graphical models; message passing; system identification; stochastic systems; autoregressive models

1. Introduction

Autoregressive models provide a simple yet powerful framework for capturing dynamical systems [1,2,3,4,5]. Multivariate autoregressive models with exogenous inputs (MARX) exhibit a complex dependence structure. Each component of the vector signal evolves as a weighted combination of (i) its own past observations, (ii) other components, and (iii) an exogenous vector-valued input signal [6,7]. This intricate dependence structure generates significant uncertainty in parameter estimation.

Bayesian inference offers a principled approach for quantifying and propagating this uncertainty into predictions for future system outputs [8,9]. Moreover, uncertainty quantification enables the incorporation of information-theoretic quantities into cost functions, which is useful for optimal experimental design and adaptive control [10,11]. Markov Chain Monte Carlo techniques are typically employed to approximate posterior distributions. However their computational cost makes them impractical for large-scale real-time applications such as online system identification and adaptive control. In contrast, exact and variational inference methods provide full posterior distributions over parameters, thereby enabling robust decision-making under uncertainty [12,13]. This capability is particularly crucial in safety-critical applications, such as robotics, where understanding uncertainty is as important as making accurate predictions.

To address this challenge, we introduce an exact recursive Bayesian estimator that maintains a full posterior distribution and is computationally efficient. Recursive estimators offer a scalable alternative to batch estimators, but they either lack posterior uncertainty over parameters or rely on approximations [3,8]. Shaarawy and Ali proposed an exact recursive Bayesian estimator based on the matrix normal Wishart distribution, demonstrating its effectiveness for system identification [9]. We extend their approach by casting the inference procedure as a message-passing algorithm on a factor graph, thereby improving both computational efficiency and interpretability.

Factor graphs are graphical tools that capture the probabilistic relationships between random variables [14]. Many algorithms, including inference, can be formulated as message passing on a factor graph. Thus, message passing on factor graphs provides a structured and scalable framework for Bayesian inference, offering several key advantages over conventional inference frameworks [15,16,17]. We specifically consider Forney-style factor graphs, for their simplicity and compact visual representation [18]. First, factor graphs offer an intuitive representation of probabilistic models and data flow by depicting distinct probabilistic relationships as separate factor nodes that explicitly capture dependencies between variables [15,17]. This structured representation makes the inference process more interpretable and supports a more flexible model design, contributing to explainable artificial intelligence [19,20]. Second, message passing on factor graphs enables distributed computation by structuring inference into localized update rules at each node [21]. In particular, casting inference as message passing on a factor graph can enable federated learning, which accelerates learning in a multi-agent setting where physically separated agents share likelihood messages for joint parameter estimation [22]. This formulation significantly reduces the computational complexity compared to traditional recursive methods, making real-time Bayesian inference more tractable in large-scale settings [23,24]. Localized updates facilitate the efficient propagation of uncertainty throughout the graph, allowing for the attribution of uncertainty to specific sources, for example, distinguishing between prediction uncertainty arising from the likelihood model versus uncertainty in the inferred parameters. This fine-grained decomposition of uncertainties further enables a novel evaluation of model performance: the negative log-model evidence (surprisal) can be decomposed into contributions from individual nodes and edges in the factor graph. By analyzing how these contributions evolve over time, one gains detailed insights into the learning dynamics during system identification, thus linking model evaluation directly to the underlying probabilistic structure. Lastly, message passing unifies a broad class of algorithms, spanning signal filtering, optimal control, and path planning [14,17,24,25], making it a computationally efficient tool for probabilistic reasoning in large-scale problems. Overall, by leveraging this structured inference technique, our approach not only enhances Bayesian inference for dynamical systems but also yields more interpretable, scalable, and computationally efficient probabilistic machine learning models.

In summary, our key contributions are as follows:

We derive a message-passing algorithm for exact recursive Bayesian inference in MARX models, maintaining full posterior distributions while ensuring computational efficiency.
We extend the inference framework to predict future system outputs that explicitly account for parameter uncertainty, improving robustness for real-time applications.
We introduce a novel model evaluation method by decomposing the negative log-model evidence (surprisal) into contributions from individual nodes and edges in the factor graph, providing insights into uncertainty and learning dynamics.
We demonstrate the effectiveness of our approach through empirical evaluations on (i) a synthetic MARX system with known parameters for verification, and (ii) two synthetic dynamical systems with unknown parameters: a double mass-spring-damper system and a nonlinear double pendulum system.

The remainder of this paper is organized as follows. In Section 2, we formally describe the class of the discrete-time dynamical system considered. In Section 3, we present our probabilistic MARX model and its representation using Forney-style factor graphs. In Section 4, we detail the message-passing algorithm for recursive Bayesian inference, including both parameter estimation and predictive inference. In Section 5, we introduce our novel evaluation method based on decomposing surprisal. In Section 6, we demonstrate the effectiveness of our approach on synthetic system identification tasks. In Section 7, we discuss the computational benefits, interpretability, and broader implications of our method. Finally, in Section 8, we conclude this paper.

2. Problem Statement

We consider discrete-time dynamical systems, represented by a state

z_{k} \in R^{D_{z}}

and driven by a control signal

u_{k} \in R^{D_{u}}

. These systems evolve according to a state transition function

f : R^{D_{z}} \times R^{D_{u}} \mapsto R^{D_{z}}

. At each time step, we observe a noisy measurement

y_{k} \in R^{D_{y}}

of the state via a measurement function

g : R^{D_{z}} \mapsto R^{D_{y}}

. This can be expressed as a state–space model of the form:

z_{k} = f (z_{k - 1}, u_{k}), y_{k} = g (z_{k}) + e_{k},

where

e_{k} \in R^{D_{y}}

is a stochastic disturbance. Our objective is to predict future observations

y_{t}

for

t > k

, given future inputs

u_{t}

, without prior knowledge about the system dynamics.

3. Model Specification

To address the problem defined in Section 2, we propose a probabilistic model that enables recursive learning and prediction of future observations in a partially observed dynamical system. Specifically, we assume that the unknown system can be approximated by a multivariate autoregressive model with exogenous inputs of order N, denoted as MARX(N). Let

y_{k} \in R^{D_{y}}

denote the

D_{y}

-dimensional observation at time step k. We collect the past

N_{y}

outputs into the matrix

\begin{matrix} {\bar{y}}_{k - 1} & ≜ [\begin{matrix} y_{k - 1, 1} & y_{k - 2, 1} & \dots & y_{k - N_{y}, 1} \\ ⋮ & \dots & \dots & ⋮ \\ y_{k - 1, D_{y}} & y_{k - 2, D_{y}} & \dots & y_{k - N_{y}, D_{y}} \end{matrix}], \end{matrix}

and, similarly, the most recent

N_{u}

control inputs into

\begin{matrix} {\bar{u}}_{k} & ≜ [\begin{matrix} u_{k, 1} & u_{k - 1, 1} & \dots & u_{k - N_{u} + 1, 1} \\ ⋮ & \dots & \dots & ⋮ \\ u_{k, D_{u}} & u_{k - 1, D_{u}} & \dots & u_{k - N_{u} + 1, D_{u}} \end{matrix}] . \end{matrix}

We then reshape both matrices

{\bar{y}}_{k - 1}

and

{\bar{u}}_{k}

into a single vector

x_{k} \in R^{D_{x}}

, where

D_{x} = N_{y} D_{y} + N_{u} D_{u}

:

\begin{matrix} x_{k} ≜ [\begin{matrix} vec ({\bar{y}}_{k - 1}) \\ vec ({\bar{u}}_{k}) \end{matrix}], \end{matrix}

(1)

and

vec (\cdot)

denotes the column-wise vectorization operator that stacks the columns of a matrix into a single column vector [26]. At the core of our MARX(N) model is a vector autoregressive process with exogenous inputs, characterized by the following likelihood function:

\begin{matrix} p (y_{k} | Θ, x_{k}) & = N (y_{k} | A^{⊺} x_{k}, W^{- 1}) \\ = \sqrt{\frac{| W |}{{(2 π)}^{D_{y}}}} exp (- \frac{1}{2} {(y_{k} - A^{⊺} x_{k})}^{⊺} W (y_{k} - A^{⊺} x_{k})), \end{matrix}

(2)

where the parameters—jointly denoted as

Θ = (A, W)

—consist of a regression coefficient matrix

A \in R^{D_{x} \times D_{y}}

and a noise precision matrix

W \in R_{+}^{D_{y} \times D_{y}}

, with

R_{+}

denoting the space of positive semi-definite matrices. Each column

A_{:, j}

specifies how the full memory vector

x_{k}

(comprising past outputs and inputs) linearly predicts the jth component of the current observation

y_{k, j}

. In state–space terminology, A captures both the temporal memory and cross-variable coupling by weighting each lagged signal in

x_{k}

. The matrix W represents the inverse covariance (precision) of the Gaussian measurement noise: its diagonal entries set the inverse variances for each observed dimension while off-diagonals model instantaneous noise correlations between different components of

y_{k}

.

For computational convenience (see Section 4.1), we specify our prior distribution over

Θ

as a matrix normal Wishart distribution [27]:

\begin{matrix} p (Θ) = p (A | W) p (W) = MN (A | M_{0}, Λ_{0}^{- 1}, W^{- 1}) W (W | Ω_{0}^{- 1}, ν_{0}) . \end{matrix}

(3)

Here, the coefficient matrix A follows a matrix normal distribution with mean

M_{0} \in R^{D_{x} \times D_{y}}

, row covariance

Λ_{0}^{- 1} \in R^{D_{x} \times D_{x}}

, and column covariance

W^{- 1} \in R^{D_{y} \times D_{y}}

,

\begin{matrix} p (A | W) & = MN (A | M_{0}, Λ_{0}^{- 1}, W^{- 1}) \\ = \sqrt{\frac{{| W |}^{D_{x}} {| Λ_{0} |}^{D_{y}}}{{(2 π)}^{D_{x} D_{y}}}} exp (- \frac{1}{2} tr [W {(A - M_{0})}^{⊺} Λ_{0} (A - M_{0})]), \end{matrix}

(4)

where

tr (\cdot)

denotes the trace of a square matrix, i.e., the sum of its diagonal entries [26]. The precision matrix W follows a Wishart distribution with a scale matrix

Ω_{0}^{- 1} \in R^{D_{y} \times D_{y}}

and degrees of freedom

ν_{0} \in R

\begin{matrix} p (W) = W (W | Ω_{0}^{- 1}, ν_{0}) = \sqrt{\frac{| Ω_{0} |^{ν_{0}}}{2^{ν_{0} D_{y}}}} \frac{\sqrt{{| W |}^{ν_{0} - D_{y} - 1}}}{Γ_{D_{y}} (ν_{0} / 2)} exp (- \frac{1}{2} tr [W Ω_{0}]) . \end{matrix}

Here,

Γ_{D_{y}} (\cdot)

is the multivariate Gamma function with dimension

D_{y}

[28]. Our goal is to infer the posterior distribution over A and W and subsequently use these parameter posterior distributions to make predictions for future outputs

y_{t}

.

The chosen prior and likelihood define the following generative model over the joint distribution of observations, inputs, and parameters:

p (y_{1 : k}, u_{1 : k}, Θ) = p (Θ) \prod_{i = 1}^{k} p (y_{i} | Θ, x_{i}) .

We consider two inference paradigms for parameter estimation [29]. In batch estimation, the full dataset is used to compute the posterior:

p (Θ | y_{1 : k}, u_{1 : k}) \propto p (Θ) \prod_{i = 1}^{k} p (y_{i} | Θ, x_{i}) .

Alternatively, in recursive estimation, the posterior is updated incrementally as new data arrives:

p (Θ | y_{1 : k}, u_{1 : k}) \propto p (Θ | y_{1 : k - 1}, u_{1 : k - 1}) p (y_{k} | Θ, y_{1 : k - 1}, u_{1 : k}) .

In this paper, we focus on the recursive formulation, which enables efficient online model updates and is well suited for real-time applications and systems where storing and reprocessing the entire history is infeasible.

Factor Graph

The probabilistic graphical model underlying the recursive formulation is straightforward, consisting of a prior distribution and a likelihood function. Figure 1 presents a Forney-style factor graph in which nodes represent factors, edges denote variables, and each edge connects exactly two nodes [15]. In the graph, time flows from left to right, predictions flow from top to bottom, and corrections flow from bottom to top. The factor node labeled

MNW

represents the matrix normal Wishart prediction distribution along with its associated prior parameters. The dashed box represents the composite likelihood node, which comprises (i) the concatenation operation described in (1), (ii) the dot–product operation between the regression coefficient matrix A and the memory

x_{k}

, and (iii) the stochastic disturbance. The equality node connects the parameters

Θ

to the likelihood nodes for each time step k.

4. Inference

Inference consists of two stages: (i) parameter estimation, where we infer model parameters from observed outputs

y_{k}

(Section 4.1), and (ii) output prediction, where we forecast future outputs

y_{t}

for

t > k

, given future system inputs

u_{k + 1}

(Section 4.2).

4.1. Parameter Estimation

We wish to recursively estimate the posterior distribution over the model parameters:

\begin{matrix} p (Θ | D_{k}) = \frac{p (y_{k} | Θ, x_{k})}{p (y_{k} | u_{k}, D_{k - 1})} p (Θ | D_{k - 1}), \end{matrix}

where

D_{k} = {y_{i}, u_{i}}_{i = 1}^{k}

denotes the data up to time k. Note that the memory vector

x_{k}

is a subset of

D_{k - 1}

. The evidence term in the denominator is

p (y_{k} | u_{k}, D_{k - 1}) = \int p (y_{k} | Θ, x_{k}) p (Θ | D_{k - 1}) d Θ .

(5)

This evidence term will be discussed in detail in Section 5.

Lemma 1.

Combining the MARX likelihood (2) with a matrix normal Wishart prior distribution over MARX coefficient matrix A and precision matrix W (3) yields a matrix normal Wishart distribution:

p (Θ | D_{k}) = MNW (A, W | M_{k}, Λ_{k}^{- 1}, Ω_{k}^{- 1}, ν_{k}),

with the following parameter updates:

\begin{matrix} ν_{k} & = ν_{k - 1} + 1 \\ Λ_{k} & = Λ_{k - 1} + x_{k} x_{k}^{⊺} \\ M_{k} & = {(Λ_{k - 1} + x_{k} x_{k}^{⊺})}^{- 1} (Λ_{k - 1} M_{k - 1} + x_{k} y_{k}^{⊺}) \\ Ω_{k} & = Ω_{k - 1} + y_{k} y_{k}^{⊺} + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} \\ - {(Λ_{k - 1} M_{k - 1} + x_{k} y_{k}^{⊺})}^{⊺} {(Λ_{k - 1} + x_{k} x_{k}^{⊺})}^{- 1} (Λ_{k - 1} M_{k - 1} + x_{k} y_{k}^{⊺}) . \end{matrix}

See Appendix A for the proof. This solution can be cast as a message-passing procedure on a factor graph, allowing distributed computation [15,30].

In Figure 1, circled messages indicate the information flow between the factor nodes along the edges. Message

①

represents the previous posterior belief over

Θ = (A, W)

:

\begin{matrix} \vec{①} = p (Θ | D_{k - 1}) = MNW (A, W | M_{k - 1}, Λ_{k - 1}^{- 1}, Ω_{k - 1}^{- 1}, ν_{k - 1}) . \end{matrix}

(6)

The sum–product message from the composite MARX likelihood towards its parameters is the likelihood function itself, re-expressible as a probability distribution over

Θ

.

Lemma 2.

The message from the composite MARX likelihood (2) towards its parameters is matrix normal Wishart distributed as follows:

\begin{matrix} ↑ ② & = p (y_{k} | Θ, x_{k}) \propto MNW (A, W | {\bar{M}}_{k}, {\bar{Λ}}_{k}^{- 1}, {\bar{Ω}}_{k}^{- 1}, {\bar{ν}}_{k}) . \end{matrix}

(7)

Its parameters are

\begin{matrix} {\bar{ν}}_{k} & = 2 - D_{x} + D_{y}, {\bar{Λ}}_{k} = x_{k} x_{k}^{⊺}, \\ {\bar{M}}_{k} & = {(x_{k} x_{k}^{⊺})}^{- 1} x_{k} y_{k}^{⊺}, {\bar{Ω}}_{k} = 0_{D_{y} \times D_{y}} . \end{matrix}

See Appendix B for the proof. Note that the scale matrix is not positive-definite, which implies that message

②

is an improper distribution. Utilizing improper distributions is not uncommon when messages are intermediate results. For example, in variational and particle-based message passing, the messages are unnormalized and therefore also technically improper distributions [31,32]. However, should one want to visualize message

②

or convert it to a related distribution, for instance, then the scale matrix can be perturbed with a machine precision offset (i.e.,

{\bar{Ω}}_{k} = 10^{- 8} \cdot I_{D_{y} \times D_{y}}

).

Message

③

results from multiplying messages

①

and

②

at the equality node [15].

Lemma 3.

Let

p_{1}

and

p_{2}

be two matrix normal Wishart distributions over the same random variables Θ:

\begin{matrix} p_{1} (Θ) & = MNW (A, W | M_{1}, Λ_{1}^{- 1}, Ω_{1}^{- 1}, ν_{1}) \\ p_{2} (Θ) & = MNW (A, W | M_{2}, Λ_{2}^{- 1}, Ω_{2}^{- 1}, ν_{2}) . \end{matrix}

Their product is proportional to another matrix normal Wishart distribution:

p_{1} (Θ) p_{2} (Θ) \propto MNW (A, W | M_{3}, Λ_{3}^{- 1}, Ω_{3}^{- 1}, ν_{3}),

and its parameters are combinations of

p_{1}, p_{2}

’s parameters,

\begin{matrix} ν_{3} & = ν_{1} + ν_{2} + D_{x} - D_{y} - 1, \\ Λ_{3} & = Λ_{1} + Λ_{2}, \\ M_{3} & = {(Λ_{1} + Λ_{2})}^{- 1} (Λ_{1} M_{1} + Λ_{2} M_{2}), \\ Ω_{3} & = Ω_{1} + Ω_{2} + M_{1}^{⊺} Λ_{1} M_{1} + M_{2}^{⊺} Λ_{2} M_{2} \\ - {(Λ_{1} M_{1} + Λ_{2} M_{2})}^{⊺} {(Λ_{1} + Λ_{2})}^{- 1} (Λ_{1} M_{1} + Λ_{2} M_{2}) . \end{matrix}

See Appendix C for the proof.

Theorem 1.

The outgoing message from the equality node is proportional to the exact recursive posterior distribution:

\vec{③} = \vec{①} \cdot ② ↑ \propto MNW (A, W | M_{k}, Λ_{k}^{- 1}, Ω_{k}^{- 1}, ν_{k}) .

Proof.

Combining parameters from the messages in (6) and (7) according to the product operation in Lemma 3 yields

\begin{matrix} ν_{k} & = ν_{k - 1} + {\bar{ν}}_{k} + D_{x} - D_{y} - 1 = ν_{k - 1} + 1, \\ Λ_{k} & = Λ_{k - 1} + {\bar{Λ}}_{k} = Λ_{k - 1} + x_{k} x_{k}^{⊺}, \\ M_{k} & = {(Λ_{k - 1} + {\bar{Λ}}_{k})}^{- 1} (Λ_{k - 1} M_{k - 1} + {\bar{Λ}}_{k} {\bar{M}}_{k}) = {(Λ_{k - 1} + x_{k} x_{k}^{⊺})}^{- 1} (Λ_{k - 1} M_{k - 1} + x_{k} y_{k}^{⊺}), \\ Ω_{k} & = Ω_{k - 1} + {\bar{Ω}}_{k} + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} + {\bar{M}}_{k}^{⊺} {\bar{Λ}}_{k} {\bar{M}}_{k} \\ - {(Λ_{k - 1} M_{k - 1} + {\bar{Λ}}_{k} {\bar{M}}_{k})}^{⊺} {(Λ_{k - 1} + {\bar{Λ}}_{k})}^{- 1} (Λ_{k - 1} M_{k - 1} + {\bar{Λ}}_{k} {\bar{M}}_{k}) \\ = Ω_{k - 1} + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} + y_{k} y_{k}^{⊺} \\ - {(Λ_{k - 1} M_{k - 1} + x_{k} y_{k}^{⊺})}^{⊺} {(Λ_{k - 1} + x_{k} x_{k}^{⊺})}^{- 1} (Λ_{k - 1} M_{k - 1} + x_{k} y_{k}^{⊺}) . \end{matrix}

These match the parameter update rules outlined in Lemma 1. □

4.2. Output Prediction

Predicting future system outputs amounts to computing the posterior predictive distribution, i.e., the marginal distribution of

y_{t}

for

t > k

:

↓ ④ = p (y_{t} | u_{t}, D_{k}) = \int p (y_{t} | Θ, x_{t}) p (Θ | D_{k}) d Θ .

(8)

We exploit the factorization of the parameter posterior over

(A, W)

to split this into a marginalization over A:

p (y_{t} | W, u_{t}, D_{k}) = \int p (y_{t} | Θ, x_{t}) p (A | W, D_{k}) d A,

and a marginalization over W:

p (y_{t} | u_{t}, D_{k}) = \int p (y_{t} | W, u_{t}, D_{k}) p (W | D_{k}) d W .

Theorem 2.

Marginalizing the composite MARX likelihood (2) over the matrix normal distribution (4) for A yields a multivariate normal distribution:

\begin{matrix} \int N (y_{t} & | A^{⊺} x_{t}, W^{- 1}) MN (A | M_{k}, Λ_{k}^{- 1}, W^{- 1}) d A = N (y_{t} | M_{k}^{⊺} x_{t}, {(λ_{t} W)}^{- 1}), \end{matrix}

where

λ_{t} ≜ {(1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t})}^{- 1}

.

See Appendix D for the proof.

Theorem 3.

Marginalizing a multivariate normal distribution over a Wishart distribution on its precision parameter yields a multivariate location-scale Student’s t-distribution [27]:

\begin{matrix} \int N (y_{t} & | M_{k}^{⊺} x_{t}, {(λ_{t} W)}^{- 1}) W (W | Ω_{k}^{- 1}, ν_{k}) d W = T (y_{t} | μ_{t}, Ψ_{t}^{- 1}, η_{t}), \end{matrix}

(9)

where

μ_{t} ≜ M_{k}^{⊺} x_{t}

,

η_{t} ≜ ν_{k} - D_{y} + 1

, and

Ψ_{t} ≜ η_{t} Ω_{k}^{- 1} λ_{t}

.

See Appendix E for the proof. The resulting posterior predictive distribution provides a recursive estimate of output uncertainty, which is valuable for decision-making and adaptive control.

5. Model Evaluation

A key criterion for probabilistic model evaluation is the negative log-model evidence (or surprisal)

- log p (y_{k})

, which quantifies how surprising the observed data

y_{k}

is under the model [33,34]. To gain deeper insights into model performance, we analyze surprisal from the perspective of variational inference on factor graphs. This approach enables us to decompose the overall model score into contributions from the individual nodes and edges of the graph.

Variational inference casts Bayesian inference as an optimization problem by approximating the true posterior

p (Θ | D_{k})

with a computationally tractable variational posterior

q (Θ | D_{k})

, chosen from a variational family Q [33,35]. At time k, the optimal variational posterior is obtained by minimizing variational free energy (VFE) [36,37]:

q^{*} (Θ | D_{k}) = arg min_{q \in Q} F_{V F E} [q (Θ | D_{k}), p (y_{k}, Θ)],

where the VFE functional

F_{V F E}

is defined as

F_{V F E} [q (Θ | D_{k}), p (y_{k}, Θ)] = \underset{Inference Cost}{\underset{︸}{D_{K L} [q (Θ | D_{k}) ∣ ∣ p (Θ | D_{k})]}} - log \underset{Model Evidence}{\underset{︸}{p (y_{k} | u_{k}, D_{k - 1})}} .

In exact inference, where the true posterior is computed via Bayes’ rule, the inference cost becomes zero, and the VFE equals the exact surprisal. When exact inference is intractable, VFE is expressed in a different way. By absorbing the evidence term into the Kullback–Leibler (KL)-divergence, the product of the posterior and the evidence becomes the joint distribution of the generative model, which can be decomposed into a likelihood times prior distribution. This yields the decomposition of free energy into complexity and accuracy terms [37]:

\begin{matrix} D_{K L} [q (Θ & | D_{k}) | | p (Θ | D_{k})] - log p (y_{k} | u_{k}, D_{k - 1}) \\ = E_{q (Θ | D_{k})} [log \frac{q (Θ | D_{k})}{p (y_{k}, Θ | D_{k})}] \\ = \underset{Complexity}{\underset{︸}{D_{K L} [q (Θ | D_{k}) ∣ ∣ p (Θ | D_{k - 1})]}} + \underset{Accuracy}{\underset{︸}{H [q (Θ | D_{k}), p (y_{k} | Θ, x_{k})]}}, \end{matrix}

(10)

where complexity measures how much the variational posterior deviates from the prior, penalizing unnecessary deviations from prior knowledge and controlling overfitting. Accuracy quantifies the model’s ability to explain the observed data, expressed as the expected negative log-likelihood under the variational posterior. To refine this decomposition further, we introduce an auxiliary entropy term

H (Θ | D_{k})

and rewrite (10) as

\begin{matrix} F_{V F E} [q (Θ | D_{k}), p (y_{k}, Θ)] \\ = D_{K L} [q (Θ | D_{k}) ∥ p (Θ | D_{k - 1})] + H [q (Θ | D_{k}), p (y_{k} | Θ, x_{k}))] - H [q (Θ | D_{k})] + H [q (Θ | D_{k})] \\ = D_{K L} [q (Θ | D_{k}) ∣ ∣ p (Θ | D_{k - 1})] + D_{K L} [q (Θ | D_{k}) ∣ ∣ p (y_{k} | Θ, x_{k}))] + H [q (Θ | D_{k})] . \end{matrix}

(11)

For models formulated as Forney-style factor graphs, inference is performed by optimizing the Bethe Free Energy (BFE), a generalization of VFE, which accounts for the graph’s structure [13,21,38]:

F_{B F E} [q (Θ | D_{k}), p (y_{k}, Θ)] ≜ \sum_{a \in V} D_{K L} [q_{a} | | p_{a}] + \sum_{i \in E} H [q_{i}],

(12)

where

V

is the set of factor nodes and

E

is the set of edges. In this formulation, each

q_{a}

is the local variational belief at node a,

p_{a}

is the corresponding exact local distribution, and each edge i contributes an entropy term

H [q_{i}]

. In our recursive MARX model—comprising a MARX likelihood node, a prior node, and an edge for the joint parameters

Θ

—the BFE decomposition in (12) coincides with the VFE decomposition in (11). Thus, factor graphs enable a fine-grained attribution of surprisal to specific components of the system.

5.1. MARX Model Evidence and Surprisal

To evaluate the model properly, we must compute the model evidence (marginal likelihood), which is the probability of an observed sample marginalized over parameters, weighted by their prior probabilities. Equation (5) already detailed the evidence term, but this still involved an integral. This integral is identical to the integral for the posterior predictive distribution (8), except that

y_{k}

and

u_{k}

are observed and the prior parameters are those from time step

k - 1

. Concretely,

\begin{matrix} p (y_{k} | u_{k}, D_{k - 1}) = \int p (y_{k} | Θ, x_{k}) p (Θ | D_{k - 1}) d Θ = T (y_{k} | m_{k}, Ψ_{k}^{- 1}, η_{k}) \\ = \sqrt{\frac{| Ψ_{k} |}{{(η_{k} π)}^{D_{y}}}} \frac{Γ_{D_{y}} ((η_{k} + D_{y}) / 2)}{Γ_{D_{y}} ((η_{k} + D_{y} - 1) / 2)} {(1 + \frac{1}{η_{k}} {(y_{k} - m_{k})}^{⊺} Ψ_{k} (y_{k} - m_{k}))}^{- (η_{k} + D_{y}) / 2}, \end{matrix}

where

m_{k} = M_{k - 1}^{⊺} x_{k}

,

η_{k} = ν_{k - 1} - D_{y} + 1

,

Ψ_{k} = η_{k} Ω_{k - 1}^{- 1} λ_{k}

, and

λ_{k} = {(1 + x_{k}^{⊺} Λ_{k - 1}^{- 1} x_{k})}^{- 1}

. Here

T (\cdot | μ, Σ^{- 1}, ν)

denotes the multivariate Student’s t-distribution with location

μ

, scale

Σ^{- 1}

, and degrees of freedom

ν

. Unlike the posterior predictive distribution, the model evidence is a scalar: higher values indicate that the model better explains the observed data. Hence, the surprisal for our model is

\begin{matrix} - log p (y_{k} | u_{k}, D_{k - 1}) & = - \frac{1}{2} log | Ψ_{k} | + \frac{D_{y}}{2} log (η_{k} π) \\ - log Γ_{D_{y}} (\frac{η_{k} + D_{y}}{2}) + log Γ_{D_{y}} (\frac{η_{k} + D_{y} - 1}{2}) \\ + \frac{η_{k} + D_{y}}{2} log (1 + \frac{1}{η_{k}} {(y_{k} - m_{k})}^{⊺} Ψ_{k} (y_{k} - m_{k})) . \end{matrix}

(13)

5.2. MARX Variational Free Energy

Lemma 4.

Let q and p be two matrix normal Wishart distributions over the same random variables Θ, representing the posterior and prior, respectively:

\begin{matrix} q (Θ | D_{k}) & = MNW (Θ | M_{k}, Λ_{k}^{- 1}, Ω_{k}^{- 1}, ν_{k}) \\ p (Θ | D_{k - 1}) & = MNW (Θ | M_{k - 1}, Λ_{k - 1}^{- 1}, Ω_{k - 1}^{- 1}, ν_{k - 1}) . \end{matrix}

The differential cross-entropy

H [q (Θ | D_{k}), p (Θ | D_{k - 1})]

of the posterior relative to the prior is

\begin{matrix} H [q (Θ | D_{k}), p (Θ | D_{k - 1})] & = - \frac{1}{2} D_{y} log | Λ_{k - 1} | + \frac{1}{2} (ν_{k - 1} + D_{x} - D_{y} - 1) log | Ω_{k} | \\ - \frac{1}{2} ν_{k - 1} log | Ω_{k - 1} | + \frac{1}{2} (D_{y} + 1) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π \\ + log Γ_{D_{y}} (\frac{ν_{k - 1}}{2}) - \frac{1}{2} (ν_{k - 1} + D_{x} - D_{y} - 1) ψ_{D_{y}} (\frac{ν_{k}}{2}) \\ + \frac{1}{2} ν_{k} tr (Ω_{k}^{- 1} {(M_{k} - M_{k - 1})}^{⊺} Λ_{k - 1} (M_{k} - M_{k - 1})) \\ + \frac{1}{2} D_{y} tr (Λ_{k}^{- 1} Λ_{k - 1}^{⊺}) + ν_{k} tr (Ω_{k}^{- 1} Ω_{k - 1}) . \end{matrix}

See Appendix F for the proof.

Lemma 5.

Consider the matrix normal Wishart posterior:

q (Θ | D_{k}) = MNW (A, W | M_{k}, Λ_{k}^{- 1}, Ω_{k}^{- 1}, ν_{k}) .

Its (differential) entropy is

\begin{matrix} H [q (Θ | D_{k})] & = - \frac{1}{2} D_{y} log | Λ_{k} | + \frac{1}{2} (D_{x} - D_{y} - 1) log | Ω_{k} | \\ + \frac{1}{2} (D_{y} + 1) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π + \frac{1}{2} (D_{x} + ν_{k}) D_{y} \\ + log Γ_{D_{y}} (\frac{ν_{k}}{2}) - \frac{1}{2} (ν_{k} + D_{x} - D_{y} - 1) ψ_{D_{y}} (\frac{ν_{k}}{2}) . \end{matrix}

(14)

See Appendix G for the proof.

Lemma 6.

Let q and p be two matrix normal Wishart distributions over the same random variables Θ, representing the posterior and prior, respectively:

\begin{matrix} q (Θ | D_{k}) & = MNW (Θ | M_{k}, Λ_{k}^{- 1}, Ω_{k}^{- 1}, ν_{k}) \\ p (Θ | D_{k - 1}) & = MNW (Θ | M_{k - 1}, Λ_{k - 1}^{- 1}, Ω_{k - 1}^{- 1}, ν_{k - 1}) . \end{matrix}

The KL-divergence

D_{K L} [q (Θ | D_{k}) ∣ ∣ p (Θ | D_{k - 1})]

of the posterior from the prior (complexity) is

\begin{matrix} D_{K L} [q (Θ | D_{k}) ∣ ∣ p (Θ | D_{k - 1})] & = \frac{1}{2} D_{y} log \frac{| Λ_{k} |}{| Λ_{k - 1} |} + \frac{1}{2} ν_{k - 1} log \frac{| Ω_{k} |}{| Ω_{k - 1} |} - \frac{1}{2} (D_{x} + ν_{k}) D_{y} \\ - log Γ_{D_{y}} (\frac{ν_{k}}{2}) + log Γ_{D_{y}} (\frac{ν_{k - 1}}{2}) + \frac{1}{2} (ν_{k} - ν_{k - 1}) ψ_{D_{y}} (\frac{ν_{k}}{2}) \\ + \frac{1}{2} ν_{k} tr (Ω_{k}^{- 1} {(M_{k} - M_{k - 1})}^{⊺} Λ_{k - 1} (M_{k} - M_{k - 1})) \\ + \frac{1}{2} D_{y} tr (Λ_{k}^{- 1} Λ_{k - 1}^{⊺}) + ν_{k} tr (Ω_{k}^{- 1} Ω_{k - 1}) . \end{matrix}

See Appendix H for the proof.

Lemma 7.

Consider a matrix normal Wishart distribution q and a multivariate normal distribution p, representing the posterior and MARX likelihood:

\begin{matrix} q (Θ | D_{k}) & = MNW (A, W | M_{k}, Λ_{k}^{- 1}, Ω_{k}^{- 1}, ν_{k}) \\ p (y_{k} | Θ, x_{k}) & = N (y_{k} | A^{⊺} x_{k}, W^{- 1}) . \end{matrix}

The differential cross-entropy

H [q (Θ | D_{k}), p (y_{k} | Θ, x_{k})]

of the posterior relative to the likelihood (accuracy) is

\begin{matrix} H [q (Θ | D_{k}), p (y_{k} | Θ, x_{k})] & = - \frac{1}{2} ψ_{D_{y}} (\frac{ν_{k}}{2}) + \frac{1}{2} log | Ω_{k} | + \frac{1}{2} D_{y} log π \\ + \frac{1}{2} ν_{k} {(y_{k} - M_{k}^{⊺} x_{k})}^{⊺} Ω_{k}^{- 1} (y_{k} - M_{k}^{⊺} x_{k}) + \frac{1}{2} x_{k}^{⊺} Λ_{k}^{- 1} x_{k} D_{y} . \end{matrix}

See Appendix I for the proof.

6. Experiments

We conducted three experiments: one verification experiment and two validation experiments (Code: https://github.com/biaslab/MDPI2025-MARX, accessed on 8 March 2025). In the verification experiment (Section 6.2), we tested whether the MARX estimator could identify a dynamical system with known parameters. In the validation experiments (Section 6.3), we assess the estimator’s performance on two complex dynamical systems with unknown parameters: a linear double mass-spring-damper system and a nonlinear double pendulum. In all the experiments, we compare the performance of the MARX estimator to a baseline approach.

6.1. Baseline Estimator

We compare against a recursive least squares (RLS) estimator [3]. Let

{\hat{A}}_{k}

be a point estimate of the coefficient matrix based on the previous k data points, and let

P_{0} = I_{D_{x}}

be an initial inverse sample covariance matrix. These matrices are updated at each time step according to

\begin{matrix} P_{k} & = P_{k - 1} - P_{k - 1} x_{k} {(1 + x_{k}^{⊺} P_{k - 1} x_{k})}^{- 1} x_{k}^{⊺} P_{k - 1} \\ {\hat{A}}_{k} & = {\hat{A}}_{k - 1} + P_{k - 1} x_{k} {(1 + x_{k}^{⊺} P_{k - 1} x_{k})}^{- 1} {(y_{k} - {\hat{A}}_{k - 1}^{⊺} x_{k})}^{⊺} . \end{matrix}

Note that this formulation corresponds to a forgetting factor of

1.0

, meaning that older data points are not down-weighted. The system outputs are predicted with

y_{t} = {\hat{A}}_{k}^{⊺} x_{t}

.

6.2. Verification

We perform a verification experiment on a MARX system with state

z_{k} = x_{k}

(1), memory sizes

N_{y} = 2, N_{u} = 3

, and dimensions

D_{y} = D_{u} = 2

. The system has true parameters

\tilde{Θ} = (\tilde{A}, \tilde{W})

. It evolves according to

g (f (x_{k})) = {\tilde{A}}^{⊺} x_{k}

, where

\tilde{A}

is the known coefficient matrix (see Figure 2). For each output dimension i, the lag-dependent coefficients were generated using a Butterworth low-pass filter (cutoff frequency 20 Hz) applied to that same dimension, while cross-dimensional coefficients were sampled from

N (0, 0 . 1^{2})

[39]. We chose the Butterworth filter because its maximally flat response in the passband ensures that signals below the cutoff frequency are transmitted with little distortion while attenuating higher-frequency components [40]. This makes it suitable for generating stable linear dynamics and mimicking the low-pass behavior often observed in physical dynamical systems—such as mechanical or electrical processes [41,42]—and is common in applications like audio and biomedical signal processing [41,43]. The disturbance follows

e_{k} \sim N (0, {\tilde{W}}^{- 1})

with precision matrix

\tilde{W} = [\begin{matrix} 300 & 100 \\ 100 & 200 \end{matrix}]

.

We evaluated each estimator for training sizes

T_{train} \in {2^{l} ∣ l \in {2, 3, 4, 5, 6}}

, using Monte Carlo experiments with

N_{M C} = 100

runs. To learn the parameters, each estimator uses

T_{train}

state transitions, starting from state

z_{0} = 0_{D_{z}}

. After training, each estimator is tested for

T_{test} = 100

time steps, again starting from

z_{0}

but with different control signals. For the MARX estimator, we compare two priors (see Table 1): uninformative (MARX-UI) and weakly informative (MARX-WI). The uninformative prior uses small precision values for

Λ_{0}

and

Ω_{0}

, corresponding to large prior variancesthat reflect minimal prior belief about the parameters. The weakly informative prior assigns higher precision (lower variance), introducing a mild preference for more stable parameter values while still letting the data dominate. In both cases, the degrees of freedom

ν_{0}

are kept minimal at

D_{y} + 3

, just above the threshold for the Wishart distribution to be well defined, further reinforcing the limited informativeness of the prior. The weakly informative prior also encodes approximate prior knowledge about the observation noise. Specifically, the Wishart component

p (W)

has a mode at

ν_{0} Ω_{0}^{- 1} = [\begin{matrix} 500 & 0 \\ 0 & 500 \end{matrix}]

, which is of similar magnitude to the true noise precision

\tilde{W}

. In contrast, the uninformative prior sets

Ω_{0}

to much larger, placing its mode far from the true noise characteristics. Thus, the weakly informative prior softly incorporates domain knowledge about expected noise levels, improving convergence and stability in the early stages of recursive estimation. For each training size, we calculate the root mean squared error (RMSE),

\begin{matrix} RMSE = \sqrt{\frac{1}{T_{test}} \sum_{k = 1}^{T_{test}} {({\hat{y}}_{k} - y_{k})}^{2}}, \end{matrix}

between the predicted output

{\hat{y}}_{k}

, i.e., the mean of the posterior predictive

p (y_{k} | u_{k}, D_{k - 1})

, and the true output

y_{k}

for all

k \in T_{test}

evaluation steps.

Figure 3 shows the simulation errors for MARX-UI, MARX-WI, and RLS as a function of the training size. For small sample sizes, MARX-WI consistently outperforms RLS, while MARX-UI performs slightly worse. All three estimators converge to the same performance level as the training size increases.

Figure 4 focuses on a single Monte Carlo experiment with

T_{train} = 2^{6}

. It plots

log (| | \tilde{A} - A | |_{F})

, the log of the Frobenius norm between the true coefficient matrix

\tilde{A}

and each estimate A. MARX-WI consistently yields better estimates of

\tilde{A}

than MARX-UI and RLS. Although MARX-UI struggles during the first 25 time steps, it eventually produces a more accurate estimate of

\tilde{A}

compared to RLS.

Unlike RLS, the MARX estimator also estimates the noise precision matrix W. Figure 5 shows

log (| | \tilde{W} - W | |_{F})

for both MARX-WI and MARX-UI. MARX-WI consistently achieves more accurate estimates of

\tilde{W}

than MARX-UI.

Figure 6 plots the negative log posterior probability of the true parameters

\tilde{Θ}

(lower is better), showing that the posterior concentrates sharply on the true values. As a probabilistic estimator, MARX also quantifies uncertainty in its estimates of

\tilde{A}

and

\tilde{W}

via the posterior precision (or scale) parameters. Figure 7 illustrates the evolution of MARX-WI’s estimates of W for a single run with

T_{train} = 2^{6}

. The ribbon represents one standard deviation around the mean. Initially, MARX-WI exhibits high uncertainty (large variance), which generally decreases over time. Because

\tilde{W}

and W are symmetric, only the upper-triangular elements are shown.

Figure 8 (top) shows a heatmap of the difference

A - \tilde{A}

. To save space, we plot only a subset of the elements of A, marked by “X”. This subset includes the elements with the largest estimation errors and two randomly selected elements. Figure 8 (bottom) shows the evolution of these selected elements for the same Monte Carlo experiment run, with ribbons indicating one standard deviation around each mean estimate.

Furthermore, we apply the model score decomposition from Section 5 to evaluate our recursive MARX model. By tracking how surprisal and its constituent terms evolve, we obtain fine-grained insights into the model’s learning dynamics and uncertainty reduction. We can recall from (10) that surprisal decomposes into an accuracy term—given by the cross-entropy of the variational posterior relative to the likelihood, reflecting data fit—and a complexity term—given by the KL-divergence of the variational posterior from the prior, quantifying deviation from prior beliefs. Figure 9 illustrates this decomposition. In the early stages of model training, the complexity term (green) dominates overall surprisal (dashed blue), indicating substantial updates from the prior as the model learns the system parameters. As training progresses and the posterior stabilizes, the complexity term diminishes, and the accuracy term (red) becomes the main source of uncertainty. Spikes in overall surprisal during later stages align with spikes in the accuracy term, which we interpret as indicators of measurement outliers that temporarily degrade model fit.

Figure 10 complements this analysis by plotting the entropy of the variational posterior

q (Θ | D_{k})

over time. This highlights how quickly the inference procedure narrows the parameter space, providing insight into convergence speed and residual uncertainty in the model parameters.

We also demonstrate model evaluation using model evidence. Figure 11 shows the evolution of surprisal (lower is better) over time for MARX-WI and MARX-UI. This plot highlights that the prior choice matters only initially; with sufficient data, MARX-WI and MARX-UI converge to the same performance.

6.3. Validation

To evaluate the proposed method, we perform validation experiments on two distinct mechanical systems: a linear double mass-spring-damper system and a nonlinear double pendulum system. These testbeds span a range of dynamical complexity and are standard benchmarks for modeling and control tasks. Despite their differences, both systems share a common formulation as second-order dynamical systems expressed in first-order ODE form:

I_{k} {\ddot{z}}_{k} = F (z_{k}, {\dot{z}}_{k}, u_{k}),

where

z_{k}

denotes generalized coordinates,

{\dot{z}}_{k}

and

{\ddot{z}}_{k}

are the first and second time derivatives of

z_{k}

,

u_{k}

are the control inputs,

I_{k}

is a (state-dependent) generalized inertia matrix, and F encodes the system-specific generalized forces (including passive dynamics and external control inputs). Time evolution is performed using a forward Euler integrator with a system-specific time step

Δ t

:

z_{k + 1} = z_{k} + Δ t {\dot{z}}_{k} and {\dot{z}}_{k + 1} = {\dot{z}}_{k} + Δ t {\ddot{z}}_{k} .

For both validation systems, we choose a disturbance

e_{k} \sim N (0, {\tilde{W}}^{- 1})

with a precision matrix

\tilde{W} = [\begin{matrix} 2000 & 1000 \\ 1000 & 2000 \end{matrix}]

. The validation experiments follow the same procedure as the verification experiment: we perform Monte Carlo experiments with

N_{M C} = 100

runs with

Δ t = 0.05

, in which each estimator has

T_{train} \in {2^{l} ∣ l \in {2, 3, 4, 5, 6}}

state transitions to learn the parameters (starting from state

z_{0} = 0_{D_{z}}

), and we test each estimator with

T_{test} = 100

transitions. However, we increase the memory sizes of the MARX model to

N_{y} = N_{u} = 5

.

In the following, we describe each validation system individually, and then present the combined validation results.

6.3.1. Linear System: Double Mass-Spring-Damper

The linear system consists of two masses:

m_{1} = 1.0

kg, connected to a fixed base by a spring and damper with stiffness

k_{1} = 0.99

and damping

c_{1} = 0.4

, and

m_{2} = 2.0

kg, connected to

m_{1}

via a second spring and damper with

k_{2} = 0.8

and

c_{2} = 0.4

. The generalized coordinates

z_{k} \in R^{2}

represent the displacements of each mass from the equilibrium, and the generalized inertia matrix is a constant:

I_{k} = diag (m_{1}, m_{2})

, where

diag (\cdot)

denotes a diagonal matrix with the given entries [26]. The generalized force function F combines the internal spring and damping forces with external inputs:

F (z_{k}, {\dot{z}}_{k}, u_{k}) = K z_{k} + C {\dot{z}}_{k} + u_{k},

with the stiffness and damping matrices:

\begin{matrix} K & = [\begin{matrix} - (k_{1} + k_{2}) & k_{2} \\ k_{2} & - k_{2} \end{matrix}], C = [\begin{matrix} - (c_{1} + c_{2}) & c_{2} \\ c_{2} & - c_{2} \end{matrix}] . \end{matrix}

6.3.2. Nonlinear System: Double Pendulum

The nonlinear system is a planar double pendulum (also called an acrobot) with two links of lengths

l_{1} = 1.0

m and

l_{2} = 1.0

m and masses

m_{1} = 1.0

kg and

m_{2} = 1.0

kg, respectively. The generalized coordinates

z_{k} \in R^{2}

represent the joint angles, and the generalized inertia matrix is captured implicitly through a structured nonlinear force formulation. The dynamics are governed by gravity and nonlinear velocity coupling, yielding

\begin{matrix} F (z_{k}, {\dot{z}}_{k}, u_{k}) = diag (g (\frac{1}{2} m_{1} + m_{2}) l_{1}, - \frac{1}{2} g m_{2} l_{2}) sin (z_{k}) + J_{x} V {\dot{z}}_{k}^{2} + u_{k}, \end{matrix}

where g is gravitational acceleration,

J_{x} ≜ \frac{1}{2} m_{2} l_{1} l_{2}

, and V is the nonlinear velocity-coupling matrix:

\begin{matrix} V = [\begin{matrix} 0 & - sin (z_{k, 1} - z_{k, 2}) \\ sin (z_{k, 1} - z_{k, 2}) & 0 \end{matrix}] . \end{matrix}

6.3.3. Results

As in the verification experiment, Figure 12 shows the simulation errors for MARX-UI, MARX-WI, and RLS for both the double mass-spring-damper system Figure 12a) and the double pendulum system (Figure 12b). Convergence to stable performance is slower in both systems compared to the verification case. Nevertheless, both MARX variants outperform RLS and converge to similar levels of predictive performance. This confirms that the MARX model generalizes to more complex dynamical systems. As expected, the overall RMSE is higher for the nonlinear double pendulum system. A peak of performance loss is present for MARX-UI, which is more pronounced in the double mass-spring-damper system.

Figure 13 shows

log (| | \tilde{W} - W | |_{F})

for both MARX-WI and MARX-UI for the validation systems. Initially, MARX-WI achieves better accuracy and lower variability than MARX-UI. Unlike in the verification setting, MARX-UI improves significantly over time and ultimately approaches similar estimation quality.

Figure 14 illustrates estimates of

\tilde{W}

by MARX-WI for a single Monte Carlo experiment (

T_{train} = 2^{6}

) for both systems. The model struggles with learning and initially shows high uncertainty, followed by a sharp reduction as learning progresses. This reflects the challenge of inferring observation noise structure in nonlinear systems from limited data.

Figure 15 displays the evolution of MARX-WI’s surprisal and its decomposition into accuracy and complexity. The early learning phases show that surprisal reduction is dominated by decreasing model complexity. This trend is more difficult to sustain in the nonlinear system, where complexity remains elevated for longer. Later in training, fluctuations in surprisal are primarily driven by changes in accuracy.

Finally, Figure 16 shows the entropy of the variational posterior

q (Θ | D_{k})

for each validation system. In both systems, MARX-WI rapidly reduces entropy, indicating fast convergence to informative parameter regions despite the different complexities of the systems.

7. Discussion

The modular nature of the factor graph methodology provides substantial practical advantages. As demonstrated by Loeliger et al. [15], factor graphs facilitate the visual construction of complex algorithms by incorporating, eliminating, or merging established computational units. For example, the MARX model’s factor graph (Figure 1) could be extended to support time-varying parameters by introducing state transition factor nodes between the equality nodes over the parameters [24]. In multi-agent robotics, where sensors and actuators are spread across various platforms, each agent can update its local beliefs through message passing and share only the most informative summaries [44]. This targeted communication reduces bandwidth demands while enabling swift convergence to an accurate global model. Recent research highlights the importance of transmitting informative variational beliefs in multi-agent environments [22,45], facilitating scalable cooperative learning among heterogeneous agents. The resulting computational decentralization opens promising opportunities for federated system identification and coordination in multi-robot systems, especially when subject to privacy or bandwidth constraints [46,47,48].

7.1. Computational Efficiency

The dominant computational cost in our inference algorithm arises from the matrix inversion of

Λ

(4), which scales as

O (D_{x}^{3})

in the worst case. We benchmarked the update rule computations on a Julia-based implementation running on an Apple Macbook M1, averaging over 1,000,000 runs. For a state dimension of

D_{x} = 10

, updating the parameters for a single time step took approximately 2 nanoseconds (excluding garbage collection). Further computational savings are possible by adopting an information filter parameterization, where

Ξ_{k}

(A3) is stored instead of

M_{k}

(3) [49]. This approach defers the matrix inversion until

M_{k}

is explicitly needed, offering an efficiency boost, particularly in high-dimensional or resource-constrained scenarios.

7.2. Limitations

Despite its efficiency and modularity, our method has several limitations. First, it does not support fully Bayesian k-step ahead predictions. Computing joint posterior predictives over a longer horizon is intractable under the current formulation and is challenging as it requires marginalization over a (deeply) nested set of autoregressive coefficients. Second, the model is built on a linear multivariate autoregressive likelihood, which—while computationally efficient—limits its expressiveness. In systems characterized by strong nonlinearities, this assumption can lead to underfitting and reduced predictive performance. Lastly, although we explored both uninformative and weakly informative priors, the model remains sensitive to prior settings, particularly in data-scarce settings or during the early stages of recursive estimation. In these scenarios, poor prior choices can significantly degrade both convergence speed and final performance.

7.3. Future Work

Future work may explore extending the MARX framework to accommodate time-varying parameters by inserting state-transition factors between the equality nodes—analogous to prior work on univariate autoregressive models [24]. Another extension is to utilize the posterior distributions over the parameters to formulate a mutual information-based cost function for input signal design [10].

8. Conclusions

We presented a recursive Bayesian estimation procedure for multivariate autoregressive models with exogenous inputs. The method produces matrix-variate posterior distributions over both the model coefficients and the noise precision, allowing uncertainty to be explicitly propagated into future output predictions. We also demonstrated how these uncertainty estimates enable the analysis of individual factor nodes and edges within the model, making it possible to assess their contributions to the overall model score and to identify potential outliers. The ability to track sources of uncertainty online and evaluate their impact on output predictions is especially valuable for applications such as Bayesian optimal experimental design or information-theoretic adaptive control.

Author Contributions

T.N.N. contributed to the derivations, simulations, experimental results, and writing. W.M.K. contributed to the conception, direction, derivations, software, and writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Eindhoven Artificial Intelligence Systems Institute.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All data in this work is synthetic. For details on it was simulated, see the accompanying repository at https://github.com/biaslab/MDPI2025-MARX (accessed on 8 March 2025).

Acknowledgments

The authors gratefully acknowledge the support from Albert Podusenko.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MARX	multivariate autoregressive models with exogenous inputs
MARX-UI	MARX model with uninformative prior
MARX-WI	MARX model with weakly informative prior
VFE	variational free energy
BFE	Bethe Free Energy
RLS	recursive least squares
RMSE	Root Mean Square Error
ODE	Ordinary Differential Equation
KL	Kullback–Leibler

Appendix A. Parameter Estimation

Proof.

The functional form of the likelihood is

\begin{matrix} p (y_{k} | Θ, x_{k}) & \propto \sqrt{| W |} exp (- \frac{1}{2} tr [W L_{k}]), \end{matrix}

where

L_{k} ≜ (y_{k} - A^{⊺} x_{k}) {(y_{k} - A^{⊺} x_{k})}^{⊺}

. The prior is

\begin{matrix} p (Θ | D_{k - 1}) \propto \sqrt{{| W |}^{ν_{t - 1} + \bar{D}}} exp (- \frac{1}{2} tr [W (H_{k - 1} + Ω_{k - 1})]), \end{matrix}

where

H_{k - 1} ≜ {(A - M_{k - 1})}^{⊺} Λ_{k - 1} (A - M_{k - 1})

and

\bar{D} ≜ D_{x} - D_{y} - 1

. The posterior is proportional to the likelihood times the prior:

\begin{matrix} p (Θ | D_{k}) & \propto p (y_{k} | Θ, x_{k}) p (Θ | D_{k - 1}) \\ \propto \sqrt{{| W |}^{ν_{k - 1} + 1 + \bar{D}}} exp (- \frac{1}{2} tr [W (L_{k} + H_{k - 1} + Ω_{k - 1})]) . \end{matrix}

(A1)

We expand the first terms in the exponent and group them as follows:

\begin{matrix} L_{k} + H_{k - 1} & = y_{k} y_{k}^{⊺} - y_{k} x_{k}^{⊺} A - A^{⊺} x_{k} y_{k}^{⊺} + A^{⊺} x_{k} x_{k}^{⊺} A \\ + A^{⊺} Λ_{k - 1} A - A^{⊺} Λ_{k - 1} M_{k - 1} - M_{k - 1}^{⊺} Λ_{k - 1} A + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} \\ = A^{⊺} (Λ_{k - 1} + x_{k} x_{k}^{⊺}) A - A^{⊺} (x_{k} y_{k}^{⊺} + Λ_{k - 1} M_{k - 1}) \\ - (M_{k - 1}^{⊺} Λ_{k - 1} + y_{k} x_{k}^{⊺}) A + y_{k} y_{k}^{⊺} + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} . \end{matrix}

(A2)

Let

Λ_{k} ≜ Λ_{k - 1} + x_{k} x_{k}^{⊺}

,

Ξ_{k} ≜ x_{k} y_{k}^{⊺} + Λ_{k - 1} M_{k - 1}

and

M_{k} ≜ Λ_{k}^{- 1} Ξ_{k}

. Adding and subtracting

Ξ_{k}^{⊺} Λ_{k}^{- 1} Ξ_{k}

to (A2) yields

\begin{matrix} L_{k} + H_{k - 1} & = A^{⊺} Λ_{k} A - A^{⊺} Ξ_{k} - Ξ_{k}^{⊺} A + Ξ_{k}^{⊺} Λ_{k}^{- 1} Ξ_{k} \\ - Ξ_{k}^{⊺} Λ_{k}^{- 1} Ξ_{k} + y_{k} y_{k}^{⊺} + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} \\ = {(A - Λ_{k}^{- 1} Ξ_{k})}^{⊺} Λ_{k} (A - Λ_{k}^{- 1} Ξ_{k}) \\ - M_{k}^{⊺} Λ_{k} M_{k} + y_{k} y_{k}^{⊺} + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} . \end{matrix}

(A3)

Plugging the above into (A1), we recognize the functional form of the matrix normal Wishart distribution:

\begin{matrix} \sqrt{{| W |}^{ν_{k} + \bar{D}}} & exp (- \frac{1}{2} tr [W ({(A - M_{k})}^{⊺} Λ_{k} (A - M_{k}) + Ω_{k})]) \\ \propto MNW (A, W | M_{k}, Λ_{k}^{- 1}, Ω_{k}^{- 1}, ν_{k}), \end{matrix}

which parameters are

\begin{matrix} ν_{k} & = ν_{k - 1} + 1, \\ Λ_{k} & = Λ_{k - 1} + x_{k} x_{k}^{⊺}, \\ M_{k} & = {(Λ_{k - 1} + x_{k} x_{k}^{⊺})}^{- 1} (Λ_{k - 1} M_{k - 1} + x_{k} y_{k}^{⊺}), and \\ Ω_{k} & = Ω_{k - 1} + y_{k} y_{k}^{⊺} + M_{k - 1}^{⊺} Λ_{k - 1} M_{k - 1} - M_{k}^{⊺} Λ_{k} M_{k} . \end{matrix}

This concludes the proof. □

Appendix B. Backwards Message from Likelihood

Proof.

The MARX likelihood function is

\begin{matrix} p (y_{k} | Θ, x_{k}) & \propto \sqrt{| W |} exp (- \frac{1}{2} tr [W L_{k}]), \end{matrix}

(A4)

where the completed square is

\begin{matrix} L_{k} & ≜ (y_{k} - A^{⊺} x_{k}) {(y_{k} - A^{⊺} x_{k})}^{⊺} = y_{k} y_{k}^{⊺} - A^{⊺} x_{k} y_{k}^{⊺} - y_{k} x_{k}^{⊺} A + A^{⊺} x_{k} x_{k}^{⊺} A . \end{matrix}

Let

{\bar{Λ}}_{k} ≜ x_{k} x_{k}^{⊺}

,

{\bar{Ξ}}_{k} ≜ x_{k} y_{k}^{⊺}

and

{\bar{M}}_{k} = {\bar{Λ}}_{k}^{- 1} {\bar{Ξ}}_{k}

. Then adding and subtracting

{\bar{Ξ}}_{k}^{⊺} {\bar{Λ}}_{k} {\bar{Ξ}}_{k}

allows us to rewrite the square in terms of A:

\begin{matrix} L_{k} + {\bar{Ξ}}_{k}^{⊺} {\bar{Λ}}_{k}^{- 1} {\bar{Ξ}}_{k} - {\bar{Ξ}}_{k}^{⊺} {\bar{Λ}}_{k}^{- 1} {\bar{Ξ}}_{k} = y_{k} y_{k}^{⊺} + {(A - {\bar{M}}_{k})}^{⊺} {\bar{Λ}}_{k} (A - {\bar{M}}_{k}) - {\bar{Ξ}}_{k}^{⊺} {\bar{Λ}}_{k}^{- 1} {\bar{Ξ}}_{k} . \end{matrix}

The two remaining terms cancel:

\begin{matrix} y_{k} y_{k}^{⊺} - {\bar{Ξ}}_{k}^{⊺} {\bar{Λ}}_{k}^{- 1} {\bar{Ξ}}_{k} & = y_{k} y_{k}^{⊺} - y_{k} x_{k}^{⊺} {(x_{k} x_{k}^{⊺})}^{- 1} x_{k} y_{k}^{⊺} \\ = y_{k} y_{k}^{⊺} - y_{k} I y_{k}^{⊺} \\ = 0_{D_{y} \times D_{y}} . \end{matrix}

If we define

{\bar{ν}}_{k} ≜ 1 - \bar{D}

for

\bar{D} = D_{x} + D_{y} + 1

and

{\bar{Ω}}_{k} ≜ 0_{D_{y} \times D_{y}}

, then we may recognize the functional form of a matrix normal Wishart in (A4):

\begin{matrix} p (y_{k} | Θ, x_{k}) & \propto \sqrt{{| W |}^{{\bar{ν}}_{k} + \bar{D}}} exp (- \frac{1}{2} tr [W ({(A - {\bar{M}}_{k})}^{⊺} {\bar{Λ}}_{k} (A - {\bar{M}}_{k}) + {\bar{Ω}}_{k})]) \\ \propto MNW (A, W | {\bar{M}}_{k}, {\bar{Λ}}_{k}^{- 1}, {\bar{Ω}}_{k}^{- 1}, {\bar{ν}}_{k}) . \end{matrix}

This concludes the proof. □

Appendix C. Product of Matrix Normal Wishart Distributions

Proof.

Let

p_{1}, p_{2}

be two matrix normal Wishart distributions over the same random variables

Θ

:

\begin{matrix} p_{1} (Θ) & = MNW (A, W | M_{1}, Λ_{1}^{- 1}, Ω_{1}^{- 1}, ν_{1}) \\ p_{2} (Θ) & = MNW (A, W | M_{2}, Λ_{2}^{- 1}, Ω_{2}^{- 1}, ν_{2}) . \end{matrix}

Their product is proportional to

\begin{matrix} p_{1} (Θ) p_{2} (Θ) & \propto \sqrt{{| W |}^{ν_{1} + \bar{D}}} exp (- \frac{1}{2} tr [W L_{1}]) \sqrt{{| W |}^{ν_{2} + \bar{D}}} exp (- \frac{1}{2} tr [W L_{2}]) \\ = \sqrt{{| W |}^{ν_{3} + \bar{D}}} exp (- \frac{1}{2} tr [W (L_{1} + L_{2})]) \end{matrix}

for

\bar{D} ≜ D_{x} - D_{y} - 1

,

ν_{3} ≜ ν_{1} + ν_{2} + D_{x} - D_{y} - 1

and

L_{i} ≜ {(A - M_{i})}^{⊺} Λ_{i} (A - M_{i}) + Ω_{i}

. The sum of

L_{i}

is

\begin{matrix} L_{1} + L_{2} & = A^{⊺} (Λ_{1} + Λ_{2}) A - A^{⊺} (Λ_{1} M_{1} + Λ_{2} M_{2}) - (M_{1}^{⊺} Λ_{1} + M_{2}^{⊺} Λ_{2}) A \\ + M_{1}^{⊺} Λ_{1} M_{1} + M_{2}^{⊺} Λ_{2} M_{2} + Ω_{1} + Ω_{2} . \end{matrix}

(A5)

Let

Λ_{3} ≜ Λ_{1} + Λ_{2}

and

Θ_{3} ≜ Λ_{1} M_{1} + Λ_{2} M_{2}

. Then,

\begin{matrix} {(A - Λ_{3}^{- 1} Θ_{3})}^{⊺} Λ_{3} (A - Λ_{3}^{- 1} Θ_{3}) = A^{⊺} Λ_{3} A - A^{⊺} Θ_{3} - Θ_{3}^{⊺} A + Θ_{3}^{⊺} Λ_{3}^{- 1} Θ_{3} . \end{matrix}

Using

M_{3} ≜ Λ_{3}^{- 1} Θ_{3}

, (A5) can be written as

\begin{matrix} p_{1} (Θ) p_{2} (Θ) \propto \sqrt{{| W |}^{ν_{3} + \bar{D}}} exp (- \frac{1}{2} & tr [W ({(A - M_{3})}^{⊺} Λ_{3} (A - M_{3}) \\ - Θ_{3}^{⊺} Λ_{3}^{- 1} Θ_{3} + M_{1}^{⊺} Λ_{1} M_{1} + M_{2}^{⊺} Λ_{2} M_{2} + Ω_{1} + Ω_{2})]) . \end{matrix}

(A6)

Note that

Θ_{3}^{⊺} Λ_{3}^{- 1} Θ_{3} = Θ_{3}^{⊺} Λ_{3}^{- 1} Λ_{3} Λ_{3}^{- 1} Θ_{3} = M_{3}^{⊺} Λ_{3} M_{3}

. Let

\begin{matrix} Ω_{3} ≜ Ω_{1} + Ω_{2} + M_{1}^{⊺} Λ_{1} M_{1} + M_{2}^{⊺} Λ_{2} M_{2} - M_{3}^{⊺} Λ_{3} M_{3} . \end{matrix}

Then (A6) may be recognized as an unnormalized matrix normal Wishart:

\begin{matrix} \sqrt{{| W |}^{ν_{3} + \bar{D}}} & exp (- \frac{1}{2} tr [W ({(A - M_{3})}^{⊺} Λ_{3} (A - M_{3}) + Ω_{3})]) \\ \propto MNW (A, W | M_{3}, Λ_{3}^{- 1}, Ω_{3}^{- 1}, ν_{3}) . \end{matrix}

(A7)

As such, the product of two matrix normal Wishart distributions is proportional to another matrix normal Wishart distribution. □

Appendix D. Marginalization over A

Proof.

The marginalization over A is

\begin{matrix} p (y_{t} | W, u_{t}, D_{k}) & = \int p (y_{t} | Θ, x_{t}) p (A | W, D_{k}) d A \\ = \int N (y_{t} | A^{⊺} x_{t}, W^{- 1}) MN (A | M_{k}, Λ_{k}^{- 1}, W^{- 1}) d A \\ = \sqrt{{(2 π)}^{- D_{y} (1 + D_{x})} {| W |}^{D_{x} + 1} {| Λ_{k} |}^{D_{y}}} \\ \int exp (- \frac{1}{2} tr [W (L_{t} + H_{k})]) d A . \end{matrix}

(A8)

where the terms inside the trace are

\begin{matrix} L_{t} & ≜ (y_{t} - A^{⊺} x_{t}) {(y_{t} - A^{⊺} x_{t})}^{⊺} \\ H_{k} & ≜ {(A - M_{k})}^{⊺} Λ_{k} (A - M_{k}) . \end{matrix}

Expanding

L_{t}

and

H_{k}

and adding them yields

\begin{matrix} L_{t} + H_{k} & = y_{t} y_{t}^{⊺} - A^{⊺} x_{t} y_{t}^{⊺} - y_{t} x_{t}^{⊺} A + A^{⊺} x_{t} x_{t}^{⊺} A \\ + A^{⊺} Λ_{k} A - A^{⊺} Λ_{k} M_{k} - M_{k}^{⊺} Λ_{k} A + M_{k}^{⊺} Λ_{k} M_{k} \\ = y_{t} y_{t}^{⊺} + M_{k}^{⊺} Λ_{k} M_{k} + A^{⊺} (Λ_{k} + x_{t} x_{t}^{⊺}) A \\ - A^{⊺} (Λ_{k} M_{k} + x_{t} y_{t}^{⊺}) - {(Λ_{k} M_{k} + x_{t} y_{t}^{⊺})}^{⊺} A . \end{matrix}

Let

Λ_{t} ≜ Λ_{k} + x_{t} x_{t}^{⊺}

,

Θ_{t} ≜ Λ_{k} M_{k} + x_{t} y_{t}^{⊺}

and

M_{t} ≜ Λ_{t}^{- 1} Θ_{t}

. Completing the square gives

\begin{matrix} L_{t} + H_{k} & = {(A - M_{t})}^{⊺} Λ_{t} (A - M_{t}) - M_{t}^{⊺} Λ_{t} M_{t} + y_{t} y_{t}^{⊺} + M_{k}^{⊺} Λ_{k} M_{k} . \end{matrix}

Plugging this result into the integral in (A8) gives

\begin{matrix} \int exp (- \frac{1}{2} tr [W (L_{t} + H_{k})]) d A & = exp (- \frac{1}{2} tr [W (y_{t} y_{t}^{⊺} + M_{k}^{⊺} Λ_{k} M_{k} - M_{t}^{⊺} Λ_{t} M_{t})]) \\ \int exp (- \frac{1}{2} tr [W {(A - M_{t})}^{⊺} Λ_{t} (A - M_{t})]) d A . \end{matrix}

We can recognize the integrand as the functional form of a matrix normal distribution. Thus, the integral evaluates to its inverse normalization factor:

\begin{matrix} \int exp (- \frac{1}{2} tr [W {(A - M_{t})}^{⊺} Λ_{t} (A - M_{t})]) d A & = \sqrt{\frac{{(2 π)}^{D_{y} D_{x}}}{{| W |}^{D_{x}} {| Λ_{t} |}^{D_{y}}}} . \end{matrix}

Using this result, the marginalization over A is

\begin{matrix} \int p (y_{t} & | Θ, x_{t}) p (A | W, D_{k}) d A \\ = \sqrt{\frac{| W |}{{(2 π)}^{D_{y}}}} \sqrt{| Λ_{k} |^{D_{y}} {| Λ_{t} |}^{- D_{y}}} exp (- \frac{1}{2} tr [W (y_{t} y_{t}^{⊺} + M_{k}^{⊺} Λ_{k} M_{k} - M_{t}^{⊺} Λ_{t} M_{t})]) . \end{matrix}

Note that, under the matrix determinant lemma,

\begin{matrix} | Λ_{t} | & = | Λ_{k} + x_{t} x_{t}^{⊺} | = | Λ_{k} | (1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t}), \end{matrix}

which implies that the product of determinants is

\begin{matrix} | Λ_{k} |^{D_{y}} {| Λ_{t} |}^{- D_{y}} & = {(1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t})}^{- D_{y}} . \end{matrix}

Let

λ_{t} ≜ {(1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t})}^{- 1}

. As W is

D_{y}

-dimensional,

| W | λ_{t}^{D_{y}} = | W λ_{t} |

. Furthermore, note that

\begin{matrix} M_{t}^{⊺} Λ_{t} M_{t} & = M_{k}^{⊺} Λ_{k} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} Λ_{k} M_{k} + y_{t} x_{t}^{⊺} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} Λ_{k} M_{k} \\ + M_{k}^{⊺} Λ_{k} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} x_{t} y_{t}^{⊺} + y_{t} x_{t}^{⊺} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} x_{t} y_{t}^{⊺} . \end{matrix}

Combining this with the other terms in the trace gives

\begin{matrix} y_{t} y_{t}^{⊺} & + M_{k}^{⊺} Λ_{k} M_{k} - M_{t}^{⊺} Λ_{t} M_{t} \\ = M_{k}^{⊺} Λ_{k} (I - {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} Λ_{k}) M_{k} - y_{t} x_{t}^{⊺} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} Λ_{k} M_{k} \\ - M_{k}^{⊺} Λ_{k} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} x_{t} y_{t}^{⊺} + y_{t} (1 - x_{t}^{⊺} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} x_{t}) y_{t}^{⊺} . \end{matrix}

Using the Sherman–Morrison formula, we have

\begin{matrix} (1 - x_{t}^{⊺} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} x_{t}) & = (1 - x_{t}^{⊺} (Λ_{k}^{- 1} - \frac{Λ_{k}^{- 1} x_{t} x_{t}^{⊺} Λ_{k}^{- 1}}{1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t}}) x_{t}) \\ = {(1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t})}^{- 1} \\ = λ_{t} . \end{matrix}

Another application of Sherman–Morrison yields

\begin{matrix} I - {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} Λ_{k} & = (Λ_{k}^{- 1} - (Λ_{k}^{- 1} - \frac{Λ_{k}^{- 1} x_{t} x_{t}^{⊺} Λ_{k}^{- 1}}{1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t}})) Λ_{k} \\ = λ_{t} Λ_{k}^{- 1} x_{t} x_{t}^{⊺} Λ_{k}^{- 1} . \end{matrix}

A third Sherman–Morrison gives

\begin{matrix} Λ_{k} {(x_{t} x_{t}^{⊺} + Λ_{k})}^{- 1} x_{t} & = Λ_{k} (Λ_{k}^{- 1} - \frac{Λ_{k}^{- 1} x_{t} x_{t}^{⊺} Λ_{k}^{- 1}}{1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t}}) x_{t} \\ = I x_{t} - x_{t} \frac{x_{t}^{⊺} Λ_{k}^{- 1} x_{t}}{1 + x_{t}^{⊺} Λ_{k}^{- 1} x_{t}} \\ = x_{t} λ_{t} . \end{matrix}

Using these three simplifications, we have

\begin{matrix} tr [W (y_{t} y_{t}^{⊺} & + M_{k}^{⊺} Λ_{k} M_{k} - M_{t}^{⊺} Λ_{t} M_{t})] \\ = y_{t}^{⊺} W λ_{t} y_{t} - y_{t}^{⊺} W λ_{t} M_{k}^{⊺} x_{t} - x_{t}^{⊺} M_{k} W λ_{t} y_{t} + x_{t}^{⊺} M_{k} W λ_{t} M_{k}^{⊺} x_{t} \\ = {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} W λ_{t} (y_{t} - M_{k}^{⊺} x_{t}) . \end{matrix}

(A9)

Plugging (A9) into (A8) yields

\begin{matrix} p (y_{t} | W, u_{t}, D_{k}) & = \sqrt{\frac{| W λ_{t} |}{{(2 π)}^{D_{y}}}} exp (- \frac{λ_{t}}{2} {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} W (y_{t} - M_{k}^{⊺} x_{t})) \\ = N (y_{t} | M_{k}^{⊺} x_{t}, {(W λ_{t})}^{- 1}) . \end{matrix}

This concludes the proof. □

Appendix E. Marginalization over W

Proof.

The marginalization over W is

\begin{matrix} \int N (y_{t} | M_{k}^{⊺} x_{t}, & {(W λ_{t})}^{- 1}) W (W | Ω_{k}^{- 1}, ν_{k}) d W \\ = \int \sqrt{\frac{| W λ_{t} |}{{(2 π)}^{D_{y}}}} exp (- \frac{λ_{t}}{2} {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} W (y_{t} - M_{k}^{⊺} x_{t})) \\ \frac{1}{Γ_{D_{y}} (\frac{ν_{k}}{2})} \sqrt{\frac{| Ω_{k} |^{ν_{k}} {| W |}^{ν_{k} - D_{y} - 1}}{2^{ν_{k} D_{y}}}} exp (- \frac{1}{2} tr [W Ω_{k}]) d W \\ = \frac{1}{Γ_{D_{y}} (\frac{ν_{k}}{2})} \sqrt{\frac{| Ω_{k} |^{ν_{k}} λ_{t}^{D_{y}}}{{(2 π)}^{D_{y}} 2^{ν_{k} D_{y}}}} \int \sqrt{{| W |}^{ν_{k} + 1 - D_{y} - 1}} \\ exp (- \frac{1}{2} tr [W (Ω_{k} + λ_{t} (y_{t} - M_{k}^{⊺} x_{t}) {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺})]) d W \\ = \frac{1}{Γ_{D_{y}} (\frac{ν_{k}}{2})} \sqrt{\frac{| Ω_{k} |^{ν_{k}} λ_{t}^{D_{y}}}{{(2 π)}^{D_{y}} 2^{ν_{k} D_{y}}}} Γ_{D_{y}} (\frac{ν_{k} + 1}{2}) \sqrt{2^{(ν_{k} + 1) D_{y}}} \\ \sqrt{| Ω_{k} + λ_{t} (y_{t} - M_{k}^{⊺} x_{t}) {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} |^{- (ν_{k} + 1)}}, \end{matrix}

(A10)

where we made use of the normalization factor of a Wishart distribution. Note that

\begin{matrix} \sqrt{\frac{2^{(ν_{k} + 1) D_{y}}}{{(2 π)}^{D_{y}} 2^{ν_{k} D_{y}}}} & = \sqrt{\frac{2^{D_{y}}}{2^{D_{y}} π^{D_{y}}}} = \sqrt{\frac{1}{π^{D_{y}}}} . \end{matrix}

(A11)

Let

η_{t} ≜ ν_{k} - D_{y} + 1

. Then,

\begin{matrix} \frac{Γ_{D_{y}} (\frac{ν_{k} + 1}{2})}{Γ_{D_{y}} (\frac{ν_{k}}{2})} = \frac{Γ_{D_{y}} (\frac{η_{t} + D_{y}}{2})}{Γ_{D_{y}} (\frac{η_{t} + D_{y} - 1}{2})} . \end{matrix}

(A12)

The determinants simplify as follows:

\begin{matrix} \sqrt{| Ω_{k} |^{ν_{k}} {| Ω_{k} + λ_{t} (y_{t} - M_{k}^{⊺} x_{t}) {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} |}^{- (ν_{k} + 1)}} \\ = \sqrt{| λ_{t} (y_{t} - M_{k}^{⊺} x_{t}) {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} Ω_{k}^{- 1} {+ I |}^{- (ν_{k} + 1)} | Ω_{k}^{- 1} |}, \end{matrix}

(A13)

and then, using the matrix determinant lemma, we have

\begin{matrix} | λ_{t} (y_{t} - M_{k}^{⊺} x_{t}) {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} Ω_{k}^{- 1} {+ I |}^{- (ν_{k} + 1)} \\ = {({(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} Ω_{k}^{- 1} λ_{t} (y_{t} - M_{k}^{⊺} x_{t}) + 1)}^{- (η_{t} + D_{y}) / 2} . \end{matrix}

(A14)

With Equations (A11)–(A14), we may write (A10) as

\begin{matrix} \int N (y_{t} | M_{k} x_{t}, {(W λ_{t})}^{- 1}) W (W | Ω_{k}^{- 1}, ν_{k}) d W \\ = \sqrt{\frac{| Ω_{k}^{- 1} |}{π^{D_{y}}}} \frac{Γ_{D_{y}} (\frac{η_{t} + D_{y}}{2})}{Γ_{D_{y}} (\frac{η_{t} + D_{y} - 1}{2})} \sqrt{λ_{t}^{D_{y}}} \\ {(1 + {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} Ω_{k}^{- 1} λ_{t} (y_{t} - M_{k}^{⊺} x_{t}))}^{- (η_{t} + D_{y}) / 2} \\ = \sqrt{\frac{| η_{t} Ω_{k}^{- 1} λ_{t} |}{{(η_{t} π)}^{D_{y}}}} \frac{Γ_{D_{y}} ((η_{t} + D_{y}) / 2)}{Γ_{D_{y}} ((η_{t} + D_{y} - 1) / 2)} \\ {(1 + \frac{1}{η_{t}} {(y_{t} - M_{k}^{⊺} x_{t})}^{⊺} η_{t} Ω_{k}^{- 1} λ_{t} (y_{t} - M_{k}^{⊺} x_{t}))}^{- (η_{t} + D_{y}) / 2} \\ = T (y_{t} | μ_{t}, Ψ_{t}^{- 1}, η_{t}), \end{matrix}

where

μ_{t} ≜ M_{k}^{⊺} x_{t}

,

Ψ_{t} ≜ η_{t} Ω_{k}^{- 1} λ_{t}

. □

Appendix F. Cross-Entropy of a Matrix Normal Wishart Relative to a Matrix Normal Wishart

Proof.

The functional form of a matrix normal Wishart with general parameters

M, Λ, Ω

, and

ν

is

\begin{matrix} p (Θ) & = MNW (A, W | M, Λ^{- 1}, Ω^{- 1}, ν) \\ = \sqrt{\frac{{| Λ |}^{D_{y}} {| Ω |}^{ν} {| W |}^{ν + D_{x} - D_{y} - 1}}{2^{(ν + D_{x}) D_{y}} π^{D_{x} D_{y}}}} \frac{1}{Γ_{D_{y}} (\frac{ν}{2})} exp (- \frac{1}{2} tr [W ({(A - M)}^{⊺} Λ (A - M) + Ω)]) . \end{matrix}

Consider two matrix normal Wishart distributions over the same parameters

Θ

:

\begin{matrix} q (Θ) & = MNW (A, W | M_{q}, Λ_{q}^{- 1}, Ω_{q}^{- 1}, ν_{q}) \\ p (Θ) & = MNW (A, W | M_{p}, Λ_{p}^{- 1}, Ω_{p}^{- 1}, ν_{p}) . \end{matrix}

The differential cross-entropy

H [q, p]

of q relative to p is

\begin{matrix} H [q, p] & = E_{q (Θ)} [- log p (Θ)] \\ = - \frac{1}{2} D_{y} log | Λ_{p} | - \frac{1}{2} ν_{p} log | Ω_{p} | - \frac{1}{2} (ν_{p} + D_{x} - D_{y} - 1) E_{q} [log | W |] \\ + \frac{1}{2} (ν_{p} + D_{x}) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π + log Γ_{D_{y}} (\frac{ν_{p}}{2}) \\ + \frac{1}{2} E_{q} [tr [W ({(A - M_{p})}^{⊺} Λ_{p} (A - M_{p}) + Ω_{p})]] . \end{matrix}

(A15)

The first expectation is the expectation of a Wishart log-determinant [28]:

\begin{matrix} E_{q (Θ)} [log | W |] & = E_{q (W)} [E_{q (A | W)} [log | W |]] = ψ_{D_{y}} (\frac{ν_{q}}{2}) + D_{y} log 2 - log | Ω_{q} |, \end{matrix}

(A16)

where

ψ_{D_{y}}

is the multivariate digamma function of dimension

D_{y}

. For the second expectation, we first define the following expectations:

\begin{matrix} E_{q (A | W)} [A] & = M_{q} \end{matrix}

(A17)

\begin{matrix} E_{q (A | W)} [A^{⊺}] & = M_{q}^{⊺} \end{matrix}

(A18)

\begin{matrix} E_{q (A | W)} [A^{⊺} B A] & = M_{q}^{⊺} B M_{q} + tr (Λ_{q}^{- 1} B^{⊺}) W^{- 1} \end{matrix}

(A19)

\begin{matrix} E_{q (W)} [W] & = ν_{q} Ω_{q}^{- 1}, \end{matrix}

(A20)

for appropriately dimensioned matrix B. Equation (A19) is a property of matrix normal distributions [28]. We apply

E_{q} [tr (\cdot)] = tr (E_{q} [\cdot])

[27], and make use of the factorization of a matrix normal Wishart:

\begin{matrix} E_{q (Θ)} [tr (W ({(A - M_{p})}^{⊺} Λ_{p} (A - M_{p}) + Ω_{p}))] \\ = tr (E_{q (Θ)} [W ({(A - M_{p})}^{⊺} Λ_{p} (A - M_{p}) + Ω_{p})]) \\ = tr (E_{q (W)} [E_{q (A | W)} [W ({(A - M_{p})}^{⊺} Λ_{p} (A - M_{p}) + Ω_{p})]]) \\ = tr (E_{q (W)} [W E_{q (A | W)} [{(A - M_{p})}^{⊺} Λ_{p} (A - M_{p})] + W Ω_{p}]) . \end{matrix}

(A21)

We expand the term in the inner expectation and plug in Equations (A17)–(A19):

\begin{matrix} E_{q (A | W)} [{(A - M_{p})}^{⊺} Λ_{p} (A - M_{p})] \\ = E_{q (A | W)} [A^{⊺} Λ_{p} A - A^{⊺} Λ_{p} M_{p} - M_{p}^{⊺} Λ_{p} A + M_{p}^{⊺} Λ_{p} M_{p}] \\ = M_{q}^{⊺} Λ_{p} M_{q} + tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) W^{- 1} - M_{q}^{⊺} Λ_{p} M_{p} - M_{p}^{⊺} Λ_{p} M_{q} + M_{p}^{⊺} Λ_{p} M_{p} \\ = {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p}) + tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) W^{- 1} . \end{matrix}

(A22)

We plug (A22) into (A21), expand, and use (A20) to resolve the remaining expectations:

\begin{matrix} tr (E_{q (W)} [W E_{q (A | W)} [{(A - M_{p})}^{⊺} Λ_{p} (A - M_{p})] + W Ω_{p}]) \\ = tr (E_{q (W)} [W {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p}) + tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) W W^{- 1} + W Ω_{p}]) \\ = tr (ν_{q} Ω_{q}^{- 1} {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p}) + tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) I_{D_{y}} + ν_{q} Ω_{q}^{- 1} Ω_{p}) \\ = ν_{q} tr (Ω_{q}^{- 1} {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p})) + D_{y} tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) + ν_{q} tr (Ω_{q}^{- 1} Ω_{p}) . \end{matrix}

(A23)

We plug (A23) and (A16) into (A15) to yield the differential cross-entropy:

\begin{matrix} H [q, p] & = - \frac{1}{2} D_{y} log | Λ_{p} | - \frac{1}{2} ν_{p} log | Ω_{p} | \\ - \frac{1}{2} (ν_{p} + D_{x} - D_{y} - 1) (ψ_{D_{y}} (\frac{ν_{q}}{2}) + D_{y} log 2 - log | Ω_{q} |) \\ + \frac{1}{2} (ν_{p} + D_{x}) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π + log Γ_{D_{y}} (\frac{ν_{p}}{2}) \\ + \frac{1}{2} ν_{q} tr (Ω_{q}^{- 1} {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p})) + \frac{1}{2} D_{y} tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) \\ + \frac{1}{2} ν_{q} tr (Ω_{q}^{- 1} Ω_{p}) \\ = - \frac{1}{2} D_{y} log | Λ_{p} | + \frac{1}{2} (ν_{p} + D_{x} - D_{y} - 1) log | Ω_{q} | - \frac{1}{2} ν_{p} log | Ω_{p} | \\ + \frac{1}{2} (D_{y} + 1) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π \\ + log Γ_{D_{y}} (\frac{ν_{p}}{2}) - \frac{1}{2} (ν_{p} + D_{x} - D_{y} - 1) ψ_{D_{y}} (\frac{ν_{q}}{2}) \\ + \frac{1}{2} ν_{q} tr (Ω_{q}^{- 1} {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p})) \\ + \frac{1}{2} D_{y} tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) + ν_{q} tr (Ω_{q}^{- 1} Ω_{p}) . \end{matrix}

(A24)

This concludes the proof. □

Appendix G. Entropy of a Matrix Normal Wishart

Proof.

Consider a matrix normal Wishart distribution:

q (Θ) = MNW (A, W | M_{q}, Λ_{q}^{- 1}, Ω_{q}^{- 1}, ν_{q}) .

By definition, a differential entropy

H [q]

of a distribution q is a special case of a differential cross-entropy

H [q, p]

of q from another distribution p, where

p = q

, i.e.,

H [q] = H [q, q]

. Plugging in (the parameters of)

p = q

into (A24) from Appendix F, we get the entropy:

\begin{matrix} H [q] & = - \frac{1}{2} D_{y} log | Λ_{q} | + \frac{1}{2} (ν_{q} + D_{x} - D_{y} - 1) log | Ω_{q} | - \frac{1}{2} ν_{q} log | Ω_{q} | \\ + \frac{1}{2} (D_{y} + 1) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π \\ + log Γ_{D_{y}} (\frac{ν_{q}}{2}) - \frac{1}{2} (ν_{q} + D_{x} - D_{y} - 1) ψ_{D_{y}} (\frac{ν_{q}}{2}) \\ + \underset{= 0}{\underset{︸}{\frac{1}{2} ν_{q} tr (Ω_{q}^{- 1} {(M_{q} - M_{q})}^{⊺} Λ_{q} (M_{q} - M_{q}))}} + \frac{1}{2} D_{y} \underset{= D_{x}}{\underset{︸}{tr (Λ_{q}^{- 1} Λ_{q}^{⊺})}} + \frac{1}{2} ν_{q} \underset{= D_{y}}{\underset{︸}{tr (Ω_{q}^{- 1} Ω_{q})}} \\ = - \frac{1}{2} D_{y} log | Λ_{q} | + \frac{1}{2} (D_{x} - D_{y} - 1) log | Ω_{q} | \\ + \frac{1}{2} (D_{y} + 1) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π + \frac{1}{2} (D_{x} + ν_{q}) D_{y} \\ + log Γ_{D_{y}} (\frac{ν_{q}}{2}) - \frac{1}{2} (ν_{q} + D_{x} - D_{y} - 1) ψ_{D_{y}} (\frac{ν_{q}}{2}) . \end{matrix}

This concludes the proof. □

Appendix H. KL-Divergence of a Matrix Normal Wishart from a Matrix Normal Wishart

Proof.

Consider two matrix normal Wishart distributions over the same parameters

Θ

:

\begin{matrix} q (Θ) & = MNW (A, W | M_{q}, Λ_{q}^{- 1}, Ω_{q}^{- 1}, ν_{q}) \\ p (Θ) & = MNW (A, W | M_{p}, Λ_{p}^{- 1}, Ω_{p}^{- 1}, ν_{p}) . \end{matrix}

By definition, a KL-divergence

D_{K L} [q | | p]

of a distribution q from another distribution p is the difference between the differential cross-entropy

H [q, p]

of q from p (A24) and the entropy of q (14) [50]:

\begin{matrix} D_{K L} [q | | p] & = - \frac{1}{2} D_{y} log | Λ_{p} | + \frac{1}{2} (ν_{p} + D_{x} - D_{y} - 1) log | Ω_{q} | - \frac{1}{2} ν_{p} log | Ω_{p} | \\ + \frac{1}{2} (D_{y} + 1) D_{y} log 2 + \frac{1}{2} D_{x} D_{y} log π \\ + log Γ_{D_{y}} (\frac{ν_{p}}{2}) - \frac{1}{2} (ν_{p} + D_{x} - D_{y} - 1) ψ_{D_{y}} (\frac{ν_{q}}{2}) \\ + \frac{1}{2} ν_{q} tr (Ω_{q}^{- 1} {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p})) + \frac{1}{2} D_{y} tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) + \frac{1}{2} ν_{q} tr (Ω_{q}^{- 1} Ω_{p}) \\ + \frac{1}{2} D_{y} log | Λ_{q} | - \frac{1}{2} (D_{x} - D_{y} - 1) log | Ω_{q} | \\ - \frac{1}{2} (D_{y} + 1) D_{y} log 2 - \frac{1}{2} D_{x} D_{y} log π - \frac{1}{2} (D_{x} + ν_{q}) D_{y} \\ - log Γ_{D_{y}} (\frac{ν_{q}}{2}) + \frac{1}{2} (ν_{q} + D_{x} - D_{y} - 1) ψ_{D_{y}} (\frac{ν_{q}}{2}) \\ = \frac{1}{2} D_{y} log \frac{| Λ_{q} |}{| Λ_{p} |} + \frac{1}{2} ν_{p} log \frac{| Ω_{q} |}{| Ω_{p} |} - \frac{1}{2} (D_{x} + ν_{q}) D_{y} \\ - log Γ_{D_{y}} (\frac{ν_{q}}{2}) + log Γ_{D_{y}} (\frac{ν_{p}}{2}) + \frac{1}{2} (ν_{q} - ν_{p}) ψ_{D_{y}} (\frac{ν_{q}}{2}) \\ + \frac{1}{2} ν_{q} tr (Ω_{q}^{- 1} {(M_{q} - M_{p})}^{⊺} Λ_{p} (M_{q} - M_{p})) + \frac{1}{2} D_{y} tr (Λ_{q}^{- 1} Λ_{p}^{⊺}) + ν_{q} tr (Ω_{q}^{- 1} Ω_{p}) . \end{matrix}

This concludes the proof. □

Appendix I. Cross-Entropy of a Matrix Normal Wishart Relative to a Multivariate Normal

Proof.

Consider a matrix normal Wishart distribution q and a multivariate normal distribution p:

\begin{matrix} q (Θ) & = MNW (A, W | M_{q}, Λ_{q}^{- 1}, Ω_{q}^{- 1}, ν_{q}) \\ p (y | Θ, x) & = N (y | A^{⊺} x, W^{- 1}) . \end{matrix}

The differential cross-entropy

H [q, p]

of q relative to p is

\begin{matrix} H [q, p] & = E_{q (Θ)} [- log p (y | Θ, x)] \\ = - \frac{1}{2} E_{q (Θ)} [log | W |] + \frac{1}{2} D_{y} log 2 + \frac{1}{2} D_{y} log π \\ + \frac{1}{2} E_{q (Θ)} [{(y - A^{⊺} x)}^{⊺} W (y - A^{⊺} x)] . \end{matrix}

(A25)

The first expectation again is the expectation of a Wishart log-determinant [28] (A16). For the second expectation, we make use of the factorization of a matrix normal Wishart, bring the term (a scalar) in trace form, apply

E_{q} [tr (\cdot)] = tr (E_{q} [\cdot])

[27] and the cyclic property of tr, and rearrange as follows:

\begin{matrix} E_{q (Θ)} [{(y - A^{⊺} x)}^{⊺} W (y - A^{⊺} x)] \\ = E_{q (W)} [E_{q (A | W)} [{(y - A^{⊺} x)}^{⊺} W (y - A^{⊺} x)]] \\ = tr (E_{q (W)} [E_{q (A | W)} [W (y - A^{⊺} x) {(y - A^{⊺} x)}^{⊺}]]) \\ = tr (E_{q (W)} [W E_{q (A | W)} [(y - A^{⊺} x) {(y - A^{⊺} x)}^{⊺}]]) . \end{matrix}

(A26)

We expand the term in the inner expectation and plug in (A17) and (A19) (with

B = x x^{⊺}

):

\begin{matrix} E_{q (A | W)} [(y - A^{⊺} x) {(y - A^{⊺} x)}^{⊺}] \\ = y y^{⊺} - E_{q (A | W)} [y x^{⊺} A] - E_{q (A | W)} [A^{⊺} x y^{⊺}] + E_{q (A | W)} [A^{⊺} x x^{⊺} A] \\ = y y^{⊺} - y x^{⊺} E_{q (A | W)} [A] - E_{q (A | W)} [A^{⊺}] x y^{⊺} + E_{q (A | W)} [A^{⊺} x x^{⊺} A] \\ = y y^{⊺} - y x^{⊺} M_{q} - M_{q}^{⊺} x y^{⊺} + M_{q}^{⊺} x x^{⊺} M_{q} + tr (Λ_{q}^{- 1} {(x x^{⊺})}^{⊺}) W^{- 1} \\ = (y - M_{q}^{⊺} x) {(y - M_{q}^{⊺} x)}^{⊺} + tr (Λ_{q}^{- 1} x x^{⊺}) W^{- 1} \\ = (y - M_{q}^{⊺} x) {(y - M_{q}^{⊺} x)}^{⊺} + x^{⊺} Λ_{q}^{- 1} x W^{- 1} . \end{matrix}

(A27)

Note that all terms are within a trace, so we can apply the cyclic property of the trace, and

x^{⊺} x

is a scalar. We plug in (A27) into (A26) and use (A20) to solve the expectation:

\begin{matrix} E_{q (Θ)} [{(y - A^{⊺} x)}^{⊺} W (y - A^{⊺} x)] \\ = tr (E_{q (W)} [W ((y - M_{q}^{⊺} x) {(y - M_{q}^{⊺} x)}^{⊺} + x^{⊺} Λ_{q}^{- 1} x W^{- 1})]) \\ = tr (E_{q (W)} [W (y - M_{q}^{⊺} x) {(y - M_{q}^{⊺} x)}^{⊺} + x^{⊺} Λ_{q}^{- 1} x \underset{= I_{D_{y}}}{\underset{︸}{W W^{- 1}}}]) \\ = tr (E_{q (W)} [W (y - M_{q}^{⊺} x) {(y - M_{q}^{⊺} x)}^{⊺}]) + x^{⊺} Λ_{q}^{- 1} x tr (I_{D_{y}}) \\ = tr (ν_{q} Ω_{q}^{- 1} (y - M_{q}^{⊺} x) {(y - M_{q}^{⊺} x)}^{⊺}) + x^{⊺} Λ_{q}^{- 1} x D_{y} \\ = ν_{q} {(y - M_{q}^{⊺} x)}^{⊺} Ω_{q}^{- 1} (y - M_{q}^{⊺} x) + x^{⊺} Λ_{q}^{- 1} x D_{y} . \end{matrix}

(A28)

We plug (A28) and (A16) into (A25) to yield the differential cross-entropy:

\begin{matrix} H [q, p] & = - \frac{1}{2} ψ_{D_{y}} (\frac{ν_{q}}{2}) + \frac{1}{2} log | Ω_{q} | + \frac{1}{2} D_{y} log π \\ + \frac{1}{2} ν_{q} {(y - M_{q}^{⊺} x)}^{⊺} Ω_{q}^{- 1} (y - M_{q}^{⊺} x) + \frac{1}{2} x^{⊺} Λ_{q}^{- 1} x D_{y} . \end{matrix}

This concludes the proof. □

References

Nisslbeck, T.N.; Kouw, W.M. Online Bayesian system identification in multivariate autoregressive models via message passing. (accepted). In Proceedings of the European Control Conference, Thessaloniki, Greece, 24–27 June 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
Tiao, G.C.; Zellner, A. On the Bayesian estimation of multivariate regression. J. R. Stat. Soc. Ser. B 1964, 26, 277–285. [Google Scholar] [CrossRef]
Hannan, E.J.; McDougall, A.; Poskitt, D.S. Recursive estimation of autoregressions. J. R. Stat. Soc. Ser. B 1989, 51, 217–233. [Google Scholar] [CrossRef]
Karlsson, S. Forecasting with Bayesian vector autoregression. Handb. Econ. Forecast. 2013, 2, 791–897. [Google Scholar]
Nisslbeck, T.N.; Kouw, W.M. Coupled autoregressive active inference agents for control of multi-joint dynamical systems. In Proceedings of the International Workshop on Active Inference, Oxford, UK, 9–11 September 2024; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Hecq, A.; Issler, J.V.; Telg, S. Mixed causal–noncausal autoregressions with exogenous regressors. J. Appl. Econom. 2020, 35, 328–343. [Google Scholar] [CrossRef]
Penny, W.; Harrison, L. Multivariate autoregressive models. In Statistical Parametric Mapping: The Analysis of Functional Brain Images; Academic Press: Amsterdam, The Netherlands, 2007; pp. 534–540. [Google Scholar]
Shaarawy, S.M.; Ali, S.S. Bayesian identification of multivariate autoregressive processes. Commun. Stat. Methods 2008, 37, 791–802. [Google Scholar] [CrossRef]
Chaloner, K.; Verdinelli, I. Bayesian experimental design: A review. Stat. Sci. 1995, 10, 273–304. [Google Scholar] [CrossRef]
Williams, G.; Drews, P.; Goldfain, B.; Rehg, J.M.; Theodorou, E.A. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Trans. Robot. 2018, 34, 1603–1622. [Google Scholar] [CrossRef]
Kschischang, F.R.; Frey, B.J.; Loeliger, H.A. Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 2001, 47, 498–519. [Google Scholar] [CrossRef]
Şenöz, İ.; van de Laar, T.; Bagaev, D.; de Vries, B. Variational message passing and local constraint manipulation in factor graphs. Entropy 2021, 23, 807. [Google Scholar] [CrossRef]
Hoffmann, C.; Rostalski, P. Linear optimal control on factor graphs—a message passing perspective. IFAC-PapersOnLine 2017, 50, 6314–6319. [Google Scholar] [CrossRef]
Loeliger, H.A.; Dauwels, J.; Hu, J.; Korl, S.; Ping, L.; Kschischang, F.R. The factor graph approach to model-based signal processing. Proc. IEEE 2007, 95, 1295–1322. [Google Scholar] [CrossRef]
Cox, M.; van de Laar, T.; de Vries, B. A factor graph approach to automated design of Bayesian signal processing algorithms. Int. J. Approx. Reason. 2019, 104, 185–204. [Google Scholar] [CrossRef]
Palmieri, F.A.; Pattipati, K.R.; Di Gennaro, G.; Fioretti, G.; Verolla, F.; Buonanno, A. A unifying view of estimation and control using belief propagation with application to path planning. IEEE Access 2022, 10, 15193–15216. [Google Scholar] [CrossRef]
Forney, G.D. Codes on graphs: Normal realizations. IEEE Trans. Inf. Theory 2001, 47, 520–548. [Google Scholar] [CrossRef]
Le, F.; Srivatsa, M.; Reddy, K.K.; Roy, K. Using graphical models as explanations in deep neural networks. In Proceedings of the IEEE International Conference on Mobile Ad-Hoc and Smart Systems, Monterey, CA, USA, 4–7 November 2019; pp. 283–289. [Google Scholar]
Lecue, F. On the role of knowledge graphs in explainable AI. Semant. Web 2020, 11, 41–51. [Google Scholar] [CrossRef]
Yedidia, J.S.; Freeman, W.T.; Weiss, Y. Bethe free energy, Kikuchi approximations, and belief propagation algorithms. Adv. Neural Inf. Process. Syst. 2001, 13. Available online: https://merl.com/publications/docs/TR2001-16.pdf (accessed on 8 June 2025).
Zhang, Y.; Xu, W.; Liu, A.; Lau, V. Message Passing Based Wireless Federated Learning via Analog Message Aggregation. In Proceedings of the IEEE/CIC International Conference on Communications in China, Hangzhou, China, 7–9 August 2024; pp. 2161–2166. [Google Scholar]
Bagaev, D.; de Vries, B. Reactive message passing for scalable Bayesian inference. Sci. Program. 2023, 2023, 6601690. [Google Scholar] [CrossRef]
Podusenko, A.; Kouw, W.M.; de Vries, B. Message passing-based inference for time-varying autoregressive models. Entropy 2021, 23, 683. [Google Scholar] [CrossRef]
Kouw, W.M.; Podusenko, A.; Koudahl, M.T.; Schoukens, M. Variational message passing for online polynomial NARMAX identification. In Proceedings of the American Control Conference, Atlanta, GA, USA, 8–10 June 2022; IEEE: New York, NY, USA, 2022; pp. 2755–2760. [Google Scholar]
Petersen, K.B.; Pedersen, M.S. The matrix cookbook. Tech. Univ. Den. 2008, 7, 510. [Google Scholar]
Soch, J.; Allefeld, C.; Faulkenberry, T.J.; Pavlovic, M.; Petrykowski, K.; Sarıtaş, K.; Balkus, S.; Kipnis, A.; Atze, H.; Martin, O.A. The Book of Statistical Proofs (Version 2023). 2024. Available online: https://zenodo.org/records/10495684 (accessed on 8 June 2025).
Gupta, A.K.; Nagar, D.K. Matrix Variate Distributions; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
Särkkä, S. Bayesian Filtering and Smoothing; Cambridge University Press: London, UK; New York, NY, USA, 2013. [Google Scholar]
Lopes, M.T.; Castello, D.A.; Matt, C.F.T. A Bayesian inference approach to estimate elastic and damping parameters of a structure subjected to vibration tests. In Proceedings of the Inverse Problems, Design and Optimization Symposium, Joao Pessoa, Brazil, 25–27 August 2010. [Google Scholar]
Winn, J.; Bishop, C.M.; Jaakkola, T. Variational message passing. J. Mach. Learn. Res. 2005, 6, 661–694. [Google Scholar]
Dauwels, J.; Korl, S.; Loeliger, H.A. Particle methods as message passing. In Proceedings of the IEEE International Symposium on Information Theory, Seattle, DC, USA, 9–14 July 2006; pp. 2052–2056. [Google Scholar]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Smith, R.; Friston, K.J.; Whyte, C.J. A step-by-step tutorial on active inference and its application to empirical data. J. Math. Psychol. 2022, 107, 102632. [Google Scholar] [CrossRef] [PubMed]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
Parr, T.; Pezzulo, G.; Friston, K.J. Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
Friston, K.; Da Costa, L.; Sajid, N.; Heins, C.; Ueltzhöffer, K.; Pavliotis, G.A.; Parr, T. The free energy principle made simpler but not too simple. Phys. Rep. 2023, 1024, 1–29. [Google Scholar] [CrossRef]
Yedidia, J.S.; Freeman, W.T.; Weiss, Y. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inf. Theory 2005, 51, 2282–2312. [Google Scholar] [CrossRef]
Proakis, J.G. Digital Signal Processing: Principles Algorithms and Applications; Pearson Education India: Noida, India, 2001. [Google Scholar]
Robertson, D.G.E.; Dowling, J.J. Design and responses of Butterworth and critically damped digital filters. J. Electromyogr. Kinesiol. 2003, 13, 569–573. [Google Scholar] [CrossRef]
Smith, J.O. Introduction to Digital Filters: With Audio Applications; Smith, J., Ed.; W3K Publishing: San Francisco, CA, USA, 2008; Volume 2. [Google Scholar]
Zumbahlen, H. (Ed.) Linear Circuit Design Handbook; Newnes: Oxford, UK, 2011. [Google Scholar]
Mello, R.G.; Oliveira, L.F.; Nadal, J. Digital Butterworth filter for subtracting noise from low magnitude surface electromyogram. Comput. Methods Programs Biomed. 2007, 87, 28–35. [Google Scholar] [CrossRef]
Damgaard, M.R.; Pedersen, R.; Bak, T. Study of variational inference for flexible distributed probabilistic robotics. Robotics 2022, 11, 38. [Google Scholar] [CrossRef]
Tedeschini, B.C.; Brambilla, M.; Nicoli, M. Message passing neural network versus message passing algorithm for cooperative positioning. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 1666–1676. [Google Scholar] [CrossRef]
Ta, D.N.; Kobilarov, M.; Dellaert, F. A factor graph approach to estimation and model predictive control on unmanned aerial vehicles. In Proceedings of the International Conference on Unmanned Aircraft Systems, Orlando, FL, USA, 27–30 May 2014; IEEE: New York, NY, USA, 2014; pp. 181–188. [Google Scholar]
Castaldo, F.; Palmieri, F.A. A multi-camera multi-target tracker based on factor graphs. In Proceedings of the IEEE International Symposium on Innovations in Intelligent Systems and Applications, Alberobello, Italy, 23–25 June 2014; IEEE: New York, NY, USA, 2014; pp. 131–137. [Google Scholar]
van Erp, B.; Bagaev, D.; Podusenko, A.; Şenöz, İ.; de Vries, B. Multi-agent trajectory planning with NUV priors. In Proceedings of the American Control Conference, Toronto, ON, Canada, 10–12 July 2024; IEEE: New York, NY, USA, 2024; pp. 2766–2771. [Google Scholar]
Assimakis, N.; Adam, M.; Douladiris, A. Information filter and Kalman filter comparison: Selection of the faster filter. In Proceedings of the Information Engineering, Chongqing, China, 26–28 October 2012; Volume 2, pp. 1–5. [Google Scholar]
Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]

Figure 1. Forney-style factor graph of the MARX model in recursive form. A matrix normal Wishart node sends a prior message

(1)

to an equality node. A likelihood-based message

(2)

passes upwards from the MARX likelihood node (dashed box), attached to the observed variables

y_{k}

,

{\bar{y}}_{k - 1}

, and

{\bar{u}}_{k}

. Combining the prior-based and likelihood-based messages at the equality node yields the posterior (message 3). Message 4 is the posterior predictive distribution for the system output.

Figure 1. Forney-style factor graph of the MARX model in recursive form. A matrix normal Wishart node sends a prior message

(1)

to an equality node. A likelihood-based message

(2)

passes upwards from the MARX likelihood node (dashed box), attached to the observed variables

y_{k}

,

{\bar{y}}_{k - 1}

, and

{\bar{u}}_{k}

. Combining the prior-based and likelihood-based messages at the equality node yields the posterior (message 3). Message 4 is the posterior predictive distribution for the system output.

Figure 2. Heatmap of true system parameter

{\tilde{A}}^{⊺}

. “X” denotes coefficients generated from a Butterworth filter.

Figure 2. Heatmap of true system parameter

{\tilde{A}}^{⊺}

. “X” denotes coefficients generated from a Butterworth filter.

Figure 3. Simulation errors (average RMSE) of all three estimators for the MARX system, with ribbons indicating standard errors.

Figure 4. Log-scale Frobenius norm of the difference between true coefficient matrix

\tilde{A}

and estimates A of each estimator in a single Monte Carlo run with

T_{train} = 2^{6}

for the MARX system, with ribbons indicating standard errors.

Figure 4. Log-scale Frobenius norm of the difference between true coefficient matrix

\tilde{A}

and estimates A of each estimator in a single Monte Carlo run with

T_{train} = 2^{6}

for the MARX system, with ribbons indicating standard errors.

Figure 5. Log-scale Frobenius norm of the difference between true coefficient matrix

\tilde{W}

and estimates W of each MARX estimator for the MARX system, with ribbons indicating standard errors.

Figure 5. Log-scale Frobenius norm of the difference between true coefficient matrix

\tilde{W}

and estimates W of each MARX estimator for the MARX system, with ribbons indicating standard errors.

Figure 6. Negative log posterior probability of the true system parameters

\tilde{Θ}

under each prior choice for the MARX system (lower is better).

Figure 6. Negative log posterior probability of the true system parameters

\tilde{Θ}

under each prior choice for the MARX system (lower is better).

Figure 7. Time series of the estimated noise precision matrix W for the MARX-WI for the MARX system. Ribbons indicate one standard deviation, and horizontal lines denote the true values of

\tilde{W}

.

Figure 7. Time series of the estimated noise precision matrix W for the MARX-WI for the MARX system. Ribbons indicate one standard deviation, and horizontal lines denote the true values of

\tilde{W}

.

Figure 8. Top: Heatmap of the final

{\tilde{A}}^{⊺}

coefficient matrix parameter estimate by the MARX-WI model. “X” marks selected elements, and the trajectories are shown below. Bottom: Time series of the selected elements of

\tilde{A}

estimated by MARX-WI, with ribbons indicating one standard deviation. Horizontal lines show the true values of the corresponding elements of

\tilde{A}

.

Figure 8. Top: Heatmap of the final

{\tilde{A}}^{⊺}

coefficient matrix parameter estimate by the MARX-WI model. “X” marks selected elements, and the trajectories are shown below. Bottom: Time series of the selected elements of

\tilde{A}

estimated by MARX-WI, with ribbons indicating one standard deviation. Horizontal lines show the true values of the corresponding elements of

\tilde{A}

.

Figure 9. MARX-WI surprisal (dashed blue line) and its decomposition into accuracy (red line) and complexity (green line) over time for the MARX system.

Figure 10. Entropy of the MARX-WI variational posterior

q (Θ | D_{k})

over time for the MARX system.

Figure 10. Entropy of the MARX-WI variational posterior

q (Θ | D_{k})

over time for the MARX system.

Figure 11. Surprisal over time for MARX-WI versus MARX-UI for the MARX system.

Figure 12. Simulation errors (average RMSE) of all three estimators for each validation system, with ribbons indicating standard errors.

Figure 13. Log-scale Frobenius norm of the error between the true coefficient matrix

\tilde{W}

and its estimates W from each MARX estimator for each validation system. Ribbons represent standard errors.

Figure 13. Log-scale Frobenius norm of the error between the true coefficient matrix

\tilde{W}

and its estimates W from each MARX estimator for each validation system. Ribbons represent standard errors.

Figure 14. Time series of

\tilde{W}

estimates from MARX-WI for each validation system, with ribbons representing one standard deviation. Horizontal lines mark true parameter values.

Figure 14. Time series of

\tilde{W}

estimates from MARX-WI for each validation system, with ribbons representing one standard deviation. Horizontal lines mark true parameter values.

Figure 15. Surprisal (dashed blue) and its decomposition into accuracy (red) and complexity (green) for MARX-WI over time for each validation system.

Figure 16. Entropy of the MARX-WI model parameters over time for each validation system.

Table 1. Sets of prior parameters used in the experiments.

	$M_{0}$	$Λ_{0}$	$Ω_{0}$	$ν_{0}$
Uninformative	$0_{D_{x} \times D_{y}}$	$1 \times 10^{- 4} \cdot I_{D_{x}}$	$1 \times 10^{- 5} \cdot I_{D_{y}}$	$D_{y} + 3$
Weakly informative	$0_{D_{x} \times D_{y}}$	$1 \times 10^{- 1} \cdot I_{D_{x}}$	$1 \times 10^{- 2} \cdot I_{D_{y}}$	$D_{y} + 3$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nisslbeck, T.N.; Kouw, W.M. Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models. Entropy 2025, 27, 679. https://doi.org/10.3390/e27070679

AMA Style

Nisslbeck TN, Kouw WM. Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models. Entropy. 2025; 27(7):679. https://doi.org/10.3390/e27070679

Chicago/Turabian Style

Nisslbeck, Tim N., and Wouter M. Kouw. 2025. "Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models" Entropy 27, no. 7: 679. https://doi.org/10.3390/e27070679

APA Style

Nisslbeck, T. N., & Kouw, W. M. (2025). Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models. Entropy, 27(7), 679. https://doi.org/10.3390/e27070679

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models †

Abstract

1. Introduction

2. Problem Statement

3. Model Specification

Factor Graph

4. Inference

4.1. Parameter Estimation

4.2. Output Prediction

5. Model Evaluation

5.1. MARX Model Evidence and Surprisal

5.2. MARX Variational Free Energy

6. Experiments

6.1. Baseline Estimator

6.2. Verification

6.3. Validation

6.3.1. Linear System: Double Mass-Spring-Damper

6.3.2. Nonlinear System: Double Pendulum

6.3.3. Results

7. Discussion

7.1. Computational Efficiency

7.2. Limitations

7.3. Future Work

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Parameter Estimation

Appendix B. Backwards Message from Likelihood

Appendix C. Product of Matrix Normal Wishart Distributions

Appendix D. Marginalization over A

Appendix E. Marginalization over W

Appendix F. Cross-Entropy of a Matrix Normal Wishart Relative to a Matrix Normal Wishart

Appendix G. Entropy of a Matrix Normal Wishart

Appendix H. KL-Divergence of a Matrix Normal Wishart from a Matrix Normal Wishart

Appendix I. Cross-Entropy of a Matrix Normal Wishart Relative to a Multivariate Normal

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models^†