Next Article in Journal
Masses of Hadrons, Tetraquarks, and Pentaquarks Through a Tsallis-Entropy Approach in the MIT Bag Model
Previous Article in Journal
Hyperparameter Optimization EM Algorithm via Bayesian Optimization and Relative Entropy
Previous Article in Special Issue
Hidden Markov Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models †

Department of Electrical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the proceedings of the IEEE European Control Conference, held at Thessaloniki, Greece, 24–27 June 2025.
Entropy 2025, 27(7), 679; https://doi.org/10.3390/e27070679
Submission received: 9 March 2025 / Revised: 5 June 2025 / Accepted: 9 June 2025 / Published: 26 June 2025
(This article belongs to the Special Issue Advances in Probabilistic Machine Learning)

Abstract

We present a Forney-style factor graph representation for the class of multivariate autoregressive models with exogenous inputs, and we propose an online Bayesian parameter-identification procedure based on message passing within this graph. We derive message-update rules for (1) a custom factor node that represents the multivariate autoregressive likelihood function and (2) the matrix normal Wishart distribution over the parameters. The flow of messages reveals how parameter uncertainty propagates into predictive uncertainty over the system outputs and how individual factor nodes and edges contribute to the overall model evidence. We evaluate the message-passing-based procedure on (i) a simulated autoregressive system, demonstrating convergence, and (ii) on a benchmark task, demonstrating strong predictive performance.

1. Introduction

Autoregressive models provide a simple yet powerful framework for capturing dynamical systems [1,2,3,4,5]. Multivariate autoregressive models with exogenous inputs (MARX) exhibit a complex dependence structure. Each component of the vector signal evolves as a weighted combination of (i) its own past observations, (ii) other components, and (iii) an exogenous vector-valued input signal [6,7]. This intricate dependence structure generates significant uncertainty in parameter estimation.
Bayesian inference offers a principled approach for quantifying and propagating this uncertainty into predictions for future system outputs [8,9]. Moreover, uncertainty quantification enables the incorporation of information-theoretic quantities into cost functions, which is useful for optimal experimental design and adaptive control [10,11]. Markov Chain Monte Carlo techniques are typically employed to approximate posterior distributions. However their computational cost makes them impractical for large-scale real-time applications such as online system identification and adaptive control. In contrast, exact and variational inference methods provide full posterior distributions over parameters, thereby enabling robust decision-making under uncertainty [12,13]. This capability is particularly crucial in safety-critical applications, such as robotics, where understanding uncertainty is as important as making accurate predictions.
To address this challenge, we introduce an exact recursive Bayesian estimator that maintains a full posterior distribution and is computationally efficient. Recursive estimators offer a scalable alternative to batch estimators, but they either lack posterior uncertainty over parameters or rely on approximations [3,8]. Shaarawy and Ali proposed an exact recursive Bayesian estimator based on the matrix normal Wishart distribution, demonstrating its effectiveness for system identification [9]. We extend their approach by casting the inference procedure as a message-passing algorithm on a factor graph, thereby improving both computational efficiency and interpretability.
Factor graphs are graphical tools that capture the probabilistic relationships between random variables [14]. Many algorithms, including inference, can be formulated as message passing on a factor graph. Thus, message passing on factor graphs provides a structured and scalable framework for Bayesian inference, offering several key advantages over conventional inference frameworks [15,16,17]. We specifically consider Forney-style factor graphs, for their simplicity and compact visual representation [18]. First, factor graphs offer an intuitive representation of probabilistic models and data flow by depicting distinct probabilistic relationships as separate factor nodes that explicitly capture dependencies between variables [15,17]. This structured representation makes the inference process more interpretable and supports a more flexible model design, contributing to explainable artificial intelligence [19,20]. Second, message passing on factor graphs enables distributed computation by structuring inference into localized update rules at each node [21]. In particular, casting inference as message passing on a factor graph can enable federated learning, which accelerates learning in a multi-agent setting where physically separated agents share likelihood messages for joint parameter estimation [22]. This formulation significantly reduces the computational complexity compared to traditional recursive methods, making real-time Bayesian inference more tractable in large-scale settings [23,24]. Localized updates facilitate the efficient propagation of uncertainty throughout the graph, allowing for the attribution of uncertainty to specific sources, for example, distinguishing between prediction uncertainty arising from the likelihood model versus uncertainty in the inferred parameters. This fine-grained decomposition of uncertainties further enables a novel evaluation of model performance: the negative log-model evidence (surprisal) can be decomposed into contributions from individual nodes and edges in the factor graph. By analyzing how these contributions evolve over time, one gains detailed insights into the learning dynamics during system identification, thus linking model evaluation directly to the underlying probabilistic structure. Lastly, message passing unifies a broad class of algorithms, spanning signal filtering, optimal control, and path planning [14,17,24,25], making it a computationally efficient tool for probabilistic reasoning in large-scale problems. Overall, by leveraging this structured inference technique, our approach not only enhances Bayesian inference for dynamical systems but also yields more interpretable, scalable, and computationally efficient probabilistic machine learning models.
In summary, our key contributions are as follows:
  • We derive a message-passing algorithm for exact recursive Bayesian inference in MARX models, maintaining full posterior distributions while ensuring computational efficiency.
  • We extend the inference framework to predict future system outputs that explicitly account for parameter uncertainty, improving robustness for real-time applications.
  • We introduce a novel model evaluation method by decomposing the negative log-model evidence (surprisal) into contributions from individual nodes and edges in the factor graph, providing insights into uncertainty and learning dynamics.
  • We demonstrate the effectiveness of our approach through empirical evaluations on (i) a synthetic MARX system with known parameters for verification, and (ii) two synthetic dynamical systems with unknown parameters: a double mass-spring-damper system and a nonlinear double pendulum system.
The remainder of this paper is organized as follows. In Section 2, we formally describe the class of the discrete-time dynamical system considered. In Section 3, we present our probabilistic MARX model and its representation using Forney-style factor graphs. In Section 4, we detail the message-passing algorithm for recursive Bayesian inference, including both parameter estimation and predictive inference. In Section 5, we introduce our novel evaluation method based on decomposing surprisal. In Section 6, we demonstrate the effectiveness of our approach on synthetic system identification tasks. In Section 7, we discuss the computational benefits, interpretability, and broader implications of our method. Finally, in Section 8, we conclude this paper.

2. Problem Statement

We consider discrete-time dynamical systems, represented by a state z k R D z and driven by a control signal u k R D u . These systems evolve according to a state transition function f : R D z × R D u R D z . At each time step, we observe a noisy measurement y k R D y of the state via a measurement function g : R D z R D y . This can be expressed as a state–space model of the form:
z k = f ( z k 1 , u k ) , y k = g ( z k ) + e k ,
where e k R D y is a stochastic disturbance. Our objective is to predict future observations y t for t > k , given future inputs u t , without prior knowledge about the system dynamics.

3. Model Specification

To address the problem defined in Section 2, we propose a probabilistic model that enables recursive learning and prediction of future observations in a partially observed dynamical system. Specifically, we assume that the unknown system can be approximated by a multivariate autoregressive model with exogenous inputs of order N, denoted as MARX(N). Let y k R D y denote the D y -dimensional observation at time step k. We collect the past N y outputs into the matrix
y ¯ k 1 y k 1 , 1 y k 2 , 1 y k N y , 1 y k 1 , D y y k 2 , D y y k N y , D y ,
and, similarly, the most recent N u control inputs into
u ¯ k u k , 1 u k 1 , 1 u k N u + 1 , 1 u k , D u u k 1 , D u u k N u + 1 , D u .
We then reshape both matrices y ¯ k 1 and u ¯ k into a single vector x k R D x , where D x = N y D y + N u D u :
x k vec ( y ¯ k 1 ) vec ( u ¯ k ) ,
and vec ( · ) denotes the column-wise vectorization operator that stacks the columns of a matrix into a single column vector [26]. At the core of our MARX(N) model is a vector autoregressive process with exogenous inputs, characterized by the following likelihood function:
p ( y k | Θ , x k ) = N ( y k | A x k , W 1 ) = | W | ( 2 π ) D y exp 1 2 ( y k A x k ) W ( y k A x k ) ,
where the parameters—jointly denoted as Θ = ( A , W ) —consist of a regression coefficient matrix A R D x × D y and a noise precision matrix W R + D y × D y , with R + denoting the space of positive semi-definite matrices. Each column A : , j specifies how the full memory vector x k (comprising past outputs and inputs) linearly predicts the jth component of the current observation y k , j . In state–space terminology, A captures both the temporal memory and cross-variable coupling by weighting each lagged signal in x k . The matrix W represents the inverse covariance (precision) of the Gaussian measurement noise: its diagonal entries set the inverse variances for each observed dimension while off-diagonals model instantaneous noise correlations between different components of y k .
For computational convenience (see Section 4.1), we specify our prior distribution over Θ as a matrix normal Wishart distribution [27]:
p ( Θ ) = p ( A | W ) p ( W ) = MN ( A | M 0 , Λ 0 1 , W 1 ) W ( W | Ω 0 1 , ν 0 ) .
Here, the coefficient matrix A follows a matrix normal distribution with mean M 0 R D x × D y , row covariance Λ 0 1 R D x × D x , and column covariance W 1 R D y × D y ,
p ( A | W ) = MN ( A | M 0 , Λ 0 1 , W 1 ) = | W | D x | Λ 0 | D y ( 2 π ) D x D y exp 1 2 tr W ( A M 0 ) Λ 0 ( A M 0 ) ,
where tr ( · ) denotes the trace of a square matrix, i.e., the sum of its diagonal entries [26]. The precision matrix W follows a Wishart distribution with a scale matrix Ω 0 1 R D y × D y and degrees of freedom ν 0 R
p ( W ) = W ( W | Ω 0 1 , ν 0 ) = | Ω 0 | ν 0 2 ν 0 D y | W | ν 0 D y 1 Γ D y ( ν 0 / 2 ) exp 1 2 tr W Ω 0 .
Here, Γ D y ( · ) is the multivariate Gamma function with dimension D y [28]. Our goal is to infer the posterior distribution over A and W and subsequently use these parameter posterior distributions to make predictions for future outputs y t .
The chosen prior and likelihood define the following generative model over the joint distribution of observations, inputs, and parameters:
p ( y 1 : k , u 1 : k , Θ ) = p ( Θ ) i = 1 k p ( y i | Θ , x i ) .
We consider two inference paradigms for parameter estimation [29]. In batch estimation, the full dataset is used to compute the posterior:
p ( Θ | y 1 : k , u 1 : k ) p ( Θ ) i = 1 k p ( y i | Θ , x i ) .
Alternatively, in recursive estimation, the posterior is updated incrementally as new data arrives:
p ( Θ | y 1 : k , u 1 : k ) p ( Θ | y 1 : k 1 , u 1 : k 1 ) p ( y k | Θ , y 1 : k 1 , u 1 : k ) .
In this paper, we focus on the recursive formulation, which enables efficient online model updates and is well suited for real-time applications and systems where storing and reprocessing the entire history is infeasible.

Factor Graph

The probabilistic graphical model underlying the recursive formulation is straightforward, consisting of a prior distribution and a likelihood function. Figure 1 presents a Forney-style factor graph in which nodes represent factors, edges denote variables, and each edge connects exactly two nodes [15]. In the graph, time flows from left to right, predictions flow from top to bottom, and corrections flow from bottom to top. The factor node labeled MNW represents the matrix normal Wishart prediction distribution along with its associated prior parameters. The dashed box represents the composite likelihood node, which comprises (i) the concatenation operation described in (1), (ii) the dot–product operation between the regression coefficient matrix A and the memory x k , and (iii) the stochastic disturbance. The equality node connects the parameters Θ to the likelihood nodes for each time step k.

4. Inference

Inference consists of two stages: (i) parameter estimation, where we infer model parameters from observed outputs y k (Section 4.1), and (ii) output prediction, where we forecast future outputs y t for t > k , given future system inputs u k + 1 (Section 4.2).

4.1. Parameter Estimation

We wish to recursively estimate the posterior distribution over the model parameters:
p ( Θ | D k ) = p ( y k | Θ , x k ) p ( y k | u k , D k 1 ) p ( Θ | D k 1 ) ,
where D k = { y i , u i } i = 1 k denotes the data up to time k. Note that the memory vector x k is a subset of D k 1 . The evidence term in the denominator is
p ( y k | u k , D k 1 ) = p ( y k | Θ , x k ) p ( Θ | D k 1 ) d Θ .
This evidence term will be discussed in detail in Section 5.
Lemma 1.
Combining the MARX likelihood (2) with a matrix normal Wishart prior distribution over MARX coefficient matrix A and precision matrix W (3) yields a matrix normal Wishart distribution:
p ( Θ | D k ) = MNW ( A , W | M k , Λ k 1 , Ω k 1 , ν k ) ,
with the following parameter updates:
ν k = ν k 1 + 1 Λ k = Λ k 1 + x k x k M k = ( Λ k 1 + x k x k ) 1 ( Λ k 1 M k 1 + x k y k ) Ω k = Ω k 1 + y k y k + M k 1 Λ k 1 M k 1 ( Λ k 1 M k 1 + x k y k ) ( Λ k 1 + x k x k ) 1 ( Λ k 1 M k 1 + x k y k ) .
See Appendix A for the proof. This solution can be cast as a message-passing procedure on a factor graph, allowing distributed computation [15,30].
In Figure 1, circled messages indicate the information flow between the factor nodes along the edges. Message represents the previous posterior belief over Θ = ( A , W ) :
= p ( Θ | D k 1 ) = MNW ( A , W | M k 1 , Λ k 1 1 , Ω k 1 1 , ν k 1 ) .
The sum–product message from the composite MARX likelihood towards its parameters is the likelihood function itself, re-expressible as a probability distribution over Θ .
Lemma 2.
The message from the composite MARX likelihood (2) towards its parameters is matrix normal Wishart distributed as follows:
= p ( y k | Θ , x k ) MNW ( A , W | M ¯ k , Λ ¯ k 1 , Ω ¯ k 1 , ν ¯ k ) .
Its parameters are
ν ¯ k = 2 D x + D y , Λ ¯ k = x k x k , M ¯ k = ( x k x k ) 1 x k y k , Ω ¯ k = 0 D y × D y .
See Appendix B for the proof. Note that the scale matrix is not positive-definite, which implies that message is an improper distribution. Utilizing improper distributions is not uncommon when messages are intermediate results. For example, in variational and particle-based message passing, the messages are unnormalized and therefore also technically improper distributions [31,32]. However, should one want to visualize message or convert it to a related distribution, for instance, then the scale matrix can be perturbed with a machine precision offset (i.e., Ω ¯ k = 10 8 · I D y × D y ).
Message results from multiplying messages and at the equality node [15].
Lemma 3.
Let p 1 and p 2 be two matrix normal Wishart distributions over the same random variables Θ:
p 1 ( Θ ) = MNW ( A , W | M 1 , Λ 1 1 , Ω 1 1 , ν 1 ) p 2 ( Θ ) = MNW ( A , W | M 2 , Λ 2 1 , Ω 2 1 , ν 2 ) .
Their product is proportional to another matrix normal Wishart distribution:
p 1 ( Θ ) p 2 ( Θ ) MNW ( A , W | M 3 , Λ 3 1 , Ω 3 1 , ν 3 ) ,
and its parameters are combinations of p 1 , p 2 ’s parameters,
ν 3 = ν 1 + ν 2 + D x D y 1 , Λ 3 = Λ 1 + Λ 2 , M 3 = ( Λ 1 + Λ 2 ) 1 ( Λ 1 M 1 + Λ 2 M 2 ) , Ω 3 = Ω 1 + Ω 2 + M 1 Λ 1 M 1 + M 2 Λ 2 M 2 ( Λ 1 M 1 + Λ 2 M 2 ) ( Λ 1 + Λ 2 ) 1 ( Λ 1 M 1 + Λ 2 M 2 ) .
See Appendix C for the proof.
Theorem 1.
The outgoing message from the equality node is proportional to the exact recursive posterior distribution:
= · MNW ( A , W | M k , Λ k 1 , Ω k 1 , ν k ) .
Proof. 
Combining parameters from the messages in (6) and (7) according to the product operation in Lemma 3 yields
ν k = ν k 1 + ν ¯ k + D x D y 1 = ν k 1 + 1 , Λ k = Λ k 1 + Λ ¯ k = Λ k 1 + x k x k , M k = ( Λ k 1 + Λ ¯ k ) 1 ( Λ k 1 M k 1 + Λ ¯ k M ¯ k ) = ( Λ k 1 + x k x k ) 1 ( Λ k 1 M k 1 + x k y k ) , Ω k = Ω k 1 + Ω ¯ k + M k 1 Λ k 1 M k 1 + M ¯ k Λ ¯ k M ¯ k ( Λ k 1 M k 1 + Λ ¯ k M ¯ k ) ( Λ k 1 + Λ ¯ k ) 1 ( Λ k 1 M k 1 + Λ ¯ k M ¯ k ) = Ω k 1 + M k 1 Λ k 1 M k 1 + y k y k ( Λ k 1 M k 1 + x k y k ) ( Λ k 1 + x k x k ) 1 ( Λ k 1 M k 1 + x k y k ) .
These match the parameter update rules outlined in Lemma 1. □

4.2. Output Prediction

Predicting future system outputs amounts to computing the posterior predictive distribution, i.e., the marginal distribution of y t for t > k :
= p ( y t | u t , D k ) = p ( y t | Θ , x t ) p ( Θ | D k ) d Θ .
We exploit the factorization of the parameter posterior over ( A , W ) to split this into a marginalization over A:
p ( y t | W , u t , D k ) = p ( y t | Θ , x t ) p ( A | W , D k ) d A ,
and a marginalization over W:
p ( y t | u t , D k ) = p ( y t | W , u t , D k ) p ( W | D k ) d W .
Theorem 2.
Marginalizing the composite MARX likelihood (2) over the matrix normal distribution (4) for A yields a multivariate normal distribution:
N ( y t | A x t , W 1 ) MN A | M k , Λ k 1 , W 1 d A = N y t | M k x t , ( λ t W ) 1 ,
where λ t ( 1 + x t Λ k 1 x t ) 1 .
See Appendix D for the proof.
Theorem 3.
Marginalizing a multivariate normal distribution over a Wishart distribution on its precision parameter yields a multivariate location-scale Student’s t-distribution [27]:
N ( y t | M k x t , ( λ t W ) 1 ) W ( W | Ω k 1 , ν k ) d W = T ( y t | μ t , Ψ t 1 , η t ) ,
where μ t M k x t , η t ν k D y + 1 , and Ψ t η t Ω k 1 λ t .
See Appendix E for the proof. The resulting posterior predictive distribution provides a recursive estimate of output uncertainty, which is valuable for decision-making and adaptive control.

5. Model Evaluation

A key criterion for probabilistic model evaluation is the negative log-model evidence (or surprisal) log p ( y k ) , which quantifies how surprising the observed data y k is under the model [33,34]. To gain deeper insights into model performance, we analyze surprisal from the perspective of variational inference on factor graphs. This approach enables us to decompose the overall model score into contributions from the individual nodes and edges of the graph.
Variational inference casts Bayesian inference as an optimization problem by approximating the true posterior p ( Θ | D k ) with a computationally tractable variational posterior q ( Θ | D k ) , chosen from a variational family Q [33,35]. At time k, the optimal variational posterior is obtained by minimizing variational free energy (VFE) [36,37]:
q * ( Θ | D k ) = arg min q Q F V F E q ( Θ | D k ) , p ( y k , Θ ) ,
where the VFE functional F V F E is defined as
F V F E q ( Θ | D k ) , p ( y k , Θ ) = D K L [ q ( Θ | D k ) p ( Θ | D k ) ] Inference Cost log p ( y k | u k , D k 1 ) Model Evidence .
In exact inference, where the true posterior is computed via Bayes’ rule, the inference cost becomes zero, and the VFE equals the exact surprisal. When exact inference is intractable, VFE is expressed in a different way. By absorbing the evidence term into the Kullback–Leibler (KL)-divergence, the product of the posterior and the evidence becomes the joint distribution of the generative model, which can be decomposed into a likelihood times prior distribution. This yields the decomposition of free energy into complexity and accuracy terms [37]:
D K L [ q ( Θ | D k ) | | p ( Θ | D k ) ] log p ( y k | u k , D k 1 ) = E q ( Θ | D k ) log q ( Θ | D k ) p ( y k , Θ | D k ) = D K L [ q ( Θ | D k ) p ( Θ | D k 1 ) ] Complexity + H [ q ( Θ | D k ) , p ( y k | Θ , x k ) ] Accuracy ,
where complexity measures how much the variational posterior deviates from the prior, penalizing unnecessary deviations from prior knowledge and controlling overfitting. Accuracy quantifies the model’s ability to explain the observed data, expressed as the expected negative log-likelihood under the variational posterior. To refine this decomposition further, we introduce an auxiliary entropy term H ( Θ | D k ) and rewrite (10) as
F V F E q ( Θ | D k ) , p ( y k , Θ ) = D K L [ q ( Θ | D k ) p ( Θ | D k - 1 ) ] + H [ q ( Θ | D k ) , p ( y k | Θ , x k ) ) ] H [ q ( Θ | D k ) ] + H [ q ( Θ | D k ) ] = D K L [ q ( Θ | D k ) p ( Θ | D k 1 ) ] + D K L [ q ( Θ | D k ) p ( y k | Θ , x k ) ) ] + H [ q ( Θ | D k ) ] .
For models formulated as Forney-style factor graphs, inference is performed by optimizing the Bethe Free Energy (BFE), a generalization of VFE, which accounts for the graph’s structure [13,21,38]:
F B F E [ q ( Θ | D k ) , p ( y k , Θ ) ] a V D K L [ q a | | p a ] + i E H [ q i ] ,
where V is the set of factor nodes and E is the set of edges. In this formulation, each q a is the local variational belief at node a, p a is the corresponding exact local distribution, and each edge i contributes an entropy term H [ q i ] . In our recursive MARX model—comprising a MARX likelihood node, a prior node, and an edge for the joint parameters Θ —the BFE decomposition in (12) coincides with the VFE decomposition in (11). Thus, factor graphs enable a fine-grained attribution of surprisal to specific components of the system.

5.1. MARX Model Evidence and Surprisal

To evaluate the model properly, we must compute the model evidence (marginal likelihood), which is the probability of an observed sample marginalized over parameters, weighted by their prior probabilities. Equation (5) already detailed the evidence term, but this still involved an integral. This integral is identical to the integral for the posterior predictive distribution (8), except that y k and u k are observed and the prior parameters are those from time step k 1 . Concretely,
p ( y k | u k , D k 1 ) = p ( y k | Θ , x k ) p ( Θ | D k 1 ) d Θ = T ( y k | m k , Ψ k 1 , η k ) = | Ψ k | ( η k π ) D y Γ D y ( ( η k + D y ) / 2 ) Γ D y ( ( η k + D y 1 ) / 2 ) 1 + 1 η k ( y k m k ) Ψ k ( y k m k ) ( η k + D y ) / 2 ,
where m k = M k 1 x k , η k = ν k 1 D y + 1 , Ψ k = η k Ω k 1 1 λ k , and λ k = ( 1 + x k Λ k 1 1 x k ) 1 . Here T ( · | μ , Σ 1 , ν ) denotes the multivariate Student’s t-distribution with location μ , scale Σ 1 , and degrees of freedom ν . Unlike the posterior predictive distribution, the model evidence is a scalar: higher values indicate that the model better explains the observed data. Hence, the surprisal for our model is
log p ( y k | u k , D k 1 ) = 1 2 log | Ψ k | + D y 2 log ( η k π ) log Γ D y ( η k + D y 2 ) + log Γ D y ( η k + D y 1 2 ) + η k + D y 2 log 1 + 1 η k ( y k m k ) Ψ k ( y k m k ) .

5.2. MARX Variational Free Energy

Lemma 4.
Let q and p be two matrix normal Wishart distributions over the same random variables Θ, representing the posterior and prior, respectively:
q ( Θ | D k ) = MNW ( Θ | M k , Λ k 1 , Ω k 1 , ν k ) p ( Θ | D k 1 ) = MNW ( Θ | M k 1 , Λ k 1 1 , Ω k 1 1 , ν k 1 ) .
The differential cross-entropy H [ q ( Θ | D k ) , p ( Θ | D k 1 ) ] of the posterior relative to the prior is
H [ q ( Θ | D k ) , p ( Θ | D k 1 ) ] = 1 2 D y log | Λ k 1 | + 1 2 ( ν k 1 + D x D y 1 ) log | Ω k | 1 2 ν k 1 log | Ω k 1 | + 1 2 ( D y + 1 ) D y log 2 + 1 2 D x D y log π + log Γ D y ( ν k 1 2 ) 1 2 ( ν k 1 + D x D y 1 ) ψ D y ( ν k 2 ) + 1 2 ν k tr Ω k 1 ( M k M k 1 ) Λ k 1 ( M k M k 1 ) + 1 2 D y tr ( Λ k 1 Λ k 1 ) + ν k tr ( Ω k 1 Ω k 1 ) .
See Appendix F for the proof.
Lemma 5.
Consider the matrix normal Wishart posterior:
q ( Θ | D k ) = MNW ( A , W | M k , Λ k 1 , Ω k 1 , ν k ) .
Its (differential) entropy is
H [ q ( Θ | D k ) ] = 1 2 D y log | Λ k | + 1 2 ( D x D y 1 ) log | Ω k | + 1 2 ( D y + 1 ) D y log 2 + 1 2 D x D y log π + 1 2 ( D x + ν k ) D y + log Γ D y ( ν k 2 ) 1 2 ( ν k + D x D y 1 ) ψ D y ( ν k 2 ) .
See Appendix G for the proof.
Lemma 6.
Let q and p be two matrix normal Wishart distributions over the same random variables Θ, representing the posterior and prior, respectively:
q ( Θ | D k ) = MNW ( Θ | M k , Λ k 1 , Ω k 1 , ν k ) p ( Θ | D k 1 ) = MNW ( Θ | M k 1 , Λ k 1 1 , Ω k 1 1 , ν k 1 ) .
The KL-divergence D K L [ q ( Θ | D k ) p ( Θ | D k 1 ) ] of the posterior from the prior (complexity) is
D K L [ q ( Θ | D k ) p ( Θ | D k 1 ) ] = 1 2 D y log | Λ k | | Λ k 1 | + 1 2 ν k 1 log | Ω k | | Ω k 1 | 1 2 ( D x + ν k ) D y log Γ D y ( ν k 2 ) + log Γ D y ( ν k 1 2 ) + 1 2 ( ν k ν k 1 ) ψ D y ( ν k 2 ) + 1 2 ν k tr Ω k 1 ( M k M k 1 ) Λ k 1 ( M k M k 1 ) + 1 2 D y tr ( Λ k 1 Λ k 1 ) + ν k tr ( Ω k 1 Ω k 1 ) .
See Appendix H for the proof.
Lemma 7.
Consider a matrix normal Wishart distribution q and a multivariate normal distribution p, representing the posterior and MARX likelihood:
q ( Θ | D k ) = MNW ( A , W | M k , Λ k 1 , Ω k 1 , ν k ) p ( y k | Θ , x k ) = N ( y k | A x k , W 1 ) .
The differential cross-entropy H [ q ( Θ | D k ) , p ( y k | Θ , x k ) ] of the posterior relative to the likelihood (accuracy) is
H [ q ( Θ | D k ) , p ( y k | Θ , x k ) ] = 1 2 ψ D y ( ν k 2 ) + 1 2 log | Ω k | + 1 2 D y log π + 1 2 ν k ( y k M k x k ) Ω k 1 ( y k M k x k ) + 1 2 x k Λ k 1 x k D y .
See Appendix I for the proof.

6. Experiments

We conducted three experiments: one verification experiment and two validation experiments (Code: https://github.com/biaslab/MDPI2025-MARX, accessed on 8 March 2025). In the verification experiment (Section 6.2), we tested whether the MARX estimator could identify a dynamical system with known parameters. In the validation experiments (Section 6.3), we assess the estimator’s performance on two complex dynamical systems with unknown parameters: a linear double mass-spring-damper system and a nonlinear double pendulum. In all the experiments, we compare the performance of the MARX estimator to a baseline approach.

6.1. Baseline Estimator

We compare against a recursive least squares (RLS) estimator [3]. Let A ^ k be a point estimate of the coefficient matrix based on the previous k data points, and let P 0 = I D x be an initial inverse sample covariance matrix. These matrices are updated at each time step according to
P k = P k 1 P k 1 x k ( 1 + x k P k 1 x k ) 1 x k P k 1 A ^ k = A ^ k 1 + P k 1 x k ( 1 + x k P k 1 x k ) 1 ( y k A ^ k 1 x k ) .
Note that this formulation corresponds to a forgetting factor of 1.0 , meaning that older data points are not down-weighted. The system outputs are predicted with y t = A ^ k x t .

6.2. Verification

We perform a verification experiment on a MARX system with state z k = x k (1), memory sizes N y = 2 , N u = 3 , and dimensions D y = D u = 2 . The system has true parameters Θ ˜ = ( A ˜ , W ˜ ) . It evolves according to g ( f ( x k ) ) = A ˜ x k , where A ˜ is the known coefficient matrix (see Figure 2). For each output dimension i, the lag-dependent coefficients were generated using a Butterworth low-pass filter (cutoff frequency 20 Hz) applied to that same dimension, while cross-dimensional coefficients were sampled from N ( 0 , 0 . 1 2 ) [39]. We chose the Butterworth filter because its maximally flat response in the passband ensures that signals below the cutoff frequency are transmitted with little distortion while attenuating higher-frequency components [40]. This makes it suitable for generating stable linear dynamics and mimicking the low-pass behavior often observed in physical dynamical systems—such as mechanical or electrical processes [41,42]—and is common in applications like audio and biomedical signal processing [41,43]. The disturbance follows e k N ( 0 , W ˜ 1 ) with precision matrix W ˜ = 300 100 100 200 .
We evaluated each estimator for training sizes T train { 2 l l { 2 , 3 , 4 , 5 , 6 } } , using Monte Carlo experiments with N M C = 100 runs. To learn the parameters, each estimator uses T train state transitions, starting from state z 0 = 0 D z . After training, each estimator is tested for T test = 100 time steps, again starting from z 0 but with different control signals. For the MARX estimator, we compare two priors (see Table 1): uninformative (MARX-UI) and weakly informative (MARX-WI). The uninformative prior uses small precision values for Λ 0 and Ω 0 , corresponding to large prior variancesthat reflect minimal prior belief about the parameters. The weakly informative prior assigns higher precision (lower variance), introducing a mild preference for more stable parameter values while still letting the data dominate. In both cases, the degrees of freedom ν 0 are kept minimal at D y + 3 , just above the threshold for the Wishart distribution to be well defined, further reinforcing the limited informativeness of the prior. The weakly informative prior also encodes approximate prior knowledge about the observation noise. Specifically, the Wishart component p ( W ) has a mode at ν 0 Ω 0 1 = 500 0 0 500 , which is of similar magnitude to the true noise precision W ˜ . In contrast, the uninformative prior sets Ω 0 to much larger, placing its mode far from the true noise characteristics. Thus, the weakly informative prior softly incorporates domain knowledge about expected noise levels, improving convergence and stability in the early stages of recursive estimation. For each training size, we calculate the root mean squared error (RMSE),
RMSE = 1 T test k = 1 T test ( y ^ k y k ) 2 ,
between the predicted output y ^ k , i.e., the mean of the posterior predictive p ( y k | u k , D k 1 ) , and the true output y k for all k T test evaluation steps.
Figure 3 shows the simulation errors for MARX-UI, MARX-WI, and RLS as a function of the training size. For small sample sizes, MARX-WI consistently outperforms RLS, while MARX-UI performs slightly worse. All three estimators converge to the same performance level as the training size increases.
Figure 4 focuses on a single Monte Carlo experiment with T train = 2 6 . It plots log ( | | A ˜ A | | F ) , the log of the Frobenius norm between the true coefficient matrix A ˜ and each estimate A. MARX-WI consistently yields better estimates of A ˜ than MARX-UI and RLS. Although MARX-UI struggles during the first 25 time steps, it eventually produces a more accurate estimate of A ˜ compared to RLS.
Unlike RLS, the MARX estimator also estimates the noise precision matrix W. Figure 5 shows log ( | | W ˜ W | | F ) for both MARX-WI and MARX-UI. MARX-WI consistently achieves more accurate estimates of W ˜ than MARX-UI.
Figure 6 plots the negative log posterior probability of the true parameters Θ ˜ (lower is better), showing that the posterior concentrates sharply on the true values. As a probabilistic estimator, MARX also quantifies uncertainty in its estimates of A ˜ and W ˜ via the posterior precision (or scale) parameters. Figure 7 illustrates the evolution of MARX-WI’s estimates of W for a single run with T train = 2 6 . The ribbon represents one standard deviation around the mean. Initially, MARX-WI exhibits high uncertainty (large variance), which generally decreases over time. Because W ˜ and W are symmetric, only the upper-triangular elements are shown.
Figure 8 (top) shows a heatmap of the difference A A ˜ . To save space, we plot only a subset of the elements of A, marked by “X”. This subset includes the elements with the largest estimation errors and two randomly selected elements. Figure 8 (bottom) shows the evolution of these selected elements for the same Monte Carlo experiment run, with ribbons indicating one standard deviation around each mean estimate.
Furthermore, we apply the model score decomposition from Section 5 to evaluate our recursive MARX model. By tracking how surprisal and its constituent terms evolve, we obtain fine-grained insights into the model’s learning dynamics and uncertainty reduction. We can recall from (10) that surprisal decomposes into an accuracy term—given by the cross-entropy of the variational posterior relative to the likelihood, reflecting data fit—and a complexity term—given by the KL-divergence of the variational posterior from the prior, quantifying deviation from prior beliefs. Figure 9 illustrates this decomposition. In the early stages of model training, the complexity term (green) dominates overall surprisal (dashed blue), indicating substantial updates from the prior as the model learns the system parameters. As training progresses and the posterior stabilizes, the complexity term diminishes, and the accuracy term (red) becomes the main source of uncertainty. Spikes in overall surprisal during later stages align with spikes in the accuracy term, which we interpret as indicators of measurement outliers that temporarily degrade model fit.
Figure 10 complements this analysis by plotting the entropy of the variational posterior q ( Θ | D k ) over time. This highlights how quickly the inference procedure narrows the parameter space, providing insight into convergence speed and residual uncertainty in the model parameters.
We also demonstrate model evaluation using model evidence. Figure 11 shows the evolution of surprisal (lower is better) over time for MARX-WI and MARX-UI. This plot highlights that the prior choice matters only initially; with sufficient data, MARX-WI and MARX-UI converge to the same performance.

6.3. Validation

To evaluate the proposed method, we perform validation experiments on two distinct mechanical systems: a linear double mass-spring-damper system and a nonlinear double pendulum system. These testbeds span a range of dynamical complexity and are standard benchmarks for modeling and control tasks. Despite their differences, both systems share a common formulation as second-order dynamical systems expressed in first-order ODE form:
I k z ¨ k = F ( z k , z ˙ k , u k ) ,
where z k denotes generalized coordinates, z ˙ k and z ¨ k are the first and second time derivatives of z k , u k are the control inputs, I k is a (state-dependent) generalized inertia matrix, and F encodes the system-specific generalized forces (including passive dynamics and external control inputs). Time evolution is performed using a forward Euler integrator with a system-specific time step Δ t :
z k + 1 = z k + Δ t z ˙ k and z ˙ k + 1 = z ˙ k + Δ t z ¨ k .
For both validation systems, we choose a disturbance e k N ( 0 , W ˜ 1 ) with a precision matrix W ˜ = 2000 1000 1000 2000 . The validation experiments follow the same procedure as the verification experiment: we perform Monte Carlo experiments with N M C = 100 runs with Δ t = 0.05 , in which each estimator has T train { 2 l l { 2 , 3 , 4 , 5 , 6 } } state transitions to learn the parameters (starting from state z 0 = 0 D z ), and we test each estimator with T test = 100 transitions. However, we increase the memory sizes of the MARX model to N y = N u = 5 .
In the following, we describe each validation system individually, and then present the combined validation results.

6.3.1. Linear System: Double Mass-Spring-Damper

The linear system consists of two masses: m 1 = 1.0 kg, connected to a fixed base by a spring and damper with stiffness k 1 = 0.99 and damping c 1 = 0.4 , and m 2 = 2.0 kg, connected to m 1 via a second spring and damper with k 2 = 0.8 and c 2 = 0.4 . The generalized coordinates z k R 2 represent the displacements of each mass from the equilibrium, and the generalized inertia matrix is a constant: I k = diag ( m 1 , m 2 ) , where diag ( · ) denotes a diagonal matrix with the given entries [26]. The generalized force function F combines the internal spring and damping forces with external inputs:
F ( z k , z ˙ k , u k ) = K z k + C z ˙ k + u k ,
with the stiffness and damping matrices:
K = ( k 1 + k 2 ) k 2 k 2 k 2 , C = ( c 1 + c 2 ) c 2 c 2 c 2 .

6.3.2. Nonlinear System: Double Pendulum

The nonlinear system is a planar double pendulum (also called an acrobot) with two links of lengths l 1 = 1.0 m and l 2 = 1.0 m and masses m 1 = 1.0 kg and m 2 = 1.0 kg, respectively. The generalized coordinates z k R 2 represent the joint angles, and the generalized inertia matrix is captured implicitly through a structured nonlinear force formulation. The dynamics are governed by gravity and nonlinear velocity coupling, yielding
F ( z k , z ˙ k , u k ) = diag g 1 2 m 1 + m 2 l 1 , 1 2 g m 2 l 2 sin ( z k ) + J x V z ˙ k 2 + u k ,
where g is gravitational acceleration, J x 1 2 m 2 l 1 l 2 , and V is the nonlinear velocity-coupling matrix:
V = 0 sin ( z k , 1 z k , 2 ) sin ( z k , 1 z k , 2 ) 0 .

6.3.3. Results

As in the verification experiment, Figure 12 shows the simulation errors for MARX-UI, MARX-WI, and RLS for both the double mass-spring-damper system Figure 12a) and the double pendulum system (Figure 12b). Convergence to stable performance is slower in both systems compared to the verification case. Nevertheless, both MARX variants outperform RLS and converge to similar levels of predictive performance. This confirms that the MARX model generalizes to more complex dynamical systems. As expected, the overall RMSE is higher for the nonlinear double pendulum system. A peak of performance loss is present for MARX-UI, which is more pronounced in the double mass-spring-damper system.
Figure 13 shows log ( | | W ˜ W | | F ) for both MARX-WI and MARX-UI for the validation systems. Initially, MARX-WI achieves better accuracy and lower variability than MARX-UI. Unlike in the verification setting, MARX-UI improves significantly over time and ultimately approaches similar estimation quality.
Figure 14 illustrates estimates of W ˜ by MARX-WI for a single Monte Carlo experiment ( T train = 2 6 ) for both systems. The model struggles with learning and initially shows high uncertainty, followed by a sharp reduction as learning progresses. This reflects the challenge of inferring observation noise structure in nonlinear systems from limited data.
Figure 15 displays the evolution of MARX-WI’s surprisal and its decomposition into accuracy and complexity. The early learning phases show that surprisal reduction is dominated by decreasing model complexity. This trend is more difficult to sustain in the nonlinear system, where complexity remains elevated for longer. Later in training, fluctuations in surprisal are primarily driven by changes in accuracy.
Finally, Figure 16 shows the entropy of the variational posterior q ( Θ | D k ) for each validation system. In both systems, MARX-WI rapidly reduces entropy, indicating fast convergence to informative parameter regions despite the different complexities of the systems.

7. Discussion

The modular nature of the factor graph methodology provides substantial practical advantages. As demonstrated by Loeliger et al. [15], factor graphs facilitate the visual construction of complex algorithms by incorporating, eliminating, or merging established computational units. For example, the MARX model’s factor graph (Figure 1) could be extended to support time-varying parameters by introducing state transition factor nodes between the equality nodes over the parameters [24]. In multi-agent robotics, where sensors and actuators are spread across various platforms, each agent can update its local beliefs through message passing and share only the most informative summaries [44]. This targeted communication reduces bandwidth demands while enabling swift convergence to an accurate global model. Recent research highlights the importance of transmitting informative variational beliefs in multi-agent environments [22,45], facilitating scalable cooperative learning among heterogeneous agents. The resulting computational decentralization opens promising opportunities for federated system identification and coordination in multi-robot systems, especially when subject to privacy or bandwidth constraints [46,47,48].

7.1. Computational Efficiency

The dominant computational cost in our inference algorithm arises from the matrix inversion of Λ (4), which scales as O ( D x 3 ) in the worst case. We benchmarked the update rule computations on a Julia-based implementation running on an Apple Macbook M1, averaging over 1,000,000 runs. For a state dimension of D x = 10 , updating the parameters for a single time step took approximately 2 nanoseconds (excluding garbage collection). Further computational savings are possible by adopting an information filter parameterization, where Ξ k (A3) is stored instead of M k (3) [49]. This approach defers the matrix inversion until M k is explicitly needed, offering an efficiency boost, particularly in high-dimensional or resource-constrained scenarios.

7.2. Limitations

Despite its efficiency and modularity, our method has several limitations. First, it does not support fully Bayesian k-step ahead predictions. Computing joint posterior predictives over a longer horizon is intractable under the current formulation and is challenging as it requires marginalization over a (deeply) nested set of autoregressive coefficients. Second, the model is built on a linear multivariate autoregressive likelihood, which—while computationally efficient—limits its expressiveness. In systems characterized by strong nonlinearities, this assumption can lead to underfitting and reduced predictive performance. Lastly, although we explored both uninformative and weakly informative priors, the model remains sensitive to prior settings, particularly in data-scarce settings or during the early stages of recursive estimation. In these scenarios, poor prior choices can significantly degrade both convergence speed and final performance.

7.3. Future Work

Future work may explore extending the MARX framework to accommodate time-varying parameters by inserting state-transition factors between the equality nodes—analogous to prior work on univariate autoregressive models [24]. Another extension is to utilize the posterior distributions over the parameters to formulate a mutual information-based cost function for input signal design [10].

8. Conclusions

We presented a recursive Bayesian estimation procedure for multivariate autoregressive models with exogenous inputs. The method produces matrix-variate posterior distributions over both the model coefficients and the noise precision, allowing uncertainty to be explicitly propagated into future output predictions. We also demonstrated how these uncertainty estimates enable the analysis of individual factor nodes and edges within the model, making it possible to assess their contributions to the overall model score and to identify potential outliers. The ability to track sources of uncertainty online and evaluate their impact on output predictions is especially valuable for applications such as Bayesian optimal experimental design or information-theoretic adaptive control.

Author Contributions

T.N.N. contributed to the derivations, simulations, experimental results, and writing. W.M.K. contributed to the conception, direction, derivations, software, and writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Eindhoven Artificial Intelligence Systems Institute.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All data in this work is synthetic. For details on it was simulated, see the accompanying repository at https://github.com/biaslab/MDPI2025-MARX (accessed on 8 March 2025).

Acknowledgments

The authors gratefully acknowledge the support from Albert Podusenko.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MARXmultivariate autoregressive models with exogenous inputs
MARX-UIMARX model with uninformative prior
MARX-WIMARX model with weakly informative prior
VFEvariational free energy
BFEBethe Free Energy
RLSrecursive least squares
RMSERoot Mean Square Error
ODEOrdinary Differential Equation
KLKullback–Leibler

Appendix A. Parameter Estimation

Proof. 
The functional form of the likelihood is
p ( y k | Θ , x k ) | W | exp 1 2 tr W L k ,
where L k ( y k A x k ) ( y k A x k ) . The prior is
p ( Θ | D k 1 ) | W | ν t 1 + D ¯ exp 1 2 tr W H k 1 + Ω k 1 ,
where H k 1 ( A M k 1 ) Λ k 1 ( A M k 1 ) and D ¯ D x D y 1 . The posterior is proportional to the likelihood times the prior:
p ( Θ | D k ) p ( y k | Θ , x k ) p ( Θ | D k 1 ) | W | ν k 1 + 1 + D ¯ exp 1 2 tr W L k + H k 1 + Ω k 1 .
We expand the first terms in the exponent and group them as follows:
L k + H k 1 = y k y k y k x k A A x k y k + A x k x k A + A Λ k 1 A A Λ k 1 M k 1 M k 1 Λ k 1 A + M k 1 Λ k 1 M k 1 = A ( Λ k 1 + x k x k ) A A ( x k y k + Λ k 1 M k 1 ) ( M k 1 Λ k 1 + y k x k ) A + y k y k + M k 1 Λ k 1 M k 1 .
Let Λ k Λ k 1 + x k x k , Ξ k x k y k + Λ k 1 M k 1 and M k Λ k 1 Ξ k . Adding and subtracting Ξ k Λ k 1 Ξ k to (A2) yields
L k + H k 1 = A Λ k A A Ξ k Ξ k A + Ξ k Λ k 1 Ξ k Ξ k Λ k 1 Ξ k + y k y k + M k 1 Λ k 1 M k 1 = ( A Λ k 1 Ξ k ) Λ k ( A Λ k 1 Ξ k ) M k Λ k M k + y k y k + M k - 1 Λ k 1 M k 1 .
Plugging the above into (A1), we recognize the functional form of the matrix normal Wishart distribution:
| W | ν k + D ¯ exp 1 2 tr W ( ( A M k ) Λ k ( A M k ) + Ω k ) MNW ( A , W | M k , Λ k 1 , Ω k 1 , ν k ) ,
which parameters are
ν k = ν k 1 + 1 , Λ k = Λ k 1 + x k x k , M k = ( Λ k 1 + x k x k ) 1 ( Λ k 1 M k 1 + x k y k ) , and Ω k = Ω k 1 + y k y k + M k 1 Λ k 1 M k 1 M k Λ k M k .
This concludes the proof. □

Appendix B. Backwards Message from Likelihood

Proof. 
The MARX likelihood function is
p ( y k | Θ , x k ) | W | exp 1 2 tr W L k ,
where the completed square is
L k ( y k A x k ) ( y k A x k ) = y k y k A x k y k y k x k A + A x k x k A .
Let Λ ¯ k x k x k , Ξ ¯ k x k y k and M ¯ k = Λ ¯ k 1 Ξ ¯ k . Then adding and subtracting Ξ ¯ k Λ ¯ k Ξ ¯ k allows us to rewrite the square in terms of A:
L k + Ξ ¯ k Λ ¯ k 1 Ξ ¯ k Ξ ¯ k Λ ¯ k 1 Ξ ¯ k = y k y k + ( A M ¯ k ) Λ ¯ k ( A M ¯ k ) Ξ ¯ k Λ ¯ k 1 Ξ ¯ k .
The two remaining terms cancel:
y k y k Ξ ¯ k Λ ¯ k 1 Ξ ¯ k = y k y k y k x k ( x k x k ) 1 x k y k = y k y k y k I y k = 0 D y × D y .
If we define ν ¯ k 1 D ¯ for D ¯ = D x + D y + 1 and Ω ¯ k 0 D y × D y , then we may recognize the functional form of a matrix normal Wishart in (A4):
p ( y k | Θ , x k ) | W | ν ¯ k + D ¯ exp 1 2 tr W ( ( A M ¯ k ) Λ ¯ k ( A M ¯ k ) + Ω ¯ k ) MNW ( A , W | M ¯ k , Λ ¯ k 1 , Ω ¯ k 1 , ν ¯ k ) .
This concludes the proof. □

Appendix C. Product of Matrix Normal Wishart Distributions

Proof. 
Let p 1 , p 2 be two matrix normal Wishart distributions over the same random variables Θ :
p 1 ( Θ ) = MNW ( A , W | M 1 , Λ 1 1 , Ω 1 1 , ν 1 ) p 2 ( Θ ) = MNW ( A , W | M 2 , Λ 2 1 , Ω 2 1 , ν 2 ) .
Their product is proportional to
p 1 ( Θ ) p 2 ( Θ ) | W | ν 1 + D ¯ exp 1 2 tr W L 1 | W | ν 2 + D ¯ exp 1 2 tr W L 2 = | W | ν 3 + D ¯ exp 1 2 tr W ( L 1 + L 2 )
for D ¯ D x D y 1 , ν 3 ν 1 + ν 2 + D x D y 1 and L i ( A M i ) Λ i ( A M i ) + Ω i . The sum of L i is
L 1 + L 2 = A ( Λ 1 + Λ 2 ) A A ( Λ 1 M 1 + Λ 2 M 2 ) ( M 1 Λ 1 + M 2 Λ 2 ) A + M 1 Λ 1 M 1 + M 2 Λ 2 M 2 + Ω 1 + Ω 2 .
Let Λ 3 Λ 1 + Λ 2 and Θ 3 Λ 1 M 1 + Λ 2 M 2 . Then,
( A Λ 3 1 Θ 3 ) Λ 3 ( A Λ 3 1 Θ 3 ) = A Λ 3 A A Θ 3 Θ 3 A + Θ 3 Λ 3 1 Θ 3 .
Using M 3 Λ 3 1 Θ 3 , (A5) can be written as
p 1 ( Θ ) p 2 ( Θ ) | W | ν 3 + D ¯ exp ( 1 2 tr [ W ( ( A M 3 ) Λ 3 ( A M 3 ) Θ 3 Λ 3 1 Θ 3 + M 1 Λ 1 M 1 + M 2 Λ 2 M 2 + Ω 1 + Ω 2 ) ] ) .
Note that Θ 3 Λ 3 1 Θ 3 = Θ 3 Λ 3 1 Λ 3 Λ 3 1 Θ 3 = M 3 Λ 3 M 3 . Let
Ω 3 Ω 1 + Ω 2 + M 1 Λ 1 M 1 + M 2 Λ 2 M 2 M 3 Λ 3 M 3 .
Then (A6) may be recognized as an unnormalized matrix normal Wishart:
| W | ν 3 + D ¯ exp 1 2 tr W ( A M 3 ) Λ 3 ( A M 3 ) + Ω 3 MNW A , W | M 3 , Λ 3 1 , Ω 3 1 , ν 3 .
As such, the product of two matrix normal Wishart distributions is proportional to another matrix normal Wishart distribution. □

Appendix D. Marginalization over A

Proof. 
The marginalization over A is
p ( y t | W , u t , D k ) = p ( y t | Θ , x t ) p ( A | W , D k ) d A = N y t | A x t , W 1 MN A | M k , Λ k 1 , W 1 d A = ( 2 π ) D y ( 1 + D x ) | W | D x + 1 | Λ k | D y exp 1 2 tr W ( L t + H k ) d A .
where the terms inside the trace are
L t ( y t A x t ) ( y t A x t ) H k ( A M k ) Λ k ( A M k ) .
Expanding L t and H k and adding them yields
L t + H k = y t y t A x t y t y t x t A + A x t x t A + A Λ k A A Λ k M k M k Λ k A + M k Λ k M k = y t y t + M k Λ k M k + A ( Λ k + x t x t ) A A ( Λ k M k + x t y t ) ( Λ k M k + x t y t ) A .
Let Λ t Λ k + x t x t , Θ t Λ k M k + x t y t and M t Λ t 1 Θ t . Completing the square gives
L t + H k = ( A M t ) Λ t ( A M t ) M t Λ t M t + y t y t + M k Λ k M k .
Plugging this result into the integral in (A8) gives
exp 1 2 tr W ( L t + H k ) d A = exp 1 2 tr W ( y t y t + M k Λ k M k M t Λ t M t ) exp 1 2 tr W ( A M t ) Λ t ( A M t ) d A .
We can recognize the integrand as the functional form of a matrix normal distribution. Thus, the integral evaluates to its inverse normalization factor:
exp 1 2 tr W ( A M t ) Λ t ( A M t ) d A = ( 2 π ) D y D x | W | D x | Λ t | D y .
Using this result, the marginalization over A is
p ( y t | Θ , x t ) p ( A | W , D k ) d A = | W | ( 2 π ) D y | Λ k | D y | Λ t | D y exp 1 2 tr W ( y t y t + M k Λ k M k M t Λ t M t ) .
Note that, under the matrix determinant lemma,
| Λ t | = | Λ k + x t x t | = | Λ k | ( 1 + x t Λ k 1 x t ) ,
which implies that the product of determinants is
| Λ k | D y | Λ t | D y = 1 + x t Λ k 1 x t D y .
Let λ t ( 1 + x t Λ k 1 x t ) 1 . As W is D y -dimensional, | W | λ t D y = | W λ t | . Furthermore, note that
M t Λ t M t = M k Λ k ( x t x t + Λ k ) 1 Λ k M k + y t x t ( x t x t + Λ k ) 1 Λ k M k + M k Λ k ( x t x t + Λ k ) 1 x t y t + y t x t ( x t x t + Λ k ) 1 x t y t .
Combining this with the other terms in the trace gives
y t y t + M k Λ k M k M t Λ t M t = M k Λ k I ( x t x t + Λ k ) 1 Λ k M k y t x t ( x t x t + Λ k ) 1 Λ k M k M k Λ k ( x t x t + Λ k ) 1 x t y t + y t 1 x t ( x t x t + Λ k ) 1 x t y t .
Using the Sherman–Morrison formula, we have
1 x t ( x t x t + Λ k ) 1 x t = 1 x t ( Λ k 1 Λ k 1 x t x t Λ k 1 1 + x t Λ k 1 x t ) x t = 1 + x t Λ k 1 x t 1 = λ t .
Another application of Sherman–Morrison yields
I ( x t x t + Λ k ) 1 Λ k = Λ k 1 ( Λ k 1 Λ k 1 x t x t Λ k 1 1 + x t Λ k 1 x t ) Λ k = λ t Λ k 1 x t x t Λ k 1 .
A third Sherman–Morrison gives
Λ k ( x t x t + Λ k ) 1 x t = Λ k Λ k 1 Λ k 1 x t x t Λ k 1 1 + x t Λ k 1 x t x t = I x t x t x t Λ k 1 x t 1 + x t Λ k 1 x t = x t λ t .
Using these three simplifications, we have
tr [ W ( y t y t + M k Λ k M k M t Λ t M t ) ] = y t W λ t y t y t W λ t M k x t x t M k W λ t y t + x t M k W λ t M k x t = y t M k x t W λ t y t M k x t .
Plugging (A9) into (A8) yields
p ( y t | W , u t , D k ) = | W λ t | ( 2 π ) D y exp λ t 2 ( y t M k x t ) W ( y t M k x t ) = N y t | M k x t , ( W λ t ) 1 .
This concludes the proof. □

Appendix E. Marginalization over W

Proof. 
The marginalization over W is
N ( y t | M k x t , ( W λ t ) 1 ) W ( W | Ω k 1 , ν k ) d W = | W λ t | ( 2 π ) D y exp λ t 2 ( y t M k x t ) W ( y t M k x t ) 1 Γ D y ( ν k 2 ) | Ω k | ν k | W | ν k D y 1 2 ν k D y exp 1 2 tr W Ω k d W = 1 Γ D y ( ν k 2 ) | Ω k | ν k λ t D y ( 2 π ) D y 2 ν k D y | W | ν k + 1 D y 1 exp 1 2 tr W ( Ω k + λ t ( y t M k x t ) ( y t M k x t ) ) d W = 1 Γ D y ( ν k 2 ) | Ω k | ν k λ t D y ( 2 π ) D y 2 ν k D y Γ D y ( ν k + 1 2 ) 2 ( ν k + 1 ) D y | Ω k + λ t ( y t M k x t ) ( y t M k x t ) | ( ν k + 1 ) ,
where we made use of the normalization factor of a Wishart distribution. Note that
2 ( ν k + 1 ) D y ( 2 π ) D y 2 ν k D y = 2 D y 2 D y π D y = 1 π D y .
Let η t ν k D y + 1 . Then,
Γ D y ( ν k + 1 2 ) Γ D y ( ν k 2 ) = Γ D y ( η t + D y 2 ) Γ D y ( η t + D y 1 2 ) .
The determinants simplify as follows:
| Ω k | ν k | Ω k + λ t ( y t M k x t ) ( y t M k x t ) | ( ν k + 1 ) = | λ t ( y t M k x t ) ( y t M k x t ) Ω k 1 + I | ( ν k + 1 ) | Ω k 1 | ,
and then, using the matrix determinant lemma, we have
| λ t ( y t M k x t ) ( y t M k x t ) Ω k 1 + I | ( ν k + 1 ) = ( y t M k x t ) Ω k 1 λ t ( y t M k x t ) + 1 ( η t + D y ) / 2 .
With Equations (A11)–(A14), we may write (A10) as
N y t | M k x t , ( W λ t ) 1 W ( W | Ω k 1 , ν k ) d W = | Ω k 1 | π D y Γ D y ( η t + D y 2 ) Γ D y ( η t + D y 1 2 ) λ t D y 1 + ( y t M k x t ) Ω k 1 λ t ( y t M k x t ) ( η t + D y ) / 2 = | η t Ω k 1 λ t | ( η t π ) D y Γ D y ( ( η t + D y ) / 2 ) Γ D y ( ( η t + D y 1 ) / 2 ) 1 + 1 η t ( y t M k x t ) η t Ω k 1 λ t ( y t M k x t ) ( η t + D y ) / 2 = T ( y t | μ t , Ψ t 1 , η t ) ,
where μ t M k x t , Ψ t η t Ω k 1 λ t . □

Appendix F. Cross-Entropy of a Matrix Normal Wishart Relative to a Matrix Normal Wishart

Proof. 
The functional form of a matrix normal Wishart with general parameters M , Λ , Ω , and ν is
p ( Θ ) = MNW ( A , W | M , Λ 1 , Ω 1 , ν ) = | Λ | D y | Ω | ν | W | ν + D x D y 1 2 ( ν + D x ) D y π D x D y 1 Γ D y ( ν 2 ) exp 1 2 tr W ( A M ) Λ ( A M ) + Ω .
Consider two matrix normal Wishart distributions over the same parameters Θ :
q ( Θ ) = MNW ( A , W | M q , Λ q 1 , Ω q 1 , ν q ) p ( Θ ) = MNW ( A , W | M p , Λ p 1 , Ω p 1 , ν p ) .
The differential cross-entropy H [ q , p ] of q relative to p is
H [ q , p ] = E q ( Θ ) log p ( Θ ) = 1 2 D y log | Λ p | 1 2 ν p log | Ω p | 1 2 ( ν p + D x D y 1 ) E q log | W | + 1 2 ( ν p + D x ) D y log 2 + 1 2 D x D y log π + log Γ D y ( ν p 2 ) + 1 2 E q tr W ( A M p ) Λ p ( A M p ) + Ω p .
The first expectation is the expectation of a Wishart log-determinant [28]:
E q ( Θ ) log | W | = E q ( W ) E q ( A | W ) log | W | = ψ D y ( ν q 2 ) + D y log 2 log | Ω q | ,
where ψ D y is the multivariate digamma function of dimension D y . For the second expectation, we first define the following expectations:
E q ( A | W ) A = M q
E q ( A | W ) A = M q
E q ( A | W ) A B A = M q B M q + tr ( Λ q 1 B ) W 1
E q ( W ) W = ν q Ω q 1 ,
for appropriately dimensioned matrix B. Equation (A19) is a property of matrix normal distributions [28]. We apply E q [ tr ( · ) ] = tr ( E q [ · ] ) [27], and make use of the factorization of a matrix normal Wishart:
E q ( Θ ) tr W ( A M p ) Λ p ( A M p ) + Ω p = tr E q ( Θ ) W ( A M p ) Λ p ( A M p ) + Ω p = tr E q ( W ) E q ( A | W ) W ( A M p ) Λ p ( A M p ) + Ω p = tr E q ( W ) W E q ( A | W ) ( A M p ) Λ p ( A M p ) + W Ω p .
We expand the term in the inner expectation and plug in Equations (A17)–(A19):
E q ( A | W ) ( A M p ) Λ p ( A M p ) = E q ( A | W ) A Λ p A A Λ p M p M p Λ p A + M p Λ p M p = M q Λ p M q + tr ( Λ q 1 Λ p ) W 1 M q Λ p M p M p Λ p M q + M p Λ p M p = ( M q M p ) Λ p ( M q M p ) + tr ( Λ q 1 Λ p ) W 1 .
We plug (A22) into (A21), expand, and use (A20) to resolve the remaining expectations:
tr E q ( W ) W E q ( A | W ) ( A M p ) Λ p ( A M p ) + W Ω p = tr E q ( W ) W ( M q M p ) Λ p ( M q M p ) + tr ( Λ q 1 Λ p ) W W 1 + W Ω p = tr ν q Ω q 1 ( M q M p ) Λ p ( M q M p ) + tr ( Λ q 1 Λ p ) I D y + ν q Ω q 1 Ω p = ν q tr Ω q 1 ( M q M p ) Λ p ( M q M p ) + D y tr ( Λ q 1 Λ p ) + ν q tr ( Ω q 1 Ω p ) .
We plug (A23) and (A16) into (A15) to yield the differential cross-entropy:
H [ q , p ] = 1 2 D y log | Λ p | 1 2 ν p log | Ω p | 1 2 ( ν p + D x D y 1 ) ψ D y ( ν q 2 ) + D y log 2 log | Ω q | + 1 2 ( ν p + D x ) D y log 2 + 1 2 D x D y log π + log Γ D y ( ν p 2 ) + 1 2 ν q tr Ω q 1 ( M q M p ) Λ p ( M q M p ) + 1 2 D y tr ( Λ q 1 Λ p ) + 1 2 ν q tr ( Ω q 1 Ω p ) = 1 2 D y log | Λ p | + 1 2 ( ν p + D x D y 1 ) log | Ω q | 1 2 ν p log | Ω p | + 1 2 ( D y + 1 ) D y log 2 + 1 2 D x D y log π + log Γ D y ( ν p 2 ) 1 2 ( ν p + D x D y 1 ) ψ D y ( ν q 2 ) + 1 2 ν q tr Ω q 1 ( M q M p ) Λ p ( M q M p ) + 1 2 D y tr ( Λ q 1 Λ p ) + ν q tr ( Ω q 1 Ω p ) .
This concludes the proof. □

Appendix G. Entropy of a Matrix Normal Wishart

Proof. 
Consider a matrix normal Wishart distribution:
q ( Θ ) = MNW ( A , W | M q , Λ q 1 , Ω q 1 , ν q ) .
By definition, a differential entropy H [ q ] of a distribution q is a special case of a differential cross-entropy H [ q , p ] of q from another distribution p, where p = q , i.e., H [ q ] = H [ q , q ] . Plugging in (the parameters of) p = q into (A24) from Appendix F, we get the entropy:
H [ q ] = 1 2 D y log | Λ q | + 1 2 ( ν q + D x D y 1 ) log | Ω q | 1 2 ν q log | Ω q | + 1 2 ( D y + 1 ) D y log 2 + 1 2 D x D y log π + log Γ D y ( ν q 2 ) 1 2 ( ν q + D x D y 1 ) ψ D y ( ν q 2 ) + 1 2 ν q tr Ω q 1 ( M q M q ) Λ q ( M q M q ) = 0 + 1 2 D y tr ( Λ q 1 Λ q ) = D x + 1 2 ν q tr ( Ω q 1 Ω q ) = D y = 1 2 D y log | Λ q | + 1 2 ( D x D y 1 ) log | Ω q | + 1 2 ( D y + 1 ) D y log 2 + 1 2 D x D y log π + 1 2 ( D x + ν q ) D y + log Γ D y ( ν q 2 ) 1 2 ( ν q + D x D y 1 ) ψ D y ( ν q 2 ) .
This concludes the proof. □

Appendix H. KL-Divergence of a Matrix Normal Wishart from a Matrix Normal Wishart

Proof. 
Consider two matrix normal Wishart distributions over the same parameters Θ :
q ( Θ ) = MNW ( A , W | M q , Λ q 1 , Ω q 1 , ν q ) p ( Θ ) = MNW ( A , W | M p , Λ p 1 , Ω p 1 , ν p ) .
By definition, a KL-divergence D K L [ q | | p ] of a distribution q from another distribution p is the difference between the differential cross-entropy H [ q , p ] of q from p (A24) and the entropy of q (14) [50]:
D K L [ q | | p ] = 1 2 D y log | Λ p | + 1 2 ( ν p + D x D y 1 ) log | Ω q | 1 2 ν p log | Ω p | + 1 2 ( D y + 1 ) D y log 2 + 1 2 D x D y log π + log Γ D y ( ν p 2 ) 1 2 ( ν p + D x D y 1 ) ψ D y ( ν q 2 ) + 1 2 ν q tr Ω q 1 ( M q M p ) Λ p ( M q M p ) + 1 2 D y tr ( Λ q 1 Λ p ) + 1 2 ν q tr ( Ω q 1 Ω p ) + 1 2 D y log | Λ q | 1 2 ( D x D y 1 ) log | Ω q | 1 2 ( D y + 1 ) D y log 2 1 2 D x D y log π 1 2 ( D x + ν q ) D y log Γ D y ( ν q 2 ) + 1 2 ( ν q + D x D y 1 ) ψ D y ( ν q 2 ) = 1 2 D y log | Λ q | | Λ p | + 1 2 ν p log | Ω q | | Ω p | 1 2 ( D x + ν q ) D y log Γ D y ( ν q 2 ) + log Γ D y ( ν p 2 ) + 1 2 ( ν q ν p ) ψ D y ( ν q 2 ) + 1 2 ν q tr Ω q 1 ( M q M p ) Λ p ( M q M p ) + 1 2 D y tr ( Λ q 1 Λ p ) + ν q tr ( Ω q 1 Ω p ) .
This concludes the proof. □

Appendix I. Cross-Entropy of a Matrix Normal Wishart Relative to a Multivariate Normal

Proof. 
Consider a matrix normal Wishart distribution q and a multivariate normal distribution p:
q ( Θ ) = MNW ( A , W | M q , Λ q 1 , Ω q 1 , ν q ) p ( y | Θ , x ) = N ( y | A x , W 1 ) .
The differential cross-entropy H [ q , p ] of q relative to p is
H [ q , p ] = E q ( Θ ) log p ( y | Θ , x ) = 1 2 E q ( Θ ) log | W | + 1 2 D y log 2 + 1 2 D y log π + 1 2 E q ( Θ ) ( y A x ) W ( y A x ) .
The first expectation again is the expectation of a Wishart log-determinant [28] (A16). For the second expectation, we make use of the factorization of a matrix normal Wishart, bring the term (a scalar) in trace form, apply E q [ tr ( · ) ] = tr ( E q [ · ] ) [27] and the cyclic property of tr, and rearrange as follows:
E q ( Θ ) ( y A x ) W ( y A x ) = E q ( W ) E q ( A | W ) ( y A x ) W ( y A x ) = tr E q ( W ) E q ( A | W ) W ( y A x ) ( y A x ) = tr E q ( W ) W E q ( A | W ) ( y A x ) ( y A x ) .
We expand the term in the inner expectation and plug in (A17) and (A19) (with B = x x ):
E q ( A | W ) ( y A x ) ( y A x ) = y y E q ( A | W ) y x A E q ( A | W ) A x y + E q ( A | W ) A x x A = y y y x E q ( A | W ) A E q ( A | W ) A x y + E q ( A | W ) A x x A = y y y x M q M q x y + M q x x M q + tr ( Λ q 1 ( x x ) ) W 1 = ( y M q x ) ( y M q x ) + tr ( Λ q 1 x x ) W 1 = ( y M q x ) ( y M q x ) + x Λ q 1 x W 1 .
Note that all terms are within a trace, so we can apply the cyclic property of the trace, and x x is a scalar. We plug in (A27) into (A26) and use (A20) to solve the expectation:
E q ( Θ ) ( y A x ) W ( y A x ) = tr E q ( W ) W ( ( y M q x ) ( y M q x ) + x Λ q 1 x W 1 ) = tr E q ( W ) W ( y M q x ) ( y M q x ) + x Λ q 1 x W W 1 = I D y = tr E q ( W ) W ( y M q x ) ( y M q x ) + x Λ q 1 x tr ( I D y ) = tr ν q Ω q 1 ( y M q x ) ( y M q x ) + x Λ q 1 x D y = ν q ( y M q x ) Ω q 1 ( y M q x ) + x Λ q 1 x D y .
We plug (A28) and (A16) into (A25) to yield the differential cross-entropy:
H [ q , p ] = 1 2 ψ D y ( ν q 2 ) + 1 2 log | Ω q | + 1 2 D y log π + 1 2 ν q ( y M q x ) Ω q 1 ( y M q x ) + 1 2 x Λ q 1 x D y .
This concludes the proof. □

References

  1. Nisslbeck, T.N.; Kouw, W.M. Online Bayesian system identification in multivariate autoregressive models via message passing. (accepted). In Proceedings of the European Control Conference, Thessaloniki, Greece, 24–27 June 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
  2. Tiao, G.C.; Zellner, A. On the Bayesian estimation of multivariate regression. J. R. Stat. Soc. Ser. B 1964, 26, 277–285. [Google Scholar] [CrossRef]
  3. Hannan, E.J.; McDougall, A.; Poskitt, D.S. Recursive estimation of autoregressions. J. R. Stat. Soc. Ser. B 1989, 51, 217–233. [Google Scholar] [CrossRef]
  4. Karlsson, S. Forecasting with Bayesian vector autoregression. Handb. Econ. Forecast. 2013, 2, 791–897. [Google Scholar]
  5. Nisslbeck, T.N.; Kouw, W.M. Coupled autoregressive active inference agents for control of multi-joint dynamical systems. In Proceedings of the International Workshop on Active Inference, Oxford, UK, 9–11 September 2024; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
  6. Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  7. Hecq, A.; Issler, J.V.; Telg, S. Mixed causal–noncausal autoregressions with exogenous regressors. J. Appl. Econom. 2020, 35, 328–343. [Google Scholar] [CrossRef]
  8. Penny, W.; Harrison, L. Multivariate autoregressive models. In Statistical Parametric Mapping: The Analysis of Functional Brain Images; Academic Press: Amsterdam, The Netherlands, 2007; pp. 534–540. [Google Scholar]
  9. Shaarawy, S.M.; Ali, S.S. Bayesian identification of multivariate autoregressive processes. Commun. Stat. Methods 2008, 37, 791–802. [Google Scholar] [CrossRef]
  10. Chaloner, K.; Verdinelli, I. Bayesian experimental design: A review. Stat. Sci. 1995, 10, 273–304. [Google Scholar] [CrossRef]
  11. Williams, G.; Drews, P.; Goldfain, B.; Rehg, J.M.; Theodorou, E.A. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Trans. Robot. 2018, 34, 1603–1622. [Google Scholar] [CrossRef]
  12. Kschischang, F.R.; Frey, B.J.; Loeliger, H.A. Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 2001, 47, 498–519. [Google Scholar] [CrossRef]
  13. Şenöz, İ.; van de Laar, T.; Bagaev, D.; de Vries, B. Variational message passing and local constraint manipulation in factor graphs. Entropy 2021, 23, 807. [Google Scholar] [CrossRef]
  14. Hoffmann, C.; Rostalski, P. Linear optimal control on factor graphs—a message passing perspective. IFAC-PapersOnLine 2017, 50, 6314–6319. [Google Scholar] [CrossRef]
  15. Loeliger, H.A.; Dauwels, J.; Hu, J.; Korl, S.; Ping, L.; Kschischang, F.R. The factor graph approach to model-based signal processing. Proc. IEEE 2007, 95, 1295–1322. [Google Scholar] [CrossRef]
  16. Cox, M.; van de Laar, T.; de Vries, B. A factor graph approach to automated design of Bayesian signal processing algorithms. Int. J. Approx. Reason. 2019, 104, 185–204. [Google Scholar] [CrossRef]
  17. Palmieri, F.A.; Pattipati, K.R.; Di Gennaro, G.; Fioretti, G.; Verolla, F.; Buonanno, A. A unifying view of estimation and control using belief propagation with application to path planning. IEEE Access 2022, 10, 15193–15216. [Google Scholar] [CrossRef]
  18. Forney, G.D. Codes on graphs: Normal realizations. IEEE Trans. Inf. Theory 2001, 47, 520–548. [Google Scholar] [CrossRef]
  19. Le, F.; Srivatsa, M.; Reddy, K.K.; Roy, K. Using graphical models as explanations in deep neural networks. In Proceedings of the IEEE International Conference on Mobile Ad-Hoc and Smart Systems, Monterey, CA, USA, 4–7 November 2019; pp. 283–289. [Google Scholar]
  20. Lecue, F. On the role of knowledge graphs in explainable AI. Semant. Web 2020, 11, 41–51. [Google Scholar] [CrossRef]
  21. Yedidia, J.S.; Freeman, W.T.; Weiss, Y. Bethe free energy, Kikuchi approximations, and belief propagation algorithms. Adv. Neural Inf. Process. Syst. 2001, 13. Available online: https://merl.com/publications/docs/TR2001-16.pdf (accessed on 8 June 2025).
  22. Zhang, Y.; Xu, W.; Liu, A.; Lau, V. Message Passing Based Wireless Federated Learning via Analog Message Aggregation. In Proceedings of the IEEE/CIC International Conference on Communications in China, Hangzhou, China, 7–9 August 2024; pp. 2161–2166. [Google Scholar]
  23. Bagaev, D.; de Vries, B. Reactive message passing for scalable Bayesian inference. Sci. Program. 2023, 2023, 6601690. [Google Scholar] [CrossRef]
  24. Podusenko, A.; Kouw, W.M.; de Vries, B. Message passing-based inference for time-varying autoregressive models. Entropy 2021, 23, 683. [Google Scholar] [CrossRef]
  25. Kouw, W.M.; Podusenko, A.; Koudahl, M.T.; Schoukens, M. Variational message passing for online polynomial NARMAX identification. In Proceedings of the American Control Conference, Atlanta, GA, USA, 8–10 June 2022; IEEE: New York, NY, USA, 2022; pp. 2755–2760. [Google Scholar]
  26. Petersen, K.B.; Pedersen, M.S. The matrix cookbook. Tech. Univ. Den. 2008, 7, 510. [Google Scholar]
  27. Soch, J.; Allefeld, C.; Faulkenberry, T.J.; Pavlovic, M.; Petrykowski, K.; Sarıtaş, K.; Balkus, S.; Kipnis, A.; Atze, H.; Martin, O.A. The Book of Statistical Proofs (Version 2023). 2024. Available online: https://zenodo.org/records/10495684 (accessed on 8 June 2025).
  28. Gupta, A.K.; Nagar, D.K. Matrix Variate Distributions; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
  29. Särkkä, S. Bayesian Filtering and Smoothing; Cambridge University Press: London, UK; New York, NY, USA, 2013. [Google Scholar]
  30. Lopes, M.T.; Castello, D.A.; Matt, C.F.T. A Bayesian inference approach to estimate elastic and damping parameters of a structure subjected to vibration tests. In Proceedings of the Inverse Problems, Design and Optimization Symposium, Joao Pessoa, Brazil, 25–27 August 2010. [Google Scholar]
  31. Winn, J.; Bishop, C.M.; Jaakkola, T. Variational message passing. J. Mach. Learn. Res. 2005, 6, 661–694. [Google Scholar]
  32. Dauwels, J.; Korl, S.; Loeliger, H.A. Particle methods as message passing. In Proceedings of the IEEE International Symposium on Information Theory, Seattle, DC, USA, 9–14 July 2006; pp. 2052–2056. [Google Scholar]
  33. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  34. Smith, R.; Friston, K.J.; Whyte, C.J. A step-by-step tutorial on active inference and its application to empirical data. J. Math. Psychol. 2022, 107, 102632. [Google Scholar] [CrossRef] [PubMed]
  35. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
  36. Parr, T.; Pezzulo, G.; Friston, K.J. Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
  37. Friston, K.; Da Costa, L.; Sajid, N.; Heins, C.; Ueltzhöffer, K.; Pavliotis, G.A.; Parr, T. The free energy principle made simpler but not too simple. Phys. Rep. 2023, 1024, 1–29. [Google Scholar] [CrossRef]
  38. Yedidia, J.S.; Freeman, W.T.; Weiss, Y. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inf. Theory 2005, 51, 2282–2312. [Google Scholar] [CrossRef]
  39. Proakis, J.G. Digital Signal Processing: Principles Algorithms and Applications; Pearson Education India: Noida, India, 2001. [Google Scholar]
  40. Robertson, D.G.E.; Dowling, J.J. Design and responses of Butterworth and critically damped digital filters. J. Electromyogr. Kinesiol. 2003, 13, 569–573. [Google Scholar] [CrossRef]
  41. Smith, J.O. Introduction to Digital Filters: With Audio Applications; Smith, J., Ed.; W3K Publishing: San Francisco, CA, USA, 2008; Volume 2. [Google Scholar]
  42. Zumbahlen, H. (Ed.) Linear Circuit Design Handbook; Newnes: Oxford, UK, 2011. [Google Scholar]
  43. Mello, R.G.; Oliveira, L.F.; Nadal, J. Digital Butterworth filter for subtracting noise from low magnitude surface electromyogram. Comput. Methods Programs Biomed. 2007, 87, 28–35. [Google Scholar] [CrossRef]
  44. Damgaard, M.R.; Pedersen, R.; Bak, T. Study of variational inference for flexible distributed probabilistic robotics. Robotics 2022, 11, 38. [Google Scholar] [CrossRef]
  45. Tedeschini, B.C.; Brambilla, M.; Nicoli, M. Message passing neural network versus message passing algorithm for cooperative positioning. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 1666–1676. [Google Scholar] [CrossRef]
  46. Ta, D.N.; Kobilarov, M.; Dellaert, F. A factor graph approach to estimation and model predictive control on unmanned aerial vehicles. In Proceedings of the International Conference on Unmanned Aircraft Systems, Orlando, FL, USA, 27–30 May 2014; IEEE: New York, NY, USA, 2014; pp. 181–188. [Google Scholar]
  47. Castaldo, F.; Palmieri, F.A. A multi-camera multi-target tracker based on factor graphs. In Proceedings of the IEEE International Symposium on Innovations in Intelligent Systems and Applications, Alberobello, Italy, 23–25 June 2014; IEEE: New York, NY, USA, 2014; pp. 131–137. [Google Scholar]
  48. van Erp, B.; Bagaev, D.; Podusenko, A.; Şenöz, İ.; de Vries, B. Multi-agent trajectory planning with NUV priors. In Proceedings of the American Control Conference, Toronto, ON, Canada, 10–12 July 2024; IEEE: New York, NY, USA, 2024; pp. 2766–2771. [Google Scholar]
  49. Assimakis, N.; Adam, M.; Douladiris, A. Information filter and Kalman filter comparison: Selection of the faster filter. In Proceedings of the Information Engineering, Chongqing, China, 26–28 October 2012; Volume 2, pp. 1–5. [Google Scholar]
  50. Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Figure 1. Forney-style factor graph of the MARX model in recursive form. A matrix normal Wishart node sends a prior message ( 1 ) to an equality node. A likelihood-based message ( 2 ) passes upwards from the MARX likelihood node (dashed box), attached to the observed variables y k , y ¯ k 1 , and u ¯ k . Combining the prior-based and likelihood-based messages at the equality node yields the posterior (message 3). Message 4 is the posterior predictive distribution for the system output.
Figure 1. Forney-style factor graph of the MARX model in recursive form. A matrix normal Wishart node sends a prior message ( 1 ) to an equality node. A likelihood-based message ( 2 ) passes upwards from the MARX likelihood node (dashed box), attached to the observed variables y k , y ¯ k 1 , and u ¯ k . Combining the prior-based and likelihood-based messages at the equality node yields the posterior (message 3). Message 4 is the posterior predictive distribution for the system output.
Entropy 27 00679 g001
Figure 2. Heatmap of true system parameter A ˜ . “X” denotes coefficients generated from a Butterworth filter.
Figure 2. Heatmap of true system parameter A ˜ . “X” denotes coefficients generated from a Butterworth filter.
Entropy 27 00679 g002
Figure 3. Simulation errors (average RMSE) of all three estimators for the MARX system, with ribbons indicating standard errors.
Figure 3. Simulation errors (average RMSE) of all three estimators for the MARX system, with ribbons indicating standard errors.
Entropy 27 00679 g003
Figure 4. Log-scale Frobenius norm of the difference between true coefficient matrix A ˜ and estimates A of each estimator in a single Monte Carlo run with T train = 2 6 for the MARX system, with ribbons indicating standard errors.
Figure 4. Log-scale Frobenius norm of the difference between true coefficient matrix A ˜ and estimates A of each estimator in a single Monte Carlo run with T train = 2 6 for the MARX system, with ribbons indicating standard errors.
Entropy 27 00679 g004
Figure 5. Log-scale Frobenius norm of the difference between true coefficient matrix W ˜ and estimates W of each MARX estimator for the MARX system, with ribbons indicating standard errors.
Figure 5. Log-scale Frobenius norm of the difference between true coefficient matrix W ˜ and estimates W of each MARX estimator for the MARX system, with ribbons indicating standard errors.
Entropy 27 00679 g005
Figure 6. Negative log posterior probability of the true system parameters Θ ˜ under each prior choice for the MARX system (lower is better).
Figure 6. Negative log posterior probability of the true system parameters Θ ˜ under each prior choice for the MARX system (lower is better).
Entropy 27 00679 g006
Figure 7. Time series of the estimated noise precision matrix W for the MARX-WI for the MARX system. Ribbons indicate one standard deviation, and horizontal lines denote the true values of W ˜ .
Figure 7. Time series of the estimated noise precision matrix W for the MARX-WI for the MARX system. Ribbons indicate one standard deviation, and horizontal lines denote the true values of W ˜ .
Entropy 27 00679 g007
Figure 8. Top: Heatmap of the final A ˜ coefficient matrix parameter estimate by the MARX-WI model. “X” marks selected elements, and the trajectories are shown below. Bottom: Time series of the selected elements of A ˜ estimated by MARX-WI, with ribbons indicating one standard deviation. Horizontal lines show the true values of the corresponding elements of A ˜ .
Figure 8. Top: Heatmap of the final A ˜ coefficient matrix parameter estimate by the MARX-WI model. “X” marks selected elements, and the trajectories are shown below. Bottom: Time series of the selected elements of A ˜ estimated by MARX-WI, with ribbons indicating one standard deviation. Horizontal lines show the true values of the corresponding elements of A ˜ .
Entropy 27 00679 g008
Figure 9. MARX-WI surprisal (dashed blue line) and its decomposition into accuracy (red line) and complexity (green line) over time for the MARX system.
Figure 9. MARX-WI surprisal (dashed blue line) and its decomposition into accuracy (red line) and complexity (green line) over time for the MARX system.
Entropy 27 00679 g009
Figure 10. Entropy of the MARX-WI variational posterior q ( Θ | D k ) over time for the MARX system.
Figure 10. Entropy of the MARX-WI variational posterior q ( Θ | D k ) over time for the MARX system.
Entropy 27 00679 g010
Figure 11. Surprisal over time for MARX-WI versus MARX-UI for the MARX system.
Figure 11. Surprisal over time for MARX-WI versus MARX-UI for the MARX system.
Entropy 27 00679 g011
Figure 12. Simulation errors (average RMSE) of all three estimators for each validation system, with ribbons indicating standard errors.
Figure 12. Simulation errors (average RMSE) of all three estimators for each validation system, with ribbons indicating standard errors.
Entropy 27 00679 g012
Figure 13. Log-scale Frobenius norm of the error between the true coefficient matrix W ˜ and its estimates W from each MARX estimator for each validation system. Ribbons represent standard errors.
Figure 13. Log-scale Frobenius norm of the error between the true coefficient matrix W ˜ and its estimates W from each MARX estimator for each validation system. Ribbons represent standard errors.
Entropy 27 00679 g013
Figure 14. Time series of W ˜ estimates from MARX-WI for each validation system, with ribbons representing one standard deviation. Horizontal lines mark true parameter values.
Figure 14. Time series of W ˜ estimates from MARX-WI for each validation system, with ribbons representing one standard deviation. Horizontal lines mark true parameter values.
Entropy 27 00679 g014
Figure 15. Surprisal (dashed blue) and its decomposition into accuracy (red) and complexity (green) for MARX-WI over time for each validation system.
Figure 15. Surprisal (dashed blue) and its decomposition into accuracy (red) and complexity (green) for MARX-WI over time for each validation system.
Entropy 27 00679 g015
Figure 16. Entropy of the MARX-WI model parameters over time for each validation system.
Figure 16. Entropy of the MARX-WI model parameters over time for each validation system.
Entropy 27 00679 g016
Table 1. Sets of prior parameters used in the experiments.
Table 1. Sets of prior parameters used in the experiments.
M 0 Λ 0 Ω 0 ν 0
Uninformative 0 D x × D y 1 × 10 4 · I D x 1 × 10 5 · I D y D y + 3
Weakly informative 0 D x × D y 1 × 10 1 · I D x 1 × 10 2 · I D y D y + 3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nisslbeck, T.N.; Kouw, W.M. Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models. Entropy 2025, 27, 679. https://doi.org/10.3390/e27070679

AMA Style

Nisslbeck TN, Kouw WM. Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models. Entropy. 2025; 27(7):679. https://doi.org/10.3390/e27070679

Chicago/Turabian Style

Nisslbeck, Tim N., and Wouter M. Kouw. 2025. "Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models" Entropy 27, no. 7: 679. https://doi.org/10.3390/e27070679

APA Style

Nisslbeck, T. N., & Kouw, W. M. (2025). Factor Graph-Based Online Bayesian Identification and Component Evaluation for Multivariate Autoregressive Exogenous Input Models. Entropy, 27(7), 679. https://doi.org/10.3390/e27070679

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop