Extended Variational Message Passing for Automated Approximate Bayesian Inference

Variational Message Passing (VMP) provides an automatable and efficient algorithmic framework for approximating Bayesian inference in factorized probabilistic models that consist of conjugate exponential family distributions. The automation of Bayesian inference tasks is very important since many data processing problems can be formulated as inference tasks on a generative probabilistic model. However, accurate generative models may also contain deterministic and possibly nonlinear variable mappings and non-conjugate factor pairs that complicate the automatic execution of the VMP algorithm. In this paper, we show that executing VMP in complex models relies on the ability to compute the expectations of the statistics of hidden variables. We extend the applicability of VMP by approximating the required expectation quantities in appropriate cases by importance sampling and Laplace approximation. As a result, the proposed Extended VMP (EVMP) approach supports automated efficient inference for a very wide range of probabilistic model specifications. We implemented EVMP in the Julia language in the probabilistic programming package ForneyLab.jl and show by a number of examples that EVMP renders an almost universal inference engine for factorized probabilistic models.


Introduction
Probabilistic Programming Languages (PPL) and packages [1] have gained strong popularity over recent years since they support fast algorithm development through automating Bayesian inference in probabilistic models. Many of these PPLs [2][3][4][5] are based on numerical approximation methods, which leads to inexact inference results, even if the model comprises conjugate factor pairs and exact inference is achievable. Moreover, although a majority of popular PPLs scale well to processing large data sets due to their stochastic inference settings [6], they tend to execute very slowly for certain types of structured dynamic models, such as state space models. Alternatively, some PPLs that execute inference by message passing in a factor graph [7,8] provide efficient inference performance by exploiting factorization and conjugacy between exponential family-based distribution pairs in the model. In particular the Variational Message Passing (VMP) [9,10] algorithm has gained a good reputation, as it supports efficient inference for conjugate factor pairs in factorized probabilistic models. Unfortunately, non-conjugate factor pairs complicate the automated estimation of posterior distributions, due to intractability of the normalization constants. Likewise, non-linear deterministic relations between model variables often create non-conjugate pairings and thus obstruct the message-passing-based inference mechanism. This paper proposes an Extended VMP (EVMP) algorithm to support automated efficient inference on a wide class of models that contain both non-conjugate relations between factor pairs and deterministic, possibly non-linear factor nodes. In our solution proposal, the regular VMP algorithm constructs the functional forms of the messages. These functional forms contain expectations of functions of hidden variables. In the case that these expectation quantities cannot be evaluated to a closed-form expression, we estimate them by Importance Sampling (IS) [11], which is a well-known Monte Carlo method that approximates intractable posteriors by a set of weighted samples and estimates expectations over this sample set. We also make use of Laplace approximation ( [12], Section 4.4) with support by automatic differentiation tools (autodiff) [13] in appropriate cases to approximate posteriors by normal distributions, which allows us to calculate the expectations over the approximating normal distribution. Our proposal leads to an efficient, automatable message-passing framework that removes most model specification limitations.
In Section 2, we start with a review of factor graphs and the VMP algorithm. Next, we specify the proposed Extended VMP algorithm in Section 3. In order to keep the paper readable, both for the advanced researcher and someone who just needs the results, we defer detailed discussions and derivations of the key equations in EVMP to Appendices A and B. We implemented EVMP in the Julia package ForneyLab.jl [8,14]. In Section 4 we present several comparative experiments of EVMP in ForneyLab vs. Turing.jl, which is an alternative state-of-the-art Julia-based PPL that focuses on Monte Carlo methods for inference. We show that EVMP transforms ForneyLab into an almost universally applicable inference engine, while retaining computational efficiency, due to its library of closed-form message passing rules. An extensive comparison to related work is presented in Section 5.

Problem Statement Variational Message Passing on Forney-Style Factor Graphs
We assume a probabilistic model p(y, z) with a given set of observations y = y 1:N = {y 1 , . . . , y N } and a set of latent variables z = z 1:M = {z 1 , . . . , z M }. Bayesian inference in this model relates to evaluating the following posterior: p(z|y) = p(y, z) p(y, z)dz , which relies on evaluating the model evidence p(y, z)dz. Unfortunately, due to the computational complexity of evaluating the integral for the evidence, exact Bayesian inference is achievable only for a limited set of probabilistic models. Alternatively, inference can be executed by minimization of a variational objective called free energy.
where E q [·] stands for the expectation with respect to q(z) and D KL [·||·] denotes a Kullback-Leibler divergence. (In this paper, we denote the expected value q(x) f (x)dx of function f with respect to distribution q both by E q [ f (x)] and f (x) q .) The KL divergence is greater or equal to zero for any distribution q(z) and D KL [q(z)||p(z|y)] = 0 if and only if q(z) = p(z|y). As a result, minimizing the free energy with respect to q leads both to an approximate posterior q * (z) = arg max F =F + E q(z 1:K ) log q(z k ) f a (z 1:k ) f b (z k:K ) F k , whereF holds terms that are not a function of variable z k . The term F k can be re-arranged as follows: exp E q(z 1:k−1 ) [log f a (z 1:k )] exp E q(z k+1:K ) [log f b (z k: so it follows that F k is minimized when q(z k ) is set proportional to the denominator in (2) [12,17]. In addition, notice that the terms with the local factors f a and f b are uncoupled, which paves the way for a message-passing interpretation of coordinate-descent variational inference. As a result, provided that f a and f b are not deterministic factors, the VMP algorithm proceeds by repeating the following four steps until convergence [10]: 1. Choose a variable z k from the set z 1:K . 2.
Compute the incoming messages. On the other hand, f δ = δ(w t − g(z t )) represents a deterministic factor, where g(·) is a deterministic function. It is possible to compose factors and consider them as a single unit.
In this example, f d , visualized by a dashed box, stands for the composition of f δ and f b . It is a notational convention to visualize observed values (y t ) by a small black node. As we see in (3), messages flow on edges in both directions. It is common parlance to call one of the messages the forward message (denoted by − → m z k (z k )) and the other the backward message ( ← − m z k (z k )). In this paper, the central problem is how to execute the VMP update Equation (3) through (5) for a wide range of specifications for the factors f a and f b . In the next section, we specify the proposed Extended VMP (EVMP) solution. A more detailed derivation of the key equations of EVMP is provided in Appendix B.

Specification of EVMP Algorithm
Variational Message Passing is a fast, efficient and deterministic approximate inference algorithm. However, the applicability of VMP heavily relies on connected factors being conjugate pairs (see Appendix A). In contrast, Monte Carlo methods (see [18] for messagepassing interpretation) are applicable to a wider range of models with non-conjugate factor pairs. Unfortunately, in comparison to VMP, Monte Carlo methods are considerably slower since they rely on stochastic simulations. As we elaborate in Section 5, the recent efforts to combine the best of Monte Carlo methods and variational inference predominantly focus on noisy gradient estimation of the free energy through Monte Carlo sampling and do not take the full advantage of deterministic message passing steps in inference.
In this section, we specify the EVMP algorithm, which combines the efficiency of VMP with the flexibility of the Laplace approximation and the universality of Monte Carlo methods. In the proposed EVMP algorithm, VMP constructs the functional forms of the messages while importance sampling and Laplace approximations are used to estimate the required expectations of statistical quantities if they are not available in closed form. We first specify the range of probability distribution types for factors, messages and posteriors. These different types are used to identify the specific calculation rules for updating the messages and posteriors in (3) and (4). We refer the interested reader to Appendix B for detailed derivations.

Distribution Types
We consider the following representation types for probability distributions in factors p(z), where z holds a variable. (1) The standard Exponential Family (EF) of distributions, i.e., the following: where h(z) is the base measure, φ(z) is the sufficient statistics vector, η is the natural parameters vector and A η (η) is the log-partition function. (2) Distributions that are of the following exponential form: where g(z) is a deterministic function. The key characteristic here is that φ(g(z)) is not recognized as a sufficient statistics vector for any of the standard EF distributions. We call this distribution type a Non-Standard Exponential Family (NEF) distribution. As we show in Section 3.6, this distribution type arises only in backward message calculations.
A List of Weighted Samples (LWS), i.e., the following: (4) Deterministic relations are represented by delta distributions, i.e., the following: Technically, the equality factor f (x, y, z) = δ(z − x)δ(z − y) also specifies a deterministic relation between variables.

Factor Types
Factor types f (z) are represented by EF and delta distributions.
In a VMP setting, as discussed in this and previous papers on VMP, conjugate soft factors from the exponential family enjoy some computational advantages. As an extension to VMP, the EVMP algorithm inherits the same computational advantages for conjugate factor pairs. In order to automate and generalize the inference to custom non-conjugate soft factors, we compose a generic soft factor by a delta distribution (to describe a non-linear deterministic function) and a standard EF distribution. This decomposition relieves us from manually deriving VMP messages for each different soft factor specification. For a given composite node (delta + standard EF), the EVMP algorithm uses the predefined VMP messages for the standard EF component to compute messages around the composite node. As we will see, this formulation yields an almost generic inference procedure.

Message Types
Forward messages carry either an EF or an LWS distribution. Backward messages carry either an EF or an NEF distribution. This is an arbitrary choice in the sense that we only make this assignment to indicate that in the EVMP algorithm, two colliding messages in posterior calculations are not both of the LWS type nor both of the NEF type.

Posterior Types
The posteriors q(z) are represented by either the EF or LWS representations. To summarize the terminology so far, we defined four distribution types: Standard EF (EF), Non-Standard EF (NEF), List of Weighted Samples (LWS) and delta distributions. The end user of our algorithm can design a model by using EF and delta distributions. Under the hood, messages may carry EF, NEF or LWS distributions to render the inference. As the output, the end user is provided with either the EF or LWS posteriors. Next, we discuss how posteriors, messages and free energies are computed in the EVMP algorithm. The different types can be used to identify which computational recipe applies. As an aside, Julia's support for multiple dispatch in functions [14] makes this a very elegant mechanism that requires almost no if-then rules.

Computation of Posteriors
Here, we discuss how EVMP updates the posteriors in (4). In an FFG, computation of the posterior q(z) is realized by a multiplication of colliding forward and backward messages, respectively − → m z (z) and ← − m z (z), followed by normalization. We distinguish four types of updates. (1) In the case that the colliding forward and backward messages both carry EF distributions with the same sufficient statistics φ(z), then computing the posterior simplifies to a summation of natural parameters: In this case, the posterior q(z) will also be represented by the EF distribution type. This case corresponds to classical VMP with conjugate factor pairs.
The forward message again carries a standard EF distribution. The backward message carries either an NEF distribution or a non-conjugate EF distribution.
(a) If the forward message is Gaussian, i.e., − → m z (z) = N (z; µ 1 , V 1 ), we use a Laplace approximation to compute the posterior: is not a Gaussian), we use Importance Sampling (IS) to compute the posterior: Note that the message calculation rule for − → m z k (z k ) requires the computation of expectation η a (z 1:k−1 ) q(z 1:k−1 ) , and for ← − m z k (z k ) we need to compute expectations φ a (z 1 ) q(z 1 ) and η a (z 2:k ) q(z 2:k−1 ) . In the update rules to be shown below, we will see these expectations of statistics of z appear over and again. In Section 3.8 we detail how we calculate these expectations and in Appendix A, we further discuss the origins of these expectations. (2) In the case that f δ is a deterministic factor (see Figure 3b): then the forward message from f δ to x is of LWS type and is calculated as follows: where z For the computation of the backward message toward z k , we distinguish two cases: (a) If all forward incoming messages from the variables z 1:k are Gaussian, we first use a Laplace approximation to obtain a Gaussian joint posterior q(z 1:k ) = N (z 1:k ; µ 1:k , V 1:k ); see Appendices B.1.2 and B.2.2 for details. Then, we evaluate the posteriors for individual random variables, e.g., q(z k ) = q(z 1:k )dz 1:k−1 = N (z k ; µ k , V k ). Finally, we send the following Gaussian backward message: (b) Otherwise (the incoming messages from the variables z 1:k are not all Gaussian), we use Monte Carlo and send a message to z k as a NEF distribution: Note that if f δ is a single input deterministic node, i.e., f δ (x, z k ) = p(x|z k ) = δ(x − g(z k )), then the backward message simplifies to ← − m z k (z k ) = ← − m x (g(z k )) (Appendix B.1.1). (3) The third factor type that leads to a special message computation rule is the equality node; see Figure 3c. The outgoing message from an equality node is computed by following the sum-product rule:

Computation of Free Energy
Here, we discuss how EVMP computes the FE update from (5). Note that the FE can be decomposed into a subtraction of energy and entropy terms: These energy and entropy terms can be evaluated because f a (z 1:k ) f b (z k:K ) contains only factors that are defined in the generative model and q(z 1:K ) is also accessible as a result of variational inference. Thus, we evaluate the FE by evaluating the energy and entropy terms separately.
The entropy terms only need to be evaluated for variables z that are not associated with output edges of deterministic nodes. In that case, we calculate the entropy of q(z) as follows: 1.
If q(z) is a represented by a standard EF distribution, i.e.,

Expectations of Statistics
In many of the above computations for messages, posteriors and free energies, we need to compute certain expectations of statistics of z, e.g., the computation of the forward message in (13) requires evaluation of η a (z 1:k−1 ) q(z 1:k−1 ) . Here, we discuss how EVMP evaluates these expectations. Let us denote a statistic of random variable z by Φ(z) and assume we are interested in the expected value Φ(z) q(z) . The calculation rule depends on the type of q(z): (1) We have two cases when q(z) is coded as an EF distribution, i.e., i.e., the statistic Φ(z) matches with elements of the sufficient statistics vector φ(z), then Φ(z) q(z) is available in closed form as the gradient of the log-partition function (this is worked out in Appendix A.1.1, see (A14) and (A15)): In case q(z) is represented by a LWS, i.e., the following: then, we evaluate the following:

Pseudo-Code for the EVMP Algorithm
Sections 3.1-3.8 provide a recipe for almost universal evaluation of variational inference in factor graphs. We use classical VMP with closed-form solutions when possible, and resort to Laplace or IS approximations when needed. We now summarize the EVMP algorithm by a pseudo-code fragment in Algorithm 1. We use the following notation: is the set of factor nodes (vertices), where V f , V δ , V = stand for the subsets of soft factor nodes, deterministic nodes and equality nodes, respectively. E is the set of edges that connect the nodes. G = (V, E) represents the entire factor graph. h 1:M+L = z 1:M ∪ x 1:L is the set of hidden variables, where x 1:L are the variables at the output edges of deterministic nodes. z 1:M are also associated with edges in E, but in contrast to x 1:L , z 1:M are not output edges of deterministic nodes.
For structured factorizations, the overall structure remains the same, but messages and posteriors are calculated for sub-graphs instead of single random variables.
An example to illustrate the calculation of messages and posteriors in the EVMP algorithm is provided in Appendix F.

Algorithm 1 Extended VMP (Mean-field assumption)
Calculate entropy H h j using Section 3.7 Calculate energy U v using Section 3.7 Update Free Energy F = F + U v end for return q(h) for all h ∈ h 1:M+L and F end for

Experiments
We illustrate EVMP-based inference on three different applications (code for experiments can be found at https://github.com/biaslab/ExtendedVMP (accessed on 25 June 2021)). For each application, we show the favorable features of EVMP together with its shortcomings in comparison to Turing [5], which is a general purpose Julia probabilistic programming package.

Filtering with the Hierarchical Gaussian Filter
The Hierarchical Gaussian Filter (HGF) [19,20] is a popular generative model in the neuroscience community. The HGF consists of a Gaussian random walk model, where the variance of the Gaussian is a nonlinear function of the state of the next higher layer, that in turn evolves according to a Gaussian random walk, an so on. Due to the nonlinear link between the layers, classical VMP rules do not have a closed-form solution. While in principle, variational updates through Laplace approximation can be manually derived for the HGF model [19], automatically generated EVMP update rules alleviate the need for cumbersome and error-prone manual derivations.
The 2-layer HGF model is defined as For this experiment, we generated T = 400 data points by the following process. First, we generated noisy hidden states using z t ∼ N sin π 60 t , 0.01 , t = 1 . . . 400. Next, we generated observations following model (21a-d) with σ 2 y = 0.1. The generated data set is visualized in (the lower subgraph of) Figure 4.
Next, we filtered the data set by a second HGF, also given by (21a-d) with priors z 0 ∼ N (0, 1), x 0 ∼ N (0, 1) and parameters σ 2 z = σ 2 y = 0.1. We used EVMP to track the hidden states z t and x t . All inference steps including the message passing schedule for filtering in the HGF are detailed in [19]. For each time step, EVMP was run for 10 iterations at each filtering step. For comparison, we implemented a similar filtering procedure by Automatic Differentiation Variational Inference (ADVI) [21], executed by Julia's Turing.jl [5] package. At each time step t, the priors over z t−1 and x t−1 are set to Gaussian distributions, the mean and variance parameters of which are determined by sampling from the variational posteriors at t − 1. The only difference between the ForneyLab and Turing implementations, in terms of posterior distribution factorization, is that in Turing's ADVI, we posit a fully factorized posterior distribution. This assumption decreases the number of parameters to be estimated via automatic differentiation and speeds up the inference procedure. On the other hand, pre-defined message passing rules in ForneyLab enable us to retain the dependency structure between x t−1 and x t at time step t in exchange for almost no run-time loss. To be more precise, at time step t, we run inference on the following model: ) and q f (x t−1 ) are the posterior approximations from the previous time step. In For-neyLab, we run the inference with variational distribution q( Figure 4. In ADVI, the variational distribution is q(z t−1 )q f (z t )q(x t−1 )q f (x t ). Once inference has completed, Turing allows drawing samples from the variational distribution. We then calculate the mean and variance of these samples to fit Gaussian distributions on q f (z t ) and q f (x t ).
The estimated tracks of z t are visualized in Figure 4. For both EVMP and ADVI, the estimated hidden states largely coincide. However, we observe that both methods capture the periodic character of the true hidden states z 1:400 with a delay. We believe that there are two plausible explanations for the delayed estimations: (1) in the model specification, we assume that the data generative process is not known fully. The variables z 1:400 are originally generated from a sinusoidal function of discrete time steps. However, in the model specification, we do not use this information; (2) in the model specification, we define a random walk over hidden variables z 1:400 that posits the mean of z t as z t−1 . Elaborating the latter factor, the random walk avoids a hidden variable z t to change drastically, compared to z t−1 while x t forces z t to explain the volatility in the process. Reconciling the beliefs from x t and z t−1 , both Extended VMP and ADVI estimate z t with a delay.
In Turing's ADVI procedure, we used 10 samples per iteration for gradient estimation and set the maximum number of iterations to 4000 per time step to be able to capture this periodic behavior. The overall inference is completed in roughly 1.5 min (this and furtherexperiments were carried out on a machine with the following specs: Julia 1.5.0, Turing v0.15.9, AMD Ryzen 7 3700X 3.6 GHz 8-Core CPU, 16 GB DDR4-3200 MHz RAM.). ForneyLab's EVMP procedure, on the other hand, is able to perform inference in under 7 s on this time series; see Table 1. The speed of ForneyLab stems from the hybrid inference nature of EVMP. EVMP resorts to gradient-based optimization only to infer q f (z t ) and the sampling procedure is required only to estimate statistics related to w t to be used in the update steps of q(x t−1 , x t ). In contrast, ADVI requires sampling and employs noisy gradients in the estimation of all the components of the variational distribution. This experiment validates EVMP as a fast automated variational inference solution for filtering in hierarchical dynamic models.

Parameter Estimation for a Linear Dynamical System
In this experiment, we focused attention on a system identification task in a Linear Dynamical System (LDS) [22,23]. An LDS is generally defined as where y t are observations and x t are hidden states.
In this experiment, we are interested in inferring the transition matrix A together with the hidden states from a set of observations. Manually derived closed-form solutions for the system identification task are available both in maximum likelihood estimation [24] and a variational Bayesian approximation [25] contexts. Nevertheless, the goal in this and other papers on probabilistic programming packages is to automatically infer posteriors over the hidden states and parameters without resorting to manual derivations. In principle, EVMP supports to infer the hidden states, A, B, Q and R concurrently. Of course, depending on specific circumstances such as system identifiability and the richness of the observed data, the performance may vary.
In order to execute our experiment, we first extend (22a,b) with a prior on A as follows: In (23a-d), a holds the vectorized representation of the transition matrix A. Note that (23b) can be written as follows: and through this manipulation we identify reshape(a, (m, m)) as the deterministic factor in (15). As a result, ForneyLab's EVMP works out-of-the-box for inference of the transition matrix in (23a-d).
Next, we presented the data set to a second LDS model and aimed to infer posteriors over hidden states and transition matrix A. The prior on a was set to a ∼ N (0 4 , I 4×4 ) and all other parameters were set to the same values as in the data generation process.
We compared the performance of ForneyLab's EVMP with Turing's ADVI and NUTS (No U-Turn Sampler, a Hamiltonian Monte Carlo sampling-based inference method) [26] engines, see Figure 5. Both EVMP, ADVI and NUTS successfully converged to almost coinciding estimates of the transition matrix (no notable difference when visualized). We also show free energy tracks for EVMP and ADVI in Figure 5. In this experiment, Turing's ADVI outperformed ForneyLab's EVMP in terms of total execution time and the free energy minimization. As a mitigating factor in this analysis, the pre-compilation of the message passing schedule in ForneyLab takes about 13 s, while actual execution of the generated inference algorithm is on par with Turing's ADVI. Execution time details are shown in Table 2.

EVMP for a Switching State Space Model
In this experiment, we went beyond models that only contain continuously valued variables and inquired the capabilities of EVMP on a Switching State Space Model (SSSM) [27], which consists of both continuous and discrete hidden variables. The assumption of constant model parameters in the LDS of Section 4.2 does not account for the regime changes that occur in many dynamical systems of interest. The SSSM does allow for modeling parameter switches, and in this experiment we used the following model: In this system, y t ∈ R are observations, x t ∈ R is a continuously valued hidden state and z t is a one-hot coded three-dimensional selection variable, i.e., z tk ∈ {0, 1} and ∑ 3 k=1 z tk = 1. The parameters of the system are the state variances v k and concentration parameters α k . The elements of α k are all 1, except the k th element which is set to 100 to disfavor frequent regime switches, e.g., α 2 = [1, 100, 1] .
We generated T = 120 data points from a random walk process (24c) and (24d) with process noise variance parameter . From time step t = 2 to t = 25, we set z t,1 = 1 and consequently p(x t |x t−1 ) = N (x t−1 , 10). From time step t = 26 to t = 75, we set z t,2 = 1 and between t = 76 to t = T = 120 we set z t,3 = 1. The generated time series is shown in Figure 6. The main difficulty in state inference for the SSSM stems from the coupling between x and z. This is because the variational message passing rules around the node p(x t |x t−1 , z t ) are not pre-defined in ForneyLab, although technically they can be worked out to closedform expressions [27]. If EVMP were not available either, then a ForneyLab end user would be expected to manually derive closed-form update rules and implement these rules in an additional ForneyLab node. This type of manually assisted inference by end user calculations is what we try to avoid with EVMP and with probabilistic programming packages in general. EVMP enables the user to compensate for the lack of stored messagepassing rules by introducing an auxiliary variable s in the model with a deterministic relation between s and z: After we extend model specification (24a-d) by (25a-d), then ForneyLab can run EVMPbased inference out of the box. Note that there is no need for manual inference calculations, but rather a simple manipulation of the generative model that makes the system suited for automated inference. We tested the performance of two different constraints on the posterior distribution: (1) a mean-field assumption, i.e., q(x 1:T , z 1: q(x t )q(z t ); (2) a structured mean-field assumption, i.e., q(x 1:T , z 1:T ) = q(x 1:T ) T ∏ t=1 q(z t ), see Figure 6. We observe that the structured factorization, being a less stringent constraint on q, yields a slightly better performance than the mean-field factorization, particularly in estimating the length of the first regime.
We also compared the performance of ForneyLab's EVMP method to Turing's inference method. As opposed to the previous two experiments, we could not use solely ADVI, nor Hamiltonian Monte Carlo (HMC, [28,29]) and NUTS samplers in this experiment since these procedures do not allow inference for discrete random variables. Turing does provide the option to use a Particle Gibbs (PG) sampler [30,31] for the estimation of the discrete random variables (z 1:T ) in conjunction with the estimation of the continuous random variables (x 1:T , A) by HMC and NUTS. The performance results for NUTS-PG and HMC-PG are shown in Figure 6. The performance of the NUTS-PG and HMC-PG samplers in estimating the correct regimes is far below the EVMP results, although the HCM-PG sampler correctly identified the third regime. The run-time scores are shown in Table 3.

Related Work
Hybrid Monte Carlo variational inference techniques have been studied prior to our work. However, mainstream research predominantly consists of variational methods within Monte Carlo techniques as opposed to our Monte Carlo methods within a variational inference approach.
For instance, ref. [32] casts variational distributions as proposal distributions in a Markov-Chain Monte Carlo (MCMC) procedure. Similarly, ref. [33] employs variational methods to design adaptive proposal distributions for Importance Sampling (IS). In [34], gradient estimates of a variational objective are used to tune the parameters of the proposal distributions for MCMC. On the other hand, Monte-Carlo Co-Ordinate Ascent Variational Inference (MC-CAVI), proposed in [35], differs from the aforementioned methods in that it uses MCMC in the calculation of expectations required within the fixed-point iterations of Coordinate Ascent Variational Inference (CAVI).
In this paper, we follow a similar approach as [35], but we use IS to estimate the expectation quantities required in VMP. Both MCMC and IS have their own merits. IS smoothly interfaces with the message passing interpretation of Bayesian inference, which further leads to automated design of proposal distributions. We use Laplace approximation for Gaussian posteriors for variables with Gaussian priors. In the context of dynamical systems, this approach notably overlaps with Gaussian filtering techniques ( [36], Section 6) that is often achieved by Assumed Density Filtering ( [37], Section 8.4).
As we show in Appendix E, in the approach that we propose, it is also possible to run automated Bootstrap particle filtering [36,38] rather than Gaussian filtering methods. As show in [18], particle filtering can also be framed as message passing on a factor graph. The connection between the particle filtering and variational optimization was introduced in [39]. Their formalism is based on an extension of Particle Belief Propagation [40] to Tree-Reweighted Belief Propagation [41], while ours revolves around VMP. Similar to our approach, Particle Variational Inference (PVI) [42] aims at optimizing a variational objective by successive IS approximations to true posterior distributions. While PVI applies well to inference for discrete random variables, our EVMP proposal applies to both continuous and discrete random variables.
Variational inference in the context of deterministic building blocks in probabilistic models was studied in [43]. Wheras [43] allows non-linearities to be placed only after Gaussian nodes, the proposed EVMP method generalizes this concept to EF distributed factors.
Non-conjugate Variational Message Passing (NC-VMP) [44] addresses the non-conjugate factor issue in VMP. Assuming that the posterior distribution is an EF distribution, NC-VMP projects the messages to the distribution space of the posterior by equating their sufficient statistics. Thus NC-VMP tunes the natural parameters of the messages in such a way that they converge to the stationary points of the KL divergence between the approximated and true posteriors. Ref. [44] also reports that the algorithm necessitates damping for convergence in practice. In response, ref. [45] presents Conjugate-Computation Variational Inference (CVI) as a universal inference method that is based on stochastic optimization techniques. As opposed to alternative stochastic variational inference techniques, such as Black-Box Variational Inference [46] and Automatic Differentiation Variational Inference [21], CVI exploits the conjugacy structure of the probabilistic models, which leads to faster convergence. In CVI, non-conjugate factors are incorporated into coordinate ascent steps of mean-field variational inference (with ELBO objective) through a stochastic optimization procedure to form compact posterior approximations with standard probability distributions. In our EVMP approach, the Laplace approximation entails a similarly nested optimization procedure to form compact approximations with Gaussian distributions. Nevertheless, our particle approximations to the true posteriors obviate the need for additional gradient-based optimizations to estimate the parameters of the posteriors.
Finally, the original VMP paper [9] itself briefly mentions sampling methods to overcome the issues with non-conjugate priors. However, they do not extend this idea to deterministic nodes and rather present it as a fallback method whenever soft factors are tied to non-conjugate soft factor priors. Inspired by their vision of approximating the expectation quantities by sampling techniques, we introduce here a fully automated, very broadly applicable extended VMP procedure.

Discussion
In this paper, we present a method for almost universal variational inference on factorized probabilistic models. The core of our method is the locality feature of VMP: the messages at a soft factor are functions of expectations related to arguments of the factor. We employ IS to estimate these expectations or directly approximate posteriors by Laplace approximation if a Gaussian posterior is reasonable. We also extended the Julia package ForneyLab with the proposed EVMP method. In contrast to many alternative PPLs that are solely based on Monte Carlo methods, ForneyLab allows end users to take full advantage of closed-form message passing rules while resorting to small-scale numerical approximations only when needed. We showed that ForneyLab provides an efficient automated variational Bayesian inference tool that in some instances may be preferable to the state-of-the-art Turing package, especially for tasks that include filtering in dynamical models or discrete variables in state space models.
While the experiments support the notion that EVMP is a promising method for inference in non-linear and non-conjugate models, we have not tested our method yet in high-dimensional problems. It is well-known that importance sampling is not efficient in high dimensions [47]. Therefore, we anticipate that for high-dimensional inference tasks with continuous random variables, Hamiltonian Monte Carlo-based methods could outperform EVMP both in terms of run-time and quality of the estimates. Nevertheless, it should be possible to alleviate the deficiencies of EVMP in high dimensions by replacing IS and Laplace approximations by HMC samplers. In essence, HMC is an MCMC method and ref. [35] shows the efficiency of MCMC methods in estimation of the expectations that are required in variational inference. Yet, in lower dimensions, we favor IS and Laplace approximations both because of their promising performance scores in the experiments and also because EVMP relieves users of choosing hyperparameters for the best performance. Recall that in the SSSM experiments in Section 4.3, we tested HMC with various hyperparameters to attain the best performance, and yet EVMP was more successful in detecting the hidden regimes. Moreover, in contrast to EVMP, plain HMC is not applicable to estimate discrete variables and needs to be combined with other samplers to run inference on the models with discrete and continuous variables.
In Appendix C, we introduce a variational free energy estimation method that resorts to approximations only if the closed-form expressions of the information-theoretic measures are not available. This differs from alternative automated variational inference techniques, such as Automatic Differentiation Variational Inference (ADVI), which estimates the entire free energy over Monte Carlo summation. Moreover, like HMC, the applicability of ADVI is also limited to continuous variables.
In EVMP, proposal distributions for importance sampling are automatically set to forward messages. Although it is a practical solution with an elegant interpretation in a message passing context, forward messages do not carry information regarding observations. Therefore, we may not acquire useful samples from forward messages if the observations lead to peaky backward messages. In future work, we aim to investigate the effects of alternative proposal distribution design methods.
One major drawback of our ForneyLab implementation is that ForneyLab does not allow loops during the inference procedure. We rarely encounter this problem with soft factors since the mean field assumption breaks the loops by imposing additional factorizations in variational distributions. However, this may not be the case with deterministic nodes. This is because the input and the output variables of deterministic nodes are tied to each other through a deterministic mapping even after the mean field assumption. For example, consider the following mixture model specification: and p(y|x, w) = N (x, 1/w). Although it is a valid model specification with properly defined message passing rules, the EVMP algorithm is precluded due to the loop: the variable z is connected to two deterministic nodes (p(w|z) and p(x|z)) the outputs of which are connected to the same node p(y|x, w). Belief propagation (BP) [48,49] faces with a similar problem on loopy graphs. Nonetheless, it has been proven that iteratively running BP on loopy graphs often yields satisfactory approximations though the convergence is not guaranteed ( [12], Section 8.4.7), ( [37], Section 22.2). Therefore, it is worth investigating the performance of EVMP executed in a loopy setting.
There are similarities between EVMP and Expectation Propagation (EP) [50,51] in the sense that both methods estimate the moment parameters of posteriors. In contrast to EP, which approximates belief propagation (BP) [48,49] messages, EVMP approximates VMP messages, which is applicable to a broader range of model specifications. In future work, we aim to investigate and exploit this relation.

Conclusions
We developed a hybrid message passing-based approach to variational Bayesian inference that supports deterministic and non-conjugate model segments. The proposed Extended VMP (EVMP) method defaults to analytical updates for conjugate factor pairs and uses a local Laplace approximation or importance sampling when numerical methods are needed. EVMP was implemented in Julia's ForneyLab package (see Appendix D) and a set of simulations shows very competitive inference performance on various inference tasks, particularly for state and parameter tracking in state-space models.

Acknowledgments:
The authors want to extend gratitude to our fellow researchers at BIASlab for interesting discussions on the topics in this paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. On the Applicability of VMP
In this section, we show that the applicability of the VMP algorithm relies on connected factors being conjugate pairs in exponential family of distributions. Non-conjugate connected factors lead to intractable posteriors and messages. Nevertheless, we show that for a given soft factor, the corresponding VMP messages are locally expressed in terms of some expectation quantities. If these expectations are not available in closed form, then we can estimate them to approximate the VMP messages around the non-conjugate factor pairs. Let us focus on Figure 2. We postulate the following assumptions: Figure A1. Deterministic conditional distributions often complicate VMP. An example deterministic conditional is visualized for f δ (x, z k ) = p(x|z k ) = δ(x − g(z k )). f δ (x, z k ) and f c (x, z k+1:K ) together form the composite node f b (z k:K ).
• f a (z 1:k ) is an element of the exponential family (EF) of distributions, i.e., In this equation, h a (z k ) is a base measure, η a (z 1:k−1 ) is a vector of natural (or canonical) parameters, φ a (z k ) are the sufficient statistics, and A a (z 1:k−1 ) is the log-partition function, i.e., A a (z 1:k−1 ) = log( h a (z k ) exp(φ a (z k ) η a (z 1:k−1 ))dz k ). It is always possible to write the log-partition function as a function of natural parameters A η a (η a ), such that A η a (η a ) = A a (z 1:k−1 ). Throughout the paper, we sometimes prefer the natural parameter parameterization of the log partition. • We differentiate a few cases for f b : 1. f b is also an element of the EF, given by the following: and is a conditionally conjugate pair with f a for z k . This (conditional conjugacy) property implies that, given z k+1:K , we can modify f b in such a way that its sufficient statistics will match the sufficient statistics of f a . Technically, this means we can rewrite f b as follows: The crucial element of this rewrite is that both f a (z 1:k ) and f b (z k:K ) are written as exponential functions of the same sufficient statistics function φ a (z k ). This case leads to the regular VMP update equations, see Appendix A.1. Our Extended VMP does not need this assumption and derives approximate VMP update rules for the following extensions.

2.
f b is an element of the EF, but not amenable to the modification given in (A3), i.e., it cannot be written as an exponential function of sufficient statistics φ a (z k ). Therefore, f b is not a conjugate pair with f a for z k .

3.
f b (z k:K ) is a composition of a deterministic node with an EF node, see Figure A1. In particular, in this case f b (z k:K ) can be decomposed as follows: where x = g(z k ) is a deterministic, possibly nonlinear transformation and f c (x, z k+1:K ) is an element of the EF: We assume that the conjugate prior to f c for random variable x has sufficient statistics vectorφ c (x), and hence (A5) can be modified as follows: whereĉ c (z k+1:K ) refers to the terms that does not include x.
Appendix A.1. VMP with Conjugate Soft Factor Pairs The original VMP algorithm arises as an efficient inference procedure in models that solely consist of conjugate factor pairs. This is because conjugate factor pairs yield analytically tractable messages and posterior calculations. Next, we shortly review the effect of conjugate factor pairs on VMP updates.

. Messages and Posteriors
The VMP message from the factor f a to z k can easily be evaluated by applying (3) to (A1): Since f b is conjugate to f a , its functional form can be modified as (A3) and by applying (3) to (A3), we find the VMP message from the factor f b to z k : Given that the messages − → m z k (z k ) and ← − m z k (z k ) have the same sufficient statistics, the posterior update step reduces to summation of the messages' natural parameters: For the sake of brevity, the distribution subscripts in the expectation notations are dropped. Note that we evaluate the posterior up to a normalization constant. Nevertheless, the lognormalizer function A η a (·) is readily available for EF distributions having sufficient statistics vector φ a (z k ). As a consequence, the posterior evaluates to the following: Having showed that the conjugate factor pairs lead to a closed-form expression for posterior q(z k ), we now investigate which expectation quantities related to q(z k ) are required in the outgoing VMP messages from the factors f a and f b to, say z 1 and z K : In practice, the message ← − m z 1 (z 1 ) is explicitly calculated by isolating the terms with z 1 in a sufficient statistics vector as it is done for z k in (A8). Similarly, − → m z K (z K ) is explicitly calculated, analogous to message − → m z k (z k ) in (A7). Here, we follow a rather different approach to explicitly show the expectations related to q(z k ) in the message calculations. Substituting f a and f b with (A1) and (A3) in (A11a,b), and keeping in mind that the meanfield assumption allows separation of the expectation quantities with distinct random variables, the messages evaluate to the following: Notice that both messages require the expectation of the sufficient statistic vector φ a (z k ). Fortunately, in EF distributions, φ a (z k ) q(z k ) is available in closed-form as the gradient of the log-normalizer ( [52], Proposition 3.1): For the sake of completeness, we now show that this equality holds.
Evaluating this gradient at η a = η ab , we reach the following: Appendix A.1.

Free Energy
As in the message and the posterior calculations, conjugacy eases the free energy calculation. We investigate it for (5), the free energy terms that include z k . F k is decomposed as follows: Average energy terms Negative entropy (A16) Substituting f a , f b and q(z k ) with (A1), (A3) and (A10) in the above expression: The expectation terms related to z k in F k are φ a (z k ) q(z k ) and log(h a (z k )) q(z k ) . The former expectation is available in closed form, (A13). Thus F k is analytically tractable for those distributions that possess closed-form solution for log(h a (z k )) q(z k ) .
In short, conjugate factor pairs facilitate the VMP procedure by allowing closed-form expressions for updates of messages (3), posteriors (4) and FE (5). Moreover, although exceptions exist, similar to the normalization of the posterior (A10), the messages − → m z k (z k ) and ← − m z k (z k ) can be effortlessly normalized if the required expectations are known. Therefore, we can directly parameterize them with standard probability distributions and draw samples from them. This property of EF distributions plays a pivotal role in our automation of the importance sampling procedure.

Appendix A.2. VMP with Non-Conjugate Soft Factor Pairs
Suppose that the soft factors f a and f b are no longer conjugate pairs, i.e., f b given in (A2) can be written in the following form: where cruciallyφ b (z k ) = φ a (z k ). Notice thatη b ( = η b ) is the natural parameters after the modification of (A2) to isolate the terms with z k in the sufficient statistics vector. Therefore, the messages − → m z k (z k ) and ← − m z k (z k ) differ in sufficient statistics: In this case, the normalization constant calculation in the posterior update step (5) is not straightforward anymore; and worse, it is often intractable. The term intractable refers to integrals that are not available in closed-form for continuous variables. For discrete variables, it refers to summations that are not achievable in a feasible amount of time. The lack of the normalization constant, − → m z k (z k ) ← − m z k (z k )dz k , hinders the calculation of the expectations with z k terms that appear in out-going VMP messages from f a and f b to variables z 1:k−1 and z k+1:K , respectively, e.g., φ a (z k ) q(z k ) and φ As a result, non-conjugacies obstruct the VMP procedure by hampering closed-form expectation calculations. Bear in mind that even though VMP procedure is obstructed due to intractable expectations, the messages are distinctly fixed for soft factors as functions of certain expectation quantities that are supposed to be calculated over their arguments. We use this property in our Extended VMP method. . (A22) Note that the above message reduces to VMP message from f c to x followed by Belief Propagation (BP) [48,49]. The resulting backward message ← − m z k (z k ) has the sufficient statistics vectorφ c (g(z k )). Ifφ c (·) = φ a (g −1 (·)), this case reduces to ordinary VMP as discussed in Appendix A.1; otherwise this case is a special case of Appendix A.2 and q(z k ) is not available in closed-form. Hence, the outgoing messages from the factor nodes f a and f b : are intractable. The last line in the above derivations follows from the transformation of variables [53], i.e., q(z k ) = q(x)| dx dz k |, and expose the automatable nature of Variational Message Passing: the VMP message − → m z K (z K ) requires expectation quantities that are related to arguments of the soft factor z K is tied to, which is in this case f c . Therefore, once the VMP message passing rule is defined for the factor f c as a function of its arguments, we can instantiate the messages by providing the required expectation quantities. For example, the required expectation quantities related to argument x are contained in the sufficient statistics vectorφ c (x).
The second equality in the above expression is due to the transformation of variables, i.e., q(x) = q(z k )| dz k dx | [53]. Substituting q(z k ) with (4) in the above integral yields the following: where ← − m z k (z k ) = ← − m x (g(z k )), as given in (A22). Recall from Appendix A.3 that the normalizer, − → m z k (z k ) ← − m z k (z k )dz k , is often hard to calculate analytically. We use importance sampling [11,36] to approximate the integral in (A24): where z (i) k for i = 1, . . . , N are drawn from the proposal distribution π(z k ), i.e., z k ) for i = 1, . . . , N are particles and their corresponding weights are denoted by w (i) .
The design of a good proposal distribution has a critical role in IS. First, it is supposed to be an easy-to-sample distribution. Secondly, its support is required to be no smaller than the support of − → m z k (z k ) ← − m x (g(z k )) [36]. Lastly, the proposal distribution is desired to be a good representation of q k (z k ) = to attain a fast convergence [47]. In our automated design, − → m z k (z k ) constitutes the proposal distribution. Our choice is not optimal in a sense that information regarding the evidence is most often carried out by the backward message and it is not incorporated in our proposal design. However, − → m z k (z k ) satisfies the first two conditions since the messages are parameterized with standard distributions (easy-to-sample) and it has nonzero probability everywhere that the posterior has, too. Substituting π(z k ) with − → m z k (z k ) in (A25) yields the following: where z (i) k ∼ − → m z k (z k ) for i = 1, . . . , N and Φ(x) q(x) denotes our estimator for Φ(x) q x . Let us summarize the procedure in (A26) to define our first set of rules related to the deterministic nodes. (A26) consists of samples that are drawn from − → m k (z k ) and transformed through deterministic mapping g(.). We cast this process as the forward message − → m x (x) calculation. Once the samples are transformed, i.e., g(z (i) k ), the weights w (i) are determined over ← − m x (.). We interpret this process as the collision of the forward − → m x (x) and the backward messages ← − m x (.); hence, we relate it to the posterior calculation. Setting Φ(g(z k )), our interpretation of message collision becomes obvious since ) results in a Monte Carlo estimate for q(x). As a result, we introduce our first set of rules related to deterministic nodes and the posterior approximation at the output edge of the deterministic node: where z Here, we introduce the term list of weighted samples (LWS) to refer to the distributions that are represented by a set of samples and corresponding weights. Above, − → m x (x) and q(x) are represented by LWS distributions. Now, we turn our attention to calculation of q(z k ) and the expectation quantity Φ(z k ) q(z k ) . For this task we have two different strategies: if − → m z k (z k ) is a Gaussian message, we approximate q(z k ) by Laplace approximation which is also automatable thanks to automatic differentiation and otherwise we follow the IS procedure introduced above. Let us go over them starting from the latter.
Appendix B.1.1. Non-Gaussian Case This time we are supposed to evaluate Φ(z k ) q(z k ) so that the VMP messages toward z 1 , . . . , z k−1 can be computed. Notice that the procedure is exactly same with (A25), except that the expectation quantity of interest, Φ(z k ) q(z k ) , does not involve the deterministic mapping g(.), this time. Therefore, by using − → m z k (z k ) as the proposal distribution, we can estimate Φ(z k ) q(z k ) as the following: This gives us the second set of rules related to deterministic mappings. An element of this new set of rules is that the backward message is directly passed in probability distribution function (pdf) form: Recall from Appendix A.1 that the messages used to carry standard EF distributions. Now, we make an exception and introduce ← − m z k (z k ), which is no longer associated with any of the standard EF distributions. Nonetheless, ← − m z k (z k ) takes an exponential form since ← − m x (·) is an EF distribution (see (A22)). Therefore, we call ← − m z k (z k ) a non-standard exponential family (NEF) distribution. Having defined the backward message, let us evaluate the posterior q(z k ). Similar to q(x), substituting Φ(z k ) with δ(z k − z (i) k ) in (A28) gives us a Monte Carlo estimate of q(z k ): where z Appendix B.1.

Gaussian Case
In FFGs, the models are often constructed in such a way that the most prevailing message types will be Gaussians. This is because Gaussian messages facilitate inference by allowing many inference related operations to be executed in closed form, such as summation, conditioning, scaling and shifting by constants, etc. In order to retain the computational advantages of Gaussian distribution, we take it as an implicit hint that the posterior distribution is Gaussian-like, if − → m z k (z k ) is a Gaussian message. Then, we use Laplace approximation ( [12], Section 4.4) to approximate q(z k ) with N (z k ; µ z k , V z k ): where ∇ z k f denotes the gradient of f with respect to z k and ∇∇ z k f | z k =µ k refers to the Hessian of f with respect to z k evaluated at µ k . Note that the gradient and the Hessian respectively reduce to the first and the second derivatives if z k is scalar. Laplace approximation is a mode-seeking algorithm. We use automatic differentiation (autodiff) [13] to evaluate the gradient ∇ z k (log − → m z k (z k ) + log ← − m z k (z k )) and employ it in a gradient-ascent algorithm to seek the mode (we supply the implementation details in Appendix D). Once the mode is reached, we evaluate the Hessian at the mode to fit the variance term for our Gaussian approximation.
The assumption we make here that − → m z k (z k ) implies a Gaussian-like q(z k ) paves the way of automating many well known inference procedures achieved through Laplace approximation, such as Bayesian logistic regression ( [37], Section 8.4), Laplace-Gaussian filtering and smoothing in state space models [54], Poisson Linear Dynamical Systems [55], etc. However, our assumption would not be appropriate for all configurations. For example, Gaussian prior on rate parameter of a Poisson distribution would result in an ambiguous posterior since the domain of the rate is the positive real numbers while the Gaussian approximated posterior has a support on the entire real axis. A better model specification could be achieved by mapping a Gaussian distributed random variable to the rate parameter through an inverse-link function, exp in this example. Likewise, a multi-modal backward message ← − m z k (z k ) with a support on real numbers often yields a multi-modal posterior which can be better captured with particle methods. (In Appendix E, we show that it is possible to run particle filtering through Gaussian factor nodes in our technique.) In summary, our method resorts to Laplace approximation to approximate q(z k ) with a Gaussian distribution whenever − → m z k (z k ) is Gaussian. Therefore, the user of our method must keep in mind the consequences of prior choices and build her model accordingly. The overall procedure for single input deterministic functions is depicted in Figure A2. In the next subsection, we extend this procedure to multiple input deterministic mappings.

Appendix B.2. Deterministic Mappings with Multiple Inputs
Consider the deterministic node, f δ (x, z 1:k ) = δ(x − g(z 1:k )), given in Figure A3 where the inputs to the deterministic function g(.) are z 1 , . . . , z k and the output is x. Figure A3. A deterministic node with inputs z 1 , . . . , z k and output x.
Before starting the discussion on the backward messages, let us define the forward message − → m x (x). Analogous to the single input case, we define − → m x (x) with an LWS as the following: Once the message is calculated as a set of equally weighted samples, we scale the weights according to the importance score of their corresponding samples to represent q(x): where x (i) = g(z (i) Now, let us define the backward messages propagated by the deterministic node. The exact backward message towards one of the input variables, say z k , is the following: ← − m z k (z k ) = δ(x − g(z 1:k )) − → m z 1 (z 1 ) . . . − → m z k−1 (z k−1 ) ← − m x (x)dz 1 . . . dz k−1 dx.
Unfortunately, the above integral is often intractable. Even if all the variables are discrete and integral is replaced by summation, it becomes intractable in practice as the number of variables increases. Here, we address this issue with two different approximation strategies. As it is in the above subsection, type of the approximation depends on the incoming messages to the deterministic node from the input edges: if the messages − → m z 1 (z 1 ), . . . , − → m z k (z k ) are all Gaussian, we approximate the joint posterior distribution of z 1 , . . . , z k by a Gaussian distribution. Then, we calculate the backward messages over the approximated joint posterior and incoming messages. Otherwise, we use Monte Carlo summation. Let us start with the latter case.
Once the message ← − m z k (z k ) is approximately calculated and propagated as an NEF distribution, q(z k ) is also approximated either by IS or Laplace, depending on the message type − → m z k (z k ) as it is discussed in Appendix B.1.
which are contained in the sufficient statistics vectorφ c (x). This quantities are exactly Having implemented the BP rule at a soft factor for incoming LWS messages, we have to show how posteriors are approximated through updating weights. Suppose that − → m z k (z k ) is a message that carries LWS, and ← − m z k (z k ) is parameterized either by an EF or an NEF. Then, we define the posterior update rule as follows: where w In Bootstrap particle filtering, these rules update the weights at equality nodes, automatically. A major drawback of sequential importance sampling methods is that the further samples commute over time steps, the more they lose their ability to recover the underlying process, and many of the weights approach to zero. This phenomenon is known as the degeneracy problem and can be alleviated by resampling [36,60]. In our automated setting, at each weight update step, we measure the effectiveness of the existing samples by , as it is shown in [36]. Then, we resample if n eff < N/10 [36]. A user can effortlessly execute a particle filtering procedure in our method by putting an LWS prior on the first hidden state of a sequential model and running BP inference on the model (For demonstration purposes, we implemented BP rules at Gaussian node for LWS messages. The user can implement the very same rules for other soft factors according to their needs. Visit https://github.com/biaslab/ForneyLab.jl/blob/master/demo/bootstrap_particle_ filter.ipynb (accessed on 25 June 2021) for a toy example.).

Appendix F. Illustrative Example
Consider the following model visualized in Figure A5: p(x) = N (x; µ x , v x ), p(z) = N (z; µ z , v z ), p(w|z) = δ(w − exp(z)), p(y|x, w) = N (y; x, 1/w), with observation y. In [9], VMP messages are provided as an example for a normal node parameterized by mean and precision. Here, we augment their example by a deterministic node to illustrate how Extended VMP operates to approximate the posteriors for x, w and z: Figure A5. The model p(x) = N (x; µ x , v x ), p(z) = N (z; µ z , v z ), p(w|z) = δ(w − exp(z)), p(y|x, w) = N (y; x, 1/w) is visualized together with the messages. The EVMP algorithm approximates the backward VMP message towards x by estimating w with Monte Carlo summation. Once this VMP message is approximated, the update for q(x) is available in closed form. The backward message toward w and z requires x and x 2 . These expectations can be computed analytically since they are the sufficient statistics of q(x). However, this time, the forward and the backward messages differ in sufficient statistics, which impedes the analytical calculations for q(w) and q(z). We approximate them by IS and Laplace, respectively.

•
Initiate q(x), q(z) by Normal distributions and q(w) by an LWS. • Repeat until convergence the following three steps: 1. -Choose w for updating.

-
Choose z for updating.
-Choose x for updating.

-
The forward message is the prior: -Update q(x) by Section 3.5 rule (1), i.e., the following: The expectation quantities w , x , x 2 that appear in the message calculations are computed according to Section 3.8. Therefore, while w is estimated via a Monte Carlo summation, x and x 2 are available in closed form.