Variational Message Passing and Local Constraint Manipulation in Factor Graphs

Accurate evaluation of Bayesian model evidence for a given data set is a fundamental problem in model development. Since evidence evaluations are usually intractable, in practice variational free energy (VFE) minimization provides an attractive alternative, as the VFE is an upper bound on negative model log-evidence (NLE). In order to improve tractability of the VFE, it is common to manipulate the constraints in the search space for the posterior distribution of the latent variables. Unfortunately, constraint manipulation may also lead to a less accurate estimate of the NLE. Thus, constraint manipulation implies an engineering trade-off between tractability and accuracy of model evidence estimation. In this paper, we develop a unifying account of constraint manipulation for variational inference in models that can be represented by a (Forney-style) factor graph, for which we identify the Bethe Free Energy as an approximation to the VFE. We derive well-known message passing algorithms from first principles, as the result of minimizing the constrained Bethe Free Energy (BFE). The proposed method supports evaluation of the BFE in factor graphs for model scoring and development of new message passing-based inference algorithms that potentially improve evidence estimation accuracy.


Introduction
Building models from data is at the core of both science and engineering applications. The search for good models requires a performance measure that scores how well a particular model m captures the hidden patterns in a data set D. In a Bayesian framework, that measure is the Bayesian evidence ppD|mq, i.e., the probability that model m would generate D if we were to draw data from m. The art of modeling is then the iterative process of proposing new model specifications, evaluating the evidence for each model and retaining the model with the most evidence [1].
Unfortunately, Bayesian evidence is intractable for most interesting models. A popular solution to evidence evaluation is provided by variational inference, which describes the process of Bayesian evidence evaluation as a (free energy) minimization process, since the variational free energy (VFE) is a tractable upper bound on Bayesian (negative log-)evidence [2]. In practice, the model development process then consists of proposing various candidate models, minimizing VFE for each model and selecting the model with the lowest minimized VFE.
The difference between VFE and negative log-evidence (NLE) is equal to the Kullback-Leibler divergence (KLD) [3] from the (perfect) Bayesian posterior distribution to the variational distribution for the latent variables in the model. The KLD can be interpreted as the cost of conducting variational rather than Bayesian inference. Perfect (Bayesian) inference would lead to zero inference costs (KLD " 0), and the KLD increases as the variational posterior diverges further from the Bayesian posterior. As a result, model where s a collects the argument variables of factor f a . We assumed that all the factors are positive. In an FFG, a node a P V corresponds to a factor f a , and the neighboring edges E paq correspond to the variables s a that are the arguments of f a . As an example model, the following factorization (2), the corresponding FFG of which is shown in Figure 1.
f ps 1 , . . . , s 5 q " f a ps 1 q f b ps 1 , s 2 , s 3 q f c ps 2 q f d ps 3 , s 4 , s 5 q f e ps 5 q .
(2) The FFG of Figure 1 consists of five nodes V " ta, . . . , eu, as annotated by their corresponding factor functions, and five edges E " tpa, bq, . . . , pd, equ as annotated by their corresponding variables. An edge that connects to only one node (e.g., the edge for s 4 ) is called a half-edge. In this example, the neighborhood E pbq " tpa, bq, pb, cq, pb, dqu and Vppb, cqq " tb, cu.
In the FFG representation, a node can be connected to an arbitrary number of edges, while an edge can only be connected to at most two nodes. Therefore, FFGs often contain "equality nodes" that constrain connected edges to carry identical beliefs, with the implication that these beliefs can be made available to more than two factors. An equality node has the factor function f a ps i , s j , s k q " δps j´si q δps j´sk q , for which the node-induced subgraph Gpaq is drawn in Figure 2. If every edge in the FFG has exactly two connected nodes (including equality nodes), then we designate the graph as a terminated FFG (TFFG). Since multiplication of a function f psq by 1 does not alter the function, any FFG can be terminated by connecting any halfedge i to a node a that represents the unity factor f a ps i q " 1.  Figure 2. Visualization of the node-induced subgraph for an equality node. If the node function f a is known, a symbol representing the node function is often substituted within the node (""" in this case). In Section 4.2 we discuss form constraints on posterior distributions. If such a constraint takes on a Dirac-delta functional form, then we visualize the constraint on the FFG by a small circle in the middle of the edge. For example, the small shaded circle in Figure 11 indicates that the variable has been observed. In Section 4.3.2 we consider form constraints in the context of optimization, in which case the circle annotation will be left open (see, e.g., Figure 14).

Variational Free Energy
Given a model f psq and a (normalized) probability distribution qpsq, we can define a Variational Free Energy (VFE) functional as qpsq log qpsq f psq ds . (4) Variational inference is concerned with finding solutions to the minimization problem q˚psq " arg min qPQ Frq, f s , where Q imposes some constraints on q.
If q is unconstrained, then the optimal solution is obtained for q˚psq " ppsq, with ppsq " 1 Z f psq being the exact posterior, and Z " ş f psq ds a normalizing constant that is commonly referred to as the evidence. The minimum value of the free energy then follows as the negative log-evidence (NLE), Frq˚, f s "´log Z , which is also known as the surprisal. The NLE can be interpreted as a measure of model performance, where low NLE is preferred.
As an unconstrained search space for q grows exponentially with the number of variables, the optimization of (5) quickly becomes intractable beyond the most basic models. Therefore, constraints and approximations to the variational free energy (4) are often utilized. As a result, the constrained variational free energy with q˚P Q bounds the NLE by Frq˚, f s "´log Z`ż q˚psq log q˚psq ppsq ds , (6) where the latter term expresses the divergence from the (intractable) exact solution to the optimal variational belief. In practice, the functional form of qpsq " qps; θq is often parameterized, such that gradients of F can be derived w.r.t. the parameters θ. This effectively converts the variational optimization of Frq, f s to a parametric optimization of Fpθq as a function of θ. This problem can then be solved by a (stochastic) gradient descent procedure [21,22].
In the context of variational calculus, while form constraints may lead to interesting properties (see Section 4.2), they are generally not required. Interestingly, in a variational optimization context, the functional form of q is often not an assumption, but rather a result of optimization (see Section 4.3.1). An example of variational inference is provided in Appendix A.

Bethe Free Energy
The Bethe approximation enjoys a unique place in the landscape of Q, because the Bethe free energy (BFE) defines the fundamental objective of the celebrated belief propagation (BP) algorithm [17,23]. The origin of the Bethe approximation is rooted in tree-like approximations to subgraphs (possibly containing cycles) by enforcing local consistency conditions on the beliefs associated with edges and nodes [24]. Given a TFFG G " pV, E q for a factorized function f psq " ś aPV f a ps a q (1), the Bethe free energy (BFE) is defined as [25]: q a ps a q log q a ps a q f a ps a q ds a looooooooooooomooooooooooooon Frq a , f a s`ÿ iPE ż q i ps i q log 1 q i ps i q ds i loooooooooooomoooooooooooon Hrq i s (7) such that the factorized beliefs satisfy the following constraints: ż q a ps a q ds a " 1 , for all a P V (9a) ż q a ps a q ds azi " q i ps i q , for all a P V and all i P E paq .
Together, the normalization constraint (9a) and marginalization constraint (9b) imply that the edge marginals are also normalized: The Bethe free energy (7) includes a local free energy term Frq a , f a s for each node a P V, and an entropy term Hrq i s for each edge i P E . Note that the local free energy also depends on the node function f a , as specified in the factorization of f (1), whereas the entropy only depends on the local belief q i .
The Bethe factorization (8) and constraints are summarized by the local polytope [26] LpGq " tq a for all a P V s.t. (9a), and q i for all i P E paq s.t. (9b)u , (11) which defines the constrained search space for the factorized variational distribution (8).

Problem Statement
In this paper, the problem is to find the beliefs in the local polytope that minimize the Bethe free energy where q is defined by (8), and where q P LpGq offers a shorthand notation for optimizing over the individual beliefs in the local polytope. In the following sections, we will follow the Lagrangian optimization approach to derive various message passing-based inference algorithms.

Sketch of Solution Approach
The problem statement of Section 2.4 defines a global minimization of the beliefs in the Bethe factorization. Instead of solving the global optimization problem directly, we employ the factorization of the variational posterior and local polytope to subdivide the global problem statement in multiple interdependent local objectives.
From the BFE objective (12) and local polytope of (11), we can construct the Lagrangian where the Lagrange multipliers ψ a , ψ i and λ ia enforce the normalization and marginalization constraints of (9). It can be seen that this Lagrangian contains local beliefs q a and q i , which are coupled through the λ ia Lagrange multipliers. The Lagrange multipliers λ ia are doubly indexed, because there is a multiplier associated with each marginalization constraint. The Lagrangian method then converts a constrained optimization problem of Frq, f s to an unconstrained optimization problem of Lrq, f s. The total variation of the Lagrangian (13) can then be approached from the perspective of variations of the individual (coupled) local beliefs. More specifically, given a locally connected pair b P V, j P E pbq, we can rewrite the optimization of (12) in terms of the local beliefs q b , q j , and the constraints in the local polytope that pertains to these beliefs. The problem then becomes finding local stationary solutions tqb , qj u " arg min Using (13), the optimization of (15) can then be written in the Lagrangian form qb " arg min where the Lagrangians L b and L j include the local polytope of (14) to rewrite (13) as an explicit functional of beliefs q b and q j (see, e.g., Lemmas 1 and 2). The combined stationary solutions to the local objectives then also comprise a stationary solution to the global objective (Appendix B). The current paper shows how to identify stationary solutions to local objectives of the form (15), with the use of variational calculus, under varying constraints as imposed by the local polytope (14). Interestingly, the resulting fixed-point equations can be interpreted as message passing updates on the underlying TFFG representation of the model. In the following Sections 3 and 4, we derive the local stationary solutions under a selection of constraints and show how these relate to known message passing update rules (Table 1). It then becomes possible to derive novel message updates and algorithms by simply altering the local polytope.

Stationary Points of the Bethe Lagrangian
We wish to minimize the Bethe free energy under variations of the variational density. As the Bethe free energy factorizes over factors and variables (7), we first consider variations on separate node-and edge-induced subgraphs. Lemma 1. Given a TFFG G " pV, E q, consider the node-induced subgraph Gpbq (Figure 3). The stationary points of the Lagrangian (16a) as a functional of q b , where C b collects all terms that are independent of q b , which are of the form Proof. See Appendix D.1.
The µ ib ps i q are any set of positive functions that makes (18) satisfy (9b), and will be identified in Theorem 1.   L j rq j s " Hrq j s`ψ j "ż q j ps j q ds j´1 `ÿ aPVpjq ż λ ja ps j q " q j ps j q´ż q a ps a q ds azj  ds j`Cj , (19) where C j collects all terms that are independent of q j , are of the form q j ps j q " µ jb ps j qµ jc ps j q ż µ jb ps j qµ jc ps j qds j .

Minimizing the Bethe Free Energy by Belief Propagation
We now combine Lemmas 1 and 2 to derive the sum-product message update.
Theorem 1 (Sum-Product Message Update). Given a TFFG G " pV, E q, consider the induced subgraph Gpb, jq ( Figure 5). Given the local polytope LpGpb, jqq of (14), then the local stationary solutions to (15) are given by qb ps b q " with messages µj c ps j q corresponding to the fixed points of µ pk`1q jc ps j q " with k representing an iteration index.
Proof. See Appendix D. 3. The sum-product algorithm has proven to be useful in many engineering applications and disciplines. For example, it is widely used for decoding in communication systems [4,20,27]. Furthermore, for a linear Gaussian state space model, Kalman filtering and smoothing can be expressed in terms of sum-product message passing for state inference on a factor graph [28,29]. This equivalence has inspired applications ranging from localization [30] to estimation [31].
The sum-product algorithm with updates (22) obtains the exact Bayesian posterior when the underlying graph is a tree [24,25,32]. Application of the sum-product algorithm to cyclic graphs is not guaranteed to converge and might lead to oscillations in the BFE over iterations. Theorems 3.1 and 3.2 in [33] show that the BFE of a graph with a single cycle is convex, which implies that the sum-product algorithm will converge in this case. Moreover, ref. [19] shows that it is possible to obtain a double-loop message passing algorithm if the graph has a cycle such that the stable fixed points will correspond to local minima of the BFE.

Example 1.
A Linear Dynamical System Considering a Linear Gaussian state space model specified by the following factors: The FFG corresponding to the one time segment of the state space model is given in Figure 6. We assumed that we know the following matrices that are used to generate the data: with θ " π{8. Given a collection of observationsŷ " tŷ 1 , . . . ,ŷ T u, we constrain the latent states x " tx 0 , . . . , x T u by local marginalization and normalization constraints (for brevity we omit writing the normalization constraints explicitly) in accordance with Theorem 1, i.e., Moreover, we use data constraints in accordance with Theorem 3 (explained in Section 4.2.1) for the observations, state transition matrices and precision matrices, i.e., qpy t q " δpy t´ŷt q, qpA t q " δpA t´Ât q, qpB t q " δpB t´Bt q, qpQ t q " δpQ t´Qt q, qpR t q " δpR t´Rt q .
Computation of sum-product messages by (22) is analytically tractable and detailed algebraic manipulation can be found in [31]. If the backwards messages are not passed, then the resulting sum-product message passing algorithm is equivalent to Kalman filtering and if both forward and backward messages are propagated, then the Rauch-Tung-Striebel smoother is obtained [34] (Ch. 8).
We generated T " 100 observationsŷ using the matrices specified in (24) and the initial conditionx 0 " r5,´5s J . Due to (23a), we have µ x 0 g 1 " N pm x 0 , V x 0 q. We chose V x 0 " 100¨I and m x 0 "x 0 . Under these constraints, the results of sum-product message passing and Bethe free energy evaluation is given in Figure 6. As the underlying graph is a tree, sum-product message passing results are exact and the evaluated BFE corresponds to negative log-evidence. In the follow-up Example 2, we will modify the constraints and give a comparative free energy plot for the examples in Figures 10 and 16. with the sum-product messages computed according to (22). The three small dots at both sides of the graph indicate identical continuation of the graph over time. (Right) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using parameter matrices as specified in (24). The posterior distribution for the hidden states are inferred by sum-product message passing and are drawn with shaded regions, indicating plus and minus the variance. The Bethe free energy evaluates to Frq, f s " 580.698.

Message Passing Variations through Constraint Manipulation
For generic node functions with arbitrary connectivity, there is no guarantee that the sum-product updates can be solved analytically. When analytic solutions are not possible, there are two ways to proceed. One way is to try to solve the sum-product update equations numerically, e.g., by Monte Carlo methods. Alternatively, we can add additional constraints to the BFE that leads to simpler update equations at the cost of inference accuracy. In the remainder of the paper, we explore a variety of constraints that have proven to yield useful inference solutions.

Factorization Constraints
Additional factorizations of the variational density q a ps a q are often assumed to ease computation. In particular, we assumed a structured mean-field factorization such that where n indicates a local cluster as a set of edges. To define a local cluster rigorously, let us first denote by Ppaq the power set of an edge set E paq, where the power set is the set of all subsets of E paq. Then, a mean-field factorization lpaq Ď Ppaq can be chosen such that all elements in E paq are included in lpaq exactly once. Therefore, lpaq is defined as a set of one or multiple sets of edges. For example, if E paq " ti, j, ku, then lpaq " ttiu, tj, kuu is allowed, as is lpaq " tti, j, kuu itself, but lpaq " tti, ju, tj, kuu is not allowed, since the element j occurs twice. More formally, in (26), the intersection of the super-and subscript collects the required variables, see Figure 7 for an example. The special case of a fully factorized lpbq for all edges i P E pbq is known as the naive mean-field factorization [11,24].
We will analyze the effect of a structured mean-field factorization (26) on the Bethe free energy (7) for a specific factor node b P V. Substituting (26) in the local free energy for factor b yields We are then interested in where the Lagrangian L m b (Lemma 3) enforces the normalization and marginalization constraints Figure 7. A node-induced subgraph Gpbq with shaded sections that enclose the edges of an exemplary structured mean-field factorization lpbq " tm, n, ru. Note that, in this example, the cluster n only encompasses the single edge j, such that q n b ps n b q " q j ps j q. In general, the assignment and number of edges in a cluster can be arbitrary. Lemma 3. Given a terminated FFG G " pV, E q, consider a node-induced subgraph Gpbq with a structured mean-field factorization lpbq (e.g., Figure 7). Then, local stationary solutions to the Lagrangian where C m b collects all terms independent of q m b , which are of the form Proof. See Appendix D.4. We now combine Lemmas 2 and 3 to derive the structured variational message passing algorithm.
Theorem 2. Structured variational message passing: Given a TFFG G " pV, E q, consider the induced subgraph Gpb, jq with a structured mean-field factorization lpbq Ď Ppbq, with local clusters n P lpbq. Let m P lpbq be the cluster where j P m (see, e.g., Figure 8). Given the local polytope LpGpb, jqq " q n b for all n P lpbq s.t. (29a), and q j s.t. (29b) then local stationary solutions to are given by with messages µj c ps j q corresponding to the fixed points of with iteration index k, and wherẽ Proof. See Appendix D. 5. Figure 8. An example subgraph corresponding to Gpb, jq. Dashed ellipses enclose the edges of an exemplary exact cover lpbq " tm, n, ru. In general, the assignment and number of edges in a cluster can be arbitrary.
The structured mean-field factorization applies the marginalization constraint only to the local cluster beliefs, as opposed to the joint node belief. As a result, computation for the local cluster beliefs might become tractable [24] (Ch.5). The practical appeal of Variational Message Passing (VMP) based inference becomes evident when the underlying model is composed of conjugate factor pairs from the exponential family. When the underlying factors are conjugate exponential family distributions, the message passing updates (36) amounts to adding natural parameters [35] of the underlying exponential family distributions. Structured variational message passing is popular in acoustic signal modelling, e.g., [36], as it allows one to be able to keep track of correlations over time. In [37], a stochastic variant of structured variational inference is utilized for Latent Dirichlet Allocation. Structured approximations are also used to improve inference in auto-encoders. In [38], inference involving non-parametric Beta-Bernoulli process priors is improved by developing a structured approximation to variational auto-encoders. When the data being modelled are time series, structured approximations reflect the transition structure over time. In [39], an efficient structured black-box variational inference algorithm for fitting Gaussian variational models to latent time series is proposed.

Example 2.
Consider the linear Gaussian state space model of Example 1. Let us assume that the precision matrix for latent-state transitions Q t is not known and can not be constrained by data. Then, we can augment state space model by including a prior for Q t and try to infer a posterior over Q t from the observations. Since Q t is the precision of a normal factor, we chose a conjugate Wishart prior and assumed that Q t is time-invariant by adding the following factors It is certainly possible to assume a time-varying structure for Q t ; however, our purpose is to illustrate a change in constraints rather than analyzing time-varying properties. This is why we assume time-invariance. In this setting, the sum-product equations around the factor h t are not analytically tractable. Therefore, we changed the constraints associated with h t (25b) to those given in Theorem 2 as follows We removed the data constraint on qpQ t q and instead included data constraints on the hyperparameters qpVq " δpV´Vq, qpνq " δpν´νq .
With the new set of constraints ((39a) and (39b)), we obtained a hybrid of the sum-product and structured VMP algorithm, where structured messages around the factor h t are computed by (36) and the rest of the messages are computed by the sum-product (22). One time segment of the modified FFG along with the messages is given Figure 9. We used the same observationsŷ that were generated in Example 1 and the same initialization for the hidden states. For the hyper-parameters of the Wishart prior, we choseV " 0.1¨I andν " 2. Under these constraints, the result of structured variational message passing results along with the Bethe free energy evaluation is given in Figure 9.

Naive Variational Message Passing
As a corollary of Theorem 2, we can consider the special case of a naive mean-field factorization, which is defined for node b as The naive mean-field constraint (41) transforms the local free energy into Corollary 1. Naive Variational Message Passing: Given a TFFG G " pV, E q, consider the induced subgraph Gpb, jq with a naive mean-field factorization lpbq " tisuch that for all i P E pbqu. Let m P lpbq be the cluster where j " m. Given the local polytope of (33), the local stationary solutions to (34) are given by where the messages µj c ps j q are the fixed points of the following iterations µ pk`1q where k is an iteration index.
The naive mean-field factorization limits the search space of beliefs by imposing strict constraints on the variational posterior. As a result, the variational posterior also loses flexibility. To improve inference performance for sparse Bayesian learning, the authors of [40] proposes a hybrid mechanism by augmenting naive mean-field VMP with sumproduct updates. This hybrid scheme reduces the complexity of the sum-product algorithm, while improving the accuracy of the naive VMP approach. In [41], naive VMP is applied to semi-parametric regression and allows for scaling of regression models to large data sets.

Example 3.
As a follow up on Example 2, we relaxed the constraints in ((39a) and (39b)) to the following constraints presented in Corollary 1 as The FFG remains the same and we use identical data constraints as in Example 2. Together with constraint (44), we obtained a hybrid of naive variational message passing and sum-product message passing algorithm where the messages around the factor h t are computed by (43) and the rest of the messages by sum-product (22). Using the same data as in Example 1, the results for naive VMP are given in Figure 10 along with the evaluated Bethe free energy.

Form Constraints
Form constraints limit the functional form of the variational factors q a ps a q and q i ps i q. One of the most widely used form constraints, the data constraint, is also illustrated in Appendix A.

Data Constraints
A data constraint can be viewed as a special case of (9b), where the belief q j is constrained to be a Dirac-delta function [42], such that ż q a ps a q ds azj " q j ps j q " δps j´ŝj q , whereŝ j is a known value, e.g., an observation.

Lemma 4.
Given a TFFG G " pV, E q, consider the node-induced subgraph Gpbq (Figure 3). Then local stationary solutions to the Lagrangian where C b collects all terms that are independent of q b , are of the form Proof. See Appendix D.7.
Theorem 3. Data-Constrained Sum-Product: Given a TFFG G " pV, E q, consider the induced subgraph Gpb, jq ( Figure 11). Given the local polytope the local stationary solutions to qb " arg min with message µj b ps j q " δps j´ŝj q .
Proof. See Appendix D. 8. Figure 11. Visualization of a subgraph Gpb, jq with indicated messages, where the dark circled delta indicates a data constraint-i.e., the variable s j is constrained to have a distribution of the form δps j´ŝj q.
Note that the resulting message µj b ps j q to node b does not depend on messages from node c, as would be the case for a sum-product update. By the symmetry of Theorem 3 for the subgraph LtGpc, jqu, (A32) identifies This implies that messages incoming to a data constraint (such as µ cj ) are not further propagated through the data constraint. The data constraint thus effectively introduces a conditional independence between the variables of neighboring factors (conditioned on the shared constrained variable). Interestingly, this is similar to the notion of an intervention [43], where a decision variable is externally forced to a realization. Data constraints allow information from data sets to be absorbed into the model. Essentially, (variational) Bayesian machine learning is an application of inference in a graph with data constraints. In our framework, data are a constraint, and machine learning via Bayes rule follows naturally from the minimization of the Bethe free energy (see also Appendix A).

Laplace Propagation
A second type of form constraint we consider is the Laplace constraint, see also [14]. Consider a second-order Taylor approximation on the local log-node function L a ps a q " log f a ps a q , around an approximation pointŝ a , as L a ps a ;ŝ a q " L a pŝ a q`∇ J L a pŝ a qps a´ŝa q`1 2 ps a´ŝa q J ∇ 2 L a pŝ a qps a´ŝa q .
From this approximation, we define the Laplace-approximated node function as f a ps a ;ŝ a q fi exp`L a ps a ;ŝ a q˘, which is substituted in the local free energy to obtain the Laplace-encoded local free energy as Frq a ,f a ;ŝ a s " ż q a ps a q log q a ps a q f a ps a ;ŝ a q ds a .
It follows that the Laplace-encoded optimization of the local free energy becomes where the Lagrangian L a imposes the marginalization and normalization constraints of (9) on (54).

Lemma 5.
Given a TFFG G " pV, E q, consider the node-induced subgraph Gpbq (Figure 12). The stationary points of the Laplace-approximated Lagrangian (55) as a functional of q b , where C b collects all terms that are independent of q b , which are of the form Proof. See Appendix D.9. Figure 12. The subgraph around a Laplace-approximated node b with indicated messages.
We can now formulate Laplace propagation as an iterative procedure, where the approximation pointŝ b is chosen as the mode of the belief q b ps b q. Theorem 4. Laplace Propagation: Given a TFFG G " pV, E q, consider the induced subgraph Gpb, jq ( Figure 13) with the Laplace-encoded factorf b as per (53). We write the model (1) with the Laplace-encoded factorf b substituted for f b , asf . Given the local polytope LpGpb, jqq of (14), the local stationary solutions to tqb , qj u " arg min are given by qj ps j q " µj b ps j qµj c ps j q ż µj b ps j qµj c ps j qds j , withŝb and the messages µj c ps j q the fixed points of Proof. See Appendix D. 10. Figure 13. Visualization of a subgraph with indicated Laplace propagation messages. The node function f b is denoted byf b according to (53).
A Laplace propagation is introduced in [14] as an algorithm that propagates mean and variance information when exact updates are expensive to compute. Laplace propagation has found applications in the context of Gaussian processes and support vector machines [14]. In the jointly normal case, Laplace propagation coincides with sum-product and expectation propagation [14,18].

Expectation Propagation
Expectation propagation can be derived in terms of constraint manipulation by relaxing the marginalization constraints to expectation constraints. Expectation constraints are of the form ż q a ps a qT i ps i qds a " for a given function (statistic) T i ps i q. Technically, the statistic T i ps i q can be chosen arbitrarily. Nevertheless, they are often chosen as sufficient statistics of an exponential family distribution. An exponential family distribution is defined by where η i is the natural parameter, Zpη i q is the partition function, T i ps i q is the sufficient statistics and hps i q is a base measure [24]. The reason T i ps i q is a sufficient statistic is because if there are observed values of the random variable s i , then the parameter η i can be estimated by using only the statistics T i ps i q. This means that the estimator of η i will depend only on the statistics. The idea behind expectation propagation [18] is to relax the marginalization constraints with moment-matching constraints by choosing sufficient statistics from exponential family distributions [12]. Relaxation allows approximating the marginals of the sum-product algorithm with exponential family distributions. By keeping the marginals within the exponential family, the complexity of the resulting computations is reduced. Lemma 6. Given a TFFG G " pV, E q, consider the node-induced subgraph Gpbq (Figure 3). The stationary points of the Lagrangian with sufficient statistics T j , and where C b collects all terms that are independent of q b , are of the form with incoming exponential family message µ jb ps j q " exp´η J jb T j ps j q¯.
Proof. See Appendix D.11. Lemma 7. Given a TFFG G " pV, E q, consider an edge-induced subgraph Gpjq (Figure 4). The stationary solutions of the Lagrangian L j rq j s " Hrq j s`ψ j "ż q j ps j q ds j´1 `ÿ aPVpjq η J ja "ż q j ps j qT j ps j qds j´ż q a ps a qT j ps j qds a `C j , with sufficient statistics T j ps j q, and where C j collects all terms that are independent of q j , are of the form q j ps j q " exp´rη jb`ηjc s J T j ps j qż exp´rη jb`ηjc s J T j ps j q¯ds j .
and µ jb ps j q " exp´η J jb T j ps j q¯an exponential family message (from Lemma 6). Then, the local stationary solutions to (15) are given by qb ps b q " with ηj b , ηj c and µj c ps j q being the fixed points of the iterations µ pkq jc ps j q "

Moment-matching can be performed by solving [24] (Proposition 3.1)
∇ η j log Z j pη j q " żq j ps j q T j ps j q ds j for η j , where Z j pη j q " ż exp´η J j T j ps j q¯ds j .
In practice, for a Gaussian approximation, the natural parameters can be obtained by converting the matched mean and variance ofq j ps j q to the canonical form [18]. Computing the moments ofq j ps j q is often challenging due to lack of closed form solutions of the normalization constant. In order to address the computation of moments in EP, Ref. [44] proposes to evaluate challenging moments by quadrature methods. For multivariate random variables, moment-matching by spherical radial cubature would be advantageous as it will reduce the computational complexity [45]. Another popular way of evaluating the moments is through importance sampling [46] (Ch. 7) and [47]. Expectation propagation has been utilized in various applications ranging from time series estimation with Gaussian processes [48] to Bayesian learning with stochastic natural gradients [49]. When the likelihood functions for Gaussian process classification are not Gaussian, EP is often utilized [50] (Chapter 3). In [51], a message passing-based expectation propagation algorithm is developed for models that involve both continuous and discrete random variables. Perhaps the most practical applications of EP are in the context of probabilistic programming [52], where it is heavily used in real-world applications.

Hybrid Constraints
In this section, we consider hybrid methods that combine factorization and form constraints, and formalize some well-known algorithms in terms of message passing.

Mean-Field Variational Laplace
Mean-field variational Laplace applies the mean-field factorization to the Laplaceapproximated factor function. The appeal of this method is that all messages outbound from the Laplace-approximated factor can be represented by Gaussians. Theorem 6. Mean-field variational Laplace: Given a TFFG G " pV, E q, consider the induced subgraph Gpb, jq ( Figure 13) with the Laplace-encoded factorf b as per (53). We write the model (1) with substituted Laplace-encoded factorf b for f b , asf . Furthermore, assume a naive mean-field factorization lpbq " ttiu for all i P E pbqu. Let m P lpbq be the cluster where j " m. Given the local polytope of (33), the local stationary solutions to are given by q m,b ps m b q " qj ps j q " where µj c represents the fixed points of the following iterations Proof. See Appendix D.14.
Conveniently, under these constraints, every outbound message from node b will be proportional to a Gaussian. Substituting the Laplace-approximated factor function, we obtain: Resolving this expectation yields a quadratic form in s j , which after completing the square leads to a proportionally Gaussian message µ jc ps j q . This argument holds for any edge adjacent to b, and therefore for all outbound messages from node b. Moreover, if the incoming messages are represented by Gaussians as well (e.g., because these are also computed under the mean-field variational Laplace constraint), then all beliefs on the adjacent edges to b will also be Gaussian. This significantly simplifies the procedure of computing the expectations, which illustrates the computational appeal of mean-field variational Laplace.
Mean-field variational Laplace is widely used in dynamic causal modeling [53] and more generally in cognitive neuroscience, partly because the resulting computations are deemed neurologically plausible [54][55][56].

Expectation Maximization
Expectation Maximization (EM) can be viewed as a hybrid algorithm that combines a structured variational factorization with a Dirac-delta constraint, where the constrained value itself is optimized. Given a structured mean-field factorization lpaq Ď Ppaq, with a single-edge cluster m " j, then expectation maximization considers local factorizations of the form q a ps a q " δps j´θj q ź nPlpaq n‰m q n a ps n a q, where the belief for s j is constrained by a Dirac-delta distribution, similar to Section 4.2.1. In (69), however, the variable s j represents a random variable with (unknown) value θ j P R d , where d is the dimension of the random variable s j . We explicitly use the notation θ j (as opposed toŝ j for the data constraint in Section 4.2.1) to clarify that this value is a parameter for the constrained belief over s j that will be optimized-that is, θ j does not represent a model parameter in itself. To make this distinction even more explicit, in the context of optimization, we will refer to Dirac-delta constraints as point-mass constraints. The factor-local free energy Frq a , f a ; θ j s then becomes a function of the θ j parameter.
Theorem 7. Expectation maximization: Given a TFFG G " pV, E q, consider the induced subgraph Gpb, jq ( Figure 14) with a structured mean-field factorization lpbq Ď Ppbq, with local clusters n P lpbq. Let m P lpbq be the cluster where j " m. Given the local polytope LpGpb, jqq " q n b for all n P lpbq s.t. (29a) Proof. See Appendix D. 15. Expectation maximization was formulated in [57] as an iterative method that optimizes log-expectations of likelihood functions, where each EM iteration is guaranteed to increase the expected log-likelihood. Moreover, under some differentiability conditions, the EM algorithm is guaranteed to converge [57] (Theorem 3). A detailed overview of EM for exponential families is available in [24] (Ch. 6). A formulation of EM in terms of message passing is given by [58], where message passing for EM is applied in a filtering and system identification context. In [58], derivations are based on [57] (Theorem 1), whereas our derivations directly follow from variational principles. Example 4. Now suppose we do not know the angle θ for the state transition matrix A t in Example 2 and would like to estimate the value of θ. Moreover, further suppose that we are interested in estimating the hyper-parameters for the prior m x 0 and V x 0 , as well as the precision matrix for the state transitions Q t . For this purpose, we changed the constraints of (25a) into EM constraints in accordance with Theorem 7: where we optimizeθ,V x 0 andm x 0 with EM (V x 0 is further constrained to be positive definite during the optimization procedure). With the addition of the new EM constraints, the resulting FFG is given in Figure 15. The hybrid message passing algorithm consists of structured variational messages around the factor h t , and sum-product messages around w t , n t , m t and r t , and EM messages around g 0 and g t . We used identical observations as in the previous examples. The results for the hybrid SVMP-EM-SP algorithm are given in Figure 16 along with the evaluated Bethe free energy over all iterations.

Overview of Message Passing Algorithms
In Sections 4.1-4.3, following a high-level recipe pioneered by [15], we presented firstprinciple derivations of some of the popular message passing-based inference algorithms by manipulating the local constraints of the Bethe free energy. The results are summarized in Table 1.
Crucially, the method of constrained BFE minimization goes beyond the reviewed algorithms. Through creating a new set of local constraints and following similar derivations based on variational calculus, one can obtain new message passing-based inference algorithms that better match the specifics of the generative model or application.

Scoring Models by Minimized Variational Free Energy
As discussed in Section 2.2, the variational free energy is an important measure of model performance. In Sections 5.1 and 5.2, we discuss some problems that occur when evaluating the BFE on a TFFG. In Section 5.3, we propose an algorithm that evaluates the constrained BFE as a summation of local contributions on the TFFG.

Evaluation of the Entropy of Dirac-Delta Constrained Beliefs
For continuous variables, data and point-mass constraints, as discussed in Sections 4.2.1 and 4.3.2 and Appendix A, collapse the information density to infinity, which leads to singularities in entropy evaluation [59]. More specifically, for a continuous variable s j , the entropies for beliefs of the form q j ps j q " δps j´ŝj q and q a ps a q " q a|j ps azj |s j qδps j´ŝj q both evaluate to´8.
In variational inference, it is common to define the VFE only with respect to the latent (unobserved) variables [2] (Section 10.1). In contrast, in this paper, we explicitly define the BFE in terms of an iteration over all nodes and edges (7), which also includes non-latent beliefs in the BFE definition. Therefore, we define q j ps j q " δps j´ŝj q ñ Hrq j s fi 0 , q a ps a q " q a|j ps azj |s j qδps j´ŝj q ñ Hrq a s fi Hrq azj s , where q a|j ps azj |s j q indicates the conditional belief and q azj ps azj q is the joint belief. These definitions effectively remove the entropies for observed variables from the BFE evaluation. Note that although q azj ps azj q is technically not a part of our belief set (7), it can be obtained by marginalization of q a ps a q (9b).

Evaluation of Node-Local Free Energy for Deterministic Nodes
Another difficulty arises with the evaluation of the node-local free energy Frq a s for factors of the form f a ps a q " δph a ps a qq .
This type of node function reflects deterministic operations, e.g., hpx, y, zq " z´x´y corresponds to the summation z " x`y. In this case, directly evaluating Frq a s again leads to singularities.
There are (at least) two strategies available in the literature that resolve this issue. The first strategy "softens" the Dirac-delta by re-defining: with 0 ă ! 1 [17]. A drawback of this approach is that it may alter the model definition in a numerically unstable way, leading to a different inference solution and variational free energy than originally intended. The second strategy combines the deterministic factor f a with a neighboring stochastic factor f b into a new composite factor f c , by marginalizing over a shared variable s j , leading to [60] f c ps c q fi ż δph a ps a qq f b ps b q ds j , where s c " ts a Y s b uzs j . This procedure has drawbacks for models that involve many deterministic factors-namely, the convenient model modularity and resulting distributed compatibility are lost when large groups of factors are compacted in model-specific composite factors. We propose here a third strategy.
Theorem 8. Let f a ps a q " δph a ps a qq, with h a ps a q " s j´ga ps azj q, and node-local belief q a ps a q " q j|a ps j |s azj q q azj ps azj q. Then, the node-local free energy evaluates to Frq a , f a s " #´H rq azj s if q j|a ps j |s azj q " δps j´ga ps azj qq 8 otherwise.
An example that evaluates the node-local free energy for a non-trivial deterministic node can be found in Appendix C.
The equality node is a special case deterministic node, with a node function of the form (3). The argument of (Theorem 8) does not directly apply to this node. As the equality node function comprises two Dirac-delta functions, it can not be written in the form of Theorem 8. However, we can still reduce the node-local free energy contribution.
Theorem 9. Let f a ps a q " δps j´si q δps j´sk q, with node-local belief q a ps a q " q ik|j ps i , s k |s j q q j ps j q. Then, the node-local free energy evaluates to Frq a , f a s " #´H rq j s if q ik|j ps i , s k |s j q " δps j´si q δps j´sk q 8 otherwise.

Evaluating the Variational Free Energy
We propose here an algorithm that evaluates the BFE on a TFFG representation of a factorized model. The algorithm is based on the following results: • The definitions for the computation of data-constrained entropies ensure that only variables with associated stochastic beliefs count towards the Bethe entropy. This makes the BFE evaluation consistent with Theorems 3 and 7, where the single-variable beliefs for observed variables are excluded from the BFE definition; • We assume that a local mean-field factorization lpaq is available for each a P V (Section 4.1). If the mean-field factorization is not explicitly defined, we assume lpaq " tau is the unfactored set; • Deterministic nodes are accounted for by Theorem 8, which reduces the joint entropy to an entropy over the "inbound" edges. Although the belief over the "inbounds" q azj ps azj q is not a term in the Bethe factorization (8), it can simply be obtained by marginalization of q a ps a q; • The equality node is a special case, where we let the node entropy discount the degree of the associated variable in the original model definition. While the BFE definition on a TFFG (7) does not explicitly account for edge degrees, this mechanism implicitly corrects for "double-counting" [17]. In this case, edge selection for counting is arbitrary, because all associated edges are (by definition) constrained to share the same belief (Section 2.1, Theorem 9).
The decomposition of (7) shows that the BFE can be computed by an iteration over the nodes and edges of the graph. As some contributions to the BFE might cancel each other, the algorithm first tracks counting numbers u a for the average energies U a rq a s "´ż q a ps a q log f a ps a q ds a , and counting numbers h k for the (joint) entropies Hrq k s "´ż q k ps k q log q k ps k q ds k , which are ultimately combined and evaluated. We used an index k to indicate that the entropy computation may include not only the edges but a generic set of variables. We will give the definition of the set that k belongs to in Algorithm 1.
Algorithm 1 Evaluation of the Bethe free energy on a Terminated Forney-style factor graph.
given a TFFG G " pV, E q given a local mean-field factorization lpaq for all a P V define q j ps j q " δps j´ŝj q ñ Hrq j s fi 0 Ź Ignore entropy of Dirac-delta constrained beliefs define q a ps a q " q a|j ps azj |s j qδps j´ŝj q ñ Hrq a s fi Hrq azj s Ź Reduce entropy of Dirac-delta constrained joint beliefs define K " ta, azi, n, for all a P V, i P E paq, n P lpaqu the set of (joint) belief indices initialize counting numbers u a " 0 for all a P V, h k " 0 for all k P K for all nodes a P V do if a is a stochastic node then u a`" 1 Ź Count the average energy for all clusters n P lpaq do h n`" 1 Ź Count the (joint) cluster entropy end for else if a is an equality node then Select an edge j P E paq h j`" 1 Ź Count the variable entropy else Ź Deterministic node a Obtain the node function f a ps a q " δps j´ga ps azj qq h azj`" 1 Ź Count the (joint) entropy of the inbounds end if end for for all edges i P E do h i´" 1 Ź Discount the variable entropy end for U " ř aPV u a U a rq a s H " ř kPK h k Hrq k s return F " U´H

Implementation of Algorithms and Simulations
We have developed a probabilistic programming toolbox ForneyLab.jl in the Julia language [61,62]. The majority of algorithms that are reviewed in Table 1 have been implemented in ForneyLab along with variety of demos (https://github.com/biaslab/ ForneyLab.jl/tree/master/demo, accessed on 23 June 2021). ForneyLab is extendable and supports postulating new local constraints of the BFE for the creation of custom message passing-based inference algorithms.
In order to limit the length of this paper, we refer the reader to the demonstration folder of ForneyLab and to several of our previous papers with code. For instance, our previous work in [63] implemented a mean-field variational Laplace propagation for the hierarchical Gaussian filter (HGF) [64]. In the follow-up work [65], inference results improved by changing to structured factorization and moment-matching local constraints. In that case, modification of local constraints created a hybrid EP-VMP algorithm that better suited the model. Moreover, in [13], we formulated the idea of chance constraints in the form of violation probabilities leading to a new message passing algorithm that supports goal-directed behavior within the context of active inference. A similar line of reasoning led to improved inference procedures for auto-regressive models [66].

Related Work
Our work is inspired by the seminal work [17], which discusses the equivalence between the fixed points of the belief propagation algorithm [32] and the stationary points of the Bethe free energy. This equivalence is established through a Lagrangian formalism, which allows for the derivation of Generalized Belief Propagation (GBP) algorithms by introducing region-based graphs and the region-based (Kikuchi) free energy [16].
Region graph-based methods allows for overlapping clusters (Section 4.1) and thus offer a more generic message passing approach. The selection of appropriate regions (clusters), however, proves to be difficult, and the resulting algorithms may grow prohibitively complex. In this context, Ref. [67] addresses how to manipulate regions and manage the complexity of GBP algorithms. Furthermore, Ref. [68] also establishes a connection between GBP and expectation propagation (EP) by introducing structured region graphs.
The inspirational work of [15] derives message passing algorithms by minimization of α-divergences. The stationary points of α-divergences are obtained by a fixed point projection scheme. This projection scheme is reminiscent of the minimization scheme of the expectation propagation (EP) algorithm [18]. Compared to [15], our work focuses on a single divergence objective (namely, the VFE). The work of [12] derives the EP algorithm by manipulating the marginalization and factorization constraints of the Bethe free energy objective (see also Section 4.2.3). The EP algorithm is, however, not guaranteed to converge to a minimum of the associated divergence metric.
To address the convergence properties of the algorithms that are obtained by region graph methods, the outstanding work of [33] derives conditions on the region counting numbers that guarantee the convexity of the underlying objective. In general, however, the constrained Bethe free energy is not guaranteed to be convex and therefore the derived message passing updates are not guaranteed to converge.

Discussion
The key message in this paper is that a (variational) Bayesian model designer may tune the tractability-accuracy trade-off for evidence and posterior evaluation through constraint manipulation. It is interesting to note that the technique to derive message passing algorithms is always the same. We followed the recipe pioneered in [15] to derive a large variety of message passing algorithms solely through minimizing constrained Bethe free energy. This minimization leads to local fixed-point equations, which we can interpret as message passing updates on a (terminated) FFG. The presented lemmas showed how the constraints affect the Lagrangians locally. The presented theorems determined the stationary solutions of the Lagrangians and obtained the message passing equations. Thus, if a designer proposes a new set of constraints, then the first place to start is to analyze the effect on the Lagrangian. Once the effect of the constraint on the Lagrangian is known, then variational optimization may result in stationary solutions that can be obtained by a fixed-point iteration scheme.
In this paper, we selected the Forney-style factor graph framework to illustrate our ideas. FFGs are mathematically comparable to the more common bi-partite factor graphs that associate round nodes with variables and square nodes with factors [20]. Bi-partite factor graphs require two distinct types of message updates (one leaving variable nodes and one leaving factor nodes), while message passing on a (T)FFG requires only a single type of message update [69]. The (T)FFG paradigm thus substantially simplifies the derivations and resulting message passing update equations.
The message passing update rules in this paper are presented without guarantees on convergence of the (local) minimization process. In practice, however, algorithm convergence can be easily checked by evaluating the BFE (Algorithm 1) after each belief update.
In future work, we plan on extending the treatment of constraints to formulate sampling-based algorithms such as importance sampling and Hamiltonian Monte Carlo in a message passing framework. While introducing SVMP, we have limited the discussion to local clusters that are not overlapping. We plan to extend variational algorithms to include local clusters that are overlapping without altering the underlying free-energy objective or the graph structure [16,67].

Conclusions
In this paper, we formulated a message-passing approach to probabilistic inference by identifying local stationary solutions of a constrained Bethe free energy objective (Sections 3 and 4). The proposed framework constructs a graph for the generative model and specifies local constraints for variational optimization in a local polytope. The constraints are then imposed on the variational objective by a Lagrangian construct. Unconstrained optimization of the Lagrangian then leads to local expressions of stationary points, which can be obtained by iterative execution of the resulting fixed point equations, which we identify with message passing updates.
Furthermore, we presented an approach to evaluate the BFE on a (terminated) Forneystyle factor graph (Section 5). This procedure allows an algorithm designer to readily assess the performance of algorithms and models.
We have included detailed derivations of message passing updates (Appendix D) and hope that the presented formulation inspires the discovery of novel and customized message passing algorithms. Funding: This work was partly financed by GN Hearing A/S.

Acknowledgments:
The authors would like to thank the BIASlab team members for many very interesting discussions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: In this section, we present a pedagogical example of inductive inference. After we establish an intuition, we apply the same principles to a more general context in the further sections. We follow Caticha [42,70], who showed that a constrained free energy functional can be interpreted as a principled objective measure for inductive reasoning, see also [71,72]. The calculus of variations offers a principled method for optimizing this free energy functional.
In this section, we assume an example model with observed variables y and a single parameter θ.
We define the (variational) free energy (VFE) as The goal is to find a posterior q˚" arg min qPQ Frq, f s (A3) that minimizes the free energy subject to some pre-specified constraints. These constraints may include form or factorization constraints on q (to be discussed later) or relate to observations of a signal y.
As an example, assume that we obtained some measurements y "ŷ and wish to obtain a posterior marginal belief q˚pθq over the parameter. We can then incorporate the data in the form of a data constraint ż qpy, θq dθ " δpy´ŷq , where δ defines a Dirac-delta. The constrained free energy can be rewritten by including Lagrange multipliers as Lrq, f s " Frq, f s`γˆĳ qpy, θq dy dθ´1˙`ż λpyqˆż qpy, θq dθ´δpy´ŷq˙dy , (A5) where the first term specifies the (to be minimized) free energy objective, the second term a normalization constraint, and the third term the data constraint. Optimization of (A5) can be performed using variational calculus. Variational calculus considers the impact of a variation in qpy, θq on the Lagrangian Lrq, f s. We define the variation as δqpy, θq ∆ " φpy, θq , where Ñ 0, and φ is a continuous and differentiable "test" function. The fundamental theorem of variational calculus states that the stationary solutions q˚are obtained by Note that, since (A7c) has been written in similar form as (A6), it is easy to identify the functional derivative. This procedure is one of many ways to obtain the functional derivatives [73].
We find that (A9) is always satisfied since Z " ť expp´λpyqq f py, θq dy dθ by definition. Note, however, that the computation of the normalization constant still depends on the undetermined Lagrange multiplier λpyq.
The data constraint evaluates to which we recognize as the Bayes rule. Note that the Bayes rule was derived here as a special case of constrained variational free energy minimization when data constraints are present. This derivation of the Bayes rule seems unnecessarily tedious but the value of this approach to inductive inference is that the same principle applies when other (not data) constraints on q are present.

Appendix B. Lagrangian Optimization and the Dual Problem
With the addition of Lagrange multipliers to the Bethe functional, the resulting Lagrangian depends both on the variational distribution qpsq and the Lagrange multipliers Ψpsq. Formally, the introduction of the Lagrange multipliers allows us to rewrite the constrained optimization on the local polytope as an unconstrained optimization. We follow [33], and write The minimization with respect to q then yields a solution that depends on the Lagrange multipliers, as q˚ps; Ψq " arg min q Lrq, Ψs .
For any given q the Lagrangian is concave in Ψ. Therefore, substituting q˚in the Lagrangian, the maximization over Lrq˚, Ψs yields the unique solution In the current paper, we consider factorized q's (e.g., (8)), and consider variations with respect to the individual factors. We then need to show that the combined stationary points of the individual factors also constitute a stationary point of the total objective.