Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models

Variational algorithms have gained prominence over the past two decades as a scalable computational environment for Bayesian inference. In this article, we explore tools from the dynamical systems literature to study the convergence of coordinate ascent algorithms for mean field variational inference. Focusing on the Ising model defined on two nodes, we fully characterize the dynamics of the sequential coordinate ascent algorithm and its parallel version. We observe that in the regime where the objective function is convex, both the algorithms are stable and exhibit convergence to the unique fixed point. Our analyses reveal interesting discordances between these two versions of the algorithm in the region when the objective function is non-convex. In fact, the parallel version exhibits a periodic oscillatory behavior which is absent in the sequential version. Drawing intuition from the Markov chain Monte Carlo literature, we empirically show that a parameter expansion of the Ising model, popularly called the Edward–Sokal coupling, leads to an enlargement of the regime of convergence to the global optima.


Introduction
Variational Bayes (VB) is now a standard tool to approximate computationally intractable posterior densities. Traditionally this computational intractability has been circumvented using sampling techniques such as Markov chain Monte Carlo (MCMC). MCMC techniques are prone to be computationally expensive for high dimensional and complex hierarchical Bayesian models, which are prolific in modern applications. VB methods, on the other hand, typically provide answers orders of magnitude faster, as they are based on optimization. Introduction to VB can be found in chapter 10 of [1] and chapter 33 of [2]. Excellent recent surveys can be found in [3,4].
The objective of VB is to find the best approximation to the posterior distribution from a more tractable class of distributions on the latent variables that is well-suited to the problem at hand. The best approximation is found by minimizing a divergence between the posterior distribution of interest and a class of distributions that are computationally tractable. The most popular choices for the discrepancy and the approximating class are the Kullback-Leibler (KL) divergence and the class of product distributions, respectively. This combination is popularly known as mean field variational inference, originating from mean field theory in physics [5]. Mean-field inference has percolated through a wide variety of disciplines, including statistical mechanics, electrical engineering, information theory, neuroscience, cognitive sciences [6] and more recently deep neural networks [7]. While computing the KL divergence is intractable for a large class of distributions, reframing the minimization problem for maximizing the evidence lower bound (ELBO) leads to efficient algorithms. In particular, for conditionally conjugate-exponential family models, the optimal distribution for mean field variational inference can be computed by iteration of closed form updates. These updates form a coordinate ascent algorithm known as coordinate ascent variational inference (CAVI) [1].
Research into the theoretical properties of variational Bayes has exploded in the last few years. Recent theoretical work focuses on statistical risk bounds for variational estimate obtained from VB [8][9][10][11], asymptotic normality of VB posteriors [12] and extension to model misspecification [8,13]. While much of the recent theoretical work focuses on statistical optimality guarantees, there has been less work studying the convergence of the CAVI algorithms employed in practice. Convergence of CAVI to the global optima is only known in special cases that depend heavily on model structure for normal mixture models [14,15]; stochastic block models [16][17][18][19]; topic models [20]; and under special restrictions of the parameter regime, Ising models [21,22]. The convergence properties of the CAVI algorithm still largely constitute an open problem.
The goal of this work is to suggest a general systematic framework for studying convergence properties of CAVI algorithms. By viewing CAVI as a discrete time dynamical system, we can leverage dynamical systems theory to analyze the convergence behavior of the algorithm and bifurcation theory to study the types of changes that solutions can undergo as the various parameters are varied. For sake of concreteness, we focus on the 2D Ising model. While dynamical systems theory possesses the tools [23][24][25] necessary to analyze higher dimensional systems, they were mainly developed for non-sequential systems. The general theory for n-dimensional discrete dynamical systems is dependent on having the evolution function in the form x n+1 = F(x n ). Deriving this F is typically not possible for densely connected higher dimensional sequential systems. The 2D Ising model has the special property that both the sequential and parallel updates in the two variables case can be written as two separate one variable dynamical systems, allowing for a simplified analysis. Our contributions to the literature are as follows: We provide a complete classification of the dynamical properties of the the traditional sequential update CAVI algorithm, and a parallelized version of the algorithm using dynamical systems and bifurcation theory on the Ising models. Our findings show that the sequential CAVI algorithm and the parallelized version have different convergence properties. Additionally, we numerically investigated the convergence of the CAVI algorithm on the Edward-Sokal coupling, a generalization of the Ising model. Our findings suggest that couplings/parameter expansion may provide a powerful way of controlling the convergence behavior of the CAVI algorithm, beyond the immediate example considered here.

Mean-Field Variational Inference and the Coordinate Ascent Algorithm
In this section, we briefly introduce mean-field variational inference for a target distribution in the form of a Boltzmann distribution with potential function Ψ, where Z denotes the intractable normalizing constant. The above representation encapsulates both posterior distributions that arise in Bayesian inference, where Ψ is the log-posterior up to constants, and probabilistic graphical models such as the Ising and Potts models. For instance, Ψ(x) = β ∑ u∼v J uv x u x v + β ∑ u h u x u for the Ising model; see the next section for more details. Many of the complications in inference arise from the intractability of the normalizing constant Z, which is commonly referred to as the free energy in probabilistic graphical models, and the marginal likelihood or evidence in Bayesian statistics. Variational inference aims to mitigate this problem by using optimization to find the best approximation q * to the target density p from a class F of variational distributions over the parameter vector x, where D(q || p) denotes the Kullback-Leibler (KL) divergence between q and p. The complexity of this optimization problem is largely determined by the choice of variational family F . The objective function of the above optimization problem is intractable because it also involves the evidence Z. We can work around this issue by rewriting the KL divergence as where E q denotes the expectation with respect to q(x). Rearranging terms, The acronym ELBO stands for evidence lower bound and the nomenclature is now apparent from the above inequality. Notice from Equation (2) that maximizing the ELBO is equivalent to minimizing the KL divergence. By maximizing the ELBO we can solve the original variational problem while by-passing the computational intractability of the evidence.
As mentioned above, the choice of variational family controls both the complexity and accuracy of approximation. Using a more flexible family achieves a tighter lower bound but at the cost of having to solve a more complex optimization problem. A popular choice of family that balances both flexibility and computability is the mean-field family. Mean-field variational inference refers to the situation when q is restricted to the product family of densities over the parameters, The coordinate ascent variational inference (CAVI) algorithm (refer to Algorithm 1) is a learning algorithm that optimizes the ELBO over the mean-field family F MF . At each time step t ≥ 1, the CAVI algorithm iteratively updates the current mean field marginal distribution q (t) j (x j ) by maximizing the ELBO over that marginal while keeping the other marginals {q (t) (x )} =j fixed at their current values.
The objective function ELBO(q 1 ⊗ · · · ⊗ q n ) is concave in each of the arguments individually (although it is rarely jointly concave), so these individual maximization problems have unique solutions. The optimal update for the jth mean field variational component of the model has the closed form, where the expectations E −j are taken with respect to the distribution ∏ i =j q i (x i ). Furthermore, the update step of the algorithm is monotonous, as each step of the CAVI increases the objective function For parametric models, the sequential updates of the variational marginal distributions in the CAVI algorithm is done by a sequential update of the variational parameters of these distributions. The CAVI algorithm updates for parametric models induce a discrete time dynamical system of the parameters. Clearly, convergence of the CAVI algorithm can be framed in terms of this induced discrete time dynamical system. As discussed before, the ELBO is generally a non-convex function, and hence the CAVI algorithm is only guaranteed to converge to a local optimum of the system. It is also not clear how many local optima (or fixed points) the system has, nor whether the algorithm always settles on a single fixed point, diverges away from the fixed point or cycles between multiple fixed points. These questions translate to questions about the existence and stability of fixed points of the induced dynamical system. We are also interested in how the behavior of the CAVI algorithm could possibly change as we vary the parameters of the model. This translates to questions about the possible bifurcations of the induced dynamical system. In Section 3, we formally introduce the Ising model and its mean-field variational inference.

CAVI in Ising Model
We first briefly review the definition of an Ising model. The Ising model was first introduced as a model for magnetization in statistical physics, but has found many applications in other fields; see [26] and references therein. The Ising model is a probability distribution on the hypercube {±1} n given by where the interaction matrix J is a symmetric real n × n matrix with zeros on the diagonal, h is a real n-vector that represents the external magnetic field, and β is the inverse temperature parameter. The model is said to be ferromagnetic if J uv ≥ 0 for all u, v and anti-ferromagnetic if J uv < 0 for all u, v. The normalizing constant or the partition function of the Ising model is Refer to Chapter 31 of [2] for an excellent review of Ising models.

Mean Field Variational Inference in Ising Model
Here we provide a derivation of the CAVI update function for the Ising model, focusing on the two nodes (n = 2) case for simplicity and analytic tractability.
In this case, we have the Ising model on two spins with x = (x 1 , x 2 ) and influence matrix J with off diagonal term J 12 and external magnetic field h = (h 1 , h 2 ) = (0, 0). From the general framework in Section 2, the CAVI updates are given by, Equivalently, the same updates are obtained by setting the gradient of the ELBO as a function of (x 1 , x 2 ) equal to the (0, 0) vector. Illustrations of the ELBO and the gradient functions for various values of β are in Figures 1 and 2 respectively. Figure 1. A contour plot of the ELBO as a function of x 1 and x 2 for β = 0.7 (left) and β = 1.2 (right) together with the optimal update functions for x 1 (orange) and x 2 (blue) given in Equation (8). For β = 0.7 the ELBO is a convex function and has exactly one optima, the global maximum, at (0.5, 0.5). For β = 1.2 the ELBO is now a nonconvex function and has three optima at (0.5, 0.5), (0.17071, 0.17071) and (0.82928, 0.82928).

Figure 2.
A contour plot of the ELBO as a function of x 1 and x 2 for β = −0.7 (left) and β = −1.2 (right) together with the optimal update functions for x 1 (orange) and x 2 (blue) given in Equation (8). For β = −0.7 the ELBO is a convex function and has exactly one optima, the global maximum, at (0.5, 0.5). For β = −1.2 the ELBO is now a nonconvex function and has three optima at (0.5, 0.5), (0.17071, 0.82928) and (0.82928, 0.17071).
Since q * 1 and q * 2 are two point distributions, it is sufficient to keep track of the mass assigned to 1. Simplifying, Let ζ k (resp. ξ k ) denote the kth iterate of q 1 (x 1 = 1) (resp. q 2 (x 2 = 1)) from the CAVI algorithm. To succinctly represent these updates, define the logistic sigmoid function With this notation, we have, for any k ∈ Z + , Without loss of generality we henceforth set J 12 = 1. Under this choice the model is in the ferromagnetic regime for β > 0 and the anti-ferromagnetic regime for β < 0.

Why the Ising Model: A Summary of Our Contributions
There are exactly two cases of the Ising model that have a full analytic solution for the free energy. They are (i) the one dimensional line graph solved by Ernst Ising in his thesis [27] and (ii) the two dimensional case on the anisotropic square lattice when the magnetic field h = 0 by [28]. Comparison with the mean field solution for the same models highlights the poor approximation quality of the mean field solution in low dimensions. To the best knowledge of the authors, there are no results in the literature detailing the properties of the mean field solution to the anti-ferromagnetic Ising model. Readers not familiar with the physics may wonder why this is the case. To explain this, there are two cases in the anti-ferromagnetic regime: one of the two regions is equivalent to the ferromagnetic case and in the other the mean field approximation is not a good approximation of the system. The first case occurs in a bipartite graph where a transformation of variables makes the antiferromagnetic regime equivalent to the ferromagnetic one [29]. The other case can be seen on the triangle graph. By fixing the spin of one vertex as 1 and the other as −1, the third vertex becomes geometrically frustrated and neither choice of spin lowers the energy level of the system and the two configurations are equivalent [30]. In this case the mean field approximation gives a completely incorrect answer and does not merit further investigation from a qualitative point of view. The physics literature is primarily concerned with using the mean field solutions to the Ising model to estimate important physical constants of the systems. These constants are only meaningful when the mean field solution provides a good approximation to the behavior of the system in large dimensions. It is known, however, that under certain conditions the mean field approximation does indeed converge to the true free energy of the system as the dimension increases [21,31].
Our work is focused on providing a rigorous methodology to analyze dynamics of the CAVI algorithm that can be applied to any model structure. All of the interesting behaviors exhibited by the CAVI algorithm fit into the classical mathematical framework of discrete dynamical systems and bifurcation theory. Specifically, we use the Ising model as a simple and yet rich example to illustrate the potential of dynamical systems theory to analyze CAVI updates for mean field variational inference. The bifurcation of the ferromagnetic Ising model at the boundary of the Dobrushin regime is known [2,26]; however, a rigorous proof in terms of dynamical systems theory is missing in the literature.
There are several features that make the CAVI algorithm on the Ising model a nontrivial example worth investigating. The optimization problem arising from mean field variational inference on the Ising model is, in general, non-convex [21]. However, it is straightforward to obtain sufficient conditions to guarantee the existence of a global optima. One such condition is that the inverse temperature β is inside the Dobrushin regime, |β| < 1 [21]. Inside the Dobrushin regime, the CAVI update equations form a contraction mapping guaranteeing a unique global optima [21]. Outside of this regime the behavior of the CAVI algorithm is nontrivial. The CAVI solution to the Ising model with zero external magnetic field exhibits multiple local optima outside of the Dobrushin regime [2].
Our contributions to the literature are as follows. We utilize tools from dynamical systems theory to rigorously classify the full behavior of Ising model for the full parameter regime in dimension n = 2 for both the sequential and parallel versions of CAVI algorithm. We show that the dynamical behavior of the sequential CAVI is not equivalent to the behavior of the parallel CAVI. Lastly we derive a variational approximation to the Edward-Sokal parameter expansion of the Potts and Random Cluster models and numerically study its convergence behavior under the CAVI algorithm. Our numerical results reveal that the parameter expansion leads to an enlargement of the regime of convergence to the global optima. In particular the Dobrushin regime is strictly contained in the expanded regime. This is compatible with the analogous results in Markov chain literature. See the introduction of [32] for a well written summary of Markov chain mixing in the Ising model.

Statistical Significance of Our Results
Although mean-field variational inference has been routinely used in applications [3] for computational efficiency, it may not yield statistically optimal estimators. A statistically optimal estimator should correctly recover the statistical properties of the true distribution. Ideally, we would like the estimate to recover the true mean and true covariance of the distribution. It is well known that mean-field variational inference produces estimators that underestimate the posterior covariance [14]. More recently, it was shown that the mean-field estimators for certain topic models and stochastic block models may not even be correlated with the true distribution [17,20]. For these reasons, it is important to see if the mean field estimators can at least recover the true mean for various β ∈ R.
Mean field inference approximates the joint probability mass function in (6) for n = 2 by product of two distributions on {−1, 1} in the sense of Kullback-Leibler divergence. As discussed in Section 3, minimizing this divergence is equivalent to maximizing an objective function, called the Evidence Lower Bound (ELBO). Our objective is to better understand the relation between the CAVI estimate and the global maximum of ELBO in (6) when n = 2 and h = 0. Ideally, we want the global maximum of the ELBO to be a statistically reliable estimate. To understand this, let us denote 2 × Bernoulli(p) − 1 by 1, −1; p . As the marginal distributions of (6) are both equal to 1, −1; 0.5 , we want the ELBO to be maximized at this value. From an algorithmic perspective, we would like to ensure that the CAVI iterates converge to this global maximum. The synergy of these two phenomena leads to a successful variational inference method. We show in this article that both these conditions can be violated in a certain regime of the parameter space in the context of Ising model on two nodes. Inside the Dobrushin regime (−1 ≤ β ≤ 1), the global optima of the ELBO obtained from a mean field inference occurs at ( 1, −1; 0.5 , 1, −1; 0.5 ) which is qualitatively the optimal solution. In this regime, the CAVI system converges to this global optimum irrespective of where the system is initialized. Thus, in the Dobrushin regime, the mean field inference yields the statistically optimal estimate. Additionally, the CAVI algorithm is stable and convergent at this value. Unfortunately, this property deteriorates outside of the Dobrushin regime. Outside of the regime, the global maxima occur at two symmetric points which are different from ( 1, −1; 0.5 , 1, −1; 0.5 ). These two symmetric points are equivalent under label switching. For example, when β = 1.2 one of the optima is ( 1, −1; 0.17071 , 1, −1; 0.17071 ) and the other is ( 1, −1; 0.82928 , 1, −1; 0.82928 ). Notice this second optima is equivalent to the sign swapped version ( −1, 1; 0.17071 , −1, 1; 0.17071 ).
The original optima ( 1, −1; 0.5 , 1, −1; 0.5 ) is actually a local minimum of the ELBO outside the Dobrushin regime. We illustrate in our theory that the CAVI system returns one of two global maxima of the objective function depending on the initialization of the algorithm. Although it is widely known that the statistical quality of the mean field inference is poor outside the regime, we show in addition that the algorithm itself exhibits erratic behavior and may not converge to the global maximizer of the ELBO for all initializations. Interestingly, outside the Dobrushin regime, the statistically optimal solution ( 1, −1; 0.5 , 1, −1; 0.5 ) is a repelling fixed point of the CAVI system. This means that as the system is iterated, the current value of the system is pulled away from ( 1, −1; 0.5 , 1, −1; 0.5 ) to the global maximum.
A common technique to further improve computational time is the use of block updates in the CAVI algorithm, meaning groups of parameters are updated simultaneously. We refer to this as the parallelized CAVI algorithm. This has been shown to work well in certain models [17], but has not been investigated in a general setting. However, it turns out that block updating in the Ising model can lead to new problematic behaviors. Outside the Dobrushin regime, block updates can exhibit non-convergence in the form of cycling. As the system updates, it eventually switches back and forth between two points that yield the same value in the objective function.
Parameter expansions (coupling) is another method of improving the convergence properties of algorithms. In the Markov chain theory for Ising models, it is well-known that mixing and convergence time are typically improved by using the Edward-Sokal coupling, a parameter expansion of the ferromagnetic Ising model [33]. Our preliminary investigation reveals that the convergence properties of the CAVI algorithm also exhibit a similar phenomenon.

Main Results
In this section, we analyze the behavior of the dynamical systems that one can form using the CAVI update equations and show that the behaviors of the systems differ. Our results are heavily dependent on well-known techniques in dynamical systems. For readers unfamiliar with some of technical terminology below, we have included a primer on the basics of dynamical systems in Appendix A.
Recall the system of sequential updates, which are the updates used in CAVI: and the parallel updates: We will show that these two systems are not topologically conjugate. We first state and prove some results on the dynamics of the sigmoid function (7). These results will be used as building blocks to study the dynamics of (9) and (10). Phase change behavior of dynamical systems using the sigmoid and RELU activation functions are known in the literature in the context of generalization performance of deep neural networks [34,35]. In this section we present a complete proof of the bifurcation analysis of non-linear dynamical systems involving sigmoid activation function despite its connections with [34,35]. Our results in Section 5.1 provide a more complete picture of the behavior of the dynamics in all regimes and can be readily exploited to analyze the dynamics of (9) and (10).

Sigmoid Function Dynamics
In this section we provide a full classification for the dynamics of the following sigmoid function and its second iterate, To the best of our knowledge, we could not find a formal proof of the full classification of the dynamics of the sigmoid function (or its second iterate) for all β ∈ R in the literature. Additionally, it provides an introductory example to demonstrate the concepts and techniques of dynamical systems. We begin by using numerical techniques to determine the number of fixed points in the system and its possible periodic behavior. We then proceed by providing a formal proof of the full dynamical properties of (11) in Theorem 1 and the full dynamical properties of (12) in Theorem 2.
Using numerical techniques, we solve for the number of fixed points of the system. The number of fixed points the function (11) depends on the magnitude of the parameter. For β > 0, there is no periodic behavior, so there are no additional fixed points in (12) that are not fixed points in (11).
For β < 0, we see periodic behavior in the system; there are fixed points of (12) that are not fixed points of (11). For β < −1, the function (11) has one fixed point at x * = 1/2 and a periodic cycle (12) and these points are the same fixed points from the β > 0 regime as (12) is an even function with respect to β. Table 1 denotes the values of the derivatives at the fixed point 1/2 for β = ±1. Table 1. Partial derivatives of (11) and (12) at fixed point x * = 1/2 for parameter value β = ±1. The derivatives of the the function (11) are denoted using σ and the derivatives for (12) are denoted using σ 2 .
We now have enough information to provide a complete classification of the dynamics of the sigmoid function.
Theorem 1 (Dynamics of sigmoid function). Consider the discrete dynamical system generated by (11) The full dynamics of the system (11) are as follows 1. For −1 ≤ β ≤ 1, the system has a single hyperbolic fixed point x * = 1/2 which is a global attractor and there are no p-periodic points for p ≥ 2. 2. For β > 1, the system has one repelling hyperbolic fixed point x * = 1/2 and two hyperbolic stable fixed There are no p-periodic points for p ≥ 2. 3. For β < −1, the system has one unstable hyperbolic fixed point x * = 1/2, and a stable 2 There are no p-periodic points for p > 2.
4. For |β| = 1, the system has one non-hyperbolic fixed point at x * = 1/2 which is asymptotically stable and attracting.
The system undergoes a PD bifurcation at β = −1 and a pitchfork bifurcation at β = 1.
Proof. We will break the proof up into three parts. The first part of the proof is a linear stability analysis of the system, the second part is a stability analysis of the periodic points in the system and the third part is an analysis of the bifurcations of the system. We begin with a linear stability analysis of the system at each fixed point. For β ≤ 1 the system has one fixed point x * = 1/2 and for β > 1 the system has three fixed points c 0 , 1/2, Fixed point x * = 1/2 : The Jacobian of the system at the fixed point For β = 1, the fixed point x * = 1/2 is hyperbolic and for β = ±1 the fixed point is non-hyperbolic. We classify the stability of the hyperbolic fixed point x * = 1/2 using Theorem A2. For |β| < 1 the fixed point x * = 1/2 is globally attracting as |σ x (2x * − 1, 2β)| < 1 and for |β| > 1 the fixed point x * = 1/2 is globally repelling as |σ x (2x * − 1, 2β * )| > 1. For β = ±1 we invoke Theorem A3 to check for stability of the fixed point. At β = −1 we have σ x (2x * − 1, 2β) = −1 and we need to check the Schwarzian derivative. The fixed point Fixed points c 0 , c 1 : These fixed points have the same behavior so we have grouped them together in the analysis. When β > 1 there are two additional fixed points c 0 , c 1 of the system, both are attracting fixed points by Theorem A2 as |σ x (2c i − 1, 2β)| < 1 for each i = 0, 1 and all β > 1. The stable sets are Periodic points: For β < −1 we see the two cycle C = {c 0 , c 1 }. Notice σ(2c 0 − 1, 2β) = c 1 and σ(2c 1 − 1, 2β) = c 0 . This two cycle is stable since c 0 and c 1 are both stable fixed points of (12).
At (x * , β * ) = (1/2, 1) the system under goes a pitchfork bifurcation as it satisfies the conditions in Theorem A5: Similarly at (x * , β * ) = (1/2, −1) the system under goes a period doubling bifurcation as it satisfies the conditions in Theorem A4 We can fully classify the dynamics of (12) using the above theorem. We omit the proof as it is similar to the proof of Theorem 1.
The system undergoes a pitchfork bifurcation at β = ±1. There are no p-periodic points for p ≥ 2.

Sequential Dynamics
To fully understand the dynamics of the equations defining the updates to q * 1 and q * 2 it suffices to track the evolution of the points q * 1 (1) = ζ and q * 2 (1) = ξ. The CAVI algorithm updates terms sequentially, using the new values of the variables to calculate the others. We initialize the CAVI algorithm at points ζ 0 , ξ 0 . The CAVI algorithm is a dynamical system formed by sequential iterations of σ(2x − 1, 2β) starting from ζ 0 , ξ 0 . We can decouple the CAVI updates for ξ k and ζ k by looking at the second iterations. This decoupling is visualized in the diagram (14) below. The system formed the sequential updates is equivalent to the following decoupled system We propose to investigate the dynamics of the sequential system (9) by studying the dynamics of individual subsequences ζ k+1 and ξ k+1 of the decoupled system (13). The dynamical properties of the individual subsequences follow from a combination of Theorem 1, Theorem 2 and other methods from Appendix A.
Illustrations of the evolution of the dynamics of the sequential updates for various initializations and values of β are in Figures 3-6.

Parallel Updates
The system of parallel updates is defined by the one-step map F : The dynamics of the parallel system are similar to the system studied in [36]. As we shall show below, the parallel system exhibits periodic behavior that the sequential system does not and it follows as a corollary that the systems are not locally topologically conjugate.
The parallelized CAVI algorithm is a dynamical system formed by iterations of F defined in (15). We shall decouple the parallelized CAVI updates for sequences ξ k and ζ k by looking at iterations of (12) acting on the sequences individually. This decoupling is visualized in diagram form where each cross is an application of F. The system formed the parallel updates is equivalent to the following decoupled systems of even subsequences and odd subsequences. The even subsequences are The odd subsequences are Following a similar approach to the one used to study the sequential dynamics, we investigate the dynamics of the parallel system (15)        We now present the main result for the parallel dynamics. The stable sets are W s (c 0 , c 1 ) 2. For −1 ≤ β ≤ 1, the system has a global attracting fixed point (1/2, 1/2). 3. For β > 1, the system has two locally asymptotically stable fixed points (c 0 , c 0 ) and (c 1 , c 1 ), and one locally asymptotically unstable fixed point (1/2, 1/2), where c 0 and c 1 are the fixed points of (11).
The system has no p-periodic points for p > 2. The system under goes a PD bifurcation at β = −1 and a pitchfork bifurcation at β = 1.
Proof. The dynamics of the system defined by F in (15) are equivalent to the dynamics of the system generated by the subsequences (17)- (20). The dynamics of each of these subsequences are governed by the functions (11) and (12). By Theorem 1, we have the behavior for each of the subsequences (17)- (20). For |β| < 1, (11) has a globally stable fixed point at x * = 1/2 and thus the only fixed point in the parallel system is (1/2, 1/2) which must be globally stable. For β = ±1, the fixed point x 0 = 1/2 is asymptotically stable by Theorem A3.
For β > 1, (11) bifurcates. We have the unstable fixed point x * = 1/2, and the two locally stable fixed points, c 0 with stable set W s (c 0 ) = [0, 1/2), and c 1 with stable set W s (c 1 ) = (1/2, 1]. Returning to the system generated by F, if we consider the initialization (ζ 0 , ξ 0 ) = (c 0 , c 0 ) then by the sequence construction of ζ n , given in (17) and (19), we see that ζ n = c 0 for n ≥ 1, as c 0 is a fixed point of (11) for β > 1. Similarly, using the sequence construction of ξ n , given in (18) and (20), we see that ξ n = c 0 for n ≥ 1, as c 0 is a fixed point of (11) for β > 1. Therefore, (c 0 , c 0 ) is a fixed point. An analogous argument shows that (c 1 , c 1 ) is also a fixed point. The parallel system has the stable fixed points (c 0 , c 0 ) with stable set W s (c 0 , c 0 ) = W s (c 0 ) × W s (c 0 ) and (c 1 , c 1 ) with stable set W s (c 1 , c 1 ) = W s (c 1 ) × W s (c 1 ). After the bifurcation at β = 1 the parallel system also contains 2-cycles. Using the sequence construction we see that ). The dynamics of F lack any p-period point and cycles for p > 2 as a consequence of its construction from (12).
This completes the characterization of the dynamics of F for β ∈ R.

A Comparison of the Dynamics
We end the section by providing a comparison of the dynamical properties of the sequential system in Theorem 3 and the parallel system in Theorem 4. The main difference between the sequential system and the parallel system is the presence of two-cycles that can be found in the parallel system when |β| > 1. This behavior stems from the difference between the sequential and parallel implementations of the CAVI. Looking closely at the update diagrams for the two systems reveals the key difference that produces these two-cycles. The decoupled sequential system is and the decoupled parallel system is The major difference between these diagrams is how the individual update sequences begin. Notice ζ 0 plays no role in updating the sequential system as both the ζ k update sequence and the ξ k update sequence are dependent only on the choice of ξ 0 . Even after rewriting the sequential updates in terms of individual sequences the system is not truly decoupled as both sequences depend on a common starting point. This precisely prescribes the behavior that we see in the system relative to the sigmoid function dynamics in Theorem 1 and Theorem 2. Compare this to the parallel system. Here ζ 0 is involved in updating both the odd ξ 2k+1 subsequence and the even ζ 2k subsequence. Furthermore, ξ 0 remains involved by controlling the updates for the even ξ 2k subsequence and the odd ζ 2k+1 subsequence. This additional flexibility allows the parallel system to develop periodic behavior outside of the Dobrushin regime (1 ≤ β ≤ 1).
As an example, we will consider initializing the sequential algorithm to the parallel algorithm for β = 1.2. We begin with the sequential algorithm. For β = 1.2, consider initializing the sequential system at (ζ 0 , ξ 0 ) = (0.7, 0.3). The sequential system updates are fully determined by ξ 0 , so for ξ 0 = 0.3 it follows from Theorem 1 that an application of the function (11) will cause ζ 1 ∈ W s (c 0 ). At this point, the system can be evolved by applying (12) to the independent sequences for ζ and ξ as given in (13). The dynamics of the system are now controlled by the function (12). From this initialization the system will converge to the fixed point (c 0 , c 0 ) = (0.17071, 0.17071) as shown in Figure 6.
Contrast this with the behavior of the parallel system in which the updates are determined by both ξ 0 and ζ 0 . For β = 1.2, consider initializing the parallel system at (ζ 0 , ξ 0 ) = (0.7, 0.3). It follows from Theorem 1 that an application of the function (11) will cause ζ 1 ∈ W s (c 0 ) and ξ 1 ∈ W s (c 1 ). Successive updates will cause the sequences ζ k and ξ k to flip back and forth between the domains W s (c 0 ) and W s (c 1 ), until the system settles into the two cycle C 1 = {(c 0 , c 1 ), (c 1 , c 0 )} = {(0.17071, 0.82928), (0.82928, 0.17071)} as seen in Figure 10.
This simple example highlights the danger of naively parallelizing the CAVI algorithm. The convergence properties of a parallel version of the CAVI algorithm will heavily depend on the models CAVI update equations. In the case of the Ising model we have demonstrated that for certain parameter regimes the parallel implementation of the algorithm can fail to converge due to the dependence of the algorithm on both ζ 0 and ξ 0 .

Edward-Sokal Coupling
One method of improving convergence in Markov chains is through the use of probabilistic couplings. The Edward-Sokal (ES) coupling is a coupling of two statistical physics models, the random cluster model and the Potts model (a generalization of the Ising model) [37]. Running a Markov chain on the ES coupling leads to improved mixing properties compared to the equivalent Potts model and random cluster models [33]. Motivated by these findings in the Markov chain literature, we ask a similar question: Can the convergence properties of mean-field VI be improved by using the ES coupling in place of the Ising model? In this section we investigate this idea numerically. We first introduce the Edward-Sokal coupling following [37]. We introduce a variational family for the Edward-Sokal coupling and derive the variational updates for this model. Our findings suggests the variational updates converge to a unique solution in a larger range than the equivalent Dobrushin regime for the corresponding Ising measure.

Random Cluster Model
Let G = (V, E) be a finite graph. Let e = x, y ∈ E denote an edge in G with endpoints x, y ∈ V. Σ = {1, 2, . . . , q} V , Ω = {0, 1} E and F denotes the powerset of Ω. The random cluster model is a 2 parameter probability measure with an edge weight parameter p ∈ [0, 1] and a cluster weight parameter q ∈ {2, 3, . . .} on (Ω, F ) given by where κ(ω) denoted the number of connected components in the subgraph corresponding to ω. The partition function for the random cluster model is For q = 2 the the random cluster model reduces to the Ising model on G. The Edward-Sokal Coupling is a probability measure µ on Σ × Ω given by where δ a,b = 1(a = b), and δ e (σ) = 1(σ x = σ y ), for e = (x, y) ∈ E.
It is well known that in the special case, p = 1 − e −β and q = 2 the Σ-marginal of the ES coupling is the Ising model, the Ω-marginal is the random cluster model [37]. We are interested in better understanding how the convergence of the CAVI algorithm on the ES coupling compares to the convergence of the CAVI algorithm on the Ising model.

VI Objective Function
To calculate the VI updates for each variable we may need to make use of the alternative characterization of the ES coupling where ψ is uniform measure on Σ and φ p,1 (ω) is a product measure on Ω and The variational family that we will be optimizing over is We have added the indicator on the set F to eliminate the configurations (σ, ω) that are not well defined in the variational objective. We will use the convention that 0 log(0) = 0.

VI Updates
The ELBO that corresponds to the variational family (24) is Taking the derivative with respect to x 1 and simplifying gives us Taking the derivative with respect to x 2 and simplifying gives us Taking the derivative with respect to y and simplifying gives us Absence of closed form updates for any of the variables limits our ability to study the convergence of the system with classical dynamical systems techniques. Instead we look at the long evolution behavior of the system by plotting 100 iterations of the CAVI updates which are generated from the following system x 1 (t + 1) = argmin z∈(0,1) |ELBO 1 (z, x 2 (t), y(t), p)|, x 2 (t + 1) = argmin z∈(0,1) |ELBO 2 (x 1 (t + 1), z, y(t), p)|, y(t + 1) = argmin z∈(0,1) |ELBO y (x 1 (t + 1), x 2 (t + 1), z, p)|.
We generate the argmin of the free variable z from a line search with a step size of ∆ = 10 −6 . Running these simulations we find that the iterations of x 1 (t), x 2 (t), y(t) converge to a global solution within about T = 20 time steps from any initialization x 1 (0), x 2 (0), y(0) ∈ (0, 1) and any β > 0. It is evident that using the ES coupling, we get global convergence of the algorithm outside of the Dobrushin regime of the corresponding paramagnetic Ising model. The figures depicting the simulation results of convergence of the variational inference algorithm in the Edward-Sokal coupling can be found below in Figures 13-16.

Conclusions
This paper demonstrates the use of classical dynamical systems and bifurcation theory to study the convergence properties of the CAVI algorithm of the Ising model on two nodes. In our simple model we are able to provide the complete dynamical behavior for the Ising model on two nodes. Interestingly, we find that the sequential CAVI algorithm and parallelized CAVI algorithm are not topologically conjugate owing to the presence of periodic behavior in the parallelized CAVI. This behavior originates from the added flexibility of the initialization in the parallelized CAVI when compared to the sequential CAVI. The erratic behavior we see in the Ising model for |β| > 1 is due to a combination of the existence of multiple fixed points of the systems update function and the instability of these fixed points. In this parameter regime, the fixed point that produces the optimal solution (0.5, 0.5) is a repelling fixed point. Unless we initialize the algorithm exactly at (0.5, 0.5), the CAVI system cannot converge to this point. The other two suboptimal fixed points are both asymptotically stable. This suggests that the main problem that the CAVI algorithm experiences is centered around the existence of multiple fixed points. Recent work on stochastic block models (SBM) and topic models (TM) models shows that mean field VI leads to suboptimal estimators [17][18][19][20]. It is not clear if this property comes from the mean field variational inferences construction using product distributions or if this is a consequence of structure among latent variables. A minor difference of the stochastic block model (SBM) or topic model (TM) with the Ising model is that the former contain parameters (e.g., the cluster labels) that are identifiable only up to permutations. That being said, in the SBM or TM, if the cluster means are not well-separated, then it is not possible to identify the labels even up to permutations. This is somewhat related to having multiple fixed points of the objective function and we conjecture similar behavior to what we have found in the Ising model will be exhibited in the SBM or TM outside the Dobrushin regime. Interestingly, a close look at the BCAVI updates in [17,18] reveals a similar sigmoid update function 1/(1 + e −x ). Applying the tools and techniques from dynamical systems theory to study the CAVI algorithm in the SBM, TM and other models will provide a better understanding of the issues that come with using mean field variational inference and is important to developing better variational inference techniques.
Most of the research into the theoretical properties of variational inference has focused on the mean field family due to its computational simplicity. This computational simplicity comes at the cost of limited expressive power. Can we make due with this limited expressive power in practical applications? More specifically, is there an equivalent parameter regime to the Dobrushin regime (1 ≤ β ≤ 1) for other similar models like the SBM and TM inside which the CAVI produces statistically optimal estimators? The answer to this question provides researchers with stable parameter regimes for the model. The non-existenceof such a region would indicate the need for more expressive variational methods for the model beyond mean field methods. Recent work [19,20] suggests that this adding some structure to algorithms may fix the problems that arise from mean field VI. How much structure is needed to recover statistically optimal estimators? Could adding in a simple structure of pair-wise dependence to the mean field VI in the Ising model, similarly to [19], be enough to recover the optimal estimator outside of the Dobrushin regime? Is the amount of additional structure that is needed somehow related to the latent structure of the models? Tools from dynamical systems theory can be used to study these questions.
Using dynamical systems to study the convergence properties of the CAVI algorithm is not without its challenges. While dynamical systems theory can provide the answers to many of the above questions, applying these tools to higher dimensional sequential systems is a challenging problem. As mentioned previously, the general theory for n-dimensional discrete dynamical systems is dependent on writing the evolution function in the form x n+1 = F(x n ). Deriving this F is typically not possible for densely connected higher dimensional sequential systems like the n-dimensional Ising model CAVI. This is not the only challenging aspect to the problem. These systems typically possess multiple fixed points which can only be found numerically. Multiple fixed points will lead to more complicated partitions of the space into domains of attraction. Furthermore, higher dimensional systems can possess bifurcations of multiple codimensions, which as significantly more difficult to study. Bifurcations of codimension 3 are so exotic that they are not well studied [23,24]. Software to handle such calculations has only recently been developed [24]. In practical terms this means that the convergence properties can only be studied numerically for models with a small number of parameters. Furthermore, most of the numerical techniques work under the assumption of differentiability of the evolution operator and will fail to be applicable to many systems of practical interest in statistics such as the Edward-Sokal CAVI. Applying tools from dynamical systems to the study of variational inference algorithms will require developing new theory for high dimensional and well connected sequential dynamical systems.

Conflicts of Interest:
The authors declare no conflict of interest. locally unstable otherwise. A fixed point x * is locally attracting if all points in a small neighborhood converge to the fixed point as we let the system evolve. Formally, if there exists an η > 0 such that |x − x * | < η implies f n (x) → x * as n → ∞. A fixed point x * is locally asymptotically stable if it is both locally stable and attracting. A fixed point x * is locally semi-asymptotically stable from the right if it is both locally semi-stable from the right and lim n f n (x) = x * for 0 < x − x * < η for some η. It is globally asymptotically stable if the point is attracting for all x in the state space.
A cycle is a periodic orbit of distinct points C = {x 0 , x 1 , . . . , x K−1 }, where x 0 = f (x K−1 ) for some K > 0. The minimal K generating the cycle is called the period of the cycle. A subset S ⊂ R n is called invariant if f k (S) ⊂ S, k ∈ Z. An invariant set S is called asymptotically stable if there exists a neighborhood U of S such that for any point in U is eventually inside the set S. The stable set of S ⊂ R n is W s (S) = x ∈ R n : lim k→∞ f k (x) ∈ S . If f is invertible, we define the unstable set of S ⊂ R n is W u (S) = x ∈ R n : lim k→∞ f −k (x) ∈ S . The unstable set of S for the forward system f k , k > 0 is the stable set of S for the backward system f −k , k > 0. It is possible to study the behavior of points that diverge by studying points that converge under the inverse map. We can also classify the stability of K-cycles. We classify the stability of the cycle as a fixed point in the map f K .
Consider a discrete time dynamical system defined by a diffeomorphism f : R × R → R. Let x * be a fixed point of f (x, α) and consider a nearby point x, |x − x * | = . Taking a Taylor expansion of the system about the fixed point gives us If the Jacobian does not have modulus one and is small enough, then the contribution by the terms of O(|x − x * | 2 ) will be negligible, in which case the behavior of the system is governed by the the behavior of the linearization of the system f x (x * , α). We now introduce the idea of a hyperbolic fixed point. Assume that the Jacobian A := f x (x * , α) of the system (A1) at a fixed point x * is non-singular. The fixed point x * is called hyperbolic if | f x (x * , α)| = 1 and non-hyperbolic if | f x (x * , α)| = 1. The notion of hyperbolic fixed and non-hyperbolic fixed points generalizes to higher dimensions where it involves the eigenvalues of the Jacobian; see [23,25,38] for more details.
Near a hyperbolic fixed point a non-linear dynamical system behaves its first order Taylor approximation (also known as the linearization of the system). To make this argument rigorous we need to discuss what it means for two dynamical systems to be equivalent. Two systems are topologically equivalent if we can map orbits of one system to orbits of another system in a continuous way that preserves the order of time. The dynamical system (A1) is called topologically equivalent to the system if there exists a homeomorphism of the parameter space h p : R p → R p , β = h p (α) and a parameter dependent state space homeomorphism, continuous in the first argument, h : R n × R p → R n such that, y = h(x, α), mapping orbits of the system (A1) at parameter value α onto orbits of the system (A2) at parameter β = h p (α) preserving the direction of time. If h is a diffeomorphism then the systems are called smoothly equivalent. Let (A1) and (A2) be two topologically equivalent invertible dynamical systems. Consider the orbit of the system under the mapping f (x, α), orb(x; f , α) and the orbit of the system g(y, β), orb(y; g, β). Topological equivalence means that the homeomorphism (h(x, α), h p (α)) maps orb(x; f , α) to orb(y; g, β) preserving the order of time. This gives us the following commutative diagram The Schwarzian derivative controls the higher order behavior in oscillatory systems.

Appendix A.3. Codimension 1 Bifurcations
Until now we have kept the parameter of the system fixed. The study of the change in behavior of a dynamical system as the parameters are varied is called bifurcation theory. A bifurcation occurs when the dynamics of the system at a parameter value α 1 differ from the dynamics of the system at a different parameter value α 2 . Changing the parameter in a system may cause a stable fixed point to become unstable, the fixed point may split into multiple fixed points, or a new orbit may form. Each of these is an example of a bifurcation, although these are not the only things that can happen. The point at which a bifurcation occurs is called a bifurcation point. More formally, the parameter α * is called a bifurcation point if arbitrarily close to it there is α such that x → f (x, α), x ∈ R n is not topologically equivalent to x → f (x, α * ), x ∈ R n in some domain U ⊂ R n .
A necessary, but not sufficient condition for bifurcation of a fixed point to occur is for the fixed point to be nonhyperbolic. Theorem A1 together with the implicit function theorem show that in a sufficiently small neighborhood of a hyperbolic fixed point (x * , α * ), for each α there is another unique fixed point with the same stability properties as (x * , α). So hyperbolic fixed points do not undergo local bifurcations. In the context of discrete systems, a local bifurcation can occur only at a fixed point (x * , α * ) when the Jacobian of the system at (x * , α * ) has an eigenvalue with modulus one.
Perhaps surprisingly, there are only three types of generic bifurcations that can happen in a discrete system with one parameter. They are the limit point (LP), period doubling (PD) and Neimark-Sacker (NS) bifurcations. The reason for this is fairly simple. It turns out that there is a generic system, called the topological normal form, which undergoes this bifurcation at the origin in the (x, α)-plane. For any other system that undergoes the same bifurcation and satisfies certain non-degeneracy conditions there is a local change of coordinates that transforms the system into the topological normal form.
In general the types of bifurcations that can occur are connected to the number of parameters in the system. The minimal number of parameters that must be changed in order for a particular bifurcation to occur in f (x, α) is called the codimension of the bifurcation. A bifurcation is called local if it can be detected in any small neighborhood of the fixed point, otherwise its called global. Global bifurcations are much harder to analyze and since we do not attempt to investigate them in this paper we will not expand upon them further. More detailed results on bifurcations in codimension 1 and 2 can be found in [23,24].
We will now formally define the sufficient conditions for a system to undergo a period doubling or a pitchfork bifurcation. The period doubling bifurcation occurs when a system with a non-hyperbolic fixed point with multiplier λ 1 = −1 satisfies certain non-degeneracy conditions. There are two types of PD bifurcations. In the super-critical case, a stable 2-cycle is generated when a fixed point becomes unstable. In the sub-critical case, a stable fixed point turns unstable when it coalesces with an unstable 2-cycle (This is true for a general k-cycle. In the super-critical case, a stable 2k-cycle is generated when a k-cycle becomes unstable. In the sub-critical case, a stable k-cycle turns unstable when it coalesces with an unstable 2k-cycle ). The conditions for a PD bifurcation to occur are given as follows Theorem A4 (Period Doubling Bifurcation). Suppose That A One-Dimensional System x → f (x, α), x, α ∈ R, with smooth f , has at α = 0 the fixed point x * = 0, and let λ = f x (0, 0) = −1. Assume the following non-degeneracy conditions are satisfied 1. 1/2( f xx (0, 0)) 2 + 1/3 f xxx (0, 0) = 0 2. f xα (0, 0) = 0 Then there are smooth invertible coordinate and parameter changes transforming the system into An classical example of a period doubling bifurcation can be seen in the logistic map f (x, µ) = µx(1 − x), for x ∈ [0, 1]. The bifurcation occurs at the point (x * , µ * ) = (2/3, 3). The logistic map has two fixed points. One fixed point is at x = 0 and the other is at x = (µ − 1)/µ. We will ignore the fixed point at x = 0 since it is repelling for µ > 1. We look at the behavior of the system in a small neighborhood of µ * = 3. For µ = 2.9, the fixed point x * = (µ − 1)/µ is a hyperbolic attracting fixed point since | f x (x * , 2.9)| = |2 − µ| < 1. For µ = 3 the fixed point x * = (µ − 1)/µ is a non-hyperbolic fixed point since f x (x * , 2.9) = 2 − µ = −1. Checking the Schwarzian derivative shows that the fixed point is asymptotically stable. For µ = 3.1, x * = (µ − 1)/µ becomes a repelling fixed point. The points in (0, x * ) ∪ (x * , 1) converge to the attracting 2-cycle C = {0.558014, 0.7645665}. A super-critical period doubling bifurcation has occurred in the system formed by the logistic map. As the parameter µ increases we see a stable fixed point degenerate and a stable 2-cycle is formed. Figure A1. The above plots are cobweb diagrams for the logistic map f (x, µ) = µx(1 − x), for x ∈ [0, 1], with parameters µ = 2.9, µ = 3 and µ = 3.1, respectively. For µ = 2.9 the system has one stable fixed point x * = (µ − 1)/µ. For µ = 3, the system has one non-hyperbolic fixed point x * = (µ − 1)/µ which is asymptotically stable attracting; the plot was not iterated long enough to see convergence. For µ = 3.1, the system has a hyperbolic repelling fixed point x * = (µ − 1)/µ and an asymptotically stable attracting two cycle C = {0.558014, 0.7645665}.
The second iterate of a map that undergoes a PD bifurcation undergoes a bifurcation know as the pitchfork bifurcation. A system that undergoes a super-critical pitchfork bifurcation when a stable fixed point becomes unstable and two stable fixed points appear in the system. A system that undergoes a sub-critical pitchfork bifurcation when two stable fixed points coalesce with an unstable fixed point, the unstable fixed point becomes stable as the parameter crosses the bifurcation point. Below we present extra details pertaining to the period doubling bifurcation and its relation to the pitchfork bifurcation.
Consider the one-dimensional system The map f (x, α) is invertible in a small neighborhood of (0, 0). The system has a fixed point at x * = 0 for all α, with eigenvalue −(1 + α). For small α < 0 the fixed point is hyperbolic stable and for α > 0 is it hyperbolic unstable. For α = 0 the fixed point is non-hyperbolic, but is asymptotically stable.