Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models

Plummer, Sean; Pati, Debdeep; Bhattacharya, Anirban

doi:10.3390/e22111263

Open AccessArticle

Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models

by

Sean Plummer

^*,

Debdeep Pati

and

Anirban Bhattacharya

Department of Statistics, Texas A&M University, College Station, TX 77843, USA

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(11), 1263; https://doi.org/10.3390/e22111263

Submission received: 3 September 2020 / Revised: 26 October 2020 / Accepted: 3 November 2020 / Published: 6 November 2020

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Variational algorithms have gained prominence over the past two decades as a scalable computational environment for Bayesian inference. In this article, we explore tools from the dynamical systems literature to study the convergence of coordinate ascent algorithms for mean field variational inference. Focusing on the Ising model defined on two nodes, we fully characterize the dynamics of the sequential coordinate ascent algorithm and its parallel version. We observe that in the regime where the objective function is convex, both the algorithms are stable and exhibit convergence to the unique fixed point. Our analyses reveal interesting discordances between these two versions of the algorithm in the region when the objective function is non-convex. In fact, the parallel version exhibits a periodic oscillatory behavior which is absent in the sequential version. Drawing intuition from the Markov chain Monte Carlo literature, we empirically show that a parameter expansion of the Ising model, popularly called the Edward–Sokal coupling, leads to an enlargement of the regime of convergence to the global optima.

Keywords:

bifurcation; dynamical systems; Edward–Sokal coupling; mean-field; Kullback–Leibler divergence; variational inference

1. Introduction

Variational Bayes (VB) is now a standard tool to approximate computationally intractable posterior densities. Traditionally this computational intractability has been circumvented using sampling techniques such as Markov chain Monte Carlo (MCMC). MCMC techniques are prone to be computationally expensive for high dimensional and complex hierarchical Bayesian models, which are prolific in modern applications. VB methods, on the other hand, typically provide answers orders of magnitude faster, as they are based on optimization. Introduction to VB can be found in chapter 10 of [1] and chapter 33 of [2]. Excellent recent surveys can be found in [3,4].

The objective of VB is to find the best approximation to the posterior distribution from a more tractable class of distributions on the latent variables that is well-suited to the problem at hand. The best approximation is found by minimizing a divergence between the posterior distribution of interest and a class of distributions that are computationally tractable. The most popular choices for the discrepancy and the approximating class are the Kullback–Leibler (KL) divergence and the class of product distributions, respectively. This combination is popularly known as mean field variational inference, originating from mean field theory in physics [5]. Mean-field inference has percolated through a wide variety of disciplines, including statistical mechanics, electrical engineering, information theory, neuroscience, cognitive sciences [6] and more recently deep neural networks [7]. While computing the KL divergence is intractable for a large class of distributions, reframing the minimization problem for maximizing the evidence lower bound (ELBO) leads to efficient algorithms. In particular, for conditionally conjugate-exponential family models, the optimal distribution for mean field variational inference can be computed by iteration of closed form updates. These updates form a coordinate ascent algorithm known as coordinate ascent variational inference (CAVI) [1].

Research into the theoretical properties of variational Bayes has exploded in the last few years. Recent theoretical work focuses on statistical risk bounds for variational estimate obtained from VB [8,9,10,11], asymptotic normality of VB posteriors [12] and extension to model misspecification [8,13]. While much of the recent theoretical work focuses on statistical optimality guarantees, there has been less work studying the convergence of the CAVI algorithms employed in practice. Convergence of CAVI to the global optima is only known in special cases that depend heavily on model structure for normal mixture models [14,15]; stochastic block models [16,17,18,19]; topic models [20]; and under special restrictions of the parameter regime, Ising models [21,22]. The convergence properties of the CAVI algorithm still largely constitute an open problem.

The goal of this work is to suggest a general systematic framework for studying convergence properties of CAVI algorithms. By viewing CAVI as a discrete time dynamical system, we can leverage dynamical systems theory to analyze the convergence behavior of the algorithm and bifurcation theory to study the types of changes that solutions can undergo as the various parameters are varied. For sake of concreteness, we focus on the 2D Ising model. While dynamical systems theory possesses the tools [23,24,25] necessary to analyze higher dimensional systems, they were mainly developed for non-sequential systems. The general theory for n-dimensional discrete dynamical systems is dependent on having the evolution function in the form

x_{n + 1} = F (x_{n})

. Deriving this F is typically not possible for densely connected higher dimensional sequential systems. The 2D Ising model has the special property that both the sequential and parallel updates in the two variables case can be written as two separate one variable dynamical systems, allowing for a simplified analysis. Our contributions to the literature are as follows: We provide a complete classification of the dynamical properties of the the traditional sequential update CAVI algorithm, and a parallelized version of the algorithm using dynamical systems and bifurcation theory on the Ising models. Our findings show that the sequential CAVI algorithm and the parallelized version have different convergence properties. Additionally, we numerically investigated the convergence of the CAVI algorithm on the Edward–Sokal coupling, a generalization of the Ising model. Our findings suggest that couplings/parameter expansion may provide a powerful way of controlling the convergence behavior of the CAVI algorithm, beyond the immediate example considered here.

2. Mean-Field Variational Inference and the Coordinate Ascent Algorithm

In this section, we briefly introduce mean-field variational inference for a target distribution in the form of a Boltzmann distribution with potential function

Ψ

,

\begin{matrix} p (x) = \frac{exp {Ψ (x)}}{Z}, x \in X, \end{matrix}

where

Z

denotes the intractable normalizing constant. The above representation encapsulates both posterior distributions that arise in Bayesian inference, where

Ψ

is the log-posterior up to constants, and probabilistic graphical models such as the Ising and Potts models. For instance,

Ψ (x) = β \sum_{u \sim v} J_{u v} x_{u} x_{v} + β \sum_{u} h_{u} x_{u}

for the Ising model; see the next section for more details. Many of the complications in inference arise from the intractability of the normalizing constant

Z

, which is commonly referred to as the free energy in probabilistic graphical models, and the marginal likelihood or evidence in Bayesian statistics. Variational inference aims to mitigate this problem by using optimization to find the best approximation

q^{*}

to the target density p from a class

F

of variational distributions over the parameter vector

x

,

\begin{matrix} q^{*} = arg min_{q \in F} D (q | | p) \end{matrix}

(1)

where

D (q | | p)

denotes the Kullback–Leibler (KL) divergence between q and p. The complexity of this optimization problem is largely determined by the choice of variational family

F

. The objective function of the above optimization problem is intractable because it also involves the evidence

Z

. We can work around this issue by rewriting the KL divergence as

\begin{matrix} D (q | | p) = E_{q} [log q] - E_{q} [Ψ] + log Z \end{matrix}

(2)

where

E_{q}

denotes the expectation with respect to

q (x)

. Rearranging terms,

\begin{matrix} log Z & = D (q | | p) + E_{q} [Ψ] - E_{q} [log q] \end{matrix}

(3)

\begin{matrix} \geq E_{q} [Ψ] - E_{q} [log q] : = ELBO (q) . \end{matrix}

(4)

The acronym ELBO stands for evidence lower bound and the nomenclature is now apparent from the above inequality. Notice from Equation (2) that maximizing the ELBO is equivalent to minimizing the KL divergence. By maximizing the ELBO we can solve the original variational problem while by-passing the computational intractability of the evidence.

As mentioned above, the choice of variational family controls both the complexity and accuracy of approximation. Using a more flexible family achieves a tighter lower bound but at the cost of having to solve a more complex optimization problem. A popular choice of family that balances both flexibility and computability is the mean-field family. Mean-field variational inference refers to the situation when q is restricted to the product family of densities over the parameters,

\begin{matrix} F_{MF} : = \{q (x) = q_{1} (x_{1}) \otimes \dots \otimes q_{n} (x_{n}) for probability measures q_{j}, j = 1, \dots, n\}, \end{matrix}

(5)

The coordinate ascent variational inference (CAVI) algorithm (refer to Algorithm 1) is a learning algorithm that optimizes the ELBO over the mean-field family

F_{MF}

. At each time step

t \geq 1

, the CAVI algorithm iteratively updates the current mean field marginal distribution

q_{j}^{(t)} (x_{j})

by maximizing the ELBO over that marginal while keeping the other marginals

{q_{ℓ}^{(t)} (x_{ℓ})}_{ℓ \neq j}

fixed at their current values. Formally, we update the current distribution

q^{(t)} (x)

to

q^{(t + 1)} (x)

by the updates,

\begin{matrix} q_{1}^{(t + 1)} (x_{1}) & = & arg max_{q_{1}} ELBO (q_{1} \otimes q_{2}^{(t)} \otimes \dots \otimes q_{n}^{(t)}) \\ q_{2}^{(t + 1)} (x_{2}) & = & arg max_{q_{2}} ELBO (q_{1}^{(t + 1)} \otimes q_{2} \otimes q_{3}^{(t)} \otimes \dots \otimes q_{n}^{(t)}) \\ ⋮ \\ q_{n}^{(t + 1)} (x_{n}) & = & arg max_{q_{n}} ELBO (q_{1}^{(t + 1)} \otimes \dots \otimes q_{n - 1}^{(t + 1)} \otimes q_{n}) . \end{matrix}

Algorithm 1 Coordinate ascent variational inference (CAVI).

The objective function

ELBO (q_{1} \otimes \dots \otimes q_{n})

is concave in each of the arguments individually (although it is rarely jointly concave), so these individual maximization problems have unique solutions. The optimal update for the jth mean field variational component of the model has the closed form,

\begin{matrix} q_{j}^{*} (x_{j}) & \propto & exp \{E_{- j} [Ψ (x)]\} \end{matrix}

where the expectations

E_{- j}

are taken with respect to the distribution

\prod_{i \neq j} q_{i} (x_{i})

. Furthermore, the update step of the algorithm is monotonous, as each step of the CAVI increases the objective function

\begin{matrix} ELBO (q_{1}^{(t + 1)} \otimes q_{2}^{(t + 1)} \otimes \dots \otimes q_{n}^{(t + 1)}) & \geq ELBO (q_{1}^{(t + 1)} \otimes q_{2}^{(t + 1)} \otimes \dots \otimes q_{n - 1}^{(t + 1)} \otimes q_{n}^{(t)}) \geq \dots \geq \\ ELBO (q_{1}^{(t)} \otimes q_{2}^{(t)} \otimes \dots \otimes q_{n}^{(t)}) . \end{matrix}

For parametric models, the sequential updates of the variational marginal distributions in the CAVI algorithm is done by a sequential update of the variational parameters of these distributions. The CAVI algorithm updates for parametric models induce a discrete time dynamical system of the parameters. Clearly, convergence of the CAVI algorithm can be framed in terms of this induced discrete time dynamical system. As discussed before, the ELBO is generally a non-convex function, and hence the CAVI algorithm is only guaranteed to converge to a local optimum of the system. It is also not clear how many local optima (or fixed points) the system has, nor whether the algorithm always settles on a single fixed point, diverges away from the fixed point or cycles between multiple fixed points. These questions translate to questions about the existence and stability of fixed points of the induced dynamical system. We are also interested in how the behavior of the CAVI algorithm could possibly change as we vary the parameters of the model. This translates to questions about the possible bifurcations of the induced dynamical system. In Section 3, we formally introduce the Ising model and its mean-field variational inference.

3. CAVI in Ising Model

We first briefly review the definition of an Ising model. The Ising model was first introduced as a model for magnetization in statistical physics, but has found many applications in other fields; see [26] and references therein. The Ising model is a probability distribution on the hypercube

{\pm 1}^{n}

given by

\begin{matrix} p (x) & \propto & exp [β \sum_{u \sim v} J_{u v} x_{u} x_{v} + β \sum_{u} h_{u} x_{u}], \end{matrix}

(6)

where the interaction matrix J is a symmetric real

n \times n

matrix with zeros on the diagonal, h is a real n-vector that represents the external magnetic field, and

β

is the inverse temperature parameter. The model is said to be ferromagnetic if

J_{u v} \geq 0

for all

u, v

and anti-ferromagnetic if

J_{u v} < 0

for all

u, v

. The normalizing constant or the partition function of the Ising model is

\begin{matrix} Z & = & \sum_{x \in {\pm 1}^{n}} exp [β \sum_{u \sim v} J_{u v} x_{u} x_{v} + β \sum_{u} h_{u} x_{u}] . \end{matrix}

Refer to Chapter 31 of [2] for an excellent review of Ising models.

Mean Field Variational Inference in Ising Model

Here we provide a derivation of the CAVI update function for the Ising model, focusing on the two nodes (

n = 2

) case for simplicity and analytic tractability.

Notice

log p (x) : = β H (x) = β \sum_{u \sim v} J_{u v} x_{u} x_{v} + β \sum_{u} h_{u} x_{u}

. In this case, we have the Ising model on two spins with

x = (x_{1}, x_{2})

and influence matrix J with off diagonal term

J_{12}

and external magnetic field

h = (h_{1}, h_{2}) = (0, 0)

. From the general framework in Section 2, the CAVI updates are given by,

q_{j}^{*} (x_{j}) \propto exp \{E_{- j} [β (J_{12} x_{1} x_{2} + h_{1} x_{1} + h_{2} x_{2})]\} .

Equivalently, the same updates are obtained by setting the gradient of the ELBO as a function of

(x_{1}, x_{2})

equal to the

{(0, 0)}^{'}

vector. Illustrations of the ELBO and the gradient functions for various values of

β

are in Figure 1 and Figure 2 respectively.

Since

q_{1}^{*}

and

q_{2}^{*}

are two point distributions, it is sufficient to keep track of the mass assigned to 1. Simplifying,

\begin{matrix} q_{1}^{*} (x_{1}) & \propto & exp \{E_{2} [log p (x_{1}, x_{2})]\} \\ = & exp \{β H (x_{1}, x_{2} = 1) q_{2} (x_{2} = 1) + β H (x_{1}, x_{2} = - 1) q_{2} (x_{2} = - 1)\} \\ = & exp \{(β J_{12} x_{1} + β h_{1} x_{1} + β h_{2}) ξ + (- β J_{12} x_{1} + β h_{1} x_{1} - β h_{2}) (1 - ξ)\} \\ = & exp \{(2 ξ - 1) (β J_{12} x_{1} + β h_{2}) + β h_{1} x_{1}\}, \end{matrix}

where

ξ = q_{2} (x_{2} = 1)

. Therefore

\begin{matrix} q_{1}^{*} (x_{1} = 1) & = & \frac{exp \{(2 ξ - 1) (β J_{12} + β h_{2}) + β h_{1}\}}{exp \{(2 ξ - 1) (β J_{12} + β h_{2}) + β h_{1}\} + exp \{(2 ξ - 1) (- β J_{12} + β h_{2}) - β h_{1}\}} \\ = & \frac{1}{1 + exp \{- 2 β J_{12} (2 ξ - 1) - 2 β h_{1}\}} . \end{matrix}

Similarly denoting

ζ = q_{1} (x_{1} = 1)

,

\begin{matrix} q_{2}^{*} (x_{2} = 1) & = & \frac{exp \{(2 ζ - 1) (β J_{12} + β h_{1}) + β h_{2}\}}{exp \{(2 ζ - 1) (β J_{12} + β h_{1}) + β h_{2}\} + exp \{(2 ζ - 1) (- β J_{12} + β h_{1}) - β h_{2}\}} \\ = & \frac{1}{1 + exp \{- 2 β J_{12} (2 ζ - 1) - 2 β h_{2}\}} . \end{matrix}

Let

ζ_{k}

(resp.

ξ_{k}

) denote the kth iterate of

q_{1} (x_{1} = 1)

(resp.

q_{2} (x_{2} = 1)

) from the CAVI algorithm. To succinctly represent these updates, define the logistic sigmoid function

\begin{matrix} σ (u, β) = \frac{1}{1 + e^{- β u}}, u \in [0, 1], β \in R . \end{matrix}

(7)

With this notation, we have, for any

k \in Z_{+}

,

\begin{matrix} \begin{matrix} ζ_{k + 1} & = σ (J_{12} (2 ξ_{k} - 1) + h_{1}, 2 β) \\ ξ_{k + 1} & = σ (J_{12} (2 ζ_{k + 1} - 1) + h_{2}, 2 β) . \end{matrix} \end{matrix}

(8)

Without loss of generality we henceforth set

J_{12} = 1

. Under this choice the model is in the ferromagnetic regime for

β > 0

and the anti-ferromagnetic regime for

β < 0

.

4. Why the Ising Model: A Summary of Our Contributions

There are exactly two cases of the Ising model that have a full analytic solution for the free energy. They are (i) the one dimensional line graph solved by Ernst Ising in his thesis [27] and (ii) the two dimensional case on the anisotropic square lattice when the magnetic field

h = 0

by [28]. Comparison with the mean field solution for the same models highlights the poor approximation quality of the mean field solution in low dimensions. To the best knowledge of the authors, there are no results in the literature detailing the properties of the mean field solution to the anti-ferromagnetic Ising model. Readers not familiar with the physics may wonder why this is the case. To explain this, there are two cases in the anti-ferromagnetic regime: one of the two regions is equivalent to the ferromagnetic case and in the other the mean field approximation is not a good approximation of the system. The first case occurs in a bipartite graph where a transformation of variables makes the antiferromagnetic regime equivalent to the ferromagnetic one [29]. The other case can be seen on the triangle graph. By fixing the spin of one vertex as 1 and the other as

- 1

, the third vertex becomes geometrically frustrated and neither choice of spin lowers the energy level of the system and the two configurations are equivalent [30]. In this case the mean field approximation gives a completely incorrect answer and does not merit further investigation from a qualitative point of view. The physics literature is primarily concerned with using the mean field solutions to the Ising model to estimate important physical constants of the systems. These constants are only meaningful when the mean field solution provides a good approximation to the behavior of the system in large dimensions. It is known, however, that under certain conditions the mean field approximation does indeed converge to the true free energy of the system as the dimension increases [21,31].

Our work is focused on providing a rigorous methodology to analyze dynamics of the CAVI algorithm that can be applied to any model structure. All of the interesting behaviors exhibited by the CAVI algorithm fit into the classical mathematical framework of discrete dynamical systems and bifurcation theory. Specifically, we use the Ising model as a simple and yet rich example to illustrate the potential of dynamical systems theory to analyze CAVI updates for mean field variational inference. The bifurcation of the ferromagnetic Ising model at the boundary of the Dobrushin regime is known [2,26]; however, a rigorous proof in terms of dynamical systems theory is missing in the literature.

There are several features that make the CAVI algorithm on the Ising model a nontrivial example worth investigating. The optimization problem arising from mean field variational inference on the Ising model is, in general, non-convex [21]. However, it is straightforward to obtain sufficient conditions to guarantee the existence of a global optima. One such condition is that the inverse temperature

β

is inside the Dobrushin regime,

| β | < 1

[21]. Inside the Dobrushin regime, the CAVI update equations form a contraction mapping guaranteeing a unique global optima [21]. Outside of this regime the behavior of the CAVI algorithm is nontrivial. The CAVI solution to the Ising model with zero external magnetic field exhibits multiple local optima outside of the Dobrushin regime [2].

Our contributions to the literature are as follows. We utilize tools from dynamical systems theory to rigorously classify the full behavior of Ising model for the full parameter regime in dimension

n = 2

for both the sequential and parallel versions of CAVI algorithm. We show that the dynamical behavior of the sequential CAVI is not equivalent to the behavior of the parallel CAVI. Lastly we derive a variational approximation to the Edward-Sokal parameter expansion of the Potts and Random Cluster models and numerically study its convergence behavior under the CAVI algorithm. Our numerical results reveal that the parameter expansion leads to an enlargement of the regime of convergence to the global optima. In particular the Dobrushin regime is strictly contained in the expanded regime. This is compatible with the analogous results in Markov chain literature. See the introduction of [32] for a well written summary of Markov chain mixing in the Ising model.

Statistical Significance of Our Results

Although mean-field variational inference has been routinely used in applications [3] for computational efficiency, it may not yield statistically optimal estimators. A statistically optimal estimator should correctly recover the statistical properties of the true distribution. Ideally, we would like the estimate to recover the true mean and true covariance of the distribution. It is well known that mean-field variational inference produces estimators that underestimate the posterior covariance [14]. More recently, it was shown that the mean-field estimators for certain topic models and stochastic block models may not even be correlated with the true distribution [17,20]. For these reasons, it is important to see if the mean field estimators can at least recover the true mean for various

β \in R

.

Mean field inference approximates the joint probability mass function in (6) for

n = 2

by product of two distributions on

{- 1, 1}

in the sense of Kullback–Leibler divergence. As discussed in Section 3, minimizing this divergence is equivalent to maximizing an objective function, called the Evidence Lower Bound (ELBO). Our objective is to better understand the relation between the CAVI estimate and the global maximum of ELBO in (6) when

n = 2

and

h = 0

. Ideally, we want the global maximum of the ELBO to be a statistically reliable estimate. To understand this, let us denote

2 \times Bernoulli (p) - 1

by

〈 1, - 1; p 〉

. As the marginal distributions of (6) are both equal to

〈 1, - 1; 0.5 〉

, we want the ELBO to be maximized at this value. From an algorithmic perspective, we would like to ensure that the CAVI iterates converge to this global maximum. The synergy of these two phenomena leads to a successful variational inference method. We show in this article that both these conditions can be violated in a certain regime of the parameter space in the context of Ising model on two nodes. Inside the Dobrushin regime (

- 1 \leq β \leq 1

), the global optima of the ELBO obtained from a mean field inference occurs at

(〈 1, - 1; 0.5 〉, 〈 1, - 1; 0.5 〉)

which is qualitatively the optimal solution. In this regime, the CAVI system converges to this global optimum irrespective of where the system is initialized. Thus, in the Dobrushin regime, the mean field inference yields the statistically optimal estimate. Additionally, the CAVI algorithm is stable and convergent at this value. Unfortunately, this property deteriorates outside of the Dobrushin regime. Outside of the regime, the global maxima occur at two symmetric points which are different from

(〈 1, - 1; 0.5 〉, 〈 1, - 1; 0.5 〉)

. These two symmetric points are equivalent under label switching. For example, when

β = 1.2

one of the optima is

(〈 1, - 1; 0.17071 〉, 〈 1, - 1; 0.17071 〉)

and the other is

(〈 1, - 1; 0.82928 〉, 〈 1, - 1; 0.82928 〉)

. Notice this second optima is equivalent to the sign swapped version

(〈 - 1, 1; 0.17071 〉, 〈 - 1, 1; 0.17071 〉)

.

The original optima

(〈 1, - 1; 0.5 〉, 〈 1, - 1; 0.5 〉)

is actually a local minimum of the ELBO outside the Dobrushin regime. We illustrate in our theory that the CAVI system returns one of two global maxima of the objective function depending on the initialization of the algorithm. Although it is widely known that the statistical quality of the mean field inference is poor outside the regime, we show in addition that the algorithm itself exhibits erratic behavior and may not converge to the global maximizer of the ELBO for all initializations. Interestingly, outside the Dobrushin regime, the statistically optimal solution

(〈 1, - 1; 0.5 〉, 〈 1, - 1; 0.5 〉)

is a repelling fixed point of the CAVI system. This means that as the system is iterated, the current value of the system is pulled away from

(〈 1, - 1; 0.5 〉, 〈 1, - 1; 0.5 〉)

to the global maximum.

A common technique to further improve computational time is the use of block updates in the CAVI algorithm, meaning groups of parameters are updated simultaneously. We refer to this as the parallelized CAVI algorithm. This has been shown to work well in certain models [17], but has not been investigated in a general setting. However, it turns out that block updating in the Ising model can lead to new problematic behaviors. Outside the Dobrushin regime, block updates can exhibit non-convergence in the form of cycling. As the system updates, it eventually switches back and forth between two points that yield the same value in the objective function.

Parameter expansions (coupling) is another method of improving the convergence properties of algorithms. In the Markov chain theory for Ising models, it is well-known that mixing and convergence time are typically improved by using the Edward–Sokal coupling, a parameter expansion of the ferromagnetic Ising model [33]. Our preliminary investigation reveals that the convergence properties of the CAVI algorithm also exhibit a similar phenomenon.

5. Main Results

In this section, we analyze the behavior of the dynamical systems that one can form using the CAVI update equations and show that the behaviors of the systems differ. Our results are heavily dependent on well-known techniques in dynamical systems. For readers unfamiliar with some of technical terminology below, we have included a primer on the basics of dynamical systems in Appendix A.

Recall the system of sequential updates, which are the updates used in CAVI:

\begin{matrix} ζ_{k + 1} = σ (2 ξ_{k} - 1, 2 β), ξ_{k + 1} = σ (2 ζ_{k + 1} - 1, 2 β), \end{matrix}

(9)

and the parallel updates:

\begin{matrix} ζ_{k + 1} = σ (2 ξ_{k} - 1, 2 β), ξ_{k + 1} = σ (2 ζ_{k} - 1, 2 β) . \end{matrix}

(10)

We will show that these two systems are not topologically conjugate. We first state and prove some results on the dynamics of the sigmoid function (7). These results will be used as building blocks to study the dynamics of (9) and (10). Phase change behavior of dynamical systems using the sigmoid and RELU activation functions are known in the literature in the context of generalization performance of deep neural networks [34,35]. In this section we present a complete proof of the bifurcation analysis of non-linear dynamical systems involving sigmoid activation function despite its connections with [34,35]. Our results in Section 5.1 provide a more complete picture of the behavior of the dynamics in all regimes and can be readily exploited to analyze the dynamics of (9) and (10).

5.1. Sigmoid Function Dynamics

In this section we provide a full classification for the dynamics of the following sigmoid function and its second iterate,

\begin{matrix} σ (2 x - 1, 2 β), \end{matrix}

(11)

\begin{matrix} σ (2 σ (2 x - 1, 2 β) - 1, 2 β) . \end{matrix}

(12)

To the best of our knowledge, we could not find a formal proof of the full classification of the dynamics of the sigmoid function (or its second iterate) for all

β \in R

in the literature. Additionally, it provides an introductory example to demonstrate the concepts and techniques of dynamical systems. We begin by using numerical techniques to determine the number of fixed points in the system and its possible periodic behavior. We then proceed by providing a formal proof of the full dynamical properties of (11) in Theorem 1 and the full dynamical properties of (12) in Theorem 2.

Using numerical techniques, we solve for the number of fixed points of the system. The number of fixed points the function (11) depends on the magnitude of the parameter. For

β > 0

, there is no periodic behavior, so there are no additional fixed points in (12) that are not fixed points in (11). For

- 1 \leq β \leq 1

, there is a single fixed point at

x_{*} = 1 / 2

and for

β > 1

, there are 3 fixed points

c_{0} (β), 1 / 2, c_{1} (β)

in the interval

[0, 1]

. These fixed points satisfy

0 \leq c_{0} (β) < 1 / 2 < c_{1} (β) \leq 1

,

c_{0} (β) \to 0

and

c_{1} (β) \to 1

as

β \to \infty

. For

β < 0

, we see periodic behavior in the system; there are fixed points of (12) that are not fixed points of (11). For

β < - 1

, the function (11) has one fixed point at

x_{*} = 1 / 2

and a periodic cycle

C = {c_{0} (β), c_{1} (β)}

. Both

c_{0} (β), c_{1} (β)

are fixed points of (12) and these points are the same fixed points from the

β > 0

regime as (12) is an even function with respect to

β

.

Table 1 denotes the values of the derivatives at the fixed point

1 / 2

for

β = \pm 1

.

We now have enough information to provide a complete classification of the dynamics of the sigmoid function.

Theorem 1

(Dynamics of sigmoid function). Consider the discrete dynamical system generated by (11)

\begin{matrix} x \mapsto σ (2 x - 1, 2 β) = \frac{1}{1 + e^{- 2 β (2 x - 1)}} . \end{matrix}

The full dynamics of the system (11) are as follows

1.: For $- 1 \leq β \leq 1$ , the system has a single hyperbolic fixed point $x_{*} = 1 / 2$ which is a global attractor and there are no p-periodic points for $p \geq 2$ .
2.: For $β > 1$ , the system has one repelling hyperbolic fixed point $x_{*} = 1 / 2$ and two hyperbolic stable fixed points $c_{0}$ , $c_{1}$ , with $0 < c_{0} < 1 / 2 < c_{1} < 1$ , and stable sets $W^{s} (c_{0}) = [0, 1 / 2)$ , $W^{s} (c_{1}) = (1 / 2, 1]$ . There are no p-periodic points for $p \geq 2$ .
3.: For $β < - 1$ , the system has one unstable hyperbolic fixed point $x_{*} = 1 / 2$ , and a stable 2-cycle $C = {c_{0}, c_{1}}$ with stable set $W^{s} (C) = [0, 1 / 2) \cup (1 / 2, 1]$ , with $0 < c_{0} < 1 / 2 < c_{1} < 1$ . There are no p-periodic points for $p > 2$ .
4.: For $| β | = 1$ , the system has one non-hyperbolic fixed point at $x_{*} = 1 / 2$ which is asymptotically stable and attracting.

The system undergoes a PD bifurcation at

β = - 1

and a pitchfork bifurcation at

β = 1

.

Proof.

We will break the proof up into three parts. The first part of the proof is a linear stability analysis of the system, the second part is a stability analysis of the periodic points in the system and the third part is an analysis of the bifurcations of the system. We begin with a linear stability analysis of the system at each fixed point. For

β \leq 1

the system has one fixed point

x_{*} = 1 / 2

and for

β > 1

the system has three fixed points

c_{0}, 1 / 2, c_{1}

. The derivative of

σ (2 x - 1, 2 β)

is

σ_{x} (2 x - 1, 2 β) = - 4 β σ (2 x - 1, 2 β) (1 - σ (2 x - 1, 2 β))

.

Fixed point $x_{*} = 1 / 2$ : The Jacobian of the system at the fixed point

x_{*} = 1 / 2

is

σ_{x} (2 x_{*} - 1, 2 β) = β

. For

β \neq 1

, the fixed point

x_{*} = 1 / 2

is hyperbolic and for

β = \pm 1

the fixed point is non-hyperbolic. We classify the stability of the hyperbolic fixed point

x_{*} = 1 / 2

using Theorem A2. For

| β | < 1

the fixed point

x_{*} = 1 / 2

is globally attracting as

| σ_{x} (2 x_{*} - 1, 2 β) | < 1

and for

| β | > 1

the fixed point

x_{*} = 1 / 2

is globally repelling as

| σ_{x} (2 x_{*} - 1, 2 β_{*}) | > 1

. For

β = \pm 1

we invoke Theorem A3 to check for stability of the fixed point. At

β = - 1

we have

σ_{x} (2 x_{*} - 1, 2 β) = - 1

and we need to check the Schwarzian derivative. The fixed point

x_{*} = 1 / 2

is asymptotically stable for

β = - 1

by Theorem A3, as

S σ (2 σ (2 x - 1, 2 β) - 1, 2 β) ∣_{x = x_{*}} = - 8

. For

β = 1

we have

σ_{x} (2 x_{*} - 1, 2 β) = 1

and we need to check the second and third derivatives at the fixed point. The fixed point

x_{*} = 1 / 2

is asymptotically stable when

β = 1

by Theorem A3 as

σ_{x x} (2 x_{*} - 1, 2 β) = 0

and

σ_{x x x} (2 x_{*} - 1, 2 β) = - 8

.

Fixed points $c_{0}, c_{1}$ : These fixed points have the same behavior so we have grouped them together in the analysis. When

β > 1

there are two additional fixed points

c_{0}, c_{1}

of the system, both are attracting fixed points by Theorem A2 as

| σ_{x} (2 c_{i} - 1, 2 β) | < 1

for each

i = 0, 1

and all

β > 1

. The stable sets are

W^{s} (c_{0}) = [0, 1 / 2)

and

W^{s} (c_{1}) = (1 / 2, 1]

.

Periodic points: For

β < - 1

we see the two cycle

C = {c_{0}, c_{1}}

. Notice

σ (2 c_{0} - 1, 2 β) = c_{1}

and

σ (2 c_{1} - 1, 2 β) = c_{0}

. This two cycle is stable since

c_{0}

and

c_{1}

are both stable fixed points of (12). The stable set is

W^{s} (C) = [0, 1 / 2) \cup (1 / 2, 1]

,

0 < c_{0} < 1 / 2 < c_{1} < 1

.

At

(x_{*}, β_{*}) = (1 / 2, 1)

the system under goes a pitchfork bifurcation as it satisfies the conditions in Theorem A5:

\begin{matrix} σ (2 x_{*} - 1, 2 β_{*}) = 1 / 2 σ_{x} (2 x_{*} - 1, 2 β_{*}) = 1 σ_{x x} (2 x_{*} - 1, 2 β_{*}) = 0, \\ σ_{β} (2 x_{*} - 1, 2 β_{*}) = 0 σ_{x β} (2 x_{*} - 1, 2 β_{*}) \neq 0 σ_{x x x} (2 x_{*} - 1, 2 β_{*}) \neq 0 . \end{matrix}

Similarly at

(x_{*}, β_{*}) = (1 / 2, - 1)

the system under goes a period doubling bifurcation as it satisfies the conditions in Theorem A4

\begin{matrix} σ (2 x_{*} - 1, 2 β_{*}) = 1 / 2 σ_{x} (2 x_{*} - 1, 2 β_{*}) = - 1 σ_{x x} (2 x_{*} - 1, 2 β_{*}) = 0, \\ σ_{β} (2 x_{*} - 1, 2 β_{*}) = 0 σ_{x β} (2 x_{*} - 1, 2 β_{*}) \neq 0 σ_{x x x} (2 x_{*} - 1, 2 β_{*}) \neq 0 . \end{matrix}

□

We can fully classify the dynamics of (12) using the above theorem. We omit the proof as it is similar to the proof of Theorem 1.

Theorem 2.

The full dynamics of the system (12) are as follows

1.: For $- 1 \leq β \leq 1$ , the system has a single hyperbolic fixed point $x_{*} = 1 / 2$ which is a global attractor and there are no p-periodic points for $p \geq 2$ .
2.: For $| β | > 1$ , the system has one repelling hyperbolic fixed point $x_{*} = 1 / 2$ and two hyperbolic stable fixed points $c_{0}$ , $c_{1}$ , with $0 < c_{0} < 1 / 2 < c_{1} < 1$ , and stable sets $W^{s} (c_{0}) = [0, 1 / 2)$ , $W^{s} (c_{1}) = (1 / 2, 1]$ .
3.: For $| β | = 1$ , the system has one non-hyperbolic fixed point at $x_{*} = 1 / 2$ which is asymptotically stable and attracting.

The system undergoes a pitchfork bifurcation at

β = \pm 1

. There are no p-periodic points for

p \geq 2

.

5.2. Sequential Dynamics

To fully understand the dynamics of the equations defining the updates to

q_{1}^{*}

and

q_{2}^{*}

it suffices to track the evolution of the points

q_{1}^{*} (1) = ζ

and

q_{2}^{*} (1) = ξ

. The CAVI algorithm updates terms sequentially, using the new values of the variables to calculate the others. We initialize the CAVI algorithm at points

ζ_{0}, ξ_{0}

. The CAVI algorithm is a dynamical system formed by sequential iterations of

σ (2 x - 1, 2 β)

starting from

ζ_{0}, ξ_{0}

. We can decouple the CAVI updates for

ξ_{k}

and

ζ_{k}

by looking at the second iterations. This decoupling is visualized in the diagram (14) below. The system formed the sequential updates is equivalent to the following decoupled system

\begin{matrix} ζ_{1} & = σ (2 ξ_{0} - 1, 2 β), \\ ζ_{k + 1} & = σ (2 σ (2 ζ_{k} - 1, 2 β) - 1, 2 β), k \geq 1, \\ ξ_{k + 1} & = σ (2 σ (2 ξ_{k} - 1, 2 β) - 1, 2 β), k \geq 0 . \end{matrix}

(13)

We propose to investigate the dynamics of the sequential system (9) by studying the dynamics of individual subsequences

ζ_{k + 1}

and

ξ_{k + 1}

of the decoupled system (13). The dynamical properties of the individual subsequences follow from a combination of Theorem 1, Theorem 2 and other methods from Appendix A.

(14)

Illustrations of the evolution of the dynamics of the sequential updates for various initializations and values of

β

are in Figure 3, Figure 4, Figure 5 and Figure 6.

Theorem 3

(CAVI dynamics). The Dynamics of the CAVI System (9) Are Given by

1.: For $β < - 1$ , the system has the system has one locally asymptotically unstable fixed point $(1 / 2, 1 / 2)$ and two locally asymptotically stable fixed points $(c_{0}, c_{1})$ and $(c_{1}, c_{0})$ , with stable sets $W^{s} ((c_{0}, c_{1})) = [0, 1] \times [0, 1 / 2)$ and $W^{s} ((c_{1}, c_{0})) = [0, 1] \times (1 / 2, 1]$ respectively.
2.: For $| β | \leq 1$ , the system has a global asymptotically stable fixed point $(1 / 2, 1 / 2)$ .
3.: For $β > 1$ the system has the system has one locally asymptotically unstable fixed point $(1 / 2, 1 / 2)$ and two locally asymptotically stable fixed points $(c_{0}, c_{0})$ and $(c_{1}, c_{1})$ , with domains of attraction $W^{s} ((c_{0}, c_{0})) = [0, 1] \times [0, 1 / 2)$ and $W^{s} ((c_{1}, c_{1})) = [0, 1] \times (1 / 2, 1]$ respectively.

where

0 \leq c_{0} < 1 / 2 < c_{1} \leq 1

are the fixed points of (11) in

[0, 1]

. The system undergoes a super-critical pitchfork bifurcation at

β = - 1

and again at

β = 1

. Furthermore the system has no p-periodic points for

p \geq 2

.

Proof.

We will proceed to construct the dynamics of the system (9) by tracing the behavior of the dynamics in the equivalent system (13). The dynamics of each of these subsequences is governed by the Functions (11) and (12) and dependent on the initialization

ξ_{0}

. The behavior for each of the subsequence

ξ_{k + 1}

, for

k \geq 0

is governed by Theorem 2. Similarly the behavior of the subsequence

ζ_{k + 1}

, for

k \geq 1

is governed by Theorem 2 with the additional point

ζ_{1} = σ (2 ξ_{0} - 1, 2 β)

dependent on Theorem 1. For

| β | < 1

, (11) has a globally stable fixed point at

x_{*} = 1 / 2

and thus for all

ξ_{0}

,

ζ_{1} = σ (2 ξ_{0} - 1, 2 β) \in W^{s} (1 / 2)

. It now follows from Theorem 2 that the only fixed point in the sequential system is

(1 / 2, 1 / 2)

which must be globally stable. For

β = \pm 1

, the fixed point

x_{0} = 1 / 2

is asymptotically stable by Theorem A3. The system undergoes a super-critical pitchfork bifurcation at

β = - 1

and again at

β = 1

as a consequnece from its relation to (12). For

β > 1

, (11) bifurcates. We have the unstable fixed point

x_{*} = 1 / 2

, and the two locally stable fixed points,

c_{0}

with stable set

W^{s} (c_{0}) = [0, 1 / 2)

, and

c_{1}

with stable set

W^{s} (c_{1}) = (1 / 2, 1]

. For

ξ_{0} \in W^{s} (c_{0})

we have

ζ_{1} \in W^{s} (c_{0})

and

ξ_{1} \in W^{s} (c_{0})

. It now follows from Theorem 2 that the system will converge to

(c_{0}, c_{0})

and that

W^{s} ((c_{0}, c_{0})) = [0, 1] \times [0, 1 / 2)

. A similar argument shows the system converges to

(c_{1}, c_{1})

for

ξ_{0} \in W^{s} (c_{1})

and

W^{s} ((c_{1}, c_{1})) = [0, 1] \times (1 / 2, 1]

. Lastly,

(1 / 2, 1 / 2)

is a repelling fixed point of the systems since

x_{*} = 1 / 2

is a repelling fixed point for both (11) and (12). For

β < - 1

, (11) bifurcates. We have the unstable fixed point

x_{*} = 1 / 2

, and the stable two cycle,

C = {c_{0}, c_{1}}

with stable set

W^{s} (C) = [0, 1 / 2) \cup (1 / 2, 1]

. For any

ξ_{0} < 1 / 2

we have,

ζ_{1} > 1 / 2

and

ξ_{1} < 1 / 2

. It now follows from Theorem 2 that the system will converge to

(c_{1}, c_{0})

and that

W^{s} ((c_{1}, c_{0})) = [0, 1] \times [0, 1 / 2)

. A similar argument shows the system converges to

(c_{0}, c_{1})

for

ξ_{0} > 1 / 2

and

W^{s} ((c_{0}, c_{1})) = [0, 1] \times [0, 1 / 2)

. Lastly,

(1 / 2, 1 / 2)

is a repelling fixed point of the systems since

x_{*} = 1 / 2

is a repelling fixed point for both (11) and (12). The dynamics of (13) lack any p-period point and cycles for

p > 2

as a consequence of its construction from (12). □

5.3. Parallel Updates

The system of parallel updates is defined by the one-step map

F : R^{2} \to R^{2}

\begin{matrix} (\begin{matrix} ζ \\ ξ \end{matrix}) \mapsto F (ζ, ξ) = (\begin{matrix} σ (2 ξ - 1, 2 β) \\ σ (2 ζ - 1, 2 β) . \end{matrix}) \end{matrix}

(15)

The dynamics of the parallel system are similar to the system studied in [36]. As we shall show below, the parallel system exhibits periodic behavior that the sequential system does not and it follows as a corollary that the systems are not locally topologically conjugate.

The parallelized CAVI algorithm is a dynamical system formed by iterations of F defined in (15). We shall decouple the parallelized CAVI updates for sequences

ξ_{k}

and

ζ_{k}

by looking at iterations of (12) acting on the sequences individually. This decoupling is visualized in diagram form

(16)

where each cross is an application of F. The system formed the parallel updates is equivalent to the following decoupled systems of even subsequences and odd subsequences. The even subsequences are

\begin{matrix} ζ_{2 k} & = σ (2 σ (2 ζ_{2 (k - 1)} - 1, 2 β) - 1, 2 β), k \geq 1 \end{matrix}

(17)

\begin{matrix} ξ_{2 k} & = σ (2 σ (2 ξ_{2 (k - 1)} - 1, 2 β) - 1, 2 β), k \geq 1 . \end{matrix}

(18)

The odd subsequences are

\begin{matrix} ζ_{2 k + 1} & = \{\begin{matrix} σ (2 ξ_{0}, 2 β) & k = 0 \\ σ (2 σ (2 ζ_{2 k - 1}, 2 β), 2 β) & k \geq 1 \end{matrix} \end{matrix}

(19)

\begin{matrix} ξ_{2 k + 1} & = \{\begin{matrix} σ (2 ζ_{0}, 2 β) & k = 0 \\ σ (2 σ (2 ξ_{2 k - 1}, 2 β), 2 β) & k \geq 1 . \end{matrix} \end{matrix}

(20)

Following a similar approach to the one used to study the sequential dynamics, we investigate the dynamics of the parallel system (15) by studying the dynamics of four individual subsequences (17)–(20) of the decoupled system given by diagram (16). The dynamical properties of the individual subsequences follow from a combination of Theorem 1, Theorem 2 and other methods from Appendix A. Illustrations of the evolution of the dynamics of the parallel updates for various initializations and values of

β

are in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12.

We now present the main result for the parallel dynamics.

Theorem 4

(Parallel Dynamics). The Dynamics of the Parallel System (10) Are As Follows

1.: For $β < - 1$ , the system has two locally asymptotically stable fixed points $(c_{1}, c_{0})$ and $(c_{0}, c_{1})$ , and one locally asymptotically unstable fixed point $(1 / 2, 1 / 2)$ , where $c_{0}$ and $c_{1}$ are the fixed points of (11). Furthermore the system exhibits periodic behavior in the form of 2-cycles. The asymptotically stable 2-cycle, $C_{1} = {(c_{0}, c_{0}), (c_{1}, c_{1})}$ and asymptotically unstable 2-cycles,

$\begin{matrix} C_{2} = {(1 / 2, c_{1}), (c_{0}, 1 / 2)} a n d C_{3} = {(1 / 2, c_{0}), (c_{1}, 1 / 2)} . \end{matrix}$

The stable sets are

$\begin{matrix} W^{s} (c_{0}, c_{1}) & = [0, 1 / 2) \times (1 / 2, 1] \\ W^{s} (c_{1}, c_{0}) & = (1 / 2, 1] \times [0, 1 / 2) \\ W^{s} (C_{1}) & = ([0, 1 / 2) \times [0, 1 / 2)) \cup ((1 / 2, 1] \times (1 / 2, 1]) \\ W^{s} (C_{2}) & = ([0, 1 / 2) \times {1 / 2}) \cup ({1 / 2} \times (1 / 2, 1]) \\ W^{s} (C_{3}) & = ([0, 1 / 2) \times {1 / 2}) \cup ({1 / 2} \times (1 / 2, 1]) . \end{matrix}$
2.: For $- 1 \leq β \leq 1$ , the system has a global attracting fixed point $(1 / 2, 1 / 2)$ .
3.: For $β > 1$ , the system has two locally asymptotically stable fixed points $(c_{0}, c_{0})$ and $(c_{1}, c_{1})$ , and one locally asymptotically unstable fixed point $(1 / 2, 1 / 2)$ , where $c_{0}$ and $c_{1}$ are the fixed points of (11). Furthermore the system exhibits periodic behavior in the form of 2-cycles. The asymptotically stable 2-cycle, $C_{3} = {(c_{0}, c_{0}), (c_{1}, c_{1})}$ and asymptotically unstable 2-cycles, $C_{4} = {(1 / 2, c_{0}), (c_{1}, 1 / 2)}$ and $C_{5} = {(1 / 2, c_{1}), (c_{1}, 1 / 2)}$ . The stable sets are

$\begin{matrix} W^{s} (c_{0}, c_{1}) & = [0, 1 / 2) \times (1 / 2, 1] \\ W^{s} (c_{1}, c_{0}) & = (1 / 2, 1] \times [0, 1 / 2) \\ W^{s} (C_{3}) & = ([0, 1 / 2) \times [0, 1 / 2)) \cup ((1 / 2, 1] \times (1 / 2, 1]) \\ W^{s} (C_{4}) & = ([0, 1 / 2) \times {1 / 2}) \cup ({1 / 2} \times [0, 1 / 2)) \\ W^{s} (C_{5}) & = ({1 / 2} \times (1 / 2, 1]) \cup ((1 / 2, 1] \times {1 / 2}) . \end{matrix}$

The system has no p-periodic points for

p > 2

. The system under goes a PD bifurcation at

β = - 1

and a pitchfork bifurcation at

β = 1

.

Proof.

The dynamics of the system defined by F in (15) are equivalent to the dynamics of the system generated by the subsequences (17)–(20). The dynamics of each of these subsequences are governed by the functions (11) and (12). By Theorem 1, we have the behavior for each of the subsequences (17)–(20). For

| β | < 1

, (11) has a globally stable fixed point at

x_{*} = 1 / 2

and thus the only fixed point in the parallel system is

(1 / 2, 1 / 2)

which must be globally stable. For

β = \pm 1

, the fixed point

x_{0} = 1 / 2

is asymptotically stable by Theorem A3.

For

β > 1

, (11) bifurcates. We have the unstable fixed point

x_{*} = 1 / 2

, and the two locally stable fixed points,

c_{0}

with stable set

W^{s} (c_{0}) = [0, 1 / 2)

, and

c_{1}

with stable set

W^{s} (c_{1}) = (1 / 2, 1]

. Returning to the system generated by F, if we consider the initialization

(ζ_{0}, ξ_{0}) = (c_{0}, c_{0})

then by the sequence construction of

ζ_{n}

, given in (19) and (21), we see that

ζ_{n} = c_{0}

for

n \geq 1

, as

c_{0}

is a fixed point of (11) for

β > 1

. Similarly, using the sequence construction of

ξ_{n}

, given in (18) and (20), we see that

ξ_{n} = c_{0}

for

n \geq 1

, as

c_{0}

is a fixed point of (11) for

β > 1

. Therefore,

(c_{0}, c_{0})

is a fixed point. An analogous argument shows that

(c_{1}, c_{1})

is also a fixed point. The parallel system has the stable fixed points

(c_{0}, c_{0})

with stable set

W^{s} (c_{0}, c_{0}) = W^{s} (c_{0}) \times W^{s} (c_{0})

and

(c_{1}, c_{1})

with stable set

W^{s} (c_{1}, c_{1}) = W^{s} (c_{1}) \times W^{s} (c_{1})

. After the bifurcation at

β = 1

the parallel system also contains 2-cycles. Using the sequence construction we see that

C_{3} = {(c_{1}, c_{0}), (c_{0}, c_{1})}

is an asymptotically stable 2-cycle in the parallel system, with stable subspace

W^{s} (C_{3}) = (1 / 2, 1] \times [0, 1 / 2) \cup [0, 1 / 2) \times (1 / 2, 1]

. Additionally, we have two asymptotically unstable 2-cycles

C_{4} = {(c_{0}, 1 / 2), (1 / 2, c_{0})}

and

C_{5} = {(c_{1}, 1 / 2), (1 / 2, c_{1})}

. Perturbing the

1 / 2

coordinate in the unstable cycle pushes it into the basin of attraction for one of the fixed points or the asymptotically stable 2-cycle. The stable sets are

W^{s} (C_{4}) = ([0, 1 / 2) \times {1 / 2}) \cup ({1 / 2, 1} \times [0, 1 / 2))

,

W^{s} (C_{5}) = ({1 / 2} \times (1 / 2, 1]) \cup ((1 / 2, 1] \times {1 / 2})

. The dynamics of F lack any p-period point and cycles for

p > 2

as a consequence of its construction from (12).

For

β < - 1

, (11) bifurcates. We have the unstable fixed point

x_{*} = 1 / 2

, and the stable two cycle,

C = {c_{0}, c_{1}}

with stable set

W^{s} (C) = [0, 1 / 2) \cup (1 / 2, 1]

. Returning to the system generated by F, if we consider the initialization

(ζ_{0}, ξ_{0}) = (c_{0}, c_{1})

then by the sequence construction of

ζ_{n}

, given in (17) and (19), we see that

ζ_{n} = c_{0}

for

n \geq 1

, as

C

is a 2-cycle of (11) for

β < - 1

. Similarly, using the sequence construction of

ξ_{n}

, given in (18) and (20) we see that

ξ_{n} = c_{1}

for

n \geq 1

, as

C

is a 2-cycle of (11) for

β < - 1

. Therefore,

(c_{0}, c_{1})

is a fixed point. An analogous argument shows that

(c_{1}, c_{0})

is also a fixed point. The parallel system has the stable fixed points

(c_{0}, c_{1})

with stable set

W^{s} (c_{0}, c_{1}) = W^{s} (c_{0}) \times W^{s} (c_{1})

and

(c_{1}, c_{0})

with stable set

W^{s} (c_{1}, c_{0}) = W^{s} (c_{1}) \times W^{s} (c_{0})

, where

W^{s} (c_{0}) = [0, 1 / 2)

and

W^{s} (c_{1}) = (1 / 2, 1]

. After the bifurcation at

β = - 1

the parallel system also contains 2-cycles. Using the sequence construction we see that

C_{1} = {(c_{0}, c_{0}), (c_{1}, c_{1})}

is an asymptotically stable 2-cycle in the parallel system, with stable subspace

W^{s} (C_{1}) = W^{s} (c_{0}) \times W^{s} (c_{0}) \cup W^{s} (c_{1}) \times W^{s} (c_{1})

. Additional we have two asymptotically unstable 2-cycles

C_{2} = {(c_{0}, 1 / 2), (1 / 2, c_{1})}

and

C_{3} = {(c_{1}, 1 / 2), (1 / 2, c_{0})}

. Perturbing the

1 / 2

coordinate in the unstable cycle pushes it into the basin of attraction for one of the fixed points or the asymptotically stable 2-cycle. The stable sets are

W^{s} (C_{3}) = ([0, 1 / 2) \times [0, 1 / 2)) \cup ((1 / 2, 1] \times (1 / 2, 1])

,

W^{s} (C_{4}) = ([0, 1 / 2) \times {1 / 2}) \cup ({1 / 2} \times [0, 1 / 2))

,

W^{s} (C_{5}) = ({1 / 2} \times (1 / 2, 1]) \cup ((1 / 2, 1] \times {1 / 2})

. The dynamics of F lack any p-period point and cycles for

p > 2

as a consequence of its construction from (12).

This completes the characterization of the dynamics of F for

β \in R

. □

5.4. A Comparison of the Dynamics

We end the section by providing a comparison of the dynamical properties of the sequential system in Theorem 3 and the parallel system in Theorem 4. The main difference between the sequential system and the parallel system is the presence of two-cycles that can be found in the parallel system when

| β | > 1

. This behavior stems from the difference between the sequential and parallel implementations of the CAVI. Looking closely at the update diagrams for the two systems reveals the key difference that produces these two-cycles. The decoupled sequential system is

and the decoupled parallel system is

The major difference between these diagrams is how the individual update sequences begin. Notice

ζ_{0}

plays no role in updating the sequential system as both the

ζ_{k}

update sequence and the

ξ_{k}

update sequence are dependent only on the choice of

ξ_{0}

. Even after rewriting the sequential updates in terms of individual sequences the system is not truly decoupled as both sequences depend on a common starting point. This precisely prescribes the behavior that we see in the system relative to the sigmoid function dynamics in Theorem 1 and Theorem 2. Compare this to the parallel system. Here

ζ_{0}

is involved in updating both the odd

ξ_{2 k + 1}

subsequence and the even

ζ_{2 k}

subsequence. Furthermore,

ξ_{0}

remains involved by controlling the updates for the even

ξ_{2 k}

subsequence and the odd

ζ_{2 k + 1}

subsequence. This additional flexibility allows the parallel system to develop periodic behavior outside of the Dobrushin regime

(1 \leq β \leq 1)

.

As an example, we will consider initializing the sequential algorithm to the parallel algorithm for

β = 1.2

. We begin with the sequential algorithm. For

β = 1.2

, consider initializing the sequential system at

(ζ_{0}, ξ_{0}) = (0.7, 0.3)

. The sequential system updates are fully determined by

ξ_{0}

, so for

ξ_{0} = 0.3

it follows from Theorem 1 that an application of the function (11) will cause

ζ_{1} \in W^{s} (c_{0})

. At this point, the system can be evolved by applying (12) to the independent sequences for

ζ

and

ξ

as given in (13). The dynamics of the system are now controlled by the function (12). From this initialization the system will converge to the fixed point

(c_{0}, c_{0}) = (0.17071, 0.17071)

as shown in Figure 6.

Contrast this with the behavior of the parallel system in which the updates are determined by both

ξ_{0}

and

ζ_{0}

. For

β = 1.2

, consider initializing the parallel system at

(ζ_{0}, ξ_{0}) = (0.7, 0.3)

. It follows from Theorem 1 that an application of the function (11) will cause

ζ_{1} \in W^{s} (c_{0})

and

ξ_{1} \in W^{s} (c_{1})

. Successive updates will cause the sequences

ζ_{k}

and

ξ_{k}

to flip back and forth between the domains

W^{s} (c_{0})

and

W^{s} (c_{1})

, until the system settles into the two cycle

C_{1} = {(c_{0}, c_{1}), (c_{1}, c_{0})} = {(0.17071, 0.82928), (0.82928, 0.17071)}

as seen in Figure 10.

This simple example highlights the danger of naively parallelizing the CAVI algorithm. The convergence properties of a parallel version of the CAVI algorithm will heavily depend on the models CAVI update equations. In the case of the Ising model we have demonstrated that for certain parameter regimes the parallel implementation of the algorithm can fail to converge due to the dependence of the algorithm on both

ζ_{0}

and

ξ_{0}

.

6. Edward–Sokal Coupling

One method of improving convergence in Markov chains is through the use of probabilistic couplings. The Edward–Sokal (ES) coupling is a coupling of two statistical physics models, the random cluster model and the Potts model (a generalization of the Ising model) [37]. Running a Markov chain on the ES coupling leads to improved mixing properties compared to the equivalent Potts model and random cluster models [33]. Motivated by these findings in the Markov chain literature, we ask a similar question: Can the convergence properties of mean-field VI be improved by using the ES coupling in place of the Ising model? In this section we investigate this idea numerically. We first introduce the Edward–Sokal coupling following [37]. We introduce a variational family for the Edward–Sokal coupling and derive the variational updates for this model. Our findings suggests the variational updates converge to a unique solution in a larger range than the equivalent Dobrushin regime for the corresponding Ising measure.

6.1. Random Cluster Model

Let

G = (V, E)

be a finite graph. Let

e = 〈 x, y 〉 \in E

denote an edge in G with endpoints

x, y \in V

.

Σ = {1, 2, \dots, q}^{V}

,

Ω = {0, 1}^{E}

and

F

denotes the powerset of

Ω

. The random cluster model is a 2 parameter probability measure with an edge weight parameter

p \in [0, 1]

and a cluster weight parameter

q \in {2, 3, \dots}

on

(Ω, F)

given by

ϕ_{p, q} (ω) \propto \{\prod_{e \in E} p^{ω (e)} {(1 - p)}^{(1 - ω (e))}\} q^{κ (ω)},

where

κ (ω)

denoted the number of connected components in the subgraph corresponding to

ω

. The partition function for the random cluster model is

Z_{R C} = \sum_{ω \in Ω} \{\prod_{e \in E} p^{ω (e)} {(1 - p)}^{(1 - ω (e))}\} q^{κ (ω)} .

For

q = 2

the the random cluster model reduces to the Ising model on G.

The Edward–Sokal Coupling is a probability measure

μ

on

Σ \times Ω

given by

μ (σ, ω) \propto \prod_{e \in E} [(1 - p) δ_{ω (e), 0} + p δ_{ω (e), 1} δ_{e} (σ)],

(21)

where

δ_{a, b} = 1 (a = b)

, and

δ_{e} (σ) = 1 (σ_{x} = σ_{y})

, for

e = (x, y) \in E

.

It is well known that in the special case,

p = 1 - e^{- β}

and

q = 2

the

Σ

-marginal of the ES coupling is the Ising model, the

Ω

-marginal is the random cluster model [37]. We are interested in better understanding how the convergence of the CAVI algorithm on the ES coupling compares to the convergence of the CAVI algorithm on the Ising model.

6.2. VI Objective Function

To calculate the VI updates for each variable we may need to make use of the alternative characterization of the ES coupling

μ (σ, ω) \propto ψ (σ) ϕ_{p, 1} (ω) 1_{F} (σ, ω)

where

ψ

is uniform measure on

Σ

and

ϕ_{p, 1} (ω)

is a product measure on

Ω

ϕ_{p, 1} (ω) = \prod_{e \in E} p^{ω (e)} {(1 - p)}^{(1 - ω (e))}

(22)

and

F = \{(σ, ω) : δ_{ω} (e) = 1 \Rightarrow δ_{e} (σ) = 1\}

(23)

The variational family that we will be optimizing over is

\begin{matrix} q (σ, ω) = q_{1} (σ_{1}) q_{2} (σ_{2}) q_{0} (ω) 1_{F} (σ, ω) . \end{matrix}

(24)

We have added the indicator on the set F to eliminate the configurations

(σ, ω)

that are not well defined in the variational objective. We will use the convention that

0 log (0) = 0

.

6.3. VI Updates

The ELBO that corresponds to the variational family (24) is

\begin{matrix} E L B O (x_{1}, x_{2}, y, p) & = & x_{1} x_{2} y log (x_{1} x_{2} y) - x_{1} x_{2} y log (1 - p) \\ + & (1 - x_{1}) x_{2} y log ((1 - x_{1}) x_{2} y) - (1 - x_{1}) x_{2} y log (1 - p) \\ + & x_{1} (1 - x_{2}) y log (x_{1} (1 - x_{2}) y) - x_{1} (1 - x_{2}) y log (1 - p) \\ + & (1 - x_{1}) (1 - x_{2}) y log ((1 - x_{1}) (1 - x_{2}) y) - (1 - x_{1}) (1 - x_{2}) y log (1 - p) \\ + & x_{1} x_{2} (1 - y) log (x_{1} x_{2} (1 - y)) - x_{1} x_{2} (1 - y) log (p) \\ + & (1 - x_{1}) (1 - x_{2}) (1 - y) log ((1 - x_{1}) (1 - x_{2}) (1 - y)) - (1 - x_{1}) (1 - x_{2}) (1 - y) log (p) . \end{matrix}

Taking the derivative with respect to

x_{1}

and simplifying gives us

\begin{matrix} {E L B O}_{1} (x_{1}, x_{2}, y, p) & = & y log (\frac{x_{1}}{1 - x_{1}}) + (1 - y) log (\frac{1}{1 - x_{1}}) \\ + & x_{2} (1 - y) log (x_{1} (1 - x_{1})) + x_{2} (1 - y) log (\frac{x_{2} (1 - x_{2}) {(1 - y)}^{2}}{p^{2}}) \\ + & log (\frac{p}{(1 - x_{2}) (1 - y)}) + (2 x_{2} - 1) (1 - y) . \end{matrix}

Taking the derivative with respect to

x_{2}

and simplifying gives us

\begin{matrix} {E L B O}_{2} (x_{1}, x_{2}, y, p) & = & y log (\frac{x_{2}}{1 - x_{2}}) + (1 - y) log (\frac{1}{1 - x_{2}}) \\ + & x_{1} (1 - y) log (x_{2} (1 - x_{2})) + x_{1} (1 - y) log (\frac{x_{1} (1 - x_{1}) {(1 - y)}^{2}}{p^{2}}) \\ + & log (\frac{p}{(1 - x_{1}) (1 - y)}) + (2 x_{1} - 1) (1 - y) . \end{matrix}

Taking the derivative with respect to y and simplifying gives us

\begin{matrix} {E L B O}_{y} (x_{1}, x_{2}, y, p) & = & x_{1} x_{2} log (\frac{y}{1 - y}) + x_{1} x_{2} log (\frac{p}{1 - p}) \\ + & (1 - x_{1}) (1 - x_{2}) log (\frac{y}{1 - y}) + (1 - x_{1}) (1 - x_{2}) log (\frac{p}{1 - p}) \\ + & (1 - x_{1}) x_{2} log (\frac{(1 - x_{1}) x_{2} y}{1 - p}) + x_{1} (1 - x_{2}) log (\frac{x_{1} (1 - x_{2}) y}{1 - p}) \\ + & (1 - x_{1}) x_{2} + x_{1} (1 - x_{2}) . \end{matrix}

Absence of closed form updates for any of the variables limits our ability to study the convergence of the system with classical dynamical systems techniques. Instead we look at the long evolution behavior of the system by plotting 100 iterations of the CAVI updates which are generated from the following system

\begin{matrix} x_{1} (t + 1) & = & {argmin}_{z \in (0, 1)} | {E L B O}_{1} (z, x_{2} (t), y (t), p) |, \\ x_{2} (t + 1) & = & {argmin}_{z \in (0, 1)} | {E L B O}_{2} (x_{1} (t + 1), z, y (t), p) |, \\ y (t + 1) & = & {argmin}_{z \in (0, 1)} | {E L B O}_{y} (x_{1} (t + 1), x_{2} (t + 1), z, p) | . \end{matrix}

We generate the argmin of the free variable z from a line search with a step size of

Δ = 10^{- 6}

. Running these simulations we find that the iterations of

x_{1} (t), x_{2} (t), y (t)

converge to a global solution within about

T = 20

time steps from any initialization

x_{1} (0), x_{2} (0), y (0) \in (0, 1)

and any

β > 0

. It is evident that using the ES coupling, we get global convergence of the algorithm outside of the Dobrushin regime of the corresponding paramagnetic Ising model. The figures depicting the simulation results of convergence of the variational inference algorithm in the Edward–Sokal coupling can be found below in Figure 13, Figure 14, Figure 15 and Figure 16.

7. Conclusions

This paper demonstrates the use of classical dynamical systems and bifurcation theory to study the convergence properties of the CAVI algorithm of the Ising model on two nodes. In our simple model we are able to provide the complete dynamical behavior for the Ising model on two nodes. Interestingly, we find that the sequential CAVI algorithm and parallelized CAVI algorithm are not topologically conjugate owing to the presence of periodic behavior in the parallelized CAVI. This behavior originates from the added flexibility of the initialization in the parallelized CAVI when compared to the sequential CAVI. The erratic behavior we see in the Ising model for

| β | > 1

is due to a combination of the existence of multiple fixed points of the systems update function and the instability of these fixed points. In this parameter regime, the fixed point that produces the optimal solution (0.5, 0.5) is a repelling fixed point. Unless we initialize the algorithm exactly at (0.5, 0.5), the CAVI system cannot converge to this point. The other two suboptimal fixed points are both asymptotically stable. This suggests that the main problem that the CAVI algorithm experiences is centered around the existence of multiple fixed points. Recent work on stochastic block models (SBM) and topic models (TM) models shows that mean field VI leads to suboptimal estimators [17,18,19,20]. It is not clear if this property comes from the mean field variational inferences construction using product distributions or if this is a consequence of structure among latent variables. A minor difference of the stochastic block model (SBM) or topic model (TM) with the Ising model is that the former contain parameters (e.g., the cluster labels) that are identifiable only up to permutations. That being said, in the SBM or TM, if the cluster means are not well-separated, then it is not possible to identify the labels even up to permutations. This is somewhat related to having multiple fixed points of the objective function and we conjecture similar behavior to what we have found in the Ising model will be exhibited in the SBM or TM outside the Dobrushin regime. Interestingly, a close look at the BCAVI updates in [17,18] reveals a similar sigmoid update function

1 (1 + e^{- x})

. Applying the tools and techniques from dynamical systems theory to study the CAVI algorithm in the SBM, TM and other models will provide a better understanding of the issues that come with using mean field variational inference and is important to developing better variational inference techniques.

Most of the research into the theoretical properties of variational inference has focused on the mean field family due to its computational simplicity. This computational simplicity comes at the cost of limited expressive power. Can we make due with this limited expressive power in practical applications? More specifically, is there an equivalent parameter regime to the Dobrushin regime (

1 \leq β \leq 1

) for other similar models like the SBM and TM inside which the CAVI produces statistically optimal estimators? The answer to this question provides researchers with stable parameter regimes for the model. The non-existenceof such a region would indicate the need for more expressive variational methods for the model beyond mean field methods. Recent work [19,20] suggests that this adding some structure to algorithms may fix the problems that arise from mean field VI. How much structure is needed to recover statistically optimal estimators? Could adding in a simple structure of pair-wise dependence to the mean field VI in the Ising model, similarly to [19], be enough to recover the optimal estimator outside of the Dobrushin regime? Is the amount of additional structure that is needed somehow related to the latent structure of the models? Tools from dynamical systems theory can be used to study these questions.

Using dynamical systems to study the convergence properties of the CAVI algorithm is not without its challenges. While dynamical systems theory can provide the answers to many of the above questions, applying these tools to higher dimensional sequential systems is a challenging problem. As mentioned previously, the general theory for n-dimensional discrete dynamical systems is dependent on writing the evolution function in the form

x_{n + 1} = F (x_{n})

. Deriving this F is typically not possible for densely connected higher dimensional sequential systems like the n-dimensional Ising model CAVI. This is not the only challenging aspect to the problem. These systems typically possess multiple fixed points which can only be found numerically. Multiple fixed points will lead to more complicated partitions of the space into domains of attraction. Furthermore, higher dimensional systems can possess bifurcations of multiple codimensions, which as significantly more difficult to study. Bifurcations of codimension 3 are so exotic that they are not well studied [23,24]. Software to handle such calculations has only recently been developed [24]. In practical terms this means that the convergence properties can only be studied numerically for models with a small number of parameters. Furthermore, most of the numerical techniques work under the assumption of differentiability of the evolution operator and will fail to be applicable to many systems of practical interest in statistics such as the Edward–Sokal CAVI. Applying tools from dynamical systems to the study of variational inference algorithms will require developing new theory for high dimensional and well connected sequential dynamical systems.

Author Contributions

Conceptualization, S.P., D.P. and A.B.; formal analysis, S.P.; supervision, D.P. and A.B.; writing—original draft, S.P., D.P. and A.B. All authors have read and agreed to the published version of the manuscript.

Funding

Pati and Bhattacharya acknowledge support from NSF DMS (1854731, 1916371) and NSF CCF 1934904 (HDR-TRIPODS). In addition, Bhattacharya acknowledges the NSF CAREER 1653404 award for supporting this project.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. An Overview of One Dimensional Dynamical Systems

The main focus of discrete dynamical systems is the asymptotic behavior of iterated systems (8). Bifurcation theory studies how the dynamical behavior of a system changes as the parameter

J_{12}

changes. We study the behavior of convergence of the CAVI algorithm by studying the autonomous discrete time dynamical system formed by the update Equation (8). This allows us to utilize tools from dynamical systems theory to study the behavior of the algorithm with respect to its parameters. In this section we provide a brief overview of the necessary dynamical systems and bifurcation theory in dimension 1 used in Section 5.

Appendix A.1. Notation

Our focus will be on parametric dynamical systems defined by a functions

f : R^{n} \times R^{p} \to R^{n}

. We will call elements

x \in R^{n}

elements in the state space (phase space) and elements

α \in R^{p}

as parameters. We denote real numbers

x \in R

and real vectors in

x = (x_{1}, \dots, x_{n}) \in R^{n}

with bold. We denote the inverse of an invertible function f by

f^{- 1}

. The k-fold composition of a function f with itself at a point

(x, α)

will be denoted by

f^{k} (x, α)

. The k-fold composition of the inverse function

f^{- 1}

will be denoted

f^{- k}

. The identity function will be denoted

i d

. We use the convention

f^{0} = i d

. We denote the tensors of derivatives of f by

f_{x} (x, α) = (\partial f_{i} \partial x_{j})

,

f_{x x} (x, α) = (\partial^{2} f_{i} \partial x_{j} \partial x_{k})

,

f_{x x} (x, α) = (\partial^{2} f_{i} \partial x_{j} \partial x_{k})

,

f_{x x x} (x, α) = (\partial^{3} f_{i} \partial x_{j} \partial x_{k} \partial x_{ℓ})

,

f_{α} (x, α) = (\partial f_{i} \partial α_{j})

.

Appendix A.2. Dynamical Systems

Dynamical systems is a classical approach to studying the convergence properties of non-linear iterative systems. These systems can be continuous in time, for example a differential equation, or discrete in time, for example iterations of a function from an initial point. A dynamical system is called autonomous if the function governing the system is independent of time and non-autonomous otherwise. The coordinate ascent variational inference for the Ising model is a discrete-time autonomous dynamical system. Before giving a complete proof of the dynamical properties of the CAVI algorithm for the Ising model in dimension 2, we first give a basic introduction to the theory of discrete time dynamical systems and bifurcations following [23,24,25,38].

Formally, a dynamical system is triple

{T, X, ϕ^{t}}

where T is a time set, X is the state space and

ϕ^{t} : X \to X

is a family of evolution operators parameterized by

t \in T

satisfying

ϕ^{0} = i d

and

ϕ^{s + t} = ϕ^{t} \circ ϕ^{s}

for all

x \in X

. For a discrete time system the evolution operator is fully specified by the one-step map

ϕ^{1} = f

, since the composition rule then defines

ϕ^{k} = f^{k}

for

k \in Z

. We restrict the further discussion to discrete time dynamical systems defined by the one-step map

\begin{matrix} x \mapsto f (x, α), x \in R^{n}, α \in R^{p}, \end{matrix}

(A1)

where f is a diffeomorphism, a smooth function with smooth inverse, of the state space

R^{n}

and

α

are the parameters of the system.

The basic geometric objects of a dynamical system are orbits in the state space and the phase portrait, defined as follows. The phase portrait is the partition of the state space induced by the orbits. The orbit starting at a point

x

is an ordered subset of the state space

R^{n}

denoted

o r b (x) = {f^{k} (x) : k \in Z}

. There are two special types of orbits, fixed points and cycles, defined below.

A fixed point

x_{*}

of the system are points that remain fixed under the evolution of the system, ones that satisfies

x_{*} = f (x_{*})

. We can classify fixed points of the system by studying the local behavior of the system near the fixed point. To do this we consider small perturbations of the system near the fixed point. A fixed point

x_{*}

is said to be locally stable if points that are near the fixed point do not move too far away from the fixed point as the system evolves. Formally, if for any

ε > 0

there exists

δ > 0

such that for all x with

| x - x_{*} | < δ

we have

| f^{k} (x) - x_{*} | < ε

for all

k > 0

. A fixed point is called semi-stable from the right if for any

ε > 0

there exists

δ > 0

such that for all x with

0 < x - x_{*} < δ

we have

| f^{k} (x) - x_{*} | < ε

for all

k > 0

(semi-stable from the left is defined analogously). It is said to be locally unstable otherwise. A fixed point

x_{*}

is locally attracting if all points in a small neighborhood converge to the fixed point as we let the system evolve. Formally, if there exists an

η > 0

such that

| x - x_{*} | < η

implies

f^{n} (x) \to x_{*}

as

n \to \infty

. A fixed point

x_{*}

is locally asymptotically stable if it is both locally stable and attracting. A fixed point

x_{*}

is locally semi-asymptotically stable from the right if it is both locally semi-stable from the right and

{lim}_{n} f^{n} (x) = x_{*}

for

0 < x - x_{*} < η

for some

η

. It is globally asymptotically stable if the point is attracting for all

x

in the state space.

A cycle is a periodic orbit of distinct points

C = {x_{0}, x_{1}, \dots, x_{K - 1}}

, where

x_{0} = f (x_{K - 1})

for some

K > 0

. The minimal K generating the cycle is called the period of the cycle. A subset

S \subset R^{n}

is called invariant if

f^{k} (S) \subset S, k \in Z

. An invariant set S is called asymptotically stable if there exists a neighborhood U of S such that for any point in U is eventually inside the set S. The stable set of

S \subset R^{n}

is

W^{s} (S) = \{x \in R^{n} : {lim}_{k \to \infty} f^{k} (x) \in S\}

. If f is invertible, we define the unstable set of

S \subset R^{n}

is

W^{u} (S) = \{x \in R^{n} : {lim}_{k \to \infty} f^{- k} (x) \in S\}

. The unstable set of S for the forward system

f^{k}

,

k > 0

is the stable set of S for the backward system

f^{- k}

,

k > 0

. It is possible to study the behavior of points that diverge by studying points that converge under the inverse map. We can also classify the stability of K-cycles. We classify the stability of the cycle as a fixed point in the map

f^{K}

.

Consider a discrete time dynamical system defined by a diffeomorphism

f : R \times R \to R

. Let

x_{*}

be a fixed point of

f (x, α)

and consider a nearby point x,

| x - x_{*} | = ϵ

. Taking a Taylor expansion of the system about the fixed point gives us

\begin{matrix} f (x, α) - x_{*} & = & f_{x} (x_{*}, α) (x - x_{*}) + f_{x x} (x_{*}, α) {(x - x_{*})}^{2} + O (| x - x_{*} |^{3}) . \end{matrix}

If the Jacobian does not have modulus one and

ϵ

is small enough, then the contribution by the terms of

O (| x - x_{*} |^{2})

will be negligible, in which case the behavior of the system is governed by the the behavior of the linearization of the system

f_{x} (x_{*}, α)

. We now introduce the idea of a hyperbolic fixed point. Assume that the Jacobian

A : = f_{x} (x_{*}, α)

of the system (A1) at a fixed point

x_{*}

is non-singular. The fixed point

x_{*}

is called hyperbolic if

| f_{x} (x_{*}, α) | \neq 1

and non-hyperbolic if

| f_{x} (x_{*}, α) | = 1

. The notion of hyperbolic fixed and non-hyperbolic fixed points generalizes to higher dimensions where it involves the eigenvalues of the Jacobian; see [23,25,38] for more details.

Near a hyperbolic fixed point a non-linear dynamical system behaves its first order Taylor approximation (also known as the linearization of the system). To make this argument rigorous we need to discuss what it means for two dynamical systems to be equivalent. Two systems are topologically equivalent if we can map orbits of one system to orbits of another system in a continuous way that preserves the order of time. The dynamical system (A1) is called topologically equivalent to the system

\begin{matrix} y \mapsto g (y, β), y \in R^{n}, β \in R^{p}, \end{matrix}

(A2)

if there exists a homeomorphism of the parameter space

h_{p} : R^{p} \to R^{p}

,

β = h_{p} (α)

and a parameter dependent state space homeomorphism, continuous in the first argument,

h : R^{n} \times R^{p} \to R^{n}

such that,

y = h (x, α)

, mapping orbits of the system (A1) at parameter value

α

onto orbits of the system (A2) at parameter

β = h_{p} (α)

preserving the direction of time. If h is a diffeomorphism then the systems are called smoothly equivalent.

Let (A1) and (A2) be two topologically equivalent invertible dynamical systems. Consider the orbit of the system under the mapping

f (x, α)

,

o r b (x; f, α)

and the orbit of the system

g (y, β)

,

o r b (y; g, β)

. Topological equivalence means that the homeomorphism

(h (x, α), h_{p} (α))

maps

o r b (x; f, α)

to

o r b (y; g, β)

preserving the order of time. This gives us the following commutative diagram

The orbits being topologically equivalent means that orbit

x

under the mapping h should produce the same orbit as mapping

x

to

y = h (x, α)

computing the orbit of

y

under

g (\cdot, β)

and mapping back to

f (x, α)

by

h^{- 1}

,

f (x, α) = h^{- 1} \circ g \circ h (x, α)

. We shall primarily be interested in the behavior of the system in a small neighborhood of an equilibrium point. A system (A1) is called locally topologically equivalent near an equilibrium

x_{*}

to a system (A2) near an equilibrium

y_{*}

if there exists a homeomorphism

h : R^{n} \to R^{n}

defined in a small neighborhood U of

x_{*}

with

y_{*} = h (x_{*})

that maps orbits of (A1) in U onto orbits of (A2) in

V = h (U)

, preserving the direction of time.

We now have enough terminology to introduce the following theorem, which shows that the dynamics of a smooth system in the neighborhood of a hyperbolic fixed point are equivalent to the dynamics of the linearization of the system,

Theorem A1

(Grobman–Hartman). Consider a smooth map

\begin{matrix} x \mapsto A x + F (x), x \in R^{n}, \end{matrix}

(A3)

where A is an

n \times n

matrix and

F (x) = O (∥ x ∥^{2})

. If

x_{*} = 0

is a hyperbolic fixed point of (A3), then (A3) is topologically equivalent near this point to its linearization

\begin{matrix} x \mapsto A x, x \in R^{n} . \end{matrix}

Note Theorem A1 is true for a general n-dimensional system. Theorem A1 provides sufficient conditions to determine the stability of a hyperbolic fixed point of a general discrete time system,

Theorem A2.

Consider a discrete time dynamical systems (A1) where f is a smooth map. Suppose for a fixed point

x_{*}

that the eigenvalues of Jacobian

f_{x} (x_{*}, α)

all satisfy

| λ | < 1

then the fixed point is stable. Alternatively, suppose for a fixed point

x_{*}

that the eigenvalues of Jacobian

f_{x} (x_{*}, α)

all satisfy

| λ | > 1

then the fixed point is unstable.

The linearization of the system near a non-hyperbolic fixed point is not sufficient to determine stability of the fixed point and we need to investigate higher order terms. The following theorem provides sufficient condition to check the stability of a smooth one dimensional system at a non-hyperbolic fixed point,

Theorem A3.

Let

f : R \times R \to R

. Suppose that

f (\cdot, α) \in C^{3} (R; R)

and

x_{*}

is a non-hyperbolic fixed point of f,

x_{*} = f (x_{*}, α)

. We have the following cases:

Case 1: If

f_{x} (x_{*}, α) = 1

, then

1.: If $f_{x x} (x_{*}, α) \neq 0$ then $x_{*}$ is semi-asymptotically stable from the left if $f_{x x} (x_{*}, α) > 0$ and semi-asymptotically stable from the right if $f_{x x} (x_{*}, α) < 0$ ;
2.: if $f_{x x} (x_{*}, α) = 0$ and $f_{x x x} (x_{*}, α) < 0$ then $x_{*}$ is asymptotically stable;
3.: if $f_{x x} (x_{*}, α) = 0$ and $f_{x x x} (x_{*}, α) > 0$ then $x_{*}$ is unstable.

Case 2: If

f_{x} (x_{*}, α) = - 1

, then

1.: If $S f (x_{*}, α) < 0$ , then $x_{*}$ is asymptotically stable;
2.: If $S f (x_{*};, α) > 0$ , then $x_{*}$ is unstable.

where

S f (x)

is the Schwarzian derivative of f

\begin{matrix} S f (x, α) = \frac{f_{x x x} (x, α)}{f_{x} (x, α)} - \frac{3}{2} {[\frac{f_{x x} (x, α)}{f_{x} (x, α)}]}^{2} . \end{matrix}

The Schwarzian derivative controls the higher order behavior in oscillatory systems.

Appendix A.3. Codimension 1 Bifurcations

Until now we have kept the parameter of the system fixed. The study of the change in behavior of a dynamical system as the parameters are varied is called bifurcation theory. A bifurcation occurs when the dynamics of the system at a parameter value

α_{1}

differ from the dynamics of the system at a different parameter value

α_{2}

. Changing the parameter in a system may cause a stable fixed point to become unstable, the fixed point may split into multiple fixed points, or a new orbit may form. Each of these is an example of a bifurcation, although these are not the only things that can happen. The point at which a bifurcation occurs is called a bifurcation point. More formally, the parameter

α_{*}

is called a bifurcation point if arbitrarily close to it there is

α

such that

x \mapsto f (x, α), x \in R^{n}

is not topologically equivalent to

x \mapsto f (x, α_{*}), x \in R^{n}

in some domain

U \subset R^{n}

.

A necessary, but not sufficient condition for bifurcation of a fixed point to occur is for the fixed point to be nonhyperbolic. Theorem A1 together with the implicit function theorem show that in a sufficiently small neighborhood of a hyperbolic fixed point

(x_{*}, α_{*})

, for each

α

there is another unique fixed point with the same stability properties as

(x_{*}, α)

. So hyperbolic fixed points do not undergo local bifurcations. In the context of discrete systems, a local bifurcation can occur only at a fixed point

(x_{*}, α_{*})

when the Jacobian of the system at

(x_{*}, α_{*})

has an eigenvalue with modulus one.

Perhaps surprisingly, there are only three types of generic bifurcations that can happen in a discrete system with one parameter. They are the limit point (LP), period doubling (PD) and Neimark–Sacker (NS) bifurcations. The reason for this is fairly simple. It turns out that there is a generic system, called the topological normal form, which undergoes this bifurcation at the origin in the

(x, α)

-plane. For any other system that undergoes the same bifurcation and satisfies certain non-degeneracy conditions there is a local change of coordinates that transforms the system into the topological normal form.

In general the types of bifurcations that can occur are connected to the number of parameters in the system. The minimal number of parameters that must be changed in order for a particular bifurcation to occur in

f (x, α)

is called the codimension of the bifurcation. A bifurcation is called local if it can be detected in any small neighborhood of the fixed point, otherwise its called global. Global bifurcations are much harder to analyze and since we do not attempt to investigate them in this paper we will not expand upon them further. More detailed results on bifurcations in codimension 1 and 2 can be found in [23,24].

We will now formally define the sufficient conditions for a system to undergo a period doubling or a pitchfork bifurcation. The period doubling bifurcation occurs when a system with a non-hyperbolic fixed point with multiplier

λ_{1} = - 1

satisfies certain non-degeneracy conditions. There are two types of PD bifurcations. In the super-critical case, a stable 2-cycle is generated when a fixed point becomes unstable. In the sub-critical case, a stable fixed point turns unstable when it coalesces with an unstable 2-cycle (This is true for a general k-cycle. In the super-critical case, a stable

2 k

-cycle is generated when a k-cycle becomes unstable. In the sub-critical case, a stable k-cycle turns unstable when it coalesces with an unstable

2 k

-cycle ). The conditions for a PD bifurcation to occur are given as follows

Theorem A4

(Period Doubling Bifurcation). Suppose That A One-Dimensional System

\begin{matrix} x \mapsto f (x, α), x, α \in R, \end{matrix}

with smooth f, has at

α = 0

the fixed point

x_{*} = 0

, and let

λ = f_{x} (0, 0) = - 1

. Assume the following non-degeneracy conditions are satisfied

1.: $1 / 2 {(f_{x x} (0, 0))}^{2} + 1 / 3 f_{x x x} (0, 0) \neq 0$
2.: $f_{x α} (0, 0) \neq 0$

Then there are smooth invertible coordinate and parameter changes transforming the system into

\begin{matrix} η \mapsto - (1 + β) \pm η^{3} + O (η^{4}) . \end{matrix}

(A4)

An classical example of a period doubling bifurcation can be seen in the logistic map

f (x, μ) = μ x (1 - x)

, for

x \in [0, 1]

. The bifurcation occurs at the point

(x_{*}, μ_{*}) = (2 / 3, 3)

. The logistic map has two fixed points. One fixed point is at

x = 0

and the other is at

x = (μ - 1) μ

. We will ignore the fixed point at

x = 0

since it is repelling for

μ > 1

. We look at the behavior of the system in a small neighborhood of

μ_{*} = 3

. For

μ = 2.9

, the fixed point

x_{*} = (μ - 1) μ

is a hyperbolic attracting fixed point since

| f_{x} (x_{*}, 2.9) | = | 2 - μ | < 1

. For

μ = 3

the fixed point

x_{*} = (μ - 1) μ

is a non-hyperbolic fixed point since

f_{x} (x_{*}, 2.9) = 2 - μ = - 1

. Checking the Schwarzian derivative shows that the fixed point is asymptotically stable. For

μ = 3.1

,

x_{*} = (μ - 1) μ

becomes a repelling fixed point. The points in

(0, x_{*}) \cup (x_{*}, 1)

converge to the attracting 2-cycle

C = {0.558014, 0.7645665}

. A super-critical period doubling bifurcation has occurred in the system formed by the logistic map. As the parameter

μ

increases we see a stable fixed point degenerate and a stable 2-cycle is formed.

Figure A1. The above plots are cobweb diagrams for the logistic map

f (x, μ) = μ x (1 - x)

, for

x \in [0, 1]

, with parameters

μ = 2.9

,

μ = 3

and

μ = 3.1

, respectively. For

μ = 2.9

the system has one stable fixed point

x_{*} = (μ - 1) μ

. For

μ = 3

, the system has one non-hyperbolic fixed point

x_{*} = (μ - 1) μ

which is asymptotically stable attracting; the plot was not iterated long enough to see convergence. For

μ = 3.1

, the system has a hyperbolic repelling fixed point

x_{*} = (μ - 1) μ

and an asymptotically stable attracting two cycle

C = {0.558014, 0.7645665}

.

Figure A1. The above plots are cobweb diagrams for the logistic map

f (x, μ) = μ x (1 - x)

, for

x \in [0, 1]

, with parameters

μ = 2.9

,

μ = 3

and

μ = 3.1

, respectively. For

μ = 2.9

the system has one stable fixed point

x_{*} = (μ - 1) μ

. For

μ = 3

, the system has one non-hyperbolic fixed point

x_{*} = (μ - 1) μ

which is asymptotically stable attracting; the plot was not iterated long enough to see convergence. For

μ = 3.1

, the system has a hyperbolic repelling fixed point

x_{*} = (μ - 1) μ

and an asymptotically stable attracting two cycle

C = {0.558014, 0.7645665}

.

The second iterate of a map that undergoes a PD bifurcation undergoes a bifurcation know as the pitchfork bifurcation. A system that undergoes a super-critical pitchfork bifurcation when a stable fixed point becomes unstable and two stable fixed points appear in the system. A system that undergoes a sub-critical pitchfork bifurcation when two stable fixed points coalesce with an unstable fixed point, the unstable fixed point becomes stable as the parameter crosses the bifurcation point. Below we present extra details pertaining to the period doubling bifurcation and its relation to the pitchfork bifurcation.

Consider the one-dimensional system

\begin{matrix} x \mapsto - (1 + α) x + x^{3} = f (x, α) . \end{matrix}

The map

f (x, α)

is invertible in a small neighborhood of

(0, 0)

. The system has a fixed point at

x_{*} = 0

for all

α

, with eigenvalue

- (1 + α)

. For small

α < 0

the fixed point is hyperbolic stable and for

α > 0

is it hyperbolic unstable. For

α = 0

the fixed point is non-hyperbolic, but is asymptotically stable.

Consider the second iterate of

f (x, α)

\begin{matrix} f^{2} (x, α) & = & - (1 + α) f (x, α) + {(f (x, α))}^{3} \\ = & {(1 + α)}^{2} x - [(1 + α) (2 + 2 α + α^{2})] x^{3} + O (x^{5}) . \end{matrix}

The second iterate has a trivial fixed point at

x_{*} = 0

and for

α > 0

it has two non-trivial stable fixed points

x_{1} = (\sqrt{α} + O (α))

,

x_{1} = - (\sqrt{α} + O (α))

that form a two cycle

\begin{matrix} x_{2} = f (x_{1}, α), x_{1} = f (x_{2}, α) . \end{matrix}

The conditions for a generic pitchfork bifurcation can be found in [25]

TheoremA5

(Pitchfork Bifurcation). For A System

\begin{matrix} x \mapsto f (x, α), x, α \in R \end{matrix}

having non-hyperbolic fixed point at

x_{*} = 0

,

α_{*} = 0

with

f_{x} (0, 0) = 1

undergoes a pitchfork bifurcation at

(x_{*}, α_{*}) = (0, 0)

if

\begin{matrix} f_{α} (0, 0) = 0, f_{x x} (0, 0) = 0, f_{x x x} (0, 0) \neq 0, f_{x α} (0, 0) \neq 0 . \end{matrix}

A pitchfork bifurcation is super-critical if

- f_{x x x} (x_{*}, α_{*}) f_{α x} (x_{*}, α_{*}) > 0

and sub-critical if

- f_{x x x} (x_{*}, α_{*}) f_{α x} (x_{*}, α_{*}) < 0

An example of a pitchfork bifurcation can be seen in the second iteration of the logistic map

f^{2} (x, μ) = μ^{2} x (1 - x) (1 - μ x (1 - x))

, for

x \in [0, 1]

. The bifurcation occurs at the point

(x_{*}, μ_{*}) = (2 / 3, 3)

. For

μ \leq 3

, the second iteration of the logistic map has the same fixed points as the first iteration. One fixed point is at

x = 0

and the other is at

x = (μ - 1) μ

. We will ignore the fixed point at

x = 0

since it is repelling for

μ > 1

. We look at the behavior of the system in a small neighborhood of

μ_{*} = 3

. For

μ = 2.9

, the fixed point

x_{*} = (μ - 1) μ

is a hyperbolic attracting fixed point since

| f_{x}^{2} (x_{*}, 2.9) | < 1

. For

μ = 3

the fixed point

x_{*} = (μ - 1) μ

is non-hyperbolic since

f_{x}^{2} (x_{*}, 2.9) = 2 - μ = 1

. Checking the higher order derivative shows that the fixed point is asymptotically stable. For

μ = 3.1

,

x_{*} = (μ - 1) μ

becomes a repelling fixed point. Using numerical methods we find two additional fixed points,

x_{1} = 0.558014

and

x_{2} = 0.7645665

, both of which are attracting. A super-critical pitchfork bifurcation has occurred in the system formed by the logistic map. As the parameter

μ

increases we see a stable fixed point degenerates to an unstable fixed point and two stable fixed points.

Figure A2. The above plots are cobweb diagrams for the second iterate of the logistic map

f (x, μ) = μ x (1 - x)

, for

x \in [0, 1]

, with parameters

μ = 2.9

and

μ = 3.1

, respectively. For

μ = 2.9

the system has one stable fixed point

x_{*} = (μ - 1) μ

. For

μ = 3.1

, the system has a hyperbolic repelling fixed point

x_{*} = (μ - 1) μ

and two asymptotically stable attracting fixed points

x_{1} = 0.0558014

and

x_{2} = 0.7645665

.

Figure A2. The above plots are cobweb diagrams for the second iterate of the logistic map

f (x, μ) = μ x (1 - x)

, for

x \in [0, 1]

, with parameters

μ = 2.9

and

μ = 3.1

, respectively. For

μ = 2.9

the system has one stable fixed point

x_{*} = (μ - 1) μ

. For

μ = 3.1

, the system has a hyperbolic repelling fixed point

x_{*} = (μ - 1) μ

and two asymptotically stable attracting fixed points

x_{1} = 0.0558014

and

x_{2} = 0.7645665

.

References

Bishop, C. Pattern Recognition and Machine Learning; Information Science and Statistics; Springer: Berlin/Heidelberger, Germany, 2006. [Google Scholar]
MacKay, D.J.; Mac Kay, D.J. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef] [Green Version]
Zhang, C.; Bütepage, J.; Kjellström, H.; Mandt, S. Advances in variational inference. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2008–2026. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Parisi, G. Statistical Field Theory; Frontiers in Physics; Addison-Wesley: Boston, MA, USA, 1988. [Google Scholar]
Opper, M.; Saad, D. Advanced Mean Field Methods: Theory and Practice; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
Gabrié, M. Mean-field inference methods for neural networks. J. Phys. A Math. Theor. 2020, 53, 223002. [Google Scholar] [CrossRef] [Green Version]
Alquier, P.; Ridgway, J.; Chopin, N. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016, 17, 1–41. [Google Scholar]
Pati, D.; Bhattacharya, A.; Yang, Y. On statistical optimality of variational Bayes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Canary Islands, Spain, 9–11 April 2018; pp. 1579–1588. [Google Scholar]
Yang, Y.; Pati, D.; Bhattacharya, A. α-Variational inference with statistical guarantees. Ann. Stat. 2020, 48, 886–905. [Google Scholar] [CrossRef]
Chérief-Abdellatif, B.E.; Alquier, P. Consistency of variational Bayes inference for estimation and model selection in mixtures. Electron. J. Stat. 2018, 12, 2995–3035. [Google Scholar] [CrossRef]
Wang, Y.; Blei, D.M. Frequentist consistency of variational Bayes. J. Am. Stat. Assoc. 2019, 114, 1147–1161. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Blei, D.M. Variational Bayes under Model Misspecification. arXiv 2019, arXiv:1905.10859. [Google Scholar]
Wang, B.; Titterington, D. Inadequacy of Interval Estimates Corresponding to Variational Bayesian Approximations; AISTATS; Citeseer: Princeton, NJ, USA, 2005. [Google Scholar]
Wang, B.; Titterington, D. Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model. Bayesian Anal. 2006, 1, 625–650. [Google Scholar] [CrossRef]
Zhang, A.Y.; Zhou, H.H. Theoretical and Computational Guarantees of Mean Field Variational Inference for Community Detection. arXiv 2017, arXiv:math.ST/1710.11268. [Google Scholar]
Mukherjee, S.S.; Sarkar, P.; Wang, Y.R.; Yan, B. Mean field for the stochastic blockmodel: Optimization landscape and convergence issues. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; pp. 10694–10704. [Google Scholar]
Sarkar, P.; Wang, Y.; Mukherjee, S.S. When random initializations help: A study of variational inference for community detection. arXiv 2019, arXiv:1905.06661. [Google Scholar]
Yin, M.; Wang, Y.X.R.; Sarkar, P. A Theoretical Case Study of Structured Variational Inference for Community Detection. Proc. Mach. Learn. Res. 2020, 108, 3750–3761. [Google Scholar]
Ghorbani, B.; Javadi, H.; Montanari, A. An Instability in Variational Inference for Topic Models. arXiv 2018, arXiv:stat.ML/1802.00568. [Google Scholar]
Jain, V.; Koehler, F.; Mossel, E. The Mean-Field Approximation: Information Inequalities, Algorithms, and Complexity. arXiv 2018, arXiv:cs.LG/1802.06126. [Google Scholar]
Koehler, F. Fast Convergence of Belief Propagation to Global Optima: Beyond Correlation Decay. arXiv 2019, arXiv:cs.LG/1905.09992. [Google Scholar]
Kuznetsov, Y. Elements of Applied Bifurcation Theory; Applied Mathematical Sciences; Springer: New York, NY, USA, 2008. [Google Scholar]
Kuznetsov, Y.; Meijer, H. Numerical Bifurcation Analysis of Maps; Cambridge Monographs on Applied and Computational Mathematics; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
Wiggins, S. Introduction to Applied Nonlinear Dynamical Systems and Chaos; Texts in Applied Mathematics; Springer: New York, NY, USA, 2003. [Google Scholar]
Friedli, S.; Velenik, Y. Statistical Mechanics of Lattice Systems: A Concrete Mathematical Introduction; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Ising, E. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik 1925, 31, 253–258. [Google Scholar] [CrossRef]
Onsager, L. Crystal Statistics. I. A Two-Dimensional Model with an Order-Disorder Transition. Phys. Rev. 1944, 65, 117–149. [Google Scholar] [CrossRef]
Toda, M.; Toda, M.; Saito, N.; Kubo, R.; Saito, N. Statistical Physics I: Equilibrium Statistical Mechanics; Springer Series in Solid-State Sciences; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Moessner, R.; Ramirez, A.P. Geometrical frustration. Phys. Today 2006, 59, 24. [Google Scholar] [CrossRef] [Green Version]
Basak, A.; Mukherjee, S. Universality of the mean-field for the Potts model. Probab. Theory Relat. Fields 2017, 168, 557–600. [Google Scholar] [CrossRef] [Green Version]
Blanca, A.; Chen, Z.; Vigoda, E. Swendsen-Wang dynamics for general graphs in the tree uniqueness region. Random Struct. Algorithms 2019, 56, 373–400. [Google Scholar] [CrossRef]
Guo, H.; Jerrum, M. Random cluster dynamics for the Ising model is rapidly mixing. Ann. Appl. Probab. 2018, 28, 1292–1313. [Google Scholar] [CrossRef] [Green Version]
Oostwal, E.; Straat, M.; Biehl, M. Hidden Unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation. arXiv 2019, arXiv:1910.07476. [Google Scholar]
Çakmak, B.; Opper, M. A Dynamical Mean-Field Theory for Learning in Restricted Boltzmann Machines. arXiv 2020, arXiv:2005.01560. [Google Scholar] [CrossRef]
Blum, E.; Wang, X. Stability of fixed points and periodic orbits and bifurcations in analog neural networks. Neural Netw. 1992, 5, 577–587. [Google Scholar] [CrossRef]
Grimmett, G. The Random-Cluster Model; Grundlehren der Mathematischen Wissenschaften; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Elaydi, S. Discrete Chaos: With Applications in Science and Engineering; CRC Press: New York, NY, USA, 2007. [Google Scholar]

Figure 1. A contour plot of the ELBO as a function of

x_{1}

and

x_{2}

for

β = 0.7

(left) and

β = 1.2

(right) together with the optimal update functions for

x_{1}

(orange) and

x_{2}

(blue) given in Equation (8). For

β = 0.7

the ELBO is a convex function and has exactly one optima, the global maximum, at

(0.5, 0.5)

. For

β = 1.2

the ELBO is now a nonconvex function and has three optima at

(0.5, 0.5)

,

(0.17071, 0.17071)

and

(0.82928, 0.82928)

.

Figure 1. A contour plot of the ELBO as a function of

x_{1}

and

x_{2}

for

β = 0.7

(left) and

β = 1.2

(right) together with the optimal update functions for

x_{1}

(orange) and

x_{2}

(blue) given in Equation (8). For

β = 0.7

the ELBO is a convex function and has exactly one optima, the global maximum, at

(0.5, 0.5)

. For

β = 1.2

the ELBO is now a nonconvex function and has three optima at

(0.5, 0.5)

,

(0.17071, 0.17071)

and

(0.82928, 0.82928)

.

Figure 2. A contour plot of the ELBO as a function of

x_{1}

and

x_{2}

for

β = - 0.7

(left) and

β = - 1.2

(right) together with the optimal update functions for

x_{1}

(orange) and

x_{2}

(blue) given in Equation (8). For

β = - 0.7

the ELBO is a convex function and has exactly one optima, the global maximum, at

(0.5, 0.5)

. For

β = - 1.2

the ELBO is now a nonconvex function and has three optima at

(0.5, 0.5)

,

(0.17071, 0.82928)

and

(0.82928, 0.17071)

.

Figure 2. A contour plot of the ELBO as a function of

x_{1}

and

x_{2}

for

β = - 0.7

(left) and

β = - 1.2

(right) together with the optimal update functions for

x_{1}

(orange) and

x_{2}

(blue) given in Equation (8). For

β = - 0.7

the ELBO is a convex function and has exactly one optima, the global maximum, at

(0.5, 0.5)

. For

β = - 1.2

the ELBO is now a nonconvex function and has three optima at

(0.5, 0.5)

,

(0.17071, 0.82928)

and

(0.82928, 0.17071)

.

Figure 3. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = - 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that

ζ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

and

ξ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that

ζ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

and

ξ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that

ζ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

and

ξ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that

ζ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

and

ξ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

.

Figure 3. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = - 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that

ζ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

and

ξ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that

ζ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

and

ξ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that

ζ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

and

ξ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that

ζ_{k}

converges to the local fixed point

c_{0} (1.2) = 0.17071

and

ξ_{k}

converges to the local fixed point

c_{1} (1.2) = 0.82928

.

Figure 4. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = - 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 4. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = - 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 5. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 5. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 6. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the local fixed point

c_{0} (1.2) = 0.17071

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the local fixed point

c_{1} (1.2) = 0.82928

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the local fixed point

c_{0} (1.2) = 0.17071

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the local fixed point

c_{1} (1.2) = 0.82928

.

Figure 6. A plot of the first 20 iterations of the CAVI algorithm at various initializations for

β = 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the local fixed point

c_{0} (1.2) = 0.17071

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the local fixed point

c_{1} (1.2) = 0.82928

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the local fixed point

c_{0} (1.2) = 0.17071

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the local fixed point

c_{1} (1.2) = 0.82928

.

Figure 7. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = - 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the two cycle

C_{0} = {(c_{0}, c_{0}), (c_{1}, c_{1})}

. The upper right plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that both of these converge to

c_{0} (1.2) \approx 0.17071

. The lower left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to

c_{1} (1.2) \approx 0.82928

. The lower right is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{0} = {(c_{0}, c_{0}), (c_{1}, c_{1})}

.

Figure 7. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = - 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the two cycle

C_{0} = {(c_{0}, c_{0}), (c_{1}, c_{1})}

. The upper right plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that both of these converge to

c_{0} (1.2) \approx 0.17071

. The lower left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to

c_{1} (1.2) \approx 0.82928

. The lower right is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{0} = {(c_{0}, c_{0}), (c_{1}, c_{1})}

.

Figure 8. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = - 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 8. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = - 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 9. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 9. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = 0.7

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to the global fixed point

1 / 2

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the global fixed point

1 / 2

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the global fixed point

1 / 2

. The upper left plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to the global fixed point

1 / 2

.

Figure 10. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to

c_{0} (1.2) \approx 0.17071

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the two cycle

C_{1} = {(c_{1}, c_{0}), (c_{0}, c_{1})}

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{1} = {(c_{1}, c_{0}), (c_{0}, c_{1})}

. The lower right plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to

c_{1} (1.2) \approx 0.82928

.

Figure 10. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.3

; we see that both of these converge to

c_{0} (1.2) \approx 0.17071

. The upper right is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.7

; we see that this initialization converges to the two cycle

C_{1} = {(c_{1}, c_{0}), (c_{0}, c_{1})}

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{1} = {(c_{1}, c_{0}), (c_{0}, c_{1})}

. The lower right plot is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.7

; we see that both of these converge to

c_{1} (1.2) \approx 0.82928

.

Figure 11. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = - 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.5

; we see that this converges to the two-cycle

C_{2} = {(c_{0}, 1 / 2), (1 / 2, c_{1})}

. The upper right is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{3} = {(c_{1}, 1 / 2), (1 / 2, c_{0})}

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.5

; we see that this initialization converges to the two cycle

C_{3}

. The lower right plot is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.7

; we see that this converges to the two-cycle

C_{2}

.

Figure 11. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = - 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.5

; we see that this converges to the two-cycle

C_{2} = {(c_{0}, 1 / 2), (1 / 2, c_{1})}

. The upper right is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{3} = {(c_{1}, 1 / 2), (1 / 2, c_{0})}

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.5

; we see that this initialization converges to the two cycle

C_{3}

. The lower right plot is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.7

; we see that this converges to the two-cycle

C_{2}

.

Figure 12. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.5

; we see that this converges to the two-cycle

C_{4} = {(c_{0}, 1 / 2), (1 / 2, c_{0})}

. The upper right is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{4}

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.5

; we see that this initialization converges to the two cycle

C_{5} = {(c_{1}, 1 / 2), (1 / 2, c_{1})}

. The lower right plot is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.7

; we see that this converges to the two-cycle

C_{5}

.

Figure 12. A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for

β = 1.2

. In each of the plots the

ζ

updates are black and the

ξ

updates are red. The upper left plot is an initialization of

ζ_{0} = 0.3

and

ξ_{0} = 0.5

; we see that this converges to the two-cycle

C_{4} = {(c_{0}, 1 / 2), (1 / 2, c_{0})}

. The upper right is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.3

; we see that this initialization converges to the two cycle

C_{4}

. The lower left is is an initialization of

ζ_{0} = 0.7

and

ξ_{0} = 0.5

; we see that this initialization converges to the two cycle

C_{5} = {(c_{1}, 1 / 2), (1 / 2, c_{1})}

. The lower right plot is an initialization of

ζ_{0} = 0.5

and

ξ_{0} = 0.7

; we see that this converges to the two-cycle

C_{5}

.

Figure 13. A plot of the 20 iterations of the ES updates for

p = 1 - e^{- 5}

from a uniformly random initialization. Each of the lines represents a different parameter. The solid line is

x_{1}

, the dashed line is

x_{2}

and the dotted line is y. We see convergence to a unique fixed point for each of the variables.

Figure 13. A plot of the 20 iterations of the ES updates for

p = 1 - e^{- 5}

from a uniformly random initialization. Each of the lines represents a different parameter. The solid line is

x_{1}

, the dashed line is

x_{2}

and the dotted line is y. We see convergence to a unique fixed point for each of the variables.

Figure 14. A plot of the ELBO of the ES coupling for

p = 1 - e^{- 5}

. The red line denotes the global minimum ELBO value.

Figure 14. A plot of the ELBO of the ES coupling for

p = 1 - e^{- 5}

. The red line denotes the global minimum ELBO value.

Figure 15. A plot of the 20 iterations of the ES updates for

p = 1 - e^{- 0.1}

from a uniformly random initialization. Each of the lines represents a different parameter. The solid line is

x_{1}

, the dashed line is

x_{2}

and the dotted line is y. We see convergence to a unique fixed point.

Figure 15. A plot of the 20 iterations of the ES updates for

p = 1 - e^{- 0.1}

from a uniformly random initialization. Each of the lines represents a different parameter. The solid line is

x_{1}

, the dashed line is

x_{2}

and the dotted line is y. We see convergence to a unique fixed point.

Figure 16. A plot of the ELBO of the ES coupling for

p = 1 - e^{- 0.1}

. The red line denotes the global minimum ELBO value.

Figure 16. A plot of the ELBO of the ES coupling for

p = 1 - e^{- 0.1}

. The red line denotes the global minimum ELBO value.

Table 1. Partial derivatives of (11) and (12) at fixed point

x_{*} = 1 / 2

for parameter value

β = \pm 1

. The derivatives of the the function (11) are denoted using

σ

and the derivatives for (12) are denoted using

σ^{2}

.

Table 1. Partial derivatives of (11) and (12) at fixed point

x_{*} = 1 / 2

for parameter value

β = \pm 1

. The derivatives of the the function (11) are denoted using

σ

and the derivatives for (12) are denoted using

σ^{2}

.

	$σ_{x}$	$σ_{xx}$	$σ_{xxx}$	$σ_{β}$	$σ_{β x}$	$σ_{x}^{2}$	$σ_{xx}^{2}$	$σ_{xxx}^{2}$	$σ_{β}^{2}$	$σ_{β x}^{2}$
$β = 1$	1	0	−8	0	$1 / 2$	1	0	−16	0	1
$β = - 1$	−1	0	8	0	$1 / 2$	1	0	−16	0	−1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Plummer, S.; Pati, D.; Bhattacharya, A. Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models. Entropy 2020, 22, 1263. https://doi.org/10.3390/e22111263

AMA Style

Plummer S, Pati D, Bhattacharya A. Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models. Entropy. 2020; 22(11):1263. https://doi.org/10.3390/e22111263

Chicago/Turabian Style

Plummer, Sean, Debdeep Pati, and Anirban Bhattacharya. 2020. "Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models" Entropy 22, no. 11: 1263. https://doi.org/10.3390/e22111263

APA Style

Plummer, S., Pati, D., & Bhattacharya, A. (2020). Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models. Entropy, 22(11), 1263. https://doi.org/10.3390/e22111263

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models

Abstract

1. Introduction

2. Mean-Field Variational Inference and the Coordinate Ascent Algorithm

3. CAVI in Ising Model

Mean Field Variational Inference in Ising Model

4. Why the Ising Model: A Summary of Our Contributions

Statistical Significance of Our Results

5. Main Results

5.1. Sigmoid Function Dynamics

5.2. Sequential Dynamics

5.3. Parallel Updates

5.4. A Comparison of the Dynamics

6. Edward–Sokal Coupling

6.1. Random Cluster Model

6.2. VI Objective Function

6.3. VI Updates

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. An Overview of One Dimensional Dynamical Systems

Appendix A.1. Notation

Appendix A.2. Dynamical Systems

Appendix A.3. Codimension 1 Bifurcations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI