Variational Bayes for Regime-Switching Log-Normal Models

The power of projection using divergence functions is a major theme in information geometry. One version of this is the variational Bayes (VB) method. This paper looks at VB in the context of other projection-based methods in information geometry. It also describes how to apply VB to the regime-switching log-normal model and how it provides a computationally fast solution to quantify the uncertainty in the model specification. The results show that the method can recover exactly the model structure, gives the reasonable point estimates and is very computationally efficient. The potential problems of the method in quantifying the parameter uncertainty are discussed.


Introduction
While, in principle, the calculation of the posterior distribution is mathematically straightforward, in practice, the computation of many of its features, such as posterior densities, normalizing constants and posterior moments, is a major challenge in Bayesian analysis.Such computations typically involve high dimensional integrals, which often have no analytical or tractable forms.The variational Bayes (VB) method was developed to generate tractable approximations to these quantities.This method provides analytic approximations to the posterior distribution by minimizing the Kullback-Leibler (KL) divergence from the approximations to the actual posterior and has been demonstrated to be computationally very fast.
VB gains its computational advantages by making simplifying assumptions about the posterior dependence structure.For example, in the simplest form, it assumes posterior independence between selected sets of parameters.Under these assumptions, the resultant approximate posterior is either known analytically or can be computed by a simple iteration algorithm similar to the Expectation-Maximization (EM) algorithm.In this paper, we show that, as well as having advantages of computational speed, the VB algorithm does an excellent job of model selection, in particular in finding the appropriate number of regimes.
While the simplification in the dependence gives computational advantages, it also comes at a cost.For example, we also found that the posterior variance may be underestimated.In [1], we propose a novel method to compute the true posterior covariance matrix by only using the information obtained from VB approximations.
The use of projections to particular families is, of course, not new to information geometry (IG).In [2], we find the famous Pythagorean results concerning projection using α-divergences to α-families, and other important results on projections based on divergences can be found in [3] and [4] (Chapter 7).

Variational Bayes
Suppose, in a Bayesian inference problem that we use q(τ ) to approximate the posterior p(τ |y), where y is the data and τ = {τ 1 , • • • , τ p } the model parameter vector.The KL divergence between them is defined as, KL [q(τ )||p(τ |y)] = q(τ ) log q(τ ) p(τ |y) dτ , provided the integral exists.We want to balance two things, having the discrepancy between p and q small, while keeping q tractable.Hence, we want to seek q(τ ), which minimizes Equation (1), while keeping q(τ ) in an analytically tractable form.First, note that the evaluation of Equation (1) requires p(τ |y), which may be unavailable, since in the general Bayesian problem, its normalizing constant is one of the main intractable integrals.However, we note that: Thus, minimizing Equation ( 1) is equivalent to maximizing the first term of the right-hand side of Equation ( 2).The key computational point is that, often, the term p(τ , y) is available even when the full posterior p(τ,y) p(τ,y)dτ is not.
Definition 1.Let F (q) = q(τ ) log p(τ ,y) q(τ ) dτ and: where Q is a predetermined set of probability density functions over the parameter space.Then q is called the variational approximation or variational posterior distribution, and functions of q (such as mean, variance, etc.), are called variational parameters.
Some of the power of Definition 1 comes when we assume that all elements of Q have tractable posteriors.In that case, all variational parameters will then also be tractable when the optimization can be achieved.A prime example of a choice for Q is the set of all densities that factorize as This reduces the computational problem from computing a high dimensional integral to one of computing a number of one-dimensional ones.Furthermore, as we see in the example of this paper, it is often the case that the variational families are standard exponential families (since they are often 'maximum entropy models' in some sense), and the optimisation problem (3) can be solved by simple iterative methods with very fast convergence.
The core of the method builds on the basis of the principle of the variational free energy minimization in physics, which is concerned with finding the maxima and minima of a functional over a class of functions, and the method gains its name from this root.Early developments of the method can be found in machine learning, especially in applications on neural networks [5,6].The method has been successfully applied in many different disciplines and domains, for example, in independent component analysis [7,8], graphical models [9,10], information retrieval [11] and factor analysis [12].
In the statistical literature, an early application of the variational principle can be found in the work of [13] to construct Bayes estimators.In recent years, the method has obtained more attention from both the application and theoretical perspective, for example [14][15][16][17][18].

Regime-Switching Models
In this paper, we illustrate the strengths and weaknesses of VB through a detailed case study.In particular, we look at a model that is used in finance, risk management and actuarial science, the so-called regime-switching log-normal model (RSLN) proposed, in this context, by [19].
Switching between different states, or regimes, is a common phenomenon in many time series, and regime-switching models, originally proposed by [20], have been used to model these switching processes.As demonstrated in [21], the maximum likelihood estimate (MLE) does not give a simple method to deal with parameter uncertainty; for details of this method, see [21].The asymptotic normality of maximum likelihood estimators may not apply for sample sizes commonly found in practice.Hence, to understand parameter uncertainty, [21] considered the RSLN model in a Bayesian framework using the Metropolis-Hastings algorithm.Furthermore, model uncertainty, in particular selecting the correct number of regimes, is a major issue.Hence, model selection criteria have to be used to choose the "best" model.Hardy [19] found that a two-regime RSLN model maximized the Bayes information criterion (BIC) [22] for both monthly TSE 300 total return data and S&P 500 total return data; however, according to the Akaike information criterion (AIC) [23], a three-regime model was the optimal on S&P data.To account for the model uncertainty associated with the number of regimes, [24] offered a trans-dimensional model using reversible jump MCMC [25].We note that BIC is not necessarily ideal for model selection with state space models [26], while it is still commonly used in the literature.
MCMC methods make possible the computation of all posterior quantities; however there are a number of practical issues associated with their implementation.A primary concern is determining that the generated chain has, in fact, "converged".In practice, MCMC practitioners have to resort to convergence diagnostic techniques.Furthermore, the computational cost can be a concern.Other implementational issues include the difficulty of making good initalisation choices, implementing the MCMC algorithm in one long chain or several shorter chains in parallel, etc. Detailed discussions can be found in [27].
One of the main contributions of this paper is to apply the variational Bayes (VB) method to the RSLN model and present a solution to quantify the uncertainty in model specification.The VB method is a technique that provides analytical approximations to the posterior quantities, and in practice, it is demonstrated to be a very much faster alternative to MCMC methods.

Variational Bayes and Informational Geometry
In this section, we explore the relationship between VB and IG, in particular the statistical properties of divergence-based projections onto exponential families.Here, we used the IG of [2], in particular the ±1 dual affine parameters for exponential families.One of the most striking results from [2] is the Pythagorean property of these dual affine coordinate systems.This is illustrated in Figure 1, which shows a schematic representing a model space containing the distribution f 0 (x) and an exponential family f (x; θ).The Pythagorean result comes from using the KL divergence to project onto the exponential family All distributions that project to the same point form a −1-flat space defined by all distributions f (x) with the same mean, i.e., and further, it is Fisher orthogonal to the +1-flat family f (x; θ).The statistical interpretation of this concerns the behaviour of a model f (x, θ) when the data generation process does not lie in the model.
In contrast to this, we have the VB method, which uses the reverse KL divergence for the projection, i.e., This results in a Fisher orthogonal projection, shown in Figure 1, but now using a +1-flat family.This does not have the property that the mean of s(x) is constant, but as we shall see, it does have nice computational properties when used in the context of Bayesian analysis.
In order to investigate the information geometry of VB, we consider two examples.The first, in Section 3.1, is selected to maximally illustrate the underlying geometric issues and to get some understanding of the quality of the VB approximation.The second, in Section 3.2, shows an important real-world application from actuarial science and is illustrated with simulated and real data.

Geometric Foundation
We consider the simplest model that shows dependence.Let X 1 , X 2 be two binary random variables, with distribution π := (π 00 , π 10 , π 01 , π 11 ), where P (X 1 = i, X 2 = j) = π ij , i, j ∈ {0, 1}.Further, let the marginal distributions be denoted by π 1 = P (X 1 = 1), π 2 = P (X 2 = 1).We want to consider the geometry of the VB projection from a general distribution to the family of independent distributions.This represents the way that VB gains its computational advantages by simplifying the posterior dependence structure.
The model space is illustrated in Figure 2, where π is represented by a point in the three simplex, and the independence surface, where π 00 π 11 = π 10 π 01 , is also shown.Both the interior of the simplex and independence surface are exponential families, and it is convenient to use the natural parameters for the interior of the simplex: where the independence surface is given by ξ 3 = 0.The independence surface can also be parameterised by the marginal distributions π 1 , π 2 or the corresponding natural parameters ξ ind i := log(π i /(1−π i )).For any distribution, π, represented in natural parameters by (ξ 1 , ξ 2 , ξ 3 ), has its VB approximation defined implicitly by the simultaneous equations: These can be solved, as is typical with VB methods, by iterating updated estimates of π 1 and π 2 across the two equations.We show this in a realistic example in the following section.
Having seen the VB solution in this simple model, we can investigate the quality of the approximation.If we were using the forward KL project, as proposed by [2], then the mean will be preserved by the approximation, while, of course, the variance structure is distorted.In the case of using the reverse KL projection, as used by VB, the mean will not be preserved, but in this example, we can investigate the distortion explicitly.Let (ξ 1 (α), ξ 2 (α), ξ 3 (α)) be a +1-geodesic, which cuts the independence surface orthogonally and is parameterised by α, where α = 0 corresponds to the independence surface.In this example, all such geodesics can be computed explicitly.Figure 3 shows the distortion associated with the VB approximation.In the left-hand panel, we show the mean, which is the marginal probability, P (X 1 = 1), for all points on the orthogonal geodesic.We see, as expected, that this is not constant, but it is locally constant at α = 0, showing that the distortion of the mean can be small near the independence surface.The right-hand panel shows the dependence, as measured by the log-odds, for points on the geodesic.As expected, the VB does not preserve the dependence structure; indeed, it is designed to exploit the simplification of the dependence structure.

Variational Bayes for the RSLN Model
The regime-switching log-normal model [19] with a fixed finite number, K, of regimes can be described as a bivariate discrete time process with the observed data sequence w 1:T = {w t } T t=1 and the unobserved regime sequence S 1:T = {S t } T t=1 , where S t ∈ {1, • • • , K} and T is the number of observations.The logarithm of w t , denoted by y t = log w t , is assumed normally distributed, having mean µ i and variance σ 2 i both dependent on the hidden regime S t .The sequence of S 1:T is assumed to follow a first order Markov chain having transition probabilities A = (a ij ) with the probabilities π = (π i ) K i=1 to start the first regime.The RSLN model is a special case of more general state-space models, which were studied in detail by [28].In this paper, we use this model and simulated and real data to illustrate the VB method in practice.We also calibrate its performance by referring to [24], which used MCMC methods to fit the same model to the same data.Here, we are regarding the MCMC analysis as a form of "gold-standard", but with the cost of being orders-of-magnitude slower than VB in computational time.
In the Bayesian framework, we use a symmetric Dirichlet prior for π, that is Let a i denote the i − th row vector of A. The prior for A is chosen as In the above setting, C π , C A , γ, η 2 , α and β are hyper-parameters.Thus, the joint posterior distribution of π, A, {µ i , σ 2 i } K i=1 , and S 1:T is P (π, A, {µ i , σ 2 i } K i=1 , S 1:T |y 1:T ) and is proportional to: This posterior distribution and its corresponding marginal posterior distributions are analytically intractable.In VB, we seek an approximation of Equation ( 6), denoted by q(π, A, {µ i , σ 2 i } K i=1 , S 1:T ), to which we want to balance two things: having the discrepancy between Equation (6) and q small, while keeping q tractable.In general, there are two ways to choose q.The first is to specify a particular distributional family for q, for example the multivariate normal distribution.The other is to choose q with a simpler dependency structure than that of Equation ( 6); for example, we choose q, which factorizes as: The Kullback-Leibler (KL) divergence [29] can be used as the measure of dissimilarity between Equations ( 6) and (7).For succinctness, we denote τ = (π, A, {µ i , σ 2 i } K i=1 , S 1:T ); thus the KL divergence is defined as: Note that the evaluation of Equation ( 8) requires p(τ |y), which is unavailable.However, we note that: Given the factorization Equation ( 7), this can be written as: Consider first the q(π) term.The right-hand side can be rearranged as: where: and Z π is a normalizing term.The first term of Equation ( 9) is the only term that depends on q(π).Thus, the minimum value of KL(q(τ ) || p(τ |y)) is achieved when this term equals zero.Hence, we obtained: Given the joint distribution of p(π, A, {µ i , σ 2 i } K i=1 , s 1:T , y 1:T ) in the form of Equation ( 6), the straightforward evaluation of Equation ( 10) results in: where S 1,i = 1, if the process is in state i at time 1, and zero otherwise.Similarly, we can rearrange Equation ( 9) with respect to {q(a i )} K i=1 , {q(µ i |σ 2 i )} K i=1 , {q(σ 2 i )} K i=1 and q(S 1:T ), respectively, and using the same arguments, then we can obtain: where S t,i = 1, if the process in state i at time t, and zero otherwise, and with Here, ψ is the digamma function, φ is the normal density function and the exact functional forms used in the updates are shown in Algorithm 1.
Algorithm 1 Variational Bayes algorithm for the regime-switching log-normal model (RSLN) model. 1), and θ * i,t 1) do not converge do t) , and β i (t) at step t by t) and a * ij (t) at step t by:

Initialize w
, q s i (t) , and r s i (t) at step t by: The VB method proceeds, as was shown with the simple Equations ( 4) and ( 5), by iterative updating the variational parameters to solve a set of simultaneous equations.In this example, the update equations for the variables π, A, {µ i , σ 2 i } K i=1 , S 1:T are given explicitly by Algorithm 1.For the initialisation, we choose symmetric values for most of the parameters and choose random values for others, as appropriate.For this example, this worked very satisfactory, although we note that for more general state space models [28], states that find good initial values can be non-trivial.

Interpretation of Results
First, all approximating distributions above turn out to lie in well-known parametric families.The only unknown quantities are the parameters of these distributions, which are often called the variational parameters.
The evaluation of parameters of q(π), q(A), q(µ i |σ 2 i ), and q(σ 2 i ) requires knowledge of q(S 1:T ), and also, the evaluation of π * i , a * ij and θ * i,t requires knowledge of q(π), q(A), q(µ i |σ 2 i ) and q(σ 2 i ).This structure leads to an iterative updating scheme, described in Algorithm 1.
The main computational effort in Algorithm 1 is computing E q(S 1:T ) [S t,i ] and E q(S 1:T ) [S t,i S t+1,j ], which have no simple tractable forms.We note that the distributional form of q(S 1:T ) has a very similar structure as the conditional distribution of p(S 1:T |Y 1:T , τ ) for which the forward-backward algorithm [30] is commonly used to compute E p(S 1:T |Y 1:T ,τ ) [S t,i |Y 1:T , τ ] and E p(S 1:T |Y 1:T ,τ ) [S t,i S t+1,j |Y 1:T , τ ].Therefore, we also use the forward-backward algorithm to compute E q(S 1:T ) [S t,i ] and E q(S 1:T ) [S t,i S t+1,j ].

Simulated Data
In this section, we applied the VB solutions to four sets of simulated data, which are used in [24].Through these simulated studies, we will test the performance of VB on detecting the number of regimes and compare it with those of the BIC and the MCMC methods [24].For this paper, we present only an initial study with a relatively small number of datasets.The results are highly promising, but more extensive studies are needed to draw comprehensive conclusions.Furthermore, see [28] for general results on VB in hidden state space models.
To estimate the number of regimes, we construct a matrix, called the relative magnitude matrix (RMM), defined as A = â ij , where and w A ij is the parameter of q(A).Our model selection procedure is to fit a VB with a large number of regimes and to examine the rows and columns in the RMM.If the values of the entries in the i − th row and the i − th column of A are all equal to C A /K T −1+C A ×K , then we will declare the regime i nonexistent.This method is validated by the following observations.It can be shown that the parameter of v s ij in w A ij is equal to the number of times the process leaves regime i and enters regime j.Therefore, for the i − th regime, the values of zero for all of v s ji and v s ij with j = 1, • • • , K indicate that there is no transition process entering or leaving regime i.
Table 1 specifies the parameters for the four cases, and we generate 671 observations for each case (equal to the number of months from January 1956 to September 2011).The parameters used in Case 1 are identical to the maximum likelihood estimates for TSX monthly return data from 1956 to 1999 [19].Case 2 only has one regime present.Case 3 is similar to Case 1, but the two regimes have the same mean.Case 4 adds a third regime.For each case, we use MLE to fit a one-regime, two-regime, three-regime and four-regime RSLN model and report the corresponding BIC and log-likelihood scores.We then misspecify the number of regimes and run a four-regime VB algorithm.Table 2 shows the number of iterations that VB takes to converge in each case and the corresponding computational time (on a MacBook, 2 GHz processor).On average, VB converges after a hundred iterations and takes about one minute.On the same computer, a 10 4 -iteration Reverse Jump MCMC (RJMCMC) will take about 10 h to finish.Using diagnostics, this seemed to be enough for convergence, while not being an "unfair" comparison in terms of time with VB.We can see that the computational efficiency will be a very attractive feature of the VB method.The results of the BIC with the log-likelihood (in parentheses), the relative magnitude matrices and the posterior probabilities for the models with the different number of regimes estimated by MCMC (cited from Hartman and Heaton [24]) are given in Table 3.In Case 1, the BIC favors the two-regime model.The posterior probability estimated by MCMC for the one-regime model is the largest, but there is still a large probability for the two regime model.Note that the prior specification for the number of regimes can effect these numbers and is always an issue with these forms of multidimensional MCMC.The relative magnitude matrix clearly shows that there are only two regimes whose â ij are not negligible.This implies VB removes excess transition and emission processes and discovers the exact number of hidden regimes.In Case 2 and Case 3, both VB and the BIC can select the correct number of regimes, and the posterior probability for the one-regime model estimated by MCMC is still the largest.In Case 4, VB does not detect the third regime.The transition probability to this regime is only 0.01, and the means and standard deviations of Regime 1 make the rare data from Regime 3 easily merged within the data from Regime 1. From Table 3, it is clear that for all of the cases, the log-likelihood always increases as the number of regimes increase.

Real Data
In this section, we apply the VB solution to the TSX monthly total return index in the period from January, 1956, to December, 1999 (528 observations in total and studied in [19,21]).
A four-regime VB is implemented first.VB converges after 100 iterations about 34.284 s (on a MacBook, 2 GHz processor).The relative magnitude matrix, given in Table 4, clearly shows that VB identifies two regimes.This matches both of the BIC and AIC-based results [19].Based on these results, we then fit a two-regime VB, which converges after 83 iterations in about 14.241 s.Table 5 gives the marginal distributions for all of the parameters.Figure 4 presents the corresponding density functions, where we can see that all of the plots show a symmetric and bell-shaped pattern.Table 6 (the upper part) gives the maximum likelihood estimates (cited from [19]), mean parameters computed by the MCMC method (cited from [21]) and mean parameters computed by VB.It clearly shows that the point estimates by VB are very close to those by MLE and MCMC.The numbers in parenthesis in Table 6 are the standard deviations computed by the three methods, respectively.It is worth noting that all of the variance estimated by VB are smaller than those by the MLE or MCMC methods.In fact, some other researchers also report the underestimation of posterior variance in other VB applications, for example [31,32].In the paper [1], we look at some diagnostics methods that can assess how well the VB approximates the true posterior, particularly with regards to its covariance structure.The methods proposed also allow us to generate simple corrections when the approximation error is large.

Conclusions
Variational Bayes can be thought of in terms of information geometry as a projection-based approximation technique; it provides a framework to approximate posteriors.We applied this method to the regime-switching log-normal model and provide solutions to account for both model uncertainty and parameter uncertainty.The numerical results show that our method can recover exactly the number of regimes and gives reasonable point estimates.The VB method is also demonstrated to be very computationally efficient.
The application on the TSX monthly total return index data in the period from January 1956 to December 1999, confirms the similar results in the literature in finding the number of regimes.

Figure 2 .
Figure 2. Space of distributions with independence surface: marginal probability and dependence.

Table 1 .
Parameters of the simulated data.

Table 2 .
Computational efficiency of VB.

Table 3 .
The estimated number of regimes by VB, BIC and MCMC.

Table 4 .
Estimations of the number of regimes for TSXdata.

Table 5 .
The marginal distributions of the parameters estimated by VB.

Table 6 .
Estimates and standard deviations by VB, MLE and MCMC.