Variational Bayesian Variable Selection for High-Dimensional Hidden Markov Models

: The Hidden Markov Model (HMM) is a crucial probabilistic modeling technique for sequence data processing and statistical learning that has been extensively utilized in various engineering applications. Traditionally, the EM algorithm is employed to fit HMMs, but currently, academics and professionals exhibit augmenting enthusiasm in Bayesian inference. In the Bayesian context, Markov Chain Monte Carlo (MCMC) methods are commonly used for inferring HMMs, but they can be computationally demanding for high-dimensional covariate data. As a rapid substitute, variational approximation has become a noteworthy and effective approximate inference approach, particularly in recent years, for representation learning in deep generative models. However, there has been limited exploration of variational inference for HMMs with high-dimensional covariates. In this article, we develop a mean-field Variational Bayesian method with the double-exponential shrinkage prior to fit high-dimensional HMMs whose hidden states are of discrete types. The proposed method offers the advantage of fitting the model and investigating specific factors that impact the response variable changes simultaneously. In addition, since the proposed method is based on the Variational Bayesian framework, the proposed method can avoid huge memory and intensive computational cost typical of traditional Bayesian methods. In the simulation studies, we demonstrate that the proposed method can quickly and accurately estimate the posterior distributions of the parameters with good performance. We analyzed the Beijing Multi-Site Air-Quality data and predicted the PM2.5 values via the fitted HMMs.


Introduction
Hidden Markov Models (HMMs) are a statistical model used to describe the evolution of observable events that depend on internal factors or states, which are not directly observable and called hidden states.Each hidden state can transition to another hidden state, including itself, with a certain probability, while we cannot observe them directly, we infer their presence and transitions between them based on observable outputs.HMMs have been widely used in various applications, including speech recognition, bioinformatics, natural language processing, and financial markets.In practice, HMMs often face high-dimensional issues, that is, a large number of covariates (or high-dimensional covariates) and multiple states result in high-dimensional parameters existing in the HMMs.The high-dimensional issue may result in the overfitting for the HMMs.The challenge then becomes identifying important variables or parameters in different hidden states.Thus, efficient parameter estimation and hidden Markov chain recovering are significant for the high-dimensional HMMs.
Currently, there have been many methods for estimating parameter estimation and recovering hidden Markov chains, including recursive algorithms [1][2][3][4] and traditional Bayesian methods [5,6].Bayesian inference [7,8] is a versatile framework that utilizes sophisticated hierarchical data models for learning and consistently quantifies uncertainty in unknown parameters through the posterior distribution [9].However, computing the posterior is demanding even for moderately intricate models and frequently necessitates approximation.Moreover, traditional Bayesian methods (e.g., Markov chain Monte Carlo (MCMC)) for the HMMs is often considered a black-box method by many statisticians due to its reliance on simulations to produce computable results, which is generally inefficient and unnecessary.Traditional Markov Chain Monte Carlo (MCMC) methods may also exhibit slow convergence and extended running times, as documented in prior studies [10][11][12].
The Variational Bayesian approach is an alternative to traditional MCMC algorithm in high-dimensional issue.Variational inference (VI) based on Bayesian method [13][14][15][16] can approximate posterior distributions quickly [17], since VI uses the Kullback-Leibler (KL) divergence to measure the difference between the variational posterior and the true posterior, and transforms the statistical inference problem into a mathematical optimization problem by minimizing the KL divergence.Therefore, the variational approach is as close as possible to the true posterior distribution according to the KL divergence.Wang and Blei [17] have proved that the variational posterior is consistent to the true posterior distribution.Moreover, there have been many efficient optimization algorithms to approximate complex probability densities such as coordinate-ascent (CAVI) [18] and gradient-based methods [15,19].Currently, there exist many VI research studies for the HMMs [20][21][22][23][24].For example, MacKay [20] was the pioneering proponent of the application of variational methods to HMMs, with a focus only on cases with discrete observations.Despite the limited comprehension of the state-removal phenomenon, which is that removing certain states from an HMM for simplifying the HMM while preserving its essential statistical properties does not significantly affect the the ability of the HMM to represent the underly stochastic process, variational methods are gaining popularity for HMMs within the machine learning community.C. A. McGrory [21] extended the deviance information criterion for Bayesian model selection within the Hidden Markov Model framework, utilizing a variational approximation.Nicholas J. Foti [22] devised an SVI algorithm to learn HMM parameters in settings of time-dependent data.Since VI can be seen as a special instance of the EM algorithm [23], Gruhl [23] integrates both approaches and uses the multivariate Gaussian output distribution of VI to train the HMM.Ding [24] employed variational inference techniques to investigate nonparametric Bayesian Hidden Markov Models built on Dirichlet processes.These processes enable an unbounded number of hidden states and adopt an infinite number of Gaussian components to handle continuous observations.However, variational inference has not been fully explored in HMMs, especially in HMMs with a high-dimensional covariate.It is important to note that research frequently entails using high-dimensional covariate datasets in real-world applications.When high-dimensional HMMs contain a large number of parameters, there is a possibility of overfitting the given data set.The challenge then becomes identifying which covariates have a substantial impact on the interpretation of observations and state shifts in each hidden state of the model, and which covariates have a negligible effect.This process is important for improving the predictive accuracy of the model, reducing the risk of overfitting, and improving the interpretability of the model.
In this article, we develop a Variational Bayesian method for variable selection.We utilize the double-exponential shrinkage prior [25][26][27][28] as the prior of coefficients in each hidden state models, to screen vital variables that affect each hidden state and obtain hidden Markov regression models.We use mean-field variational inference to identify variational densities for approximating complex posterior densities, via minimizing the difference between the approximate probability density and the actual posterior probability density.Moreover, we adopt the Monte Carlo Co-ordinate Ascent VI (MC-CAVI) [29] algorithm to compute the necessary expectations within the CAVI.Since the variational inference is a fast alternative to the MCMC method and can avoid large memory and intensive computational cost compared to traditional Bayesian methods, the proposed approach inherits the good properties of variational inference, and can quickly and accurately estimate the posterior distributions and the unknown parameters.In the simulation studies and real data analysis, the proposed method outperforms the common methods in term of variable selection and prediction.
The main contributions of this article are as follows: First, the proposed method can perform variable selection for high-dimensional HMMs, and offer the advantage of fitting the model and investigating specific factors that impact the response variable changes simultaneously.Since the proposed method uses double-exponential shrinkage prior, which has the feature of being able to select important variables, the proposed method can simultaneously select important variables to the response variable and estimate the corresponding parameters.Second, since the proposed method is based on the Variational Bayesian framework, the proposed method can avoid huge memory and intensive computational cost of the traditional Bayesian methods, especially for the high-dimensional issue.Finally, we demonstrate that the proposed method can quickly and accurately estimate the posterior distributions of the parameters with good performance in the simulation studies.Moreover, we analyze Beijing Multi-Site Air-Quality data and predict the PM2.5 values well via the fitted HMMs.
The rest of the article is organized as follows: Section 2 introduce the Hidden Markov Model with high-dimensional covariate and shrinkage priors in the Bayesian inference.In Section 3, we propose an efficient Variational Bayesian estimation method with the double-exponential shrinkage prior for variable selection of the high-dimensional HMMs (HDVBHMM).In Section 4, we conduct simulation studies to investigate the finite sample performances of the proposed method.In Section 5, Beijing Multi-Site Air-Quality data are analyzed and the efficiency of the proposed method is verified.Section 6 concludes our work.Technical details are presented in the Appendix A.

Hidden Markov Model
In this section, we first introduce Hidden Markov Models (HMMs).The HMMs are a type of doubly stochastic process that occurs over discrete time intervals and includes observations y t and latent states z t .In a traditional Hidden Markov Model without covariates, the observation y t depends only on the current potential state z t .The conditional distribution of the observation y t when given the potential state z t = k can be expressed as: where F k (θ k ) denotes a certain family of distributions, such as the normal distribution N µ k , σ 2 k .Extended HMM models can include covariates x t ∈ R p .That is, the set of observations is y = (y 1 , . . . ,y T ) and x = (x 1 , . . . ,x T ).Specifically, for y = (y 1 , . . . ,y T ), z = (z 1 , . . . ,z T ) and x = (x 1 , . . . ,x T ), the model expression is as follows: where the symbol N represents the normal distribution, σ 2 denotes the variance of y t , and β = (β 1 , . . ., β K ) ⊤ is the coefficient of the covariate at all hidden states.In the article, we consider the high-dimensional issue of the covariate.We denote the dummy variable corresponding to z t as the v t = (v t1 , . . ., v tK ) ⊤ , where v tk = 1 and other elements being zero if z t = k.Thus, In hidden Markov chains, each hidden state z t is independent from z 1 , . . ., z t−2 and z t+1 , . . ., z T conditionally on z t−1 .Therefore, we can assume that the probability distribution of z 1 is given by z 1 ∼ P(z 1 |π) = (π 1 , π 2 , . . . ,π K ) ⊤ , where ∑ K k=1 π k = 1 and π k > 0. The conditional probability of z t given z t−1 is assumed as: , where A is the transition matrix with elements A ij for i, j = 1, . . ., K, ∑ K j=1 A ij = 1 and A ij > 0, and A ij represents the probability of transitioning from state i to state j.Thus, the joint distribution is as follows: (1)

Prior Selection in the HMMs
To make Variational Bayesian inference, we require specifying the prior of the parameters π, A, β and σ 2 .Based on the characteristics of π = (π 1 , π 2 , . . . ,π K ), ∑ K k=1 π k = 1 and π k > 0, Dirichlet distribution is applied to the prior distribution of π as follows: where In the model, A denotes the transition matrix of the hidden state z and can be expressed as follows: Like Nicholas [22], we specify the prior of the jth row of the transition matrix A as: where α jk > 0. Since σ 2 is variance of the y, we specify the prior of the σ 2 as In a high-dimensional and sparse issue, we consider the double-exponential shrinkage prior [25,27] as the prior of β, defined as follows: where Γ represents gamma distribution and Exp(•) represents the exponential distribution.The above prior can select important variables of the HMMs in each hidden state.Bayesian approaches can be used to solve the parameter estimation question with the above prior information.However, in high-dimensional data, the traditional Bayesian methods (e.g., MCMC) require huge memory and intensive computational cost.The Variational Bayesian approach is an alternative to the traditional MCMC algorithm in high-dimensional issue.Next, we introduce the proposed Variational Bayesian inference for high-dimensional HMMs.

Mean Field Variational
Mean-field Variational Bayesian inference is a prevalent approach in variational inference, and aims to identify an approximate density by minimizing the difference between the approximate probability density and the actual posterior probability density, while being bounded by the Kullback-Leibler divergence.In this subsection, we proposed the mean-field variational inference for HMMS with the high-dimensional covariates.
Let D be an observed data set, D = {y, x} with response set y = {y i | i = 1, . . ., n} and covariate set x = {x i | i = 1, . . ., n}, and θ = {π, A, β, σ 2 , τ 2 m , λ 2 }.The θ and z include all parameters in the HMMs.We focus on the posterior distribution of parameters θ and the hidden state z t .Assume that there is an approximate density family Q containing possible densities over the parameters θ, z.Minimizing the Kullback-Leibler (KL) divergence between the member of the family q(θ, z) and the true posterior P(θ, z | D) is to obtain the optimal density approximation of the true posterior, with variational inference prioritizing optimization rather than sampling.That is, where the KL-divergence is: KL(q(θ, z)∥P(θ, z | D)) = q(θ, z) log q(θ, z) P(θ, z|D) d(θ, z).
The KL-divergence can be further written as: where log P(D) is a constant, E q denotes the expected value of θ and z drawn from the distribution q.Thus, minimizing the KL divergence is equivalent to maximizing the following evidence lower bound (ELBO): From another perspective, the ELBO comprises the negative KL divergence and log P(D).
According to the mean-field variational framework [30,31], the parameters are assumed to be posterior independent of each other and to be controlled by a separate factor in the variational density.In the HMMs, q(θ, z) is decomposed as: q(θ, z) = q(π)q(A)q σ 2 q τ 2 m q λ 2 T ∏ t=1 q(z t ).
Each parameter θ i and latent state z is governed by its own variational factor.The forms of q(θ i ) and q(z) are unknown, but the form of the hypothesized factorization is determined.In the optimization process, the optimal solutions of these variational factors q(θ i ) and q(z) are obtained by maximizing the ELBO of Equation ( 6) by the coordinate ascent method.Based on the consistency of the Variational Bayesian [17], the variational densities over the mean-field family are still consistent to the posterior densities, even though the mean field approximating family can be a brutal approximation.More generally, one can consider structured variational distributions involving partial factorizations that correspond to tractable substructures of parameters [32].In this article, we only consider the mean field framework.To express the variational posterior formula concisely, we define ϕ = {θ, z} and rewrite q(θ, z) as q(ϕ).

The Coordinate Ascent Algorithm for Optimizing the ELBO
Based on the variational density decomposition, we can obtain each factor of the variational density via maximizing the ELBO.Let q i (ϕ i ) for i = 1, 2, . . ., b be the ith factor of the variational density in .The common approaches to maximize the ELBO mainly include a Coordinate Ascent Variational Inference (CAVI) and a gradient-based approach [33].The CAVI approach sequentially optimizes each factor of the variational density of the mean field to obtain a local maximizer for the ELBO, while keeping the others fixed.Based on the CAVI approach, we can obtain the optimal variational density q * i (ϕ i ) as follows: where i − (or i + ) refers to the ordered indexes that are less than (or greater than) i.Let ϕ −i := (ϕ i − , ϕ i + ).The vector ϕ −i represents the vector ϕ with the ith component ϕ i removed.
The E −i denotes the expectation with respect to ϕ −i .
Based on the joint distribution (1), the priors ( 2)-( 5) and Formula (8), we can derive all variational posteriors (see Appendix A for details).The variational posterior of the π is: where α (π) = E(z 1 ) + α (π) .The variational posterior of the A j is: where jk .The variational posterior of the β k is: where The variational posterior of the σ 2 is: where The variational posterior of the τ 2 m is: where The variational posterior of the λ 2 is: where Based on the dependencies of hidden states, we divide the posterior of z into three parts.The variational posterior of the z 1 is: where the Mult represents multinomial distribution, P (z 1 ) = (P (z 1 )1 , . . ., P (z 1 )K ) ⊤ and The variational posterior of the z t for t = 2, . . ., T − 1 is: where P (z t ) = (P (z t )1 , . . ., P (z t )K ) ⊤ and The variational posterior of the z T is: where P (z T ) = (P (z T )1 , . . ., P (z T )K ) ⊤ and The expectation E log P y t | x t , β k , σ 2 in the above variational posteriors ( 15)-( 17) is expressed as follows: Note that the expectation part of some parameter posterior formulas is difficult to derive analytically.One feasible method is to use Monte Carlo (MC) sampling to approximate the expectation part that cannot be derived analytically, that is, the Monte Carlo Coordinate Ascent VI (MC-CAVI) [29] algorithm.The MC-CAVI recursion approaches have been proved to be convergent to the maximizer of the ELBO with arbitrarily high probability under regularity conditions.In the article, we also use MC-CAVI to obtain the intractable expectations.

Implementation
Assume that the expectations E −i [log P(ϕ i− , ϕ i , ϕ i+ , D)] for i ∈ I within an index set I can be analytically obtained across all updates of the variational density q * (ϕ), and cannot be analytically obtained for i / ∈ I.For the MC-CAVI method, intractable integrals can be approximated using the MC methods if i / ∈ I. Specifically, for i / ∈ I, the samples with the sample size N ≥ 1 are drawn from the current q * −i (ϕ −i ) to obtain the expectation estimations as follows: The Algorithm 1 summarizes the implementation of MC-CAVI, where the q i,k (ϕ i ) denotes the density of the ith density factor after it has undergone the kth updates, and q −i,k (ϕ −i ) refers to the density of all density factors except the ith factors after the kth updates to the factors preceding the ith factor and the k − 1 updates to the blocks following it.Necessary: E −i log P ϕ i − , ϕ i , ϕ i + , D in closed form for i ∈ I.
If i ∈ I: Obtain N samples ϕ Combining with the MC-CAVI algorithm, we can summarize the implementation algorithm for variational posteriors for all parameters as follows in Algorithm 2. Based on the Algorithm 2, we can adopt the variational posterior means of the parameters as the estimators.

Simulation Studies
In this section, we carry out simulation studies to investigate the finite sample performances of the proposed method, denoted as HDVBHMM.To evaluate the prediction performance, we compare the proposed method with some commonly used and popular methods, including Back Propagation Neural Network (BP), Long Short-Term Memory (LSTM), and Random Forest.The experimental code can be found via the github link (https://github.com/LiuWei-hub/VBHDHMM,accessed on 23 March 2024).
To assess the predictive performance, we use the samples in the last m time intervals as the testing set and the samples in the first T − m time intervals as the training set.In addition, we use four criteria: (1) , where y represents the sample mean, ∑ m t=1 ( ŷt − y t ) 2 is the error caused by the prediction, and ∑ m t=1 ( ȳ − y t ) 2 is the error caused by the mean.The smaller the MAPE, RMSE and MAE values are, the better the performance of the method is.The larger the R 2 is, the better the performance of the method is.To evaluate the performance of the parameter estimation, we use two criteria: (1) the root mean square where n is the number of repeated experiments, θi is the estimated value of the parameter obtained in the ith experiment, and θ is the true parameter value; and (2) Bias( θ) = 1 n ∑ n i=1 θi − θ.The RMSE and Bias values closer to zero imply better performance for the method.We repeat 10 simulation examples and calculate the average values of the above metrics for each method.

Experiment 1
In experiment 1 , we consider different dimensions p = 20, 30 and 40.In addition, the state transition matrix A is set as follows: Due to K = 3, we have three regression coefficients β 1 , β 2 , β 3 .We set the coefficient as follows: where the first four rows are nonzero and other elements are zero.We set the number of the discrete time intervals T = 300 and the sample size in the testing set m = 10.In addition, the hyperparameters r, δ in the HDVBHMM method are set to 1.The results are shown in Tables 1 and 2.
In Table 1, the smaller MAPE, RMSE, and MAE index values, the better the algorithm performance.The larger the R 2 index, the better the algorithm performance.Bold indicates the optimal result in each scenario.It is clear that our method is optimal in all cases (bold), especially for p = 20, p = 30, and p = 40.In the small sample case, the prediction performance of the LSTM method decreases significantly as the dimensionality of the covariates increases.The prediction performance of the Random Forest and BP methods is not stable with increasing covariate dimensions.Although the performance of our method decreases as the covariate dimension increases, it is still significantly better than the other methods.Table 2 shows the RMSE and Bias of the estimated values of β and A. From Table 2, we can see that the proposed method performs well.Two metrics are small when the covariate dimension is 20 and 30.When the dimension is increased to 40, the value of the RMSE index increases, but it is still within the acceptable range.To better illustrate the performance of parameter estimation, Figure 1 shows box plots of the estimator values of A, β 1 , β 2 , β 3 under p = 30, where the horizontal coordinate is the index of the variables and the vertical coordinate is the values of estimators.The corresponding figures on p = 20 and p = 40 are shown in Appendix A.2.For the estimators β 1 , β 2 and β 3 , we can see the first four elements are estimated close to the true value, and the remaining values are estimated clear to zero; This implies that the proposed method can achieve good variable selection performance.In addition, all elements of the state increment matrix A are estimated close to the true values, which also confirms the good performance of our method.
In addition, to further verify that the algorithm is sensitive to the choice of hyperparameters r, δ, we conduct experiments on data with a covariate dimension of 30.Consider the following three experiments, the first with r = 0.5, δ = 0.5; the second with r = 1.0, δ = 1.0; and the third with r = 1.5, δ = 1.5.The experimental results show that the estimation results are not sensitive to the choice of the two hyperparameters r and δ.The images of the Gamma distributions for the three different hyperparameter settings are very similar in shape.This similarity may contribute to the reason why, for a certain range of variations in r and δ values, the model's performance may not show sensitivity to these hyperparameters.We show the results in Appendix A.4.

Experiment 2
In experiment 2, we consider the higher dimension cases: p = 60, 90, 120.We set the same A as experiment 1 and the coefficient as follows: where the first four rows are nonzero and other elements are zero.We set the number of discrete time intervals T = 600 and the sample size in the testing set m = 10.In addition, the hyperparameters r and δ in the HDVBHMM method are set to 1.The results are shown in Tables 3 and 4.
As can be seen in Table 4, when the covariate dimensions are increased and the sample size reaches 600, our method still performs well among the four methods.It should be noted that when the covariate dimension is 90 and 120, the MAPE metric of the random forest method is slightly smaller than our method.In addition, as the covariate dimension increases from p = 60 to P = 120, the performance of the LSTM method decreases significantly, which is the worst performance among the four methods.This shows that LSTM does not perform well on such small-sample high-dimensional datasets.As the dimensionality of the covariates increases, although the BP and Random Forest methods show better prediction performance than the LSTM method, they are also poorer than the prediction performance of the HDVBHMM method.Overall, our method outperforms the other three methods in terms of prediction performance as the dimensionality increases, suggesting that our method performs better on small-sample high-dimensional datasets.
Figure 2 shows box plots of the estimator values of A, β 1 , β 2 , β 3 under p = 90.The corresponding figures on p = 60 and p = 120 are shown in Appendix A.3.From Figure 2, we can see that the regression coefficients β 1 , β 2 , and β 3 are accurately estimated.the first four elements are estimated close to the true value, and the remaining values are estimated clear to zero.It implies that the proposed method can successfully achieve variable screening even as the covariate dimension increases.In addition, all elements of the state increment matrix A are estimated close to the true values, which also confirms the good performance of the proposed method.In the Long-Term and Short-Term Memory methods, the learning rate is lr = 0.001, the number of training cycles (Epochs) is set to 50, and the size of the hidden layer is set to 10.The hidden layer of the BP method consists of 20 neurons, and the maximum number of iterations is set to 10,000.In the random forest regression model, the number of trees is set to 100.The bold results are the optimal ones among four methods.

Application to Real Datasets
In this section, we focus on Beijing Multi-Site Air-Quality data, which include 6 major air pollutants and 6 related meteorological variables at multiple locations in Beijing.These air-quality measurements are created by the Beijing Municipal Environmental Monitoring Center.In addition, meteorological data at each air quality location are paired with the nearest weather station provided by the China Meteorological Administration.The data span from 1 March 2013 to 28 February 2017.In our study, we consider PM2.5 concentration as response variable, and PM10 concentration, SO 2 concentration, NO 2 concentration, CO concentration, O 3 concentration, Temperature (TEMP), Pressure (PRES), Dew point temperature (DEWP), Precipitation (RAIN), and Wind speed (WSPM) as covariates; that is, p = 10.In order to study the performance on small sample datasets, we delete the missing values in the data and select the data samples in the first 200 time intervals from the Shunyi observation point in Beijing in 2017 for analysis.To assess the predictive performance, we use the first 140 samples as the training set, and the remaining 60 samples as the testing set.We compare the proposed method with the BP neural network, LSTM and Random Forest method similar to Section 4.
One of the main challenges in implementing the HMM is to determine the optimal number of hidden states.The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are two common model selection techniques, which select the best model by balancing the fitting accuracy and complexity of the model.In selecting the number of hidden states for a Hidden Markov Model, both AIC and BIC evaluate multiple models containing different numbers of states and select an optimal model that balances fitting accuracy and complexity.Multiple HMMs are trained separately using different numbers of hidden states, then the AIC or BIC values are calculated for each model, and finally the model with the smallest AIC or BIC value is selected [34].Similar to the work of Dofadar et al. [34], we use AIC and BIC to select the number of the hidden states.The AIC equation used in this study is given by AIC = 2k − 2L, where k is the number of free parameters in the model and L is the log probability value.The formula for k used in this research is k = n 2 + 2n − 1, where n is the current value of the hidden state.The BIC equation used in this study is expressed as BIC = ln(T)k − 2L, where T is the total number of observations.To find the best number of hidden states, we calculate AIC and BIC values based on the different numbers of hidden states: 2, 3, 4, and 5.The results are shown in Figure 3. Figure 3 shows that when the number of hidden states is 3, the AIC and BIC values are the smallest, indicating that choosing the number of hidden states as 3 is the closest to the real model.Therefore, we set the the number of hidden states K = 3.
Similar to Section 4, we calculate MAPE, RMSE, MAE and R 2 to evaluate the predictive performance.Since the time series data are positively skewed, MAE and MASE are the best evaluation metrics for evaluating the model performance [35].The results are shown in Table 5.From Table 5, we can see that the MAPE and MAE of the proposed HDVBHMM method are smaller than ones of other methods, and the R 2 value of the proposed method is larger than one of other methods, indicating that the performance of the HDVBHMM method is better than other methods.Among the other three competing methods, the MAPE and MAE values of the Random Forest method are lowest among those of the three competing methods, but its MAPE and MAE are still much larger than ones of the proposed method.The BP method is the worst performing among four methods with MAE = 27.570 and MAPE = 1.025.
To better illustrate the predictive performance, Figure 4 shows the true data and predicted values via four methods on the testing set.From Figure 4, we can see that in the first 30 time points, the proposed method fits the true values very well.In the second set of 30 time points, as the prediction time period increases, the predicted values exhibit a slight error, but they are still better than those of other methods.Overall, the prediction accuracy of the proposed method is much better than ones of other methods in term of both short and long time periods.In the Long-Term and Short-Term Memory methods, learning rate is lr = 0.001, the number of training cycles (Epochs) is set to 50, and the size of the hidden layer is set to 40. the BP method contains 12 neuron hidden layers and the maximum number of iterations is set to 10,000.The hyperparameters hyperparameters r, δ in the HDVBHMM method were set to 1.0.The bold results are the optimal ones among four methods.
The estimated values of β corresponding to the three states are shown in Table 6.From Table 6, we can see that PM10, SO 2 , TEMP (temperature), DEWP (dew point temperature), RAIN (precipitation), and WSPM (wind speed) have the greatest influence on PM2.5 emissions in state 1.PM10, SO 2 , TEMP, and DEWP are the four factors that have a negative effect on the presence of PM2.5 emissions in the area, and as these four factors increase, PM2.5 emissions will decrease; meanwhile RAIN and WSPM have a positive effect on the presence of PM2.5 in the area.Rainfall and high wind speed may have increased PM2.5 concentrations through physical effects (such as windblown dust).The prediction formula of the PM2.5 in State 1 is as follows: In addition, PM10, Sulfur Dioxide, Nitrogen Dioxide, TEMP (temperature), DEWP (dew point temperature), RAIN (precipitation), and WSPM (wind speed) have the largest effect on PM2.5 in State 2. The results showed that in state 2, some chemical reactions led to the depletion of gases such as SO 2 and NO 2 , which reduced the production of PM2.5, and rainfall also reduced the production of PM2.5.The high wind speed led to an increase in PM2.5 concentration, probably because the wind speed increased the diffusion and transport of particulate matter.The prediction formula of the PM2.5 in this State is as follows: PM10, SO 2 , NO 2 , O 3 , TEMP (temperature), PRES (pressure), DEWP (dew point temperature), RAIN (precipitation), and WSPM (wind speed) have the greatest impact on PM2.5 in state 3.It is worth noting that the increase in variables such as SO 2 and NO 2 leads to an increase in PM2.5 concentration.In addition, the significant positive coefficients for temperature indicate that higher temperatures promote the formation of PM2.5, which may be related to the acceleration of certain chemical reactions by high temperatures.The increase in SO 2 and NO 2 may promote the formation of secondary particulate matter, which in turn increases the PM2.5 concentration.Wind speed increases particulate dispersion, and rainfall may also promote the formation of secondary particulate matter from some soluble substances.The prediction formula of the PM2.5 in this State is as follows: In summary, the regression coefficients for the three states reflect the effects of different environmental factors on PM2.5 concentrations.The positive and negative signs and magnitudes of these coefficients can provide scenarios on how to manage and predict PM2.5 concentrations by controlling these environmental factors under different environmental conditions.In particular, the fact that temperature, rainfall and wind speed have different effects on PM2.5 concentrations in different states suggests that PM2.5 management needs to take into account complex meteorological conditions and interactions between air pollutants.

Conclusions
In this paper, the variable selection for high-dimensional HMMs is studied based on the variational inference.We develop a Variational Bayesian method with the doubleexponential shrinkage prior for variable selection.The proposed method can quickly and accurately estimate the posterior distributions and the unknown parameters.In the simulation studies and real data analysis, the proposed method outperforms the common methods in term of variable selection and prediction.In the Beijing Multi-Site Air-Quality analysis, we select the optimal number of the hidden stats based on the AIC and BIC methods, and fit the HMMs of the response variable PM2.5.In the current research work, we investigate variational inference for linear HMMs with high dimensional covariates; that is, the mean of the response variable is linear with respect to the high dimensional covariates.Many of the relationships between variables in practical applications may be not linear, so variational inference for nonlinear HMMs is worth studying.In addition, it is assumed that the variances in observations are the same in different hidden states in this study, but in practical applications, heteroskedasticity may be more in line with real-world data characteristics.For that reason, the heteroskedasticity issue for HMMs is also worth exploring deeply.Moreover, Ivan Gorynin's work [36] verifies that the Pairwise Markov Model (PMM) outperforms the traditional HMM in terms of accuracy when the observed variable y is highly autocorrelated or when the hidden chain is not Markovian.Unlike the HMM, which assumes that the hidden chain z is Markovian, the PMM assumes that (z, y) is Markovian.Since hidden chains are not necessarily Markovian in the PMM, it is more general than the HMM.Parameter estimation of PMM models is done using Variational Bayesian methods in the work of Katherine Morales [37].However, the effect of including the covariate x on the target variable y was not considered in their work.Therefore, as an extension of the proposed method, which replaces the HMM with the PMM, the inclusion of high-dimensional covariates in the PMM may yield more accurate predictions.
We derive the conditional posterior distribution of A as: for j = 1, . . .K. According to Equation (10), The variational posterior distribution of A j is given by: Similarly, we derive the conditional posterior distribution of σ 2 as: The variational posterior distribution of σ 2 is given by: We derive the conditional posterior distribution of λ 2 as: The variational posterior distribution of λ 2 is given by: We derive the conditional posterior distribution of τ 2 m .Note that since the variational posterior of τ 2 m is difficult to obtain, we derive the variational posterior of 1 τ 2 m as: The variational posterior distribution of τ 2 m is given by: where We derive the conditional posterior distribution of β as: The variational posterior distribution of β is given by: where We derive the conditional posterior distribution of π: The variational posterior distribution of π is given by: Finally, we derive the variational posterior of z.Based on the dependencies of hidden states, we divide the variational posterior of z into the following three parts.
We derive the conditional posterior distribution of z 1 as: The variational posterior distribution of z 1 is given by: We derive the conditional posterior distribution of z t for t = 2, . . ., T − 1 as: The variational posterior distribution of z t for t = 2, . . ., T − 1 is given by: We derive the conditional posterior distribution of z T as: The variational posterior distribution of z T is given by:

Appendix A.4. Sensitivity Analysis Results for Different Hyperparameter Settings
To further understand whether the algorithm is sensitive to the choice of hyperparameters r, δ, we conduct experiments on simulated data with a sample size of 300 and a covariate dimension of 30 similar to Section 4. Consider the following experiments with three different hyperparameter settings, the first r = 0.5, δ = 0.5; the second r = 1.0, δ = 1.0; and the third r = 1.5, δ = 1.5.Table A1.The hyperparameters were set to r = 0.5, δ = 0.5; r = 1.0, δ = 1.0; and r = 1.5, δ = 1.5 to compute the mean of the four metrics, with standard deviations in parentheses, based on 10 simulations under the conditions of T = 300, p = 30.The experimental results show that the estimation results are not sensitive to the choice of the two hyperparameters r and δ.The images of the Gamma distributions for the three different hyperparameter settings are very similar in shape.It implies that the performances of the proposed method are not sensitive to these hyperparameters for a certain range of variations in r and δ values.

Algorithm 1
Main iteration steps of MC-CAVI Necessary: Number of iteration cycles T. Necessary: Quantity of Monte Carlo samples denoted as N.

Figure 1 .
Figure 1.Box plots of the estimator values of A, β 1 , β 2 , β 3 based on 10 experiments under p = 30 and T = 300.The horizontal coordinate is the index of the variables and the vertical coordinate is the value of the estimators.

7 AFigure 2 .
Figure 2. Box plots of the estimator values of A, β 1 , β 2 , β 3 based on 10 experiments under p = 90 and T = 600.The horizontal coordinate is the index of the variables and the vertical coordinate is the value of the estimators.

Figure 3 .
Figure 3. AIC and BIC values when the number of hidden states is 2, 3, 4, and 5 on the real dataset.

q∼Figure A1 .
Figure A1.Box plots of the estimator values of A, β 1 , β 2 , and β 3 based on 50 experiments under p = 20 and T = 200.The horizontal coordinate is the index of the variables and the vertical coordinate is the values of the estimators.

Figure A2 .
Figure A2.Box plots of the estimator values of A, β 1 , β 2 , and β 3 based on 50 experiments under p = 40 and T = 200.The horizontal coordinate is the index of the variables and the vertical coordinate is the values of the estimators.

Appendix A. 3 .Figure A3 .
Figure A3.Box plots of the estimator values of A, β 1 , β 2 , and β 3 based on 10 experiments under p = 60 and T = 600.The horizontal coordinate is the index of the variables and the vertical coordinate is the values of the estimators.

Figure A4 .
Figure A4.Box plots of the estimator values of A, β 1 , β 2 , and β 3 based on 10 experiments under p = 120 and T = 600.The horizontal coordinate is the index of the variables and the vertical coordinate is the values of the estimators.

Table 1 .
Average values of four metrics of all approaches with standard deviation in each parenthesis based on 10 simulations under T = 300.In the Long-Term and Short-Term Memory methods, the learning rate is lr = 0.001, the number of training cycles (Epochs) is set to 50, and the size of the hidden layer is set to 10.The hidden layer of the BP method consists of 20 neurons, and the maximum number of iterations is set to 10,000.In the random forest regression model, the number of trees is set to 100.The bold results are the optimal ones among four methods.

Table 2 .
Average values of the RMSE and Bias of A and β based on 10 replications in Experiment 1.

Table 3 .
Average values of four metrics of all approaches with standard deviation in each parenthesis based on 10 simulations under T = 600.

Table 4 .
Average values of the RMSE and Bias of A and β based on 10 replications in Experiment 2.

Table 5 .
Prediction Performance of Four Methods on the testing data.

Table 6 .
Estimates of the regression coefficients β for each hidden state.