Bayesian Feature Extraction for Two-Part Latent Variable Model with Polytomous Manifestations

: Semi-continuous data are very common in social sciences and economics. In this paper, a Bayesian variable selection procedure is developed to assess the influence of observed and/or unobserved exogenous factors on semi-continuous data. Our formulation is based on a two-part latent variable model with polytomous responses. We consider two schemes for the penalties of regression coefficients and factor loadings: a Bayesian spike and slab bimodal prior and a Bayesian lasso prior. Within the Bayesian framework, we implement a Markov chain Monte Carlo sampling method to conduct posterior inference. To facilitate posterior sampling, we recast the logistic model from Part One as a norm-type mixture model. A Gibbs sampler is designed to draw observations from the posterior. Our empirical results show that with suitable values of hyperparameters, the spike and slab bimodal method slightly outperforms Bayesian lasso in the current analysis. Finally, a real example related to the Chinese Household Financial Survey is analyzed to illustrate application of the methodology.


Introduction
Semi-continuous data, which are characterized by excessive zeros, are very common in the fields of social sciences and economics.A typical example is given by [1] in an analysis of medical expenditures, in which the zeros correspond to a subpopulation of patients who do not use health services, while the positive values describe the actual levels of expenditures among users.For understanding such a type of data structure, a two-part model [2] is a widely appreciated statistical method.The basic assumption for a two-part model is that the overall model consists of two processes: one binary process (Part One) and one continuous positive-valued process (Part Two).The binary process, usually formulated within a logistic or probit regression model, is used to indicate whether the items have been responded to or not, while the continuous process, conditioning of the binary process, is used to describe what the actual levels of the responses are (see, e.g., [3]).By combining two processes into one, a two-part model provides a unified and flexible way to describe various relationships underlying semi-continuous data.Now, two-part models have been widely used for health services [4][5][6], medical expenditures [1,[7][8][9][10], household finances [11], substance use studies [12,13], and genome analysis [14].
A traditional two-part model usually formulates exogenous explanatory factors as fixed and observed.However, in many real applications, especially for socials survey, many unobserved/latent and random factors also have important impacts on the outcome variable(s).This fact is revealed by [15] in a study of children's aggressive behavior.Ref. [15] noted that two factors, the propensity to engage in aggressive behavior and the propensity to have highly aggressive activity levels, had significant influence on children's aggressive behavior.The authors incorporated two such latent factors into their analysis and established a two-component-two-part mixture model to identify the heterogeneity of the population.Ref. [16] noticed that in China, the financial literacy of a family had a non-ignorable effect on the desire to hold finance debts and also affected the amount of finance debt being held.They suggested conducting a joint analysis of latent factors and observed covariates in a two-part regression model.Latent factors are further manifested by multiple binary measurements via a factor analysis model.Ref. [17] incorporated a two-part regression model into a general latent variable model framework and analyzed the internal relationships between multiple factors longitudinally.These methods have brought significant attention to two-part models in behavioral science, economics, psychology, and medicine in recent years: see, for example [14, 18,19], and references therein for further developments of two-part models.
In an analysis of semi-continuous data, an important issue is to determine which explanatory factors are helpful for improving model fit.This issue is especially true when the number of exogenous factors is large since the commonly used forward and backward regression procedure is extremely time-consuming.Now, lassos and their extensions [20][21][22][23][24][25][26][27] have been the most commonly used methods for feature extraction.A typical feature of these methods is to put some suitable penalties on the coefficients and shrink many coefficients to zero, thus performing variable selection.Recently, these penalization/regularization approaches have been applied widely for prediction and prognosis (see, for example, [28,29]).Though more appealing, lasso-type regularization methods also suffer some limitations.For example, most contributions are developed within the frequentist framework, and their performance heavily depends on the large sample theory (see, for example [20,21,26,27], and references therein).This also readily leads to computational difficulty in the analysis of mixed data.An alternative to variable selection is conducted within the Bayesian framework.Statisticians have introduced hierarchical models with mixed spike-and-slab priors that can adaptively determine the amount of shrinkage [30,31].The spike and slab prior is the fundamental basis for most Bayesian variable selection approaches and has proved remarkably successful [30][31][32][33][34][35][36].Recently, Bayesian spike and slab priors have been applied to predictive modeling and variable selection in large-scale genomic studies: see [37] for a simple review.Nevertheless, model selection has never been considered in a two-part regression model with latent variables.In this study, we introduce a spike and slab model and Bayesian lasso that have been combined into a two-part latent variable model, which is a first attempt for this model.
Our formulation is more along the lines of the spike and slab bimodal prior in [34] and the Bayesian lasso in [38].We formulate the problem by specifying a normal distribution with mean zero to the regression coefficient or factor loading of interest.The probability of a related variable being excluded or included is governed by the variance.To model the shrinkage of coefficients properly, we consider two schemes for the variance parameter: One is a two-point mixture model with one component located at a point close to zero and the other component situated at a point far away zero.The mixing proportion is governed by a beta-distribution with suitable hyperparameters.The other scheme uses a Bayesian lasso for which the variance is specified via a gamma distribution that is scaled by the penalty parameters.The two schemes are unified into a hierarchical mixture model.Within the Bayes paradigm, we developed a fully Bayesian selection procedure for the two-part latent variable model.We resort to the Markov chain Monte Carlo sampling method.A Gibbs sampler is used to draw observations from the posterior.We obtain all full conditionals.Posterior analysis is carried out based on the simulated observations.We investigate the performance of the proposed methods via a simulation study and a real example.Our empirical results show that the two schemes result in similar results for variable selection, but the spike and slab bimodal prior with suitable hyperparameters slightly outperforms the Bayesian lasso in terms of the correct rate.
The remainder of this paper is organized as follows.Section 2 introduces the proposed model for semi-continuous data with latent variables.Section 3 develops an MCMC sampling algorithm for the proposed model.Bayesian inference procedures to include parameter estimation and model assessment are also presented in this section.In Section 4, we present the results of a simulation study to assess the performance of the proposed methodology and illustrate the practical value of our proposed model by analyzing household finance debt data.Section 5 concludes the paper with a discussion.Some technical details are given in Appendix A.

Model Description
In Section 2.1, a basic formulation for analyzing semi-continuous data with latent variables is presented.Section 2.2 presents a Bayesian procedure for feature extraction.

Two-Part Latent Variable Model
Suppose that for i = 1, . . ., n, s i is a semicontinuous outcome variable that takes a value in [0, ∞); x i is a generic vector with r fixed covariates representing the collection of observed explanatory factors of interest.We assume that each x ij in x i is standardized in the sense ∑ n i=1 x ij = 0 and ∑ n i=1 x 2 ij = 1 for j = 1, . . ., r.Moreover, we include m latent/unobserved variables ω i = (ω i1 , . . ., ω im ) T into the analysis to account for the unobserved heterogeneity of responses.Conceptually, these latent variables can be the covariates that are not directly observed or the synthesization of some highly correlated explanatory items with the noise.Inclusion of latent variables can improve model fit and strengthen the power of model interpretation: see [39] for more discussion of the latent variables in a general setting.To deal with the spike of s i at zero, we follow the common routine in the literature (see, for example [10,12]) and identify s i with two surrogate variables: u i = I{s i > 0} and z i = log(s i |s i > 0), where I(A) denotes the indicator function of set A. That is, we separate the whole dataset into two parts: one part is the binary dataset that corresponds to the response-to-nonresponse indicators of the subject and the other part is the set of logarithms of positive values.Our interest focuses on the exploration of the effects of exogenous factors on the two parts.
We assume that u i and z i satisfy the following sampling models: in which α and γ are the scalars of the intercept parameters, β x and ψ x are the vectors of the regression coefficients, and β ω and ψ ω are the vectors of the factor loadings; σ 2 is the scale and 'T' is the transpose operator of the vector or matrix.For compactness, we write T as the complete explanatory variables.Note that Equation (1) can be represented as and we refer to it as the logistic model.The involvement of latent variables apparently complicates the model.It readily leads to model identification problems [40,41].This is especially true when the dimension of ω i is high.In this case, any auxiliary information is required to manifest ω i further.Among various easy constructs, we consider a latent variable (LV) [40,41] approach.A basic assumption of the LV approach is that there exists, say, p manifestations y i = (y i1 , • • • , y ip ) T , of which each y ij may be continuous, counted, or categorical, and we assume that they satisfy the following link equation: where F is a known and fixed link function, ϵ i is the vector of errors used to identity the idiosyncratic part of y i that cannot be explained by ω i , and φ is a vector of unknown parameters used to quantify the uncertainty of the model.The information about ω i is manifested by y i via F. In this paper, in view of real applications, we consider p ordered categorical variables y i = (y i1 , • • • , y ip ) T , for which y ij takes a value in {0, 1, . . ., c j }(c j > 1) and satisfies the following link model: where δ j,0 < δ j,1 < • • • < δ j,c j < δ j,c j +1 are the threshold parameters satisfying δ j,0 = −∞ and δ j,c j +1 = +∞, and T is the vector of latent responses satisfying the factor analysis model: ∼ N m (0, Φ), ϵ i ∼ N p (0, I p ), and where µ is the p-dimensional intercept vector, Λ is the p × m-dimensional factor loading matrix, and I p is an identity matrix of order p.We assume that, conditional on ω i , s i and y i are independent.
We refer to the model specified by ( 1), (2), and ( 5) associated with (6) as the two-part latent variable model with polytomous responses.It provides a unified framework to explore the dependence of binary, continuous, and categorical data simultaneously.The dependence between them results from the sharing of common factors or latent variables.If ω i is degenerated at zeros or the factor loadings are taken as zeros, the dependence among them disappears, and the overall model reduces to a traditional two-part model and ordinal regression model.
To facilitate efficient calculation and motivated by the key identity in [42] (see Equation (2) in their seminal paper), we express model (3) as a mixture model with form where κ i = u i − 0.5, and p PG (u) is the standard Pólya-Gamma probability density function.Assuming that we introduce auxiliary variables u * i and augment them with u i , then Equation (3) can be considered to be the marginal density of the joint distribution Note that the exponential part is the kernel of the normal density function with respect to η u i .Hence, it admits conjugate full-conditional distributions for all regression coefficients, factor loadings, and factor variables, leading to a straightforward Bayesian computation.
Let U = {u i } n i=1 , Z = {z i } n i=1 , and Y = {y i } n i=1 be the sets of observed variables.We write Ω = {ω i } n i=1 for the collection of factor variables and for the sets of latent response variables.The complete data likelihood is given by where I = {i : u i = 1} is the set of indices, δ = {δ jℓ } is the set of threshold parameters, and θ = {α, β, γ, ψ, σ 2 , µ, Λ, Φ, δ} is the vector of unknown parameters.For the moment, we assume θ j in θ are all free.

Bayesian Feature Selection
Generally speaking, regression variables x i and factor variables ω i may not impact u i and z i simultaneously, and some redundant variables may exist.The presence of redundant variables not only decreases the model fit but also weakens the power of model interpretation.Therefore, it is necessary to determine which regression coefficient or factor loading is significantly far from zero.In the context of frequentist inference, this issue is generally tackled via stepwise regression, during which for each variable it is decided to be excluded or included according to the model fit.However, the situation becomes complex when the number of independent variables is large.In this paper, we pursue a Bayesian variable selection procedure.To this end, we follow [38] and assume in which we use diag{a k } to represent a diagonal matrix with the k th diagonal element a k and let q = r + m.That is, we assume that each β k in β (ψ k is similar) is centered at zero (or equivalently, each w ik is excluded from w i ), but the probability is governed by the variance γ 2 βk .If γ 2 βk is close to zero, then the probability of β k taking zero increases and w ik tends to be excluded; conversely, if γ 2 βk is large, then the probability of β k being zero is small and w ik tends to be maintained.As a result, the value of γ 2 βk plays a key role in determining whether w k is relevant to be selected in Part One.With this in mind, a reasonable assumption about γ 2 βk and γ 2 ψk is that: where δ a (•) is the Dirac measure concentrated at point a, w β is the random weight used to measure the degree of similarity between γ 2 βk and η 2 βk , and η 2 βk is the hyperparameter used to represent how far β k is away from zero or the slab; ν β0 is a previously specified small positive value used to identity the 'spike' of β k at zero.In other words, every γ 2 βk is assumed to be equal to η 2 βk with probability w β and equal to ν β0 η 2 β k with probability 1 − w β .This is also true for w ψ , η ψk and ν ψ0 .To model w β and w ψ properly, we assign the following beta distributions to them: where a β , a ψ , b β , and b ψ are the hyperparameters used to control the shape of the beta density: that is, to determine the magnitude of weights in (0, 1).For example, if a β1 in Equation ( 13) is small and b β1 is large, then Equation ( 13) encourages w β to take a small value with high probability.In contrast, it follows from 1 that a large a β and small b β encourage w β to take a large value in (0, 1).In the case that a β = b β = 1.0,Equation ( 13) reduces to a uniform distribution on (0, 1).In this case, every value in (0, 1) is possible for w β with identical probability.In real applications, if no information is available, one can assign these values in a manner that ensures the beta distribution is inflated enough.Finally, to measure the magnitudes of the 'slab' in the distributions of β k and ψ k , we specify gamma distributions for η −2 βk and η −2 ψk , or equivalently, where we use 'IG(τ, ζ)' to signify the inverse gamma distribution with mean ζ/(τ − 1) for τ > 1 and variance ζ 2 /((τ − 1) 2 (τ − 2)) for τ > 2; τ β0 , ζ β0 , τ ψ0 , and ζ ψ0 are the hyperparameters, which are treated as fixed and known.Similarly, one can assign values to them to ensure that ( 14) is dispersed enough.For example, we can follow the routines in [34] for ordinary regression analysis and set τ β0 = τ ψ0 = 1.0 and ζ β0 = ζ ψ0 = 0.05 to obtain dispersed priors.Note that Equations ( 11) and ( 12) can be formulated as a hierarchy as follows: for k = 1, . . ., q, where f βk and f ψk are the latent binary variables.Such a formulation aims to separate η 2 βk and η 2 ψk from the distributions ( 11) and ( 12) to facilitate posterior sampling.It is instructive to compare the proposed method to the Bayesian lasso [38], in which the variance parameters γ 2 βk and γ 2 ψk in Equation ( 10) are specified via exponential distributions as follows: where λ 2 β = (λ 2 β1 , . . ., λ 2 βq ) T and λ 2 ψ = (λ 2 ψ1 , . . ., λ 2 ψq ) T λ 2 ψk are the shrinkage/penalty parameters used to control the amount of shrinkage of β k and ψ k toward zero.
Modeling γ 2 βk and γ 2 ψk like Equations ( 17) and ( 18) leads to marginal distributions of β k and ψ k as Laplace distributions with location zero and scale λ k .The penalty parameters λ 2 βk and λ 2 ψk are rather crucial for determining the amount of shrinkage of parameters.Figure 1 presents the plots of densities of Laplace distribution LA(λ)(λ > 0) across various choices of λ.It can be seen that the larger the value of λ is, the more kurtosis the density has, indicating more penalties on the regression coefficient.Due to the key roles in Equations ( 17) and ( 18) of λ 2 β and λ 2 ψ , we assign the following gamma priors to them, i.e., where 'Ga(ν, λ)' denotes a gamma distribution with mean ν/λ.As previously discussed, the values of a k0 , b k0 , c k0 , and d k0 should be selected with care since they relate the shrinkages directly.Similar to [43], one can set a k0 = c k0 = 1 and b k0 = d k0 = 0.05 to enhance the robustness of inference.This routine is followed in our empirical study. Let We treat ν β0 and ν ψ0 as the known hyperparameters.Note that γ 2  β and γ 2 ψ are totally determined by F * β , F * ψ and η 2 β , η 2 ψ .In the following, we abbreviate the spike and slab bimodal prior to SS and the Bayesian lasso to BaLsso.

Prior Specification and MCMC Sampling
In view of the model complexity, we consider Bayesian inference.Some priors are required to specify unknown parameters to complete Bayesian model specification.Based on the model convention, it is natural to assume that the parameters involved in the different models are independent.
Secondly, for α, γ, and σ 2 in Parts One and Two, we assume they are mutually independent and satisfy where α 0 , σ 2 α0 , γ 0 , σ 2 γ0 , a 0 , and b 0 are the fixed hyperparameters.Lastly, for threshold parameter δ, without loss of generality, we assume that c j , the number of categories of y ij , is invariant across the subscript j and equals c.Moreover, we assume that p(δ) = ∏ p j=1 p(δ j ), where δ j = (δ jk ) is the j th row vector of δ.In the following, we suppress the subscript j in δ jk for notational simplicity and write δ for δ j .
Let F 0 (•) be any strictly monotonically increasing and differentiable function on R, with F 0 (+∞) = 1 and F 0 (−∞) = 0.For example, one can take F 0 = Φ(•/τ 0 ) for some τ 0 > 0 or the distribution function of Student's t-distribution with degrees of freedom ν 0 , where Φ(•) is the standard normal distribution function.To specify a prior for δ, we follow [45] and let It is easily shown that this transformation is invertible with Jacobi determination unity.We first consider the following Dirichlet distribution for p = (p 1 , • • • , p c ) T : is the multivariate beta function evaluated at η 1 , . . ., η c+1 , and η j > 0.Then, by the formula of inverse transformation, the joint distribution of δ is given by where f 0 (x) is the derivative of F 0 (x) with respect to x.We call (25) the transformed Dirichlet prior and use it as the prior of δ.An advantage of working with (25) is that, conditional upon δ j−1 and δ j+1 , the transformed distribution of δ j has the beta distribution given by

MCMC Sampling
With the prior given above, the inference about θ is based on the posterior p o (θ|U, Z, Y), which has no closed form.Motivated by the key idea in [46], we treat latent quantities as the missing data and augment them to the observed data to form the complete data.The statistical inference is carried out based on the complete data likelihood.To this end, apart from Ω, U * , and Y * mentioned before, we further let Q * be the collection of latent quantities involved in the specifications of β and ψ, i.e., β , λ 2 ψ } under BaLsso.Rather than working with the posterior p o directly, we consider the following joint distribution where p o can be considered as the marginal of p joint .We use the Markov chain Monte Carlo (MCMC) [47,48] sampling method to simulate observations from this target distribution.In particular, a Gibbs sampler is implemented to draw observations iteratively from the full conditional distributions as follows: Upon convergence, the posterior is approximated by the empirical distribution of the simulated observations.The convergence of the algorithm can be monitored by plotting the traces of estimates for different starting values or observing the values of EPSR [49] of unknown parameters.The technical details for implementing MCMC sampling are given in Appendix A.
Simulated observations obtained from the blocked Gibbs sampler can be used for statistical inference via a straightforward analysis procedure.For example, the joint Bayesian estimates of unknown parameters can be obtained via sample averaging as follows: where {θ (m) : m = 1, • • • , M} are the simulated observations from the posterior.Consistent estimates of the covariance matrices of estimates can be obtained via sample covariance matrices.
The main purpose of introducing SS and BaLsso is to screen the variables in w i .Unlike that in the frequentist inference, Bayesian variable selection does not produce estimates β and ψ that are exactly equal to zero, and hence, it is necessary to determine which component can be treated as zero.This can accomplished via posterior confidence intervals (PCIs) of β j and ψ j , which are given by where α is any previously specified value in (0, 1).Calculation of the PCI can be achieved via the Monte Carlo method.For example, let {β j : k = 1, . . ., K} be the K observations generated from the posterior distribution.Then the PCI of β j with confidence level 100(1 − α)% is given by [β j,100(α/2) ,β j,100(1−α/2) ], where β j,k is the k th order statistics.
Another choice for variable determination in SS is based on the posterior probabilities of f βj = 1 and f ψj = 1, which can be approximated by where f (k) βj and f (k) ψj (k = 1, . . ., K) are the k observations drawn from the posterior distribution via the Gibbs sampler.The variable w j is selected in Parts One and Two if f βj > 0.5 and f ψj > 0.5.

Simulation Study
In this section, a simulation study is conducted to assess the performance of the proposed method.The main objective is to assess the accuracy of estimates and the correct rate of variable selection.We consider one semi-continuous variable s i , two factor variables ω i1 and ω i2 , and six categorical variables y ij (j = 1, . . ., 6).We assume that s i , ω ij and y ij satisfy Equations (1), (2), and (5) associated with (6), respectively, in which the number of fixed covariates is set to five.We generate x i1 and x i2 from the standard normal distribution, x i3 and x i4 from a binomial distribution with a probability of success of 0.3, and x 5 from a uniform distribution on (0, 1).All covariates were standardized to unify the scales.For ordered categorical variables, we take c j = c = 4: that is, each y ij belongs to {0, 1, 2, 3, 4}.
For Bayesian analysis, we consider the following inputs for the hyperparameters: for the parameters involved in the measurement model, we take µ 0 = 0 6 and Σ 0 = 100.0× I 6 ; the elements in Λ 0 corresponding to the free parameters in Λ are set at zero, and H 0k = I 2 for k = 1, . . ., 6; ρ 0 = 10.0, and R −1 0 = 6.0 × I 3 ; for the threshold parameters δ, we take η 1 = • • • = η 5 = 1.0, which denotes the uniform distribution of p on the simplex in R 5 ; for the intercept parameters α, γ, and scale σ 2 in the two-part model, we set α 0 = γ 0 = 0, σ 2 α0 = σ 2 γ0 = 100, and a 0 = b 0 = 2.0; the hyperparameters involved in the formulation of β and ψ are set as before.Note that these values can ensure that the corresponding priors are inflated enough; hence, we can expect this to enhance the robustness of the inference.In addition, we set ν β0 = ν ψ0 = 0.001 in Equations ( 11) and ( 12) to guarantee β k and ψ k clumping at zero sufficiently.
The MCMC algorithm described in Section 3 is implemented to obtain the estimates of unknown parameters θ.Before the formal implementation, a few test runs are conducted as a pilot to monitor the convergence of the Gibbs sampler.We plot the values of EPSR of unknown parameters against the number of iterations under three different starting values.For SS, Figure 2 presents the plots of EPSR of unknown parameters under three different starting values with sample size n = 400.It can be found that the convergence of estimates is fast and all values of EPSR are less than 1.2 in about 300 iterations.To be conservative, we remove the first 2000 observations as the burn-in phase and further collect 3000 observations for calculating the bias (BIAS), root mean square (RMS), and standard deviation (SD) of the estimate across 100 replications.The BIAS and RMS of the j-th component θ j in θ are respectively defined as follows: where θ 0 j is the j-th element of the population parameters θ 0 .The summaries of the estimates of the main parameters for the two scenarios are reported in Tables 1 and 2, where the sums of the SDs and RMSs across the estimates are presented in the last rows.Examination of Tables 1 and 2 gives the following findings: (i) Both methods produce satisfactory results.The performance of SS is slightly superior to that of BaLsso in terms of the sums of RMS and SD.For n = 400, the total RMS and SD are 1.870 and 1.975, respectively, under SS and amount to 2.016 and 2.035, respectively, under BaLsso.(ii) Except ψ 2 in Table 1 and ψ 4 in Table 2, for the regression coefficients and factor loadings with true values at zero, the RMSs and SDs produced by SS are uniformly smaller than those produced by BaLsso.This indicates that SS imposes more penalties on the regression coefficients and factor loadings than BaLsso to shrink them toward zero.(iii) For non-zero β j and ψ j , the situation becomes fiendishly complex.For example, the RMSs and SDs of the estimates of β 5 , β 7 , ψ 1 , ψ 5 , and ψ 6 in Table 1 show that the SS does a weaker job than BaLsso.This is also true for β 3 and ψ 3 in Table 2.The underlying reason perhaps lies in that the penalties imposed on the non-zero regression coefficients and factor loadings by SS are similar to those of BaLsso, and one is not overwhelmingly superior to the other.(iv) As expected, increasing the sample size improves the accuracy of the estimates for both SS and BaLsso.
Another simulation is conducted to assess the performance of the proposed method for variable selection when the covariates and latent variables are correlated.In this setting, we generate covariates and latent factors jointly from the multivariate normal distribution with mean zero and covariance matrix Σ(7 × 7) with Σ jk = ρ |j−k| , where Σ jk is the (j, k) th entry of Σ.We consider three scenarios for ρ: (i) ρ = 0.1, (ii) ρ = 0.5, and (iii) ρ = 0.8, which represent, respectively, weak, mild, and strong dependence among them.The values of β and ψ are taken as (1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0) T and (1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0) T , respectively, and the sample size is taken as n = 1000.The remainder of the model is set up the same as before.We implement MCMC sampling and collect 3000 observations after removing the first 2000 observations for posterior inference.We follow [43] and treat a regression coefficient as zero if the absolute value of its estimate is less than 0.1.Table 3 gives the summary of variable selection across 100 replications.Based on Table 3, it can be found that (i) for nonzero regression coefficients, the two methods exhibit satisfactory performances, with both having 100% correct rates across all situations; (ii) for zero regression coefficients, there are differences between the two methods, and SS uniformly outperforms BaLsso.The underlying reason perhaps is that for SS, the variances of estimates are set to be small enough to ensure that the coefficients are close to zero, while for BaLsso, the variances of estimates are controlled by the shrinkage parameters, which may not be large enough to ensure this point; (iii) with an increase in the strength of dependence, the correct rates of the two methods decrease.As an illustration, Figure 3 gives the plots of the correct rates of the selected variables for three scenarios; the values on the x-axis are the true values of β 1 to ψ 7 in the order given in Table 3.

Chinese Household Financial Survey Data
To demonstrate the usefulness of the proposed methodology, in this section, a small portion of Chinese household finance debt data is analyzed.The dataset is collected from the Chinese Household Financial Survey (CHFS), a non-profit institute organized by the Southeast University of Finance and Economics in Chengdu, Sichuan Province, China.The survey covers a series of questions that touch on information about various aspects of a household's financial situation.In this study, we only focus on the measurement 'gross debts per household (DEB)': the amount of secured debt and unsecured debt for the household under investigation.We extracted the data from a survey of Zhejiang Province in 2013.Due to some uncertain factors, some measurements of DEB are missing.The missing proportion is about 2.7%.We remove the subjects with missing entries, and the ultimate sample size is 884.A preliminary data analysis shows that the DEB measurements contain excessive zeros, and the proportion of zeros is about 72.58%.Naturally, we treat this variable as the outcome variable s i and identify it with u i and z i .Figure 4 presents the histogram of DEB as well as the logarithms of positive values.It can be seen clearly that the dataset illustrates strong heterogeneity.The skewness and kurtosis of DEB are 1.1042 and 2.3361, respectively, which indicates that a single parametric model for DEB may be unappreciated.
We include the following measurements as the potential explanatory factors to interpret the variability of DEB: gender (x 1 ), age (x 2 ), marital status (x 3 ), health condition(x 4 ), educational experience (x 5 ), employment status of the household head (x 6 ), the number of family members (aged over 16, x 7 ), and the household annual income (x 8 ).Table 4 gives the descriptive summary of these measurements.To unify the scale, all covariates were standardized.* Note: Superscripts are used to indicate values raised to the power of 10 (thus, a b = a × 10 b ).The measurement is taken as the middle value of the range in the questionnaire.
Besides the observed factors mentioned above, we also include family culture η as a latent factor into the current analysis.It is well-known that China is an ancient civilization with a long history, and Confucian culture is deeply rooted in social development.Economic activity or social development cannot be independent of cultural development.Hence, it is of practical interest to investigate how family culture affects behavior related to household finance debt.Based on the design of the questionnaire, we select the following three measurements as manifestations for η: (i) The preference for boys (BP, y 1 ): this is a threecategory measurement coded as 0, 1, and 2, which correspond to the attitudes 'opposed', 'doesn't matter', and 'strongly support'; (ii) The attitude toward having a single child (SC), coded by 0, 1 and 2, according to the level of support.(iii) The importance of family in one's life: this measurement was originally based on a six-point scale (0 to 5) according to the support level.However, in view of the frequencies in the last three groups being small, we grouped them into three categories and recoded them as 0 (does not matter), 1 (important), and 2 (very important).In addition, as some manifestations are missing, we treat missing data as missing randomly and ignorable [50] and ignore the specific mechanics that result in missing data.
Let U = {u i }, Z = {z i }, and Y = {Y obs , Y mis }, where Y obs is the collection of observed data and Y mis is the set of missing data.We formulate U, Z, and Y within Equations ( 1), (2), and (6) and assume that η i , iid.∼ N(0, 1).The inputs of the hyperparameters in the priors are taken as follows: Λ j0 = 0.0, H j0 = 1, and η j1 = η j2 = η j3 = 2.0.The values of other hyperparameters are taken to be the same as those in the simulation study.To implement the MCMC sampling algorithm, we need to impute the missing data in Y.This is done by drawing y ij,mis from the conditional distribution p(y ij,mis |θ, Y obs ) = N(µ j,mis + Λ j,mis η i , 1), where µ j,mis and Λ j,mis are the components of µ and Λ, respectively, that correspond to the missing entries y ij,mis in y i .In addition, to identify the model and scale the factor, we set Λ 1 = 1.We also adopt the method in [51] in the context of a latent variable model with polytomous data and fix δ j1 at Φ −1 ( f j1 /n j ), where n j is the size of y obs,ij that is equal to 1, and f j1 is the observed frequency of 0 in y obs,ij .To assess the convergence of the algorithm, for SS, we plot the traces of estimates under three different initial values (see Figure 5).It can be seen that the algorithm converges at about 3000 iterations.To be conservative, we collect 6000 observations after deleting the initial 4000 observations for calculating the estimates and their standard deviations.Table 5 gives the summary of two estimates of unknown parameters in the two-part model and factor analysis model.Examination of Table 5 shows that most of the estimates for both models are very close, but there exist differences in the estimates of β 4 , β 5 , β 7 , β 8 , ψ 2 , ψ 7 , and ψ 8 .For example, the estimates of β 4 , β 5 , and β 7 under SS are 0.428, 0.577, and 0.747 with standard deviations of 0.062, 0.070, and 0.072, respectively, while they equal 0.072, 0.082, and 0.092 with standard deviations of 0.07, 0.081, and 0.092 under BaLsso.These differences reflect the fact that the two methods impose different penalties on the regression coefficients during variable selection.
To see more clearly, Table 6 gives the resulting selected variables according to SS and BaLsso.It can be seen that (i) for Part One, both methods give the same results for the selection of factors 'gender', 'age', 'marital status', 'employment', 'number of adults', and 'family culture'.The two methods favor 'age','marital status', and 'number of adults' as helpful for improving model fit, while 'gender' and 'family culture' have less influence on the probability of holding household finance debt.However, there exist contradictory conclusions when selecting 'health condition', 'education', and 'income'.(ii) For Part Two, except for the factors 'age' and 'number of adults', the two methods give the same results.In particular, both methods support that 'family culture' is relevant to the amount of household finance debt being held.This fact is also revealed by [18] in an analysis of the CHFS using a two-part nonlinear latent variable model.Further interpretation is omitted in order to save space.

Discussion
A two-part latent variable model can be considered to be an extension of a traditional two-part model for situations where the latent variables are included to identify the unobserved heterogeneity of a population resulting from the absence of the observed covariates.When analyzing such a model, an important issue is to determine which factors are relevant to the outcome variable.This is especially true when the number of exogenous factors is high, because usual model selection/comparison procedures are extremely timeconsuming.In this paper, we resort to a Bayesian variable selection method and develop a fully Bayesian variable selection procedure for semi-continuous data.Our formulation is along the lines of a spike and slab bimodal prior and recasts the distribution of regression coefficients and factor loadings as a hierarchy of priors over the parameter and model space.The selected variables are identified based on having a high posterior probability of occurrence.We also consider an adaptive Bayesian lasso (BaLsso) for reference.To facilitate computation, we recast the logistic regression model in Part One as a flavor of a normal mixture model by introducing latent Pólya-Gamma variables.This admits the conjugate full-conditional distributions for all regression coefficients, factor loadings, and factor variables.
Although Bayesian variable selection has its unique advantages, there are still some limitations that need to be considered with care.Firstly, its computational complexity is high.Bayes SSL requires Monte Carlo sampling to estimate the posterior distribution, which can lead to slower calculation speed, especially when working with high-dimensional datasets.Secondly, the method is sensitive to hyperparameter and data distribution assumptions.The selection of the hyperparameters of the prior distribution, such as the ratio of spike to slab, lasso penalty parameters, and data distribution assumptions, will have a great impact on the results.When the data do not conform to the model's convention, the performance of the method is poor.Therefore, these issues need to be carefully considered in real applications to ensure that the Bayesian SS method can be effectively applied to the specific dataset.
The proposed method can be applied to more general latent variable models, including multilevel SEMs [51] and longitudinal dynamic variable models with discrete variables [17,52].These extensions are left for further study.

Let κ
By some algebra, it can be shown that where Hence, the drawing of Ω can be obtained by simulating ω i independently from the normal distribution (A1).

Full conditional of p(U * | • • • )
Following a similar derivation to that in [42], it can be shown that, given U, Ω, and θ, U * is the Pólya-Gamma distribution through exponential tilting of the standard Pólya-Gamma density given by where η i = α + β T w i .Drawing u i from this distribution can be achieved via a rejection sampling method; see [42] or [53] for more details about this issue.

Full conditional of p(Y
Hence, given Ω, the full conditional of Y * only depends on µ, Λ, Y, and Ω and is given by This is the truncated normal distribution, and its drawing can be obtained via an inverse distribution sampling method; see, for example, [54].

Full conditional of p(θ| • • • )
Recall that θ consists of α, β, γ, ψ, σ 2 , µ, Λ, Φ, and δ.Hence, the drawing of θ can be accomplished by (i) drawing α from p(α| , and (viii) drawing δ from p(δ| • • • ) sequentially.Note that given U * , Y * , and Ω, the models (2), ( 6) and ( 9), reduce to ordinary regression models and, hence, most of the full conditionals, similar to the regression coefficients and variance/covariance in the Bayesian regression analysis, are standard distributions such as normal, gamma, inverse gamma, and Wishart distributions.As a matter of fact, by some tedious but non-trivial calculations, it can be shown that in which where Y * * is the n × p matrix with i th row y * T i − µ T , Y * * As a result, we can draw δ kℓ by first generating a δ * kℓ from the truncated beta distribution (A8) and then transform it to δ kl via an inverse transformation by setting F −1 0 (δ * kℓ [F 0 (δ k,ℓ+1 ) − F 0 (δ k,ℓ−1 )] + F 0 (δ k,ℓ−1 )).A drawing of the truncated beta distribution can be obtained by implementing an inverse distribution sampling method.5. Full conditional of p(Q * | • • • ) First of all, it is noted that Q * consists of F β , F ψ , w β , w ψ , η −2 β , and η −2 ψ under SS, and it is formed by γ 2  β , γ 2 ψ , λ 2 β , and λ 2 ψ under BaLsso.Similar to that of θ, we update Q * by drawing observations from their full conditionals per component sequentially.

Figure 1 .
Figure 1.Plot of the densities of Laplace distribution for different choices of λ.

Figure 2 .
Figure 2. Plot of the values of EPSR of unknown parameters under three different starting values in which the colored solid lines represent the trajectories of EPSR of estimates against the number of iteraitons: simulation study and n = 400.

Figure 3 .
Figure 3. Plots of the correct rates of the selected variables under three scenarios: simulation study and n = 1000.

Figure 4 .
Figure 4. Histograms of DEB and the logarithms of their positive values: Chinese Household Financial Survey data.(Left) panel corresponds to DEB and (right) panel corresponds to log(DEB|DEB > 0).

23 Figure 5 .
Figure 5. Trace plots of the estimates of unknown parameters against the number of iterations under SS prior, in which the colored solid lines repsent the traces of estimates under three starting values: CHFS data.

Table 1 .
Summary of the estimates of unknown parameters under SS and BaLsso: simulation study and n = 400.

Table 2 .
Summary of the estimates of unknown parameters under SS and BaLsso: simulation study and n = 1000.

Table 3 .
Number of correctly selected variables in the two-part model on the simulated datasets.

Table 4 .
Descriptive statistics of explanatory variables: CHFS data.

Table 5 .
Estimates and standard deviation estimates of unknown parameters under SS and BaLsso: CHFS data.

Table 6 .
The selected variables in the CHFS data-0: excluded and 1: included.