Model Selection for Exponential Power Mixture Regression Models

Finite mixture of linear regression (FMLR) models are among the most exemplary statistical tools to deal with various heterogeneous data. In this paper, we introduce a new procedure to simultaneously determine the number of components and perform variable selection for the different regressions for FMLR models via an exponential power error distribution, which includes normal distributions and Laplace distributions as special cases. Under some regularity conditions, the consistency of order selection and the consistency of variable selection are established, and the asymptotic normality for the estimators of non-zero parameters is investigated. In addition, an efficient modified expectation-maximization (EM) algorithm and a majorization-maximization (MM) algorithm are proposed to implement the proposed optimization problem. Furthermore, we use the numerical simulations to demonstrate the finite sample performance of the proposed methodology. Finally, we apply the proposed approach to analyze a baseball salary data set. Results indicate that our proposed method obtains a smaller BIC value than the existing method.

There are two important statistical problems in FMLR models: order selection and variable selection for the different regressions.However, order selection should be the first discussed issue in FMLR models.There exists a lot of literature to deal with this problem.For example, Ref. [9] introduced a penalized likelihood method for mixtures of univariate location distributions.Ref. [10] proposed a penalized likelihood method to select the number of mixing components for the finite multivariate Gaussian mixture models.For variable selection problems for each regression component, Ref. [11] applied subsect selection, REDapproaches such as Akaike information criterion (AIC) and Bayesian information criterion (BIC) to perform a variable selection for each component in a finite mixture of Poisson regression models.To avoid the drawbacks of subsect selection, Ref. [12] introduced a penalized likelihood method for variable selection in FMLR models.Ref. [13] proposed a robust variable selection procedure to estimate and select relevant covariates for FMLR models.
The above-proposed methods do not jointly select the order selection and significant variables in FMLR models.In fact, it is a challenging issue, although some literature exists to solve this problem.Ref. [14] introduced MR-Lasso for FMLR models to simultaneously identify the order selection and significant variables.However, they do not study the large sample properties of the proposed method.Ref. [15] proposed a robust mixture regression estimator via an asymmetric exponential power distribution and [16] studied component selection for exponential power mixture models, while they did not consider the variable selection procedure.Ref. [17] applied the penalized method on the number of components and regression coefficients to conduct model selection for FMLR models, but the error followed a normal distribution.Therefore, the proposed method is very sensitive to the heavy-tailed distribution.
In this paper, motivated by [10,18], we propose a new model selection procedure for the FMLR models via an exponential power distribution, which includes normal distributions and Laplace distributions as special cases.Under some regularity conditions, we investigate the asymptotic properties of the proposed method.In addition, we introduce an expectationmaximization (EM) algorithm [19] and a majorization-maximization (MM) algorithm [20] to solve the proposed optimization problem.The finite sample performance of the proposed method is illustrated via some numerical simulations.Results indicate that the proposed method is more robust to the heavy-tailed distributions than the existing method.
The rest of this paper is organized as follows.In Section 2, we present the finite mixture of regression models with an exponential power distribution and a penalized likelihoodbased model selection approach.The asymptotic properties of the resulting estimates are investigated.In Section 3, a modified EM algorithm and an MM algorithm are developed to maximize the penalized likelihood.In Section 4, we propose a data-driven procedure to select the tuning parameters.In Section 5, simulation studies are conducted to evaluate the finite sample performance of the proposed method.In Section 6, a real data set is analyzed to compare the proposed test with some existing methods.We conclude with some remarks in Section 7. Technical conditions and proofs are given in the Appendix A.

Methodology
The density function of an exponential power (EP) distribution is defined as follows: where p > 0, σ > 0 is the scale parameter, and Γ(•) is the Gamma function.When 0 < p < 2, the EP distribution is heavy-tailed, which indicates that it can provide protection against outliers.The EP density function is a flexible and general density function class, and includes some important statistical density functions as its special cases, e.g., Gaussian density function (p = 2), and Laplace density function (p = 1).Meanwhile, the EP distribution has a wide range of applications, particularly in the area of business applications [21].
Based on the EP density function, we study the FMLR models.Let Z be a latent class variable with P(Z = j|x) = π j for j = 1, 2, • • • , m, where X is a p-dimensional vector.Given Z = j, suppose that the response Y depends on X in a linear way where β j is a p-dimensional vector, and ϵ j is a random error with an EP density function f p j (x; 0, σ j ).Then the conditional density of Y given X can be written as To deal with the model selection problem, according to [10], we consider the following objective function, with the penalty function where p λ (•) is a non-negative and non-decreasing function, and λ 1 > 0 and λ 2 > 0 are two penalized parameters.Thus, we can obtain the estimators θn of θ as follows θn = arg max θ Qn (θ).
To derive some theoretical properties of the estimators θn , we first define where p ′ λ (h) and p ′′ λ (h) are the first and second derivatives of the function p λ (h) with respect to h.To establish the asymptotic properties of the proposed estimators, we assume the following regularity conditions: (C1) For any λ, p λ (0) = 0, and p λ (•) is non-negative and symmetric.Furthermore, it is non-decreasing and twice differentiable in (0, ∞) with at most a few exceptions.
(C4) The joint density f (z, θ) of Z = (X, Y) have the third partial derivatives with respect to θ for almost all z.(C5) For each θ 0 , there exists R 1 (z) and R 2 (z) such that for θ in a neighborhood N(θ 0 ) of θ 0 , where θ 0 is the true parameter, R 1 (z) and R 2 (z) satisfy R 1 (z)dz < ∞, and R 2 (z) f (z; θ)dz < ∞. (C6) The Fisher information matrix Q(θ) is finite and positive definite at θ = θ 0 , where Q(θ) is defined as follows, , where c 1 is some positive constant, c 2 and c 3 are some large constants.
Remark 1. Conditions C1-C3 are the assumption about the penalty function, and assure that the variable selection of the proposed estimators is consistent.The similar conditions are also used in [22].Condition (C5) ensures that the main term dominates the remainder in the Taylor expansion.Conditions (C4)-(C6) are used in [17].Condition (C7) ensures the concavity of the likelihood function since the log likelihood function of random sample from EP distribution is concave if p > 1. Condition (C8) ensures the compactness of parameter space.Conditions (C7) and (C8) are similarly applied in Wang and Feng [16].
In the following, we have two theorems with proofs given in the Appendix A.

Algorithm
In this section, we apply a modified EM algorithm and an MM algorithm to solve the proposed optimization problem (3).Let z ij be the indicator variables that show if the i-th observation arises from the j-th component as missing data, and p ij is the posterior probability that the i-th observation belongs to the j-th component.Therefore, the expected complete-data log-likelihood function is given as follows: Then, the objective function ( 2) is rewritten as Next, we apply a modified EM algorithm to maximize the objective function (4).The detailed procedure is given as follows: Step 1 Given the l-th approximation we can calculate the classification probabilities: .
Step 2 We first update {π 1 , • • • , π m }.We use a Lagrange multiplier δ to take into account for the constraint ∑ m j=1 π j = 1, then we have In ( 5), we apply the local linear approximation [23] to log Then, π j can be updated by straightforward calculations, π where Then, the resulting estimator is given as follows: Finally, we update {β 1 , • • • , β m }.By ignoring some terms which do not involve in β j , we have By using a MM algorithm for L(β j )'s the first term, we have For p λ 1 (|β jt |), we apply a local quadratic approximation [22], then we have Thus, for each β j , j = 1, 2, • • • , m, we only need to solve the following minimization problem . Thus, we can update β j as follows β(l+1) Step 3 Repeat Step 1, and Step 2 until convergence.

Choice of the Tuning Parameters
The selection of tuning parameters is a vital part in the order selection and variable selection procedure.In order to guarantee that a true model can be chosen correctly, we should select the proper tuning parameters λ 1 and λ 2 in the process of practice.There are lots of methods to select λ 1 and λ 2 , such as cross-validation (CV), generalized crossvalidation (GCV), AIC, and BIC.
As suggested in [24], we introduce a data-driven procedure to choose the tuning parameters λ 1 and λ 2 by minimizing the following modified Bayesian information criterion, where m denotes the estimate of the number of components, d f = 3 m − 1 + Mβ , and

Simulation
In this section, we use some numerical simulations to illustrate the finite sample performance of the proposed method.For the penalty function, we use the SCAD penalty [22], which is given as follows: where λ is a tuning parameter and a > 2. According to the suggestion in Fan and Li [22], a is equal to 3.7 by minimizing the Bayes risks.The datasets are generated via a threecomponent FMLR model where the components of x are generated independently from the 7-dimensional standard normal distribution.In detail, we generate random samples of each component from the following linear model We simulate 100 datasets from the FMLR model (7) with sample size of n=200, 600, 800, 1000.The datasets are generated by the following four scenarios: , 0, 0, 0) T , β 3 = (5, 6, 7, 8, 0, 0, 0) T and π 1 = 0.4, π 2 = 0.3, π 3 = 0.3, and the random error ϵ ∼ N(0, 1); Scenario 2. We use the same setting as in Scenario 1, except that the error term follows a t-distribution with freedom degree 2; Scenario 3. We use the same setting as in Scenario 1, except that the error term follows a mixture t distribution: ϵ ∼ 0.5t(1) + 0.5t(3); Scenario 4. We use the same setting as in Scenario 1, except that the error term follows a mixture normal distribution: ϵ ∼ 0.95N(0, 1) + 0.05N(0.5 2 ).
We compare our proposed method with the method proposed by [17].To assess the finite-sample performance, we consider four different measures: (1) RMSE π j : the root mean square error of πj when the order is corrected estimated, which is defined by where M * is the number of simulations with correct estimation of the order.(2) RMSE β c : the root mean square error of βj , which can be similarly calculated as RMSE π j .
(3) NCZ (the number of correct zeros): It denotes that the number of the true value of the parameter is zero and is correctly estimated as zero.NCZ can be calculated by where #{A} denotes the number of elements within A. ( 4) NIZ (the number of incorrect zeros): It indicates that the number of the true value of the parameter is non-zero and is incorrectly estimated as zero.NIZ is given as follows: In simulation studies, suppose we know that the data come from a mixture regression model with at most five components, but the true number of components should be estimated.For each scenario, the simulation is repeated 100 times.The corresponding results are shown in Tables 1-8.In these Tables, M1 and M2 denote the results by [17] and our proposed method, respectively.1 shows the simulation results of order selection.Columns labeled "Underfitted" are the proportion of the fitted model with less than three components in 100 simulations.Meanwhile, "Correctly fitted" and "Overfitted" can be similarly interpreted.From Table 1, we can find that the effects of the two models are very similar, and the accuracy rate of order selection can reach more than 98% for M1 and M2 when n is larger than or equal to 600.Table 2 presents the results of variable selection and parameter estimation for each component.From Table 2, we observe that the finite sample performances of the two models are very similar for n ≥ 600.Therefore, when the error term follows a normal distribution, the two models have similar performance when the sample size is sufficiently large.
Tables 3 and 4 present the results of Scenario 2, which is a heavy-tailed scenario.We can observe from Table 3 that M1 can only estimate about 20% underfitted or overfitted model, while our method keeps robustness and continues to maintain 98% accuracy when n ≥ 600.In Table 4, M1 has a poor performance in variable selection.M1 has many non-zero NIZ, while our method's NIZ is all zero for n ≥ 600.Meanwhile, the NCZ of our proposed method increases as n increases.In addition, our proposed method has a smaller RMSE than M1.For Table 5, the performance of order selection for M1 is worse than that for M2.The ratio of the correctly fitted model remains above 98% with our method for n ≥ 600, while M1 is easy to overfit the model's components.In Table 6, it can be seen that the NCZ value of M1 is a little better than that of M2.Compared with RMSE β c , we can find that our method is better than M1 consistently., we can find that M1 is larger than M2.
In general, our model is better than M1 in both order selection and variable selection and parameter estimation.

Real Data Analysis
In this section, we apply the proposed methodology to analyze baseball salary data, which consists of information about major league baseball players.The response variable is their 1992 salaries (measured in thousands of dollars).In addition, there are 16 performance measures for 337 MLB players who participated in at least one game in both the 1991 and 1992 seasons.This data set has been analyzed by others, such as [12,17].We want to study how the performance measures affect salaries using our method.
The performance measures are batting average (x 1 ), on-base percentage (x 2 ), runs (x 3 ), hits (x 4 ), doubles (x 5 ), triples (x 6 ), home runs (x 7 ), runs batted in (x 8 ), walks (x 9 ), strikeouts (x 10 ), stolen bases (x 11 ), and errors (x 12 ); and indicators of free agency eligibility (x 13 ), free agent in 1991/2 (x 14 ), arbitration eligibility (x 15 ), and arbitration in 1991/2 (x 16 ).The four (dummy) variables x 13 − x 16 indicate how free each player was to move to another team.As suggested in [25], the interaction effects between (dummy) variables x 13 − x 16 and the quantitative variables x 1 , x 3 , x 7 , and x 8 should be added to the consideration.Therefore, we obtain a set of 32 potential covariates affecting each player's salary.Ref. [12] fitted a mixture of linear regression models with two or three components to depict the overlaid shape of the histogram of log(salary), and concluded that a two-component mixture regression model labeled MIXSCAD fitted the data well.[17] uses an FMLR model based on normal distribution, and the number of components is two.
As advocated by [12], we use log(salary) as the response variable.We first fit a linear model via stepwise regression, the results are shown in Table 9, denoted as βols .Based on [17], we consider the following four-component mixture model, where Y = log(salary) and X is a 33 × 1 vector containing 32 covariates plus an intercept term.In order to implement the proposed modified EM algorithm, we set the initial values as follows where ϵ j ∼ N(0, I), j = 1, 2, 3, 4. The results are reported in Tables 9 and 10.From Table 9, we find that both M1 and M2 choose two components.Furthermore, we can observe from Table 10 that M2 has a smaller BIC value than M1, which indicates that our proposed method can better fit this dataset than M1.Of interest is to explain how the performance measures affect salaries by interpreting the outcome of the fit, although it can be a source of controversy.Do not think about it; there should be many positive correlations between a baseball player and his salary.M1 and M2 have the same sign and approximate coefficients in x 0 , x 13 , x 15 , and interactions of x 8 and x 14 .Recall that x 1 and x 7 are individual performances, while x 13 , x 15 , and x 16 are three dummy variables indicating how freely players change teams.For example, the effect of x 1 * x 16 implies that for most players having arbitration eligibility in 1991/2 enhances the individual ability (x 1 ) toward a lower salary, but not the value of their team contribution (x 8 ).
The main differences between the two models are interaction effects x 1 * x 14 and x 1 * x 15 .This implies that M1 disregards x 1 * x 14 's effect, but M2 indicates that it is in two directions.And M2 attaches great importance to the interaction effect of x 1 * x 14 .

Discussion
In this paper, we introduced the FMLR models via an exponential power distribution.Under some conditions, the asymptotic properties of the proposed estimators were established.Meanwhile, a modified EM algorithm and an MM algorithm were applied to solve the proposed optimization problem.Furthermore, the merits of our proposed methodology were illustrated through some numerical simulations and real data analysis.Simulation studies showed that the proposed method had better performance than the existing methods under difference errors.By analyzing a baseball salary dataset, our proposed method had a smaller BIC value than the method proposed [17].
where u 1 is a subvector of u with the corresponding non-zero coefficients.
By conditions (C4), (C5), (C7) and (C8), and the Taylor's expansion, we have By condition (C1), the Taylor's expansion, triangular inequality, and Cauchy-Schwarz inequality, we have where m 0 is the number of true non-zero mixing weights, and t = max k √ t k , and t k is the number of true non-zero regression coefficients in the k-th component.

Proof of Theorem 2. We first show that πk
To prove (a), it is sufficient to show with probability tending to 1 as n → ∞ for any π k satisfying πk − where C is a positive constant number, and δ is a Lagrange multiplier.Therefore, πk , We first consider k ≤ m 0 .By the law of large numbers, we have Therefore, the first term and the third term in the Equation (A2) are dominated by the second term.Thus, we prove the Equation (A1).This completes the proof of (a).

Table 1 .
Order selection results in Scenario 1.

Table 2 .
Variable selection and parameter estimation results in Scenario 1.

Table 3 .
Order selection results in Scenario 2.

Table 4 .
Variable selection and parameter estimation results in Scenario 2.

Table 5 .
Order selection results in Scenario 3.

Table 6 .
Variable selection and parameter estimation results in Scenario 3.Tables7 and 8present the results of Scenario 4. M1 absolutely stays away from the right number of components.On the contrary, our method can select the correct number of components with 98% accuracy for n ≥ 600.In Table8, M1 is better than M2 in NCZ, but M1 is unstable in NIZ.Comparing RMSE β c

Table 7 .
Order selection results in Scenario 4.

Table 8 .
Variable selection and parameter estimation results in Scenario 4.

Table 9 .
Parameter estimates for baseball salary data.

Table 10 .
Model parameter estimates for baseball salary data.