1. Introduction
With the expansion of datasets, selecting factors that truly affect the response variable from enormous predictors has been a topic of interest for statisticians for many years. However, the response variable commonly contains heavy-tailed errors or outliers in practice. In such a situation, traditional variable selection techniques may fail to produce robust sparse solutions. In this paper, a new estimator for the heavy-tailed distribution data is suggested as a way to deal with this problem.
In the past two decades, Tibshirani [
1] first combined ordinary least square (OLS) with
penalty and proposed a new variable selection method named least absolute shrinkage and selection operator (lasso). Lasso is a convex regularization method by adding
norm, which avoids the influence of the sign of OLS on the prediction results. The method can also perform simultaneously model selection and shrinkage estimation in high-dimensional data. However, lasso is sensitive in the case of heavy tails on the model distribution, which arises from the problem of heterogeneity due to the data coming from different sets [
2]. So, any small changes in the data can cause the solution path of lasso to contain many irrelevant noise variables. The above instability can also occur when a single relevant covariate is randomly selected. It means that applying lasso to the same data may generate widely different results [
3]. In addition, the convergence speed of lasso can be affected by the rapid growth of noise variables, and the convergence speed itself is slow. Relaxed lasso was proposed to overcome the influence of noise variables and perform variable selection at a faster and more stable speed. Meinshausen [
4] defined the relaxed lasso estimator for
and
as
where
,
p is the number of variables with nonzero coefficients selected into the model, and
is an indicator function, that is
, for all
. Hastie et al. [
5] extended the work of Bertsimas et al. by comparing the lasso, forward stepwise and relaxed lasso methods with different signal-to-noise ratios (SNRs) scenarios. The results show that relaxed lasso has an overall outstanding performance at any SNR level. This superiority is reflected in the relaxation parameter
. By appropriately modifying the parameter
, relaxed lasso ensures that the resulting model is consistent with the true model, neither favoring excessive compression that would result in the exclusion of essential variables nor selecting redundant noise variables. This serves as the main reason why we add the relaxation parameter
to the lad lasso. Compared to lasso, relaxed lasso greatly reduces the number of false positives while also achieving a trade-off between low computational complexity and fast convergence rates [
6]. From the perspective of the closed-form solution, Mentch and Zhou [
7] indicate that the relaxed lasso estimator can be expressed as a weighted average of lasso and least squares. When the weight of the lasso is increased, it provides a greater amount of regularization, hence reducing the degree of freedom of the variables in the final model to achieve sparse solutions. Bloise et al. [
8] demonstrated that relaxed lasso has higher predictive power because it is able to avoid overfitting by tuning two separate parameters. He [
9] concluded that relaxed lasso improves prediction accuracy since it avoids selecting unimportant variables and excessively removing informative variables. Extensive research has demonstrated that relaxed lasso has advantages in terms of variable selection, prediction accuracy, convergence speed, and computational complexity. However, relaxed lasso, like OLS cannot produce reliable solutions when the response variable contains heavy-tailed errors or outliers.
In order to solve the problem of poor fitting results of relaxed lasso to the heavy-tailed distribution or outliers, least absolute deviation (LAD) based on robust regression is introduced. It estimates coefficients by minimizing the sum of the absolute values of the prediction errors. The traditional squared loss in the objective function used by classic regularization methods is unsuitable for heavy-tailed distributions and outliers, but LAD performs admirably in these situations. Gao [
10] showed that the LAD loss could provide a powerful alternative to the squared loss. In recent years, some researchers have combined robust regression with popular penalty regularization methods. The most typical method is the lad lasso of Wang et al. [
11], which combines lad and adaptive lasso so that the model can perform robust variable selection. Then, the theoretical properties of lad lasso under large samples have been systematically studied by Gao and Huang [
12] and Xu and Ying [
13]. Arslan [
14] proposed a weighted lad lasso to mitigate the effect of outliers on explanatory and response variables. In addition, lad lasso also has a wide range of practical applications. For example, Rahardiantoro and Kurnia [
15] showed that lad lasso has a more minor standard error than lasso in the presence of outliers in high-dimensional data via simulation. Zhou and Liu [
16] also applied the lad lasso to the double-truncated data and showed that it is more accurate to select the real model than the best subset selection procedure. Li and Wang [
17] applied lad lasso to the change point problem in fields such as statistics and econometrics. Thanks to the superior performance of lad lasso, we consider proposing a new estimator that can not only perform variable selection but is also insensitive to the heavy-tailed distribution or outliers in the response variable.
In this article, we combine lad lasso and the relaxation parameter of relaxed lasso to propose relaxed lad lasso and study its asymptotic properties in the case of large samples. It integrates the advantages of relaxed lasso and lad lasso methods into the following three points. Firstly, the relaxed lad lasso estimator has the same consistency property as the lad lasso, i.e., the method selects important variables with a high probability of convergence to one. Secondly, since relaxed lasso has a closed-form solution, solving relaxed lad lasso is eventually equivalent to solving the LAD program, so we can employ a simple and efficient algorithm. Thirdly, relaxed lad lasso possesses the robustness of lad lasso to heavy-tailed errors or outliers in the response variable. In theory, we prove the -consistency of relaxed lad lasso under some mild assumptions and illustrate its advantages in convergence speed. Although the convergence speed of the relaxed lad lasso is slower than that of relaxed lasso , our method handles outliers and heavy-tailed errors well because it is not affected by the rapid growth of noise variables. The simulation shows that relaxed lad lasso has the highest prediction accuracy and probability of the correct selection of important variables under heavy-tailed distributions compared to other methods. We also apply relaxed lad lasso to financial data and obtain the same results as the simulation regarding prediction accuracy.
However, our method has room for improvement, as LAD cannot handle the presence of outliers in the explanatory variables and is sensitive to leverage points [
18]. Hence, our method suffers from the same problem. Under the framework of LAD regression, researchers have proposed many new methods to improve robustness by reducing the weight of leverage points. Giloni et al. [
19] proposed a weighted least absolute deviation process (WLAD) to overcome the shortcomings of the LAD method. However, as the proportion of outliers increases, the robustness of the WLAD estimator significantly decreases [
20]. To obtain a high robustness estimator and abnormal information of observations, Gao and Feng [
21] proposed a penalized weighted least absolute deviation (PWLAD) regression method. Jiang et al. [
22] combined the PWLAD estimator and the lasso method to detect outliers and select variables robustly. However, it is worth noting that these methods mainly address the robustness problem when there are leverage points or outliers in the explanatory variables. Still, our method is suitable in situations with heavy-tailed errors or outliers in the response variable. In the simulation, we assume that the model error follows a heavy-tailed distribution such as the t-distribution. Therefore, we do not compare relaxed lad lasso with the above methods due to the different application scenarios. More specific details can be found in
Section 4.
The remainder of the paper is organized as follows:
Section 2 defines the estimator of relaxed lad lasso and interprets the parameters in the model. In addition, we give the detailed procedure of the algorithm.
Section 3 describes the asymptotic properties of the loss function and provides the theorems’ assumptions.
Section 4 compares the performance of relaxed lad lasso with conventional lasso methods (such as classical lasso, adaptive lasso, and relaxed lasso) through simulations under different heavy-tailed distribution scenarios.
Section 5 analyzes empirical data to confirm the robustness of the proposed method to heavy-tailed distributions.
Section 6 summarizes the advantages of the new method as well as suggestions for further research. The proofs of the theorems are given in
Appendix A,
Appendix B,
Appendix C,
Appendix D,
Appendix E.
3. The Asymptotic Properties of Relaxed Lad Lasso
Before obtaining the asymptotic properties, we must first set certain conditions. Regarding the covariance matrix
, we consider the settings in Fu and Knight [
23] and Meinshausen [
4] and put forward the first hypothesis:
Assumption 1. For all , the covariance matrix is diagonally dominant. According to the setting of Fu and Knight [23]:and then, it can be deduced to obtain Obviously, the default precondition for diagonal dominance of the covariance matrix is that the covariance matrix exists. When the strong condition of diagonal dominance is satisfied, the covariance matrix is positive definite, and the hidden condition is that its inverse matrix still exists.
Assumption 2. There exist constants and such that the number of predictors p grows exponentially with the number of observed variables n. It can be written as Assumption 2 sets the growth mode of p to satisfy the requirement that relaxed lad lasso still retains a better convergence speed in variable selection.
Assumption 3. Define the range of the penalty parameter λ. For a constant , we have Assumption 3 sets the range of penalty parameters necessary to prove consistency.
Assumption 4. The random error term does not follow any distribution and has a median of 0.
In other variable selection models, such as lasso and adaptive lasso, the random error term usually obeys the normal distribution. However, for the study of relaxed lad lasso in this paper, the distribution conditions for the random error term are relaxed, and only the median is imposed. All of the above assumptions are necessary for proving the consistency of relaxed lad lasso.
Lemma 1. Let with . be an empirical loss function of , where is its sample size. Then, under Assumptions 1–4, Lemma 1 will be used to prove the key conclusion in Theorem 4.
According to Lemma 1 of Wang et al. [
11], lad lasso’s oracle property is dependent on the
-consistency, that is,
. Therefore,
is in a sequence with
as
. The lad lasso model in this article uses a fixed
because
is the largest
in the nonzero parameters; then, you can obtain
.
Theorem 1. In order to describe the loss under the lad lasso estimator when , according to Assumptions 1–4, we have: Theorem 1 first proves the convergence rate of lad lasso. Lad lasso uses the
loss function. According to Pesme and Flammarion [
24], it is shown that the
loss function is non-strongly convex. Since the loss function does not have non-derivable points, we can still think that the algorithm is convex. The
loss’s non-strong convexity can guarantee
convergence speed before and after iteration, and smoothness has no effect on the above conclusion, which indirectly proves that our conclusion is correct.
Theorem 2. In order to describe the loss under the relaxed lad lasso estimator when , according to Assumptions 1–4, we have: One of the main contributions of our paper is to prove that the convergence speed of relaxed lad lasso is equivalent to that of lad lasso, even although adding the relaxation parameter does not improve the convergence speed of relaxed lad lasso. However, when the number of variables p grows exponentially with the sample n, the number of potential noise variables likewise increases significantly, but this will not slow down the convergence speed of relaxed lad lasso. Although the convergence speed of relaxed lad lasso is not ideally as fast as relaxed lasso, it still outperforms lasso due to the existence of loss and relaxation parameter , which offers good stability.
Theorem 3. Under the condition that the design matrix is positive definite and the prediction error is continuous and has a positive density at the origin, when , the estimator of relaxed lad lasso is -consistency for , which is Another major contribution of this paper is to prove that the the relaxed lad lasso estimator is consistent, where the conclusion of Lemma 1 of Wang et al. [
11] in lad lasso is an essential precondition that the important variable’s penalty parameter can converge to 0 faster than
. It guarantees the consistency of lad lasso, and our proof is also based on this conclusion.
Theorem 4. Let be the loss of the relaxed lad lasso estimate and chosen by K-fold cross-validation with . Under assumptions of relaxed lasso, it holds that We still use K-fold cross-validation when choosing the penalty parameters; that is, we select the optimal penalty parameters
and
by minimizing the empirical loss function of cross-validation. First, define the empirical loss function on a different observation set from
as
where each partition of
R consists of
observations and
is the empirical loss of the response variable.