Variable Selection and Regularization in Quantile Regression via Minimum Covariance Determinant Based Weights

The importance of variable selection and regularization procedures in multiple regression analysis cannot be overemphasized. These procedures are adversely affected by predictor space data aberrations as well as outliers in the response space. To counter the latter, robust statistical procedures such as quantile regression which generalizes the well-known least absolute deviation procedure to all quantile levels have been proposed in the literature. Quantile regression is robust to response variable outliers but very susceptible to outliers in the predictor space (high leverage points) which may alter the eigen-structure of the predictor matrix. High leverage points that alter the eigen-structure of the predictor matrix by creating or hiding collinearity are referred to as collinearity influential points. In this paper, we suggest generalizing the penalized weighted least absolute deviation to all quantile levels, i.e., to penalized weighted quantile regression using the RIDGE, LASSO, and elastic net penalties as a remedy against collinearity influential points and high leverage points in general. To maintain robustness, we make use of very robust weights based on the computationally intensive high breakdown minimum covariance determinant. Simulations and applications to well-known data sets from the literature show an improvement in variable selection and regularization due to the robust weighting formulation.


Introduction
Variable selection and robust estimation procedures are an important consideration of multiple regression analysis in the presence of predictor space data aberrations (high leverage points (outliers in the X-space) and multicollinearity) as well as response variable (Y-space) outliers. It is well known that the least squares (LS) are susceptible to both data aberrations in the predictor space and response variable (Y-space) outliers. To counter the influence of Y-space outliers, alternative robust procedures have been developed in the literature. One such attractive robust procedure is quantile regression (QR) [1]. In addition to being robust, QR is more versatile than the LS. This is due to the fact that, while the LS procedure models the conditional mean (E(Y|X)) (center of distribution), the QR procedure is able to detect heterogeneous effects of predictors at different quantile levels of the outcome as it models the conditional quantiles (Q Y|X (τ) for 0 < τ < 1) of the response variable Y given the predictors X over the entire range of quantiles in (0, 1) [2]. Regression quantiles (RQs) are optimal solutions to an optimization problem obtained using linear programming (LP) algorithms [3]. An RQ at the τ = 0.5 quantile level corresponds to the well-known least absolute deviation (LAD) estimator, i.e., the 1 estimator. Although RQs are robust to Y-space outliers, on the other hand, they are susceptible to high leverage points as their influence function are bounded in the Y-space but unbounded in the X-space. Amongst the numerous sources of multicollinearity, some high-leverage observations tend to influence the eigen-structure of the predictor matrix thereby creating multicollinearity or hiding it [4]. Such high leverage points are referred to as collinearity-influential points. However, not all high leverage observations are collinearity influential points. As remedies to high leverage points influences, weighted LAD (WLAD) [5] and weighted QR (WQR) have been suggested in the literature [6].
In both variable selection and regularization, in order to enhance the prediction accuracy and interpretability of statistical models, the RIDGE type [7], the least absolute shrinkage and selection operator (LASSO) type [8][9][10] penalties and a hybrid of these two penalties, viz., elastic net (E-NET) penalty [11,12] as well as the smoothly clipped absolute deviation (SCAD) penalty [13,14] have been suggested in the literature. Notable extensions of the LASSO penalty in the literature are the adaptive LASSO proposed by [15], fused LASSO [16], and group LASSO [17]. In high-dimensional sparse models where ordinary quantile regression is not consistent, [18] proposed the 1 -penalized QR. The author of [14] developed, upon the procedure of [19] on model selection in composite quantile regression (CQR) and suggested weighted CQR (WCQR), a procedure based on data driven efficient weights, as the former researcher's equal weight property procedure lacked optimality. To mitigate against the undesirable effects of high leverage points in variable selection in the 1 estimator (Q Y|X (0.5)), the weighted LAD-LASSO (WLAD-LASSO) procedure has been suggested [20,21]. Few variable selection procedures based on WQR have been suggested in the QR regression framework in different settings. We generalize the WLAD-LASSO approach to penalized WQR to mitigate against collinearity influential points and high leverage points in general, and maintain robustness via use of very robust weights.
In summary, the motivations of this study are premised on the following: • The generalization of WLAD-LASSO [20,21] (in addition, we also include the RIDGE and the E-NET penalties) procedure to the QR framework, i.e., to penalized WQR as each RQ (including the LAD estimator) is a local measure, unlike the LS estimator, which is a global one.
• Rather than carrying an "omnibus" study of penalized WQR as in [20,21], we carry out a detailed study by distinguishing different types of high leverage points viz., -Collinearity influential points which comprise collinearity inducing and collinearity hiding points.
-High leverage points which are not collinearity influential.
• Taking advantage of high computing power, we make use of the very robust weights based on the computationally intensive high breakdown minimum covariance determinant (MCD) method rather than the well-known classical Mahalanobis distance or any LS based weights as in [20] which are amenable to outliers.
The remainder of this article is structured as follows. Some preliminaries on QR and variable selection in QR are discussed in Section 2. Variable selection in WQR as well as motivation for our choice of weights are detailed in Section 3 while simulation studies are detailed in Section 4. In Section 5, applications to two well-known data sets from the literature are detailed. Lastly, Section 6 concludes the paper.

Quantile Regression
Consider the linear regression model given by where y i denotes the value of the response variable vector Y for the ith observation, .., x ip ) denotes the vector of p predictor variables from the n × p design matrix X excluding the intercept, β = (β 1 , ..., β p ) denotes a p × 1 vector of yet to be estimated unknown regression coefficients (parameters) and i denotes the value of the ith random error term, with cumulative distribution function F ( i ∼ F).
QR is based on an optimization problem which can be solved by linear programming techniques, viz., where .I(u < 0)] and β(τ) denotes the τ th RQ.

Variable Selection in Quantile Regression
In this section, we discuss QR variable selection. We specifically present variable selection using LS-RIDGE [7], LASSO [9], adaptive LASSO [15], and E-NET [11] in a QR scenario. Firstly, we consider QR penalized with the RIDGE penalty [7] denoted by QR-RIDGE. The QR-RIDGE is given by the minimization problem where λ is a positive ridge parameter in the range 0 < λ < 1 and other terms are as defined in Equation (3). Many variations of λ have been used in literature (see [7,[22][23][24][25][26]). QR with the RIDGE ( 2 -squared) penalty has been proposed as a remedy to the multicollinearity problem [25]. The presence of multicollinearity results in undue large sample variance resulting in unreliable inference and prediction. Secondly, we consider QR variable selection procedure which uses the LASSO ( 1 ) penalty [9] denoted by QR-LASSO. The QR-LASSO is then given by the minimization problem where λ is the tuning parameter that shrinks some beta coefficients towards zero, the second term is the penalty term, and other terms are as defined in Equation (3). This 1 -penalized QR may be superior to the 2 -squared penalized QR in Equation (3) in some instances.
Considering the more recent adaptive LASSO penalty by [15] in a penalized QR scenario, the tuning parameter is no longer constant (λ) but λ j for j = 1, 2, ..., p. The minimization problem becomes where λ j is the jth tuning parameter that shrinks some predictor variables to zero and other unknowns are as defined in Equation (4). We also present a penalized QR procedure that uses the elastic NET penalty from [11] (E-NET-penalized QR which is best suitable for applications with unidentified groups of predictors). The E-NET penalized QR is given by where α ∈ [0, 1] and λ is the tuning parameter for the second and third terms, respectively. Note that, for α = 0, the E-NET penalty reduces to the RIDGE penalty while, for α = 1, it reduces to the LASSO penalty. In some instances, the E-NET based procedure performs better than its ridge and LASSO counterparts [11]. Since QR is prone to outliers in the predictor space, weighted QR has been proposed as a remedy to high leverage points [6]. In the subsequent section, we motivate the choice of weights used in robust variable selection in QR framework.

Choice of Weights for Downweighing High Leverage Observations Motivation
In the LS case, statistics of the hat (projection) matrix h i = x i (X X) −1 x i [4] have been used as standard tools to generate weights for weighted LS (WLS) estimation. Although this approach is both mathematically and practically tractable, such estimators have a breakdown point of only 1/n so a single high leverage point can completely dominate the ensuing estimates. Furthermore, such weights may suffer from the masking and swamping effect associated with the LS. Permitting contamination in both the predictor and response variables results in the breakdown point of LAD (and hence QR) estimator being the same as that of LS, 1/n (see, e.g., [27]). To circumvent the undesirable effects of both outliers and high leverage points, Ref. [5] proposed a weighted version of an LAD (WLAD) estimator. The weights of this estimator are based on the computationally intensive high breakdown Minimum Covariance Determinant (MCD) method [28]. RD(x j ) = x j − µ Σ −1 x j − µ is the robust distance (a modification of the classical Mahalanobis distance), µ is the center of the smallest ellipsoid (whose classical covariance matrix has the lowest possible determinant) containing half (or h observations as defined by the user) of the observations of the design matrix X, and Σ is their covariance matrix multiplied by a consistency factor [29]. To side step the huge computational load associated with this weight, Ref. [30] suggested is the ith leverage point relative to the clean subset X c (without high leverage points). However, in this study, we make use of RD(x j ) due to the existing computer power (efficient algorithms) as well as its robustness to generalize the WLAD concept to the whole set of weighted conditional quantiles, i.e., weighted regression quantiles (WRQs).
Generalizing the [20] WLAD regression estimator and making use of the MCD based weights, we suggest a WQR estimator as a minimization problem given by where ω i and ρ τ (.) are as defined in Equations (2) and (7), respectively.

Penalized Weighted Quantile Regression
Building on Equation (8), we suggest a penalized weighted QR (WQR) variable selection procedure using MCD based weight 7 to counter the undesirable influences of high leverage observations. We achieve robustness of WQR in the X-space due to the robustness of the this MCD based weights chosen appropriately as in the LS case (see also [5,30]). The then suggested penalized WQR variable selection procedures are based on three penalty functions, viz., the RIDGE, LASSO, and the E-NET penalties bringing about the WQR-RIDGE, WQR-LASSO, and WQR-E-NET estimators, respectively.
Let ω i be a robust weight. We use ω i in our proposed quantile variable selection procedures. The incorporation of WLAD-LASSO in the procedure inherits X-space robustness property as in [20]. The weight ω i was discussed in Section 3. First, we propose the WQR-RIDGE given by where the terms are already defined. The tuning parameter λ shrinks the coefficients of the predictor variables towards zero.
We take advantage of the properties of WLAD-LASSO [20] and propose a weighted quantile variable selection procedure called WQR-LASSO. The WQR-LASSO procedure is expected to be robust and superior to its penalized QR counterpart, QR-LASSO. The WQR-LASSO is the solution of a minimization problem given by where tuning parameter λ is constant. In the literature, this procedure is found to perform better than the WQR-RIDGE procedure under deviations from the Normality assumptions. Lastly, we apply the E-NET penalty to WQR variable selection to bring about the WQR-E-NET procedure. This weighted penalized QR procedure has both the LASSO and RIDGE penalties properties inherent in it [11]. The WQR-E-NET is the solution to a minimization problem given by where α and λ are as in Equation (6).

Asymptotic Properties
We conveniently decompose the regression coefficient as with true regression coefficient β 1 corresponding to non zero coefficients and β 2 = 0.
To establish asymptotic normality, suppose for a suitable choice of λ n that we now assume the two theoretical conditions to be true as stated in [13], as follows: The regression errors i 's are i.i.d., with τth quantile zero and continuous, positive density f (.) in a neighborhood of zero (see [31]). (ii) Let W = diag(ω 1 , ω 2 , ..., ω n ), where ω i for i = 1, 2, ..., n are known positive values that satisfy max{ω i } = O(1).
We first give the following Theorem 1 (oracle property) for the i.i.d. error terms case (W = I n ) before we consider the non i.i.d. error terms case which concerns this study. (12) satisfying conditions (i) and (iii) (with W = I n ). If √ nλ n → 0 and n (γ+1)/2 λ n → ∞, then we have (1) Sparsity: β 2 = 0; (2) Asymptotic normality: In order to extend the conclusions of the i.i.d. error terms case to the non i.i.d. error terms case, we invoke the following assumptions by [32]: The random error terms i 's are independent with F i (t) = P( i ≤ t) the distribution function of i . We assume F i (.) is locally linear near zero (with a positive slope) and F i (0) = τ.
)ds denoting a convex function for each n and i.

Simulation Study
In this section, we perform a simulation study to investigate the finite-sample performance of penalized WQR under the RIDGE, the LASSO, and the E-NET penalty (for α = 0.5) functions in terms of the variable selection and regularization of the regression parameters making use of Equations (9)-(11) and the robust MCD based weight (ω j ) given in Equation (7) in comparison to their unweighted versions. For brevity, we consider τ = 0.5 (the LAD estimator) and τ = 0.25 RQ levels. We summarize the simulation results in terms of the percentage of correctly estimated regression models, the average correct zero coefficients (β 3 , β 4 , β 6 , β 7 , and β 8 ) and the average incorrect zero coefficients along with the median of the test error and its respective measure of dispersion, where constant 1.4826 is a correction factor which makes the MAD consistent at Normal distributions (see e.g., [29]). The simulation study is designed according to the following scenarios with the predictor matrices of size n × p, p = 8 and n = 50, 100 (but we only give results for n = 50 for brevity); • D1− This predictor matrix is obtained from orthogonalization such that X X = nI.
Using singular value decomposition (SVD), we solve W = UDV , where w ij ∼ N(0, 1) for i = 1, ..., n and j = 1, ..., p; U and V are orthogonal with the diagonal entries of D giving the singular (eigen) values of W. Then, X = √ nU is such that X X = nI due to orthogonality of U.
• D2−has a collinearity inducing point which is D1, but with observation having the largest Euclidean distance from the center of the design space moved 10 units in the X-space.
• D3−has collinearity hiding point which is D1, but with observations having the largest (second largest) Euclidean distance from the center of the design space moved 10 units in the X-space.
• D4−has a collinearity inducing point which is D1, but with observation having the largest Euclidean distance from the center of the design space moved 100 units in the X-space.
• D5−has collinearity hiding point which is D1, but with observations having the largest (second largest) Euclidean distance from the center of the design space moved 100 units in the X-space.
The regression coefficients with zero intercept, i.e., β 0 = 0 are We consider the following error term distribution scenarios; The design matrix D1 is used as a baseline, and a schematic representation of D2-D5 departures from it is shown in Figure 1; D2 and D4 have a high leverage point that induces collinearity while D3 and D5 each have a pair of high leverage points that hide collinearity, viz., one for inducing the collinearity and the other for hiding it. On the other hand, D6 has both multicollinearity due the covariance structure, V of the sub matrix X 1 , and high leverage points due to the mean shift in X 2 . D1-D6 (see, e.g., the response variable Y = (Y 1 , Y 2 ) is generated as in [20], i.e., The number of simulation runs is 200 while the 100-fold cross-validations are employed to obtain the tuning parameters λ. Instead of using cross-validation error metrics independent tuning data sets and testing data sets of size n and 100n were generated in the exact way the training data sets were generated (see, e.g., [13]). This simulation study explores these different scenarios in both QR and WQR variable selection procedures using the R-add-on package hqreg. In this package, the regularization parameter lambda is fit using a semismooth Newton coordinate descent algorithm. See [34] for details.
A schematic representation of collinearity influential points is given in Figure 1 below showing the scatter plots representations of predictor matrices D2-D5. The extreme observation in D2 and D4 creates collinearity while the second extreme observation D3 and D5 obscures it. We only consider the Normal distribution at the well-behaved orthogonal predictor matrix D1, and from thereon we only consider the t distribution as QR being designed to handle distributions heavier than the Normal one which is handled best by the LS.

D1 SCENARIO
As a point of departure, we consider the well-behaved predictor matrix D1 under the Normal distribution as contrasted with the D1 under the t distribution on 1 d. f . (implying outliers). The results given in Table 1 are as expected. At D1 under the Normal distribution, variable (model) selection performs best under the LASSO penalty followed by the E-NET penalty across all models with no marked differences in the median and MAD test error measures indicating the robustness of the QR-LASSO procedure under the t distribution. This is further illustrated graphically in Figures 2 and 3 where the QR-LASSO procedure correctly shrinks to zero the zero coefficients {β 3 , β 4 , β 6 , β 7 , β 8 } more often than not.

Remark 2.
The five zero coefficients correspond to the set {β 3 , β 4 , β 6 , β 7 , β 8 }, hence the maximum average of correctly/incorrectly selected (shrunk) coefficients is 5 while the set of correctly selected models is given as a proportion, i.e., a %.

D2 AND D4 SCENARIOS
We consider the introduction of collinearity inducing points in both D2 and D4. The RIDGE penalty performs the worst in every scenario in both penalized QR and penalized WQR in model/variable selection. The model/variable selection pattern at D2 and D4 under both the t 6 and t 1 distributions is shown in Figure 4. The dominance of WQR-LASSO followed by WQR-E-NET is clearly depicted. However, the performance of WQR-E-NET is generally better under the t 1 distribution than it is under the t 6 distribution. In Figure 4 (lower panels), both the median absolute test error and its MAD measure (in the line graph) show that generally the unweighted versions outperform the weighted ones.

D3 AND D5 SCENARIO
We consider the introduction of collinearity hiding points in both D3 and D5. The performances under the t 6 and t 1 distributions are shown graphically in Figure 5. In model/variable selection, throughout all the scenarios, the weighted penalized versions outperform the penalized unweighted versions under the LASSO and E-NET penalties. Amongst the penalized weighted versions, WQR-RIDGE performs the worst while WQR-LASSO performs better than WQR-E-NET. The prediction pattern under the t distributions exhibited under D3 and D5 in Figure 5 (lower panel) are different to that exhibited at D2 and D4 in that the MAD error measure is more erratic (but the absolute median error is less erratic) at D3 and D5 but the absolute. In fact, the prediction picture of QR-LASSO and WQR-LASSO at τ = 0.5 based on the absolute median error are similar at D3 and D5, whereas, at D2 and D4, WQR-LASSO performs better with respect to this measure.

D6 SCENARIO
The D6-scenario model/variable selection performance outcomes under the t distribution is given in Figure 6. In Figure 6, the dominance of the LASSO penalty in both QR and weighted QR is clearly depicted with WQR-LASSO far outperforming QR-LASSO in model/variable selection. Overall, on average, the zero βs are most incorrectly selected under the t 1 distribution and at σ = 1. The prediction pattern under the t distribution exhibited under D6 in Figure 6 (lower panel) is slightly poorer under the t 1 distribution compared to that exhibited under the t 6 distribution.

Examples
In this section, we consider two data sets often used to illustrate the efficacy of robust methodologies in mitigating against high leverage points in general as well as collinearity influential points in particular, viz. the [35] and the [36] data sets. In both data sets, the response is generated based on the t 1 error term distribution in line with testing the efficacy of robust procedures like QR as in [20].

Remark 3.
The LS procedure is adversely affected by both high leverage points and outliers in the Y-space, hence it consistently gives the worst performance as expected. On the other hand, QR is not affected by the latter data aberrations. Consequently, we mainly focus on the efficacy of penalized WQR at quantile levels τ = 0.5 and τ = 0.25 in addressing the problem of high leverage points with the intercept = F −1 (τ) + β 0 corresponding to 0 and −1 under the t 1 error term distribution, respectively.

Hawkins, Bradu, and Kass Data Set
We firstly consider the [35] which consists of 75 observations with three predictor variables. The first 14 observations of the 75 observations are high leverage points with the first 13 observations inducing the collinearity while the 14th observation greatly affects the collinearity structure on its own although it is also a collinearity inducing point (see [37]). Figure 7 shows the leverage structure for the predictor variables of this data set based on the robust MCD based distance as well as the classical one. We give results for the full data set and reduced data sets without observations 1-14.
The response variable is generated as Y 2 = X 2 β 2 for the first 10 observations and Y 1 = X 1 β 1 + , ∼ t 1 for the remainder of the data, where β 2 = (2, 2, 0) and The results for the full data set are given in Table 2.  Similar poor performances of penalized QR at both τ-levels are exhibited across all penalty functions. However, penalized WQR exhibits a drastic improvement at τ = 0.5 where penalized WQR is equivalent to unpenalized WQR under both the RIDGE and the E-NET penalties as the λ = 0. At τ = 0.25, where λ = 0, WQR tends to be too "greedy" under LASSO and E-NET penalties where all the parameters are shrunk to zero.
Results for the reduced [35] data set with a "clean" predictor matrix (without observations 1-14) are given in Table 3. As expected, both QR and WQR select unpenalized models (models with tuning parameter λ = 0) except WQR at τ = 0.25. QR performs considerably well at both τ levels and across penalty functions. However, there is a marginal improvement in using penalized WQR at τ = 0.5 while, at τ = 0.25, the LASSO and E-NET penalties are too "greedy".

Hocking and Pendleton Data Set
While the [35] set is an example of high leverage points that are collinearity inducing points, the [36] data set is an example of high leverage points that are collinearity hiding points. This data set has 26 observations with three predictor variables, X 1 , X 2 , and X 3 , whereby X 3 is created as a linear combination of X 1 and X 2 . The response variable is generated as Y 1 = X 1 β 1 + , ∼ t 1 for the first 22 observations and Y 2 = X 2 β 2 for the remainder of the data, where β 1 = (3, −2, 0) and β 2 = (1, 1, 0) such that Y = (Y 1 , Y 2 ) . Figure 8 shows the leverage structure for the predictor variables of this data set based on the robust MCD based distance as well as the classical one. We give results for the full data set and a reduced data set (without the collinearity hiding observation 24).
The results for the full data set are given in Table 4. Under penalized unweighted RQ, β is reasonably best estimated at τ = 0.25 while, under penalized, WQR, it is reasonably best estimated at τ = 0.5 across all penalty functions. Under penalized WQR, the model with tuning parameter λ = 0 is selected indicating that unpenalized WQR exhibits the optimal model. There is an improvement in adopting unpenalized WQR compared to penalized QR at both τ-levels. However, the best parameter estimation is exhibited under unpenalized WQR at τ = 0.5. The results for the reduced data set without observation 24 are given in Table 5, leaving only one mild leverage point, observation 8. Cross validation results have consistently chosen unpenalized QR and unpenalized WQR models as the optimal models except for QR-LASSO, QR-E-NET, and WRQ-LASSO at τ = 0.5, where λ = 0. The best results are exhibited under unpenalized WQR at τ = 0.5 followed by unpenalized WQR at τ = 0.25.

Conclusions
This paper suggested a penalized WQR procedure making use of robust weights based on the computationally intensive high breakdown MCD method rather than the well-known classical Mahalanobis distance or any other LS based weights as in [20] which are amenable to outliers. As penalty functions, the RIDGE, LASSO, and E-NET penalties were used yielding the procedures WQR-RIDGE, WQR-LASSO, and the WQR-E-NET, respectively. The efficacy of these procedures as a remedy to high leverage points and collinearity influential points were investigated via simulations and applications to wellknown data sets from the literature.
Simulation studies show that generally penalized versions of robustly WQR performs better than their unweighted counterparts, with the WQR-LASSO generally performing the best although marginally so at D2 and D4 under the Normal distribution. However, there are few exceptions; at D2 and D4 under the Normal distribution, with respect to model/variable selection, WQR-LASSO and WQR-E-NET alternately dominate each other while, with respect to prediction, penalized QR performs better than penalized WQR. The occasional dominance of the WQR-E-NET over WQR-LASSO is expected (see, e.g., [11]).
Applications to well-known data sets from the literature show that, in some cases, applying the MCD based robust weight is adequate, i.e., WQR (tuning parameter λ = 0) performs better than penalized WQR. The best performance is mostly at a quantile level τ = 0.5 while WQR-LASSO and WQR-E-NET are too "greedy" at τ = 0.25, shrinking all parameters to zero. Thus, overall, simulations and applications to the [35] and the [36] data sets show an improvement in variable selection and regularization due to the robust weighting formulation.  Acknowledgments: The authors would like to thank the editor and the two referees for carefully reading the manuscript and for the comments which greatly improved the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: