Next Article in Journal
Sharp Estimates Involving a Generalized Symmetric Sălăgean q-Differential Operator for Harmonic Functions via Quantum Calculus
Previous Article in Journal
The Effect of the Stationary Phase on Resolution in the HPLC-Based Separation of Racemic Mixtures Using Vancomycin as a Chiral Selector: A Case Study with Profen Nonsteroidal Anti-Inflammatory Drugs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Method for Variable Selection Based on Diagnostic-Lasso Regression

by
Shokrya Saleh Alshqaq
1 and
Ali H. Abuzaid
2,*
1
Department of Mathematics, Jazan University, Jazan 45142, Saudi Arabia
2
Department of Mathematics, Al-Azhar University-Gaza, Gaza P.O. Box 1277, Palestine
*
Author to whom correspondence should be addressed.
Symmetry 2023, 15(12), 2155; https://doi.org/10.3390/sym15122155
Submission received: 10 September 2023 / Revised: 14 November 2023 / Accepted: 22 November 2023 / Published: 4 December 2023
(This article belongs to the Section Mathematics)

Abstract

:
In contemporary statistical methods, robust regression shrinkage and variable selection have gained paramount significance due to the prevalence of datasets characterized by contamination and an abundance of variables, often categorized as ‘high-dimensional data’. The Least Absolute Shrinkage and Selection Operator (Lasso) is frequently employed in this context for both the model and selecting variables. However, no one has attempted to apply regression diagnostic measures to Lasso regression, despite its power and widespread practical use. This work introduces a combined Lasso and diagnostic technique to enhance Lasso regression modeling for high-dimensional datasets with multicollinearity and outliers. We utilize a diagnostic Lasso estimator (D-Lasso). The breakdown point of the proposed method is also discussed. Finally, simulation examples and analyses of real data are provided to support the conclusions. The results of the numerical examples demonstrate that the D-Lasso approach performs as well as, if not better than, the robust Lasso method based on the MM-estimator.

1. Introduction

The usual estimators are inapplicable in cases where the matrix X T X is singular. In practice, when estimating ( β ) or pursuing variable selection, high empirical correlations between two or a few other covariates (multicollinarity) lead to unstable outcomes. This study uses the least absolute shrinkage and selection operator (Lasso) estimation method in order to circumvent this issue.
Lasso estimation is a type of variable selection. It was first described by Tibshirani [1] and was further studied by Fan and Li [2]. They explored a class of penalized likelihood approaches to address these types of problems, including the Lasso problem.
In contrast, when outliers are present in the sample, classical least squares and maximum likelihood estimation methods often fail to produce reliable results. In such situations, there is a need for an estimation method that can effectively handle multicollinearity and is robust in the presence of outliers.
Zou [3] introduced the concept of assigning adaptive weights to penalize coefficients with different degrees of skewness. Due to the convex nature of this penalty, it typically leads to convex optimization problems, ensuring that the estimators do not suffer from local minima issues. These adaptive weights in the penalty term allow for oracle properties.
To create a robust Lasso estimator, the authors of [4] proposed combining the least absolute deviation (LAD) loss with an adaptive Lasso penalty (LAD-Lasso). This approach results in an estimator that is robust against outliers and proficient at variable selection. Nevertheless, it is important to note that the LAD loss is not designed for handling small errors; it penalizes small residuals severely. Consequently, this estimator may be less accurate than the classic Lasso when the error distribution lacks heavy tails or outliers.
In a different approach, Lambert-Lacroix, S. and Zwald, L. [5] introduced a novel estimator by combining Huber’s criterion with an adaptive Lasso penalty. This estimator demonstrates resilience to heavy-tailed errors and outliers in the response variable.
Additionally, ref. [6] proposed the Sparse-LTS estimator, which is a least-trimmed-squares estimator with an L 1 penalty. The study by Alfons demonstrates that the Sparse-LTS estimator exhibits robustness to contamination in both response and predictor variables.
Furthermore, ref. [7] combined MM-estimators with an adaptive L 1 penalty, yielding lower bounds on the breakdown points of MM-Lasso and adaptive MM-Lasso estimators.
Recently, ref. [8] introduced c-lasso, a Python tool, while [9] proposed robust multivariate Lasso regression with covariance estimation.
While recent years have seen a significant focus on outlier detection using direct approaches, a substantial portion of this research has centered on the utilization of single-case diagnostics (as seen in [10,11,12,13]). For high-dimensional data, ref. [12] introduced the identification of numerous influential observations within linear regression models.
This article introduces a novel Lasso estimator named D-Lasso. D-Lasso is grounded in diagnostic techniques and involves the creation of a clean subset of data, free from outliers, before calculating Lasso estimates for the clean samples. We anticipate that these modified Lasso estimates will exhibit greater robustness against the presence of outliers. Moreover, they are expected to yield minimized sums of squares of residuals and possess a breakdown point of 50%. This is achieved through the elimination of outlier influence, as well as addressing multicollinearity and variable selection via Lasso regression.
The paper’s structure is as follows: Section 2 provides a review of both classical and robust Lasso-type techniques. Section 3 introduces the diagnostic-Lasso estimator. Regression diagnostic measures are presented and discussed in Section 4. Section 5 offers a comparison of the proposed method’s performance against existing approaches, while Section 6 presents an analysis of the Los Angeles ozone data as an illustrative example. Finally, a concluding remark is presented in Section 7.

2. The Lasso Technique

2.1. Classical Lasso Estimator

Consider the situation in which the observed data are realizations of { ( X , y i ) } i = 1 n with a p-dimensional vector of covariates X R p and univariate continuous response variables y i R , a basic regression model has the following form: y i = X i T β + ϵ i , where β is the set of regression coefficients and ϵ i is the ith error component. Tibshirani [1] assumed that X was normalized such that the mean and variance of each ocvariate X k , k = 1 , , p , is equal to 0 and 1, respectively. Letting β ^ = ( β 1 , , β p ) , the classical-Lasso estimate β ^ is defined by
β ^ L a s s o = arg min β i = 1 n ( y i X i T β ) 2 + λ L a s s o j = 1 p | β j | ,
where λ L a s s o > 0 is a Lasso tuning parameter. The following adaptive Lasso (adl-Lasso) criterion, which is a modified Lasso criterion, is proposed by Zou [3]:
β ^ a d l L a s s o = arg min β i = 1 n ( y i X i T β ) 2 + λ a d l j = 1 p w ^ j a d l | β j | ,
where w ^ j a d l = ( w ^ 1 a d l , , w ^ p a d l ) a is a known weights vector.

2.2. Robust Lasso Estimator

When there are outliers in the data, the standard least squares method L S fails to generate accurate estimates. The Lasso, however, is not robust against outliers since it is a particular instance of the penalized loss function of the L S that is susceptible to the L 1 penalty function [14].
In these situations, we need to build a Lasso estimation method that can be used in multicollinearity situations and works well enough when there are outliers. The following equation gives the LAD-Lasso regression method proposed by [4]; it combines the least absolute deviation (LAD) and the Lasso methods and is only resistant to outliers in the response variable, as shown by [14],
β ^ L A D L a s s o = arg min β i = 1 n | y i X i T β | + λ L A D L a s s o j = 1 p | β j | .
Combining Huber’s criterion with the adl-Lasso penalty, Ref. [5] produced another estimator that is resilient to heavy-tailed errors or outliers in the response. The Huber-Lasso estimator is written as follows:
β ^ H u b e r L a s s o = arg min β i = 1 n H M ( y i X i T β s ) + λ H u b e r L a s s o j = 1 p w ^ j a d l | β j | ,
where s > 0 is a distribution scaling parameter and H M ( . ) is Huber’s criterion as loss function as introduced in [15] for every positive real M. For each positive real M, the H M ( z ) expression is as follows:
H M ( z ) = z 2 , f o r z M , 2 M z M 2 , f o r z > M ,
The Sparse-LTS estimator proposed by [6] combines a least-trimmed squares estimator with a L 1 penalty as follows:
β ^ s p a r s e L T S = arg min β i = 1 h ( r 2 ( β ) ) i : n + λ j = 1 p | β j | ,
where ( r 2 ( β ) ) 1 : n ( r 2 ( β ) ) n : n are the order statistics of the squared residuals and h n and r 2 ( β ) = ( r 1 2 , , r n 2 ) T denote the vector of squared with r i 2 = ( y i X i T β ) 2 . Ref. [6] showed in a simulation study that the Sparse-LTS can be robust against contamination in both the response and predictor variables. Ref. [7] proposed the MM-Lasso estimator, which combines MM-estimators with an adaptive L 1 penalty, and obtained lower bounds on the breakdown points of the MM-Lasso and adaptive MM-Lasso estimators. The MM-Lasso estimator is given as follows:
β ^ M M L a s s o = arg min β i = 1 n ρ 1 r i ( β ) s n ( r ( β ^ 0 ) ) + λ j = 1 p w ^ j a d l | β j | ,
where ρ 1 H M , r i ( β ) = y i X i T β , β ^ 0 is an initial consistent and high breakdown point estimate of β , r ( β ^ 0 ) = y i X i T β ^ 0 , and s n ( r ( β ^ 0 ) ) is the M-estimate of the β ^ 0 residuals’ scale.

3. Diagnostic-Lasso Estimator (D-Lasso)

This section proposes a new Lasso estimator based on the regression diagnostic method.

3.1. D-Lasso Estimator Formulation

A general idea to compute D-Lasso attempts first to create a clean, outlier-free subset of data. Let R represent the collection of observational indexes in the outlier-free subset, y i R and X i R are observation subsets indexed by R, and β ^ R are estimated regression coefficients obtained by fitting the model to the set R.
And let S S E R = i = 1 n y i R X i R T β R 2 be the corresponding sum of squares residual that finds the estimates corresponding to the clean samples having the smallest sum of squares of residuals. As such, as expected, the breakdown point is 50%. When the number of collection of observational indexes in the outlier-free subset in R equal to n then β ^ R = β ^ . This study suggests using S S E R in Lasso regression by replacing the value of i = 1 n y i X i T β 2 in Equation (1) by S S E R . Thus, the D-Lasso can be expressed as follows:
β ^ D L a s s o = arg min S S E R + λ D L a s s o j = 1 p | β j | ,

3.2. Breakdown Point

The replacement finite-sample breakdown point is the most commonly used measure of an estimator’s robustness.

3.2.1. Definition

Rousseeuw and Yohai [16,17] introduced the breakdown point of an estimator, which is defined as the minimum fraction of outliers that may take the estimator beyond any bound. In other words, the breakdown point of an estimate β ^ shows the effects of replacing several data values by outliers. The breakdown point for the regression estimator β ^ of the sample Z = ( X , y ) is defined as
ε * ( β ^ ; Z ˜ ) = min m n : sup Z ˜ β ^ ( Z ˜ ) 2 = ,
where Z ˜ are contaminated data obtained from Z by replacing m of the original n data by outliers.

3.2.2. Breakdown Point of D-Lasso Estimator

The D-Lasso Estimator’s breakdown point for subsets of size R n is provided by
ε * ( β ^ R ; Z R ) = ( n n R + 1 ) n .
This study suggests taking a value of n R equal to a fraction α of the sample size, with α = 0.75 , such that the final estimate is based on a sufficiently greater number of observations. This ensures a high enough statistical efficiency. The breakdown point that results from this is around ( 1 α ) 100 % = 25 % (see [18]). Notice that the breakdown point is independent of the p dimension. Breakdown point is assured even if the number of predictor variables is greater than the sample size. Applying Equation (8) to the classical Lasso, which has ( n n R = 0 ), yields a finite sample breakdown point of ε * ( β ^ L a s s o ; Z ) = 1 n However, classical-Lasso is very sensitive to the presence of one outlier.

4. Regression Diagnostic Measures

4.1. Influential Observations in Regression

In this section, we introduce a different way of finding the clean subset R. A large body of literature is now available [10,11,12,13] for the identification of influential observations in linear regression. The Cook’s distance [19] and the differential of fits (DFFITS) [20] are two of the many influence measure that are currently accessible. The ith Cook’s distance is defined by [13] as:
C D i = ( β ^ ( i ) β ^ ) T ( X T X ) 1 ( β ^ ( i ) β ^ ) p σ ^ 2 , i = 1 , , n ,
where β ^ ( i ) is the estimated parameter of β with the ith observation deleted. The suggested cutoff point is 4 / n . Equation (9) can be expressed as:
C D i 1 p + 1 r s i 2 h i i 1 h i i ,
where r s i is the ith standardized Pearson residual defined as:
r s i = y i X i T β ^ ( i ) σ ^ i 2 ( 1 h i i ) , i = 1 , , n ,
where h i i is the ith leverage value, which is in fact the ith diagonal element of the Hat matrix H = X ( X T X ) 1 X T and σ ^ i is an appropriate estimate of σ .
The D F F I T S was introduced in [20], which is defined as:
D F F I T S i = y ^ i y ^ i ( i ) σ ^ ( i ) h i i , i = 1 , , n ,
where y ^ i ( i ) and σ ^ ( i ) are, respectively, the ith fitted response and the estimated standard error with the ith observation deleted. The relationship between C D i and DFFITS is given by
C D i = σ ^ ( i ) p σ ^ 2 D F F I T S i 2 .
Many other influence measures are available in the literature [11,21].

4.2. Identification of Multiple Influential Observations

The diagnostic tools discussed so far are designed for the identification of a single influential observation and are ineffective when masking and/or swamping occur. Therefore, we need detection techniques that are free from these problems. Ref. [12] introduced a group-deleted version of the residuals and weights in regression. Assume that d observations among a set of n observations are deleted. Let us denote a set of cases ‘remaining’ in the analysis by R and a set of cases ‘deleted’ by D. Therefore, R contains ( n d ) cases after d cases are deleted. Without loss of generality, assume that these observations are the last d rows of X , y, and σ i (variance–covariance matrix) so that
X i R X i D , y i R y i D , σ R 0 0 σ D .
The generalized DFFITS (GDFFITS) is used [12] for the entire data set, defined as:
G D F F I T S i = y ^ i ( D ) y ^ i ( D i ) σ ^ ( D i ) h i i ( D ) , for i R , y ^ i ( D + i ) y ^ i ( D ) σ ^ ( D ) h i i ( D + i ) , for i K ,
where y ^ i ( D ) is the fitted response when a set of data indexed by D is omitted and
h i i ( D + i ) = X T ( X R T X R + X X T ) 1 X = h i i ( D ) 1 h i i ( D ) ,
where h i i ( D ) = X T ( X R T X R ) X . The author considered observations as influential if G D F F I T S i   3 p / ( n d ) .

4.3. Tuning D-LASSO Parameter Estimation

Robust 5-fold cross validation was used to select the λ D L a s s o penalization parameter from a collection of possibilities, with a τ -scale of the residuals serving as the objective function. The τ -scale was introduced by [22] to estimate the largeness of the residuals in a regression model in a robust and effective manner.
To find a set of candidate values for λ , we selected 30 equally spaced points between 0 and λ m a x , where λ m a x is about the smallest λ for which all the coefficients of β ^ D L a s s o 0 except the intercept.
To estimate λ m a x , we first used bivariate winsorization [23] to robustly estimate the maximal correlation between y i and X i . This estimate was used as an initial guess for λ m a x , and then a binary search was used to improve it. If p > n , then 0 is excluded from the candidate set.

5. Simulation Study

In this part, a simulation study that compared the proposed D-Lasso estimator’s performance to those of some other Lasso estimators is described. There are six estimators in the study: (i) D-Lasso, (ii) MM-Lasso, (iii) Sparse-LTS, (iv) Huber-LTS, (v) LAD-Lasso, and (vi) classical Lasso. Consider three different simulation scenarios:
Simulation 1: In the first simulation, multiple linear regression is taken into account with a sample size of 50 (n = 50) and 25 variables (p = 25), where each variable is selected from a joint Gaussian marginal distribution with a correlation structure of ρ = 0.5.
The true regression parameters β are set to be ( 1 , 2 , 3 , 4 , 5 , 5 0 , 0 , , 0 20 ) . The distribution of random errors e is generated from the following contamination model: F ( e ) = [ ( 1 ε ) N ( 0 , 1 ) + ε H ( 0 , 2 ) ] × σ where ε is the contamination ratio, σ is the signal to noise, which is chosen to be 3, N ( 0 , 1 ) is the standard normal distribution, and H is the Cauchy distribution to create a heavy-tailed distribution. The response variables are then calculated as follows:
y ( 50 × 1 ) = X ( 50 × 25 ) × ( 1 , 2 , 3 , 4 , 5 , 5 0 , 0 , , 0 20 ) T + F ( e ) ( 50 × 1 ) .
The percentage of zero coefficients (Z.coef) equals 80%, and the percentage of non-zero coefficients (N.Z coef.) equals 20%.
Simulation 2: The second simulation process is similar to the first, except that the p and n values are different (p = 50, n = 150), and the response variables are calculated as follows:
y ( 150 × 1 ) = X ( 150 × 50 ) × ( 1 , 0 , 0 , 0 , 5 , 5 0 , 1 , 0 , 0 , 5 , 5 0 , 1 , 0 , 0 , 0 , 5 0 , 0 , , 0 35 ) T + F ( e ) ( 150 × 1 )
The percentage of true zero coefficients (Z.coef) equals 90% and the percentage of true non-zero coefficients (N.Z.coef) equals 10%.
Simulation 3: The third simulation is similar to the second, but n is increased to 500 and β = ( 1 , 0 , 0 , 0 , 5 , 0 , 6 0 , 0 , 2 , 0 , 0 , 5 1 , 0 , 0 , 0 , 0 , 5 , 6 0 , 0 , , 0 33 ) . The response variables are then calculated as follows:
y ( 500 × 1 ) = X ( 500 × 50 ) × ( β ) T + F ( e ) ( 500 × 1 ) .
The percentage of true zero coefficients (Z.coef) equals 90% and the percentage of true non-zero coefficients (N.Z.coef) equals 10%.
The following data are looked at to see how well the approaches stand up to outliers and leverage points: (a) uncontaminated data; (b) vertical contamination (outliers on the response variables); (c) bad leverage points (outliers on the covariates).
The response variables and covariates are contaminated by certain ratios ( ε = 0.05, 0.10, 0.15, and 0.20) of vertical and high leverage points; these are created by randomly replacing some original observations with large values equal to 15 [24].
The simulations were performed in statistical software R. D-Lasso, MM-Lasso, Sparse-LTS, Huber-Lasso, LAD-Lasso, and classical Lasso. Using the G D F F I T S measure suggested by [12], D-Lasso was assessed in order to determine which observations were influential, and λ D L a s s o was chosen as described in Section 4.3. For MM-Lasso we used the functions available in the github repository https://github.com/esmucler/mmlasso (accesses on 25 October 2017) [7].
The estimator was calculated using the sparseLTS() function from the robustHD package in R, and λ S p a r s e L T S was chosen using a B I C criterion as advocated by references [6]. Huber-Lasso used the package M T E and the LAD-Lasso estimator was calculated using the package f l a r e . The Lasso estimator was calculated using the lars() function from the lars package ([25]), where λ L a s s o was chosen based on 5-fold cross-validation. λ L A D L a s s o , and λ H u b e r L a s s o , λ M M L a s s o were chosen by applying the classical B I C .
In each simulation run, there were 1000 replications. Four criteria are considered to evaluate the performances of the six methods, namely: (1) the percentage of zero coefficients (Z.coef), (2) the percentage of non-zero coefficients (N.Z.coef), (3) the average of mean squares of errors ( m s e ¯ ), and (4) the median of the mean squares of errors Med(mse). A good method for Simulation 1 is the one that possesses the percentage of Z.coef closed to 80% and the percentage of N.Z.coef closed to 20%. However, a good method for Simulations 2 and 3 is the one that possesses the percentage of Z.coef and N.Z.coef reasonably close to 90% and 10%, respectively, with a good method having the least ( m s e ¯ ) and Med(mse) values.
Several interesting points appear from the results of Table 1, Table 2 and Table 3.
The results clearly show the merit of D-Lasso. It can be observed from Table 1, Table 2 and Table 3 that the D-Lasso has the smallest values of m s e ¯ and Med(mse) compared to the other methods.
In the case of no contamination, Table 1 shows that both classical and D-Lasso methods perform well in model selection ability. For example, in the scenario of Simulation 1, the classical Lasso successfully selected 80% of Z.coef and 20% N.Z.coef, followed by D-Lasso, which selected 77.6% and 22.4% for Z.coef and N.Z.coef, respectively.
However, the performance of other robust lasso methods is good, but it has a larger m s e ¯ and Med(mse) than the other two methods. Furthermore, none of the methods suffer from false selection variables.
In the case of vertical outliers and leverage points, the classical Lasso is clearly influenced by the outliers, as reflected in the much higher m s e ¯ and Med(mse). Furthermore, it tended to select more variables in the final model (overfitting) when the percentage of contamination increased to 20%.
On the other hand, in the case of vertical outliers, the robust Lasso methods (MM-Lasso, Sparse, Huber-Lasso, and LAD-Lasso) clearly maintain their excellent behavior. Sparse-LTS has a considerable tendency toward false selection when the percentage of contamination increases to 20%. Table 2 shows that the robust Lasso methods (Sparse, Huber-Lasso, and LAD-Lasso) were affected by the presence of leverage points in the data. The effect was worse with a higher percentage of bad leverage points in the data.
The results of D-Lasso and MM-Lasso are consistent for all percentages of contamination, but MM-Lasso has a larger m s e ¯ than D-Lasso, which indicates that the performance of D-Lasso is more efficient than the other methods. For further illustrative purposes, the forthcoming section analyzes some real data sets.

6. Application to Real Data

6.1. Ozone Data

To assess the performance of D-Lasso in comparison with other Lasso methods, we analyzed the Los Angeles ozone pollution data, as originally studied by [26], which is available in the R package ‘cosso’. The Ozone dataset comprises 330 observations, each representing daily measurements of nine meteorological variables. The ozone reading serves as the predicted variable, while the remaining eight covariates are temperature (temp), inversion base height (invHt), pressure (press), visibility (vis), millibar pressure height (milPress), humidity (hum), inversion base temperature (invTemp), and wind speed (wind).
Figure 1 displays the correlation matrix of the meteorological variables, revealing significant correlations between the following pairs: (temp and invTemp), (invHt and invTemp), (milPress and invTemp), and (milPress and temp). Additionally, the Variance Inflation Factor ( V I F ), calculated as V I F = 1 1 R j 2 , quantifies the increase in variance due to correlations among explanatory variables, where R j 2 represents the unadjusted coefficient of determination for regressing the jth independent variable [27]. A commonly used default V I F cutoff value is 5; only variables with a V I F less than 5 are included in the model. If one or more variables in a regression exhibit high V I F values, it indicates collinearity.
The V I F values of the predictors for the Ozone data are provided in Table 4. It is evident that invTemp has the highest V I F value, followed by temp.
To identify potential outliers in the Ozone data, we used the function “outlier” in R, and we employed boxplots for the variables and evaluated Cook’s distance in a multiple regression model. Consequently, the Ozone dataset reveals the presence of one vertical outliers at row 38 and three leverage points, as illustrated in Figure 2.
We utilized the G D F F I T S measure to detect outliers in the proposed D-Lasso estimators, while applying other Lasso methods for comparative purposes. We also calculated the Root Mean Squared Error ( R M S E ) and R-squared values for these methods.
Table 5 presents the results of the various Lasso methods. Notably, classical Lasso and LAD-Lasso selected six non-zero coefficients, albeit with higher R M S E values. In contrast, Huber-Lasso, Sparse-LTS, and MM-Lasso selected four non-zero coefficients. D-Lasso, on the other hand, selects three non-zero coefficients (intercept, temp, and hum). The R M S E values of D-Lasso are smaller than those of Huber-Lasso, Sparse-LTS, and MM-Lasso, signifying the greater reliability of D-Lasso for this dataset.
Furthermore, Table 5 demonstrates that across different Lasso criteria, the R-squared values of D-Lasso, Sparse-LTS, and MM-Lasso estimators are notably more acceptable than the R-squared values of other Lasso estimators.

6.2. Prostate Cancer Data

The Prostate cancer dataset encompasses 97 observations from male patients aged between 41 and 79 years. This dataset was originally sourced from a study conducted by [28] and is accessible through the R package ’genridge’. The response variable is the log(prostate specific antigen), denoted as Ipsa, while the explanatory variables include log(cancer volume) (lcavol), log(prostate weight) (lweight), age, log(benign prostatic hyperplasia amount) (lbph), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score (gleason), percentage of Gleason scores 4 or 5 (pgg45), and log(prostate specific antigen) (lpsa).
Figure 3 presents the correlation matrix of these variables, highlighting significant correlations between the following pairs: (pgg45, gleason) and (lcp, lcavol). Furthermore, the Variance Inflation Factor ( V I F ) values for the predictors in the Prostate cancer data are provided in Table 6, with pgg45 exhibiting the highest V I F value, followed by lcp.
To identify potential outliers in the Prostate cancer data, we used the function “outlier” in R, and we employed boxplots for the variables and assessed Cook’s distance in a multiple regression model. As illustrated in Figure 4, the Prostate cancer dataset contains two vertical outliers and five leverage points.
For the identification of outliers in the proposed D-Lasso estimators, we utilized the G D F F I T S measure, while other Lasso methods were applied for comparative analysis. We also calculated the Root Mean Squared Error ( R M S E ) and R-squared values for these methods.
Table 7 presents the results of the different Lasso methods. Classical Lasso, Huber-Lasso, and MM-Lasso each select six non-zero coefficients, albeit with higher R M S E values. In contrast, LAD-Lasso selects seven non-zero coefficients with an R M S E of 3.1248, while Sparce-LTS selects five non-zero coefficients with an R M S E of 2.4334.
D-Lasso, on the other hand, selects two zero coefficients (lcp and pgg45), and the R M S E value of D-Lasso is smaller than that of other methods. Consequently, D-Lasso is considered more reliable for this dataset.

7. Conclusions

The classical Lasso technique is often utilized for creating regression models, but it can be influenced by the presence of vertical and high leverage points, leading to potentially misleading results. A robust version of the Lasso estimator is commonly derived by replacing the ordinary squared residuals ( L S ) function with a robust alternative.
This article aims to introduce robust Lasso methods that utilize regression diagnostic tools to detect suspected outliers and high leverage points. Subsequently, the D-Lasso is computed following diagnostic checks.
To assess the effectiveness of our newly proposed approaches, we conducted comparisons with the classical Lasso and existing robust Lasso methods based on LAD, Huber, Sparse-LTS, and MM estimators using both simulations and real datasets.
In this article, D-Lasso regression serves as the primary variable selection technique. Future endeavors may delve into exploring the asymptotic theoretical aspects and establishing the oracle properties of D-Lasso.

Author Contributions

Conceptualization, S.S.A. and A.H.A.; methodology, S.S.A. and A.H.A.; software, S.S.A.; validation, S.S.A. and A.H.A.; formal analysis, S.S.A.; writing—review and editing, S.S.A. and A.H.A.; visualization, S.S.A. and A.H.A.; supervision, S.S.A.; project administration, S.S.A. and A.H.A.; funding acquisition, S.S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia project number ISP-2024.

Data Availability Statement

The two data used here are applicable in R package.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  2. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  3. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
  4. Wang, H.; Li, G.; Jiang, G. Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J. Bus. Econ. Stat. 2007, 25, 347–355. [Google Scholar] [CrossRef]
  5. Lambert-Lacroix, S.; Zwald, L. Robust regression through the Huber’s criterion and adaptive lasso penalty. Electron. J. Stat. 2011, 5, 1015–1053. [Google Scholar] [CrossRef]
  6. Alfons, A.; Croux, C.; Gelper, S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 2013, 7, 226–248. [Google Scholar] [CrossRef]
  7. Smucler, E.; Yohai, V.J. Robust and sparse estimators for linear regression models. Comput. Stat. Data Anal. 2017, 111, 116–130. [Google Scholar] [CrossRef]
  8. Simpson, L.; Combettes, P.L.; Müller, C.L. C-lasso—A Python package for constrained sparse and robust re-gression and classification. J. Open Source Softw. 2020, 6, 2844. [Google Scholar] [CrossRef]
  9. Chang, L.; Welsh, A.H. Robust Multivariate Lasso Regression with Covariance Estimation. J. Comput. Graph. Stat. 2022, 32, 961–973. [Google Scholar] [CrossRef]
  10. Atkinson, A.C.; Riani, M.; Riani, M. Robust Diagnostic Regression Analysis, 2nd ed.; Springer: New York, NY, USA, 2000. [Google Scholar]
  11. Chatterjee, S.; Hadi, A.S. Regression Analysis by Example; John Wiley & Sons. Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
  12. Rahmatullah Imon, A. Identifying multiple influential observations in linear regression. J. Appl. Stat. 2005, 32, 929–946. [Google Scholar] [CrossRef]
  13. Ryan, T.P. Modern Regression Methods; John Wiley & Sons: Hoboken, NJ, USA, 2008; p. 655. [Google Scholar]
  14. Alshqaq, S.S.A. Robust Variable Selection in Linear Regression Models. Doctoral Dissertation, Institut Sains Matematik, Fakulti Sains, Universiti Malaya, Kuala Lumpure, Malaysia, 2015. [Google Scholar]
  15. Huber, P.J. Robust Statistics; Wiley: New York, NY, USA, 1981. [Google Scholar]
  16. Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley Series in Probability and Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  17. Rousseeuw, P.; Yohai, V. Robust regression by means of s-estimators. In Robust and Nonlinear Time Series Analysis; Springer: Berlin/Heidelberg, Germany, 1984; pp. 256–272. [Google Scholar]
  18. Saleh, S.; Abuzaid, A.H. Alternative Robust Variable Selection Procedures in Multiple Regression. Stat. Inf. Comput. 2019, 7, 816–825. [Google Scholar] [CrossRef]
  19. Cook, R.D. Detection of influential observation in linear regression. Technometrics 1977, 19, 15–18. [Google Scholar]
  20. Belsley, D.A.; Kuh, E.; Welsch, R.E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
  21. Hadi, A.S. A new measure of overall potential influence in linear regression. Comput. Stat. Data Anal. 1992, 14, 1–27. [Google Scholar] [CrossRef]
  22. Yohai, V.J.; Zamar, R.H. High breakdown-point estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 1988, 83, 406–413. [Google Scholar] [CrossRef]
  23. Khan, J.A.; Van Aelst, S.; Zamar, R.H. Robust linear model selection based on least angle regression. J. Am. Stat. Assoc. 2007, 102, 1289–1299. [Google Scholar] [CrossRef]
  24. Uraibi, H.; Midi, H. Robust variable selection method based on huberized LARS-Lasso regression. Econ. Comput. Econ. Cybern. Stud. Res. 2020, 54, 145–160. [Google Scholar]
  25. Hastie, T.; Efron, B. Lars: Least Angle Regression, Lasso and Forward Stagewise; R Package; Version 1.2. Available online: https://CRAN.R-project.org/package=lars (accessed on 15 November 2018).
  26. Breiman, L.; Friedman, J. Estimating Optimal Transformations for Multiple Regression and Corre-lation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
  27. Fox, J. Regression Diagnostics: An Introduction; Sage Publications: New York, NY, USA, 2019. [Google Scholar]
  28. Stamey, T.A.; Kabalin, J.N.; McNeal, J.E.; Johnstone, I.M.; Freiha, F.; Redwine, E.A.; Yang, N. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. J. Urol. 1989, 141, 1076–1083. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Correlation matrix of variables in the Ozone data set. The colors reflect sign of correlation, where red reveals negative correlation and blue reveals a positive correlation.
Figure 1. Correlation matrix of variables in the Ozone data set. The colors reflect sign of correlation, where red reveals negative correlation and blue reveals a positive correlation.
Symmetry 15 02155 g001
Figure 2. Boxplot of predictors (left) and Cook’s distance (an observations over red line have too much influence) (right) for Ozone data set.
Figure 2. Boxplot of predictors (left) and Cook’s distance (an observations over red line have too much influence) (right) for Ozone data set.
Symmetry 15 02155 g002
Figure 3. Correlation matrix of variables in the Prostate cancer data set. The colors reflect sign of correlation, where red reveals negative correlation and blue reveals a positive correlation.
Figure 3. Correlation matrix of variables in the Prostate cancer data set. The colors reflect sign of correlation, where red reveals negative correlation and blue reveals a positive correlation.
Symmetry 15 02155 g003
Figure 4. Boxplot of predictors (left) and Cook’s distance (an observations over red line have too much influence, and red numders show the row number of influence obevation in data) (right) for Prostate cancer data set.
Figure 4. Boxplot of predictors (left) and Cook’s distance (an observations over red line have too much influence, and red numders show the row number of influence obevation in data) (right) for Prostate cancer data set.
Symmetry 15 02155 g004
Table 1. The results of three scenarios of simulation study for uncontaminated data.
Table 1. The results of three scenarios of simulation study for uncontaminated data.
Lasso EstimatorsZ.cofN.Z.cof m s e ¯ Med(mse)
Simulation 1D-Lasso77.6%22.4%45.177
MM-Lasso80.6%19.4%79.4102.6
Sparse-LTS78.6%21.4%74128.2
Huber-LTS82.1%17.9%83.6119.0
LAD-Lasso76.2%23.8%88.281.3
classical Lasso80%20%26.922.1
Simulation 2D-Lasso90.6%9.4%25.117
MM-Lasso85.6%14.4%69.4102.3
Sparse-LTS85.6%14.4%74128.2
Huber-LTS90.1%9.9%91.661.0
LAD-Lasso92.2%7.8%98.991.8
classical Lasso89.9%10.1%16.912.1
Simulation 3D-Lasso91.9%8.1%25.117
MM-Lasso88.9%11.1%56122.3
Sparse-LTS87.8%12.2%74138.2
Huber-LTS90%10%96106.0
LAD-Lasso92.1%7.9%5998.8
classical Lasso90.2%9.8%6.62.1
Table 2. The results of three scenarios of simulation study for data with ε of vertical contamination.
Table 2. The results of three scenarios of simulation study for data with ε of vertical contamination.
Lasso EstimatorsZ.cofN.Z.cof m s e ¯ Med(mse)
ε = 0.05 Simulation 1D-Lasso81%19%3.83.1
MM-Lasso80.4%19.6%97226
Sparse-LTS78%22%47122
Huber-LTS82%18%68111
LAD-Lasso76%24%8485
classical Lasso44%56%87.3308.4
Simulation 2D-Lasso89%11%10.42.06
MM-Lasso85%15%96123
Sparse-LTS85%15%76121
Huber-LTS90%10%2018
LAD-Lasso92%8%9998
classical Lasso78%22%143.9348.2
Simulation 3D-Lasso88.8%12.2%15.54.7
MM-Lasso88%12%65133
Sparse-LTS87%13%77121
Huber-LTS90%10%98102
LAD-Lasso92%8%98102
classical Lasso75%25%168.1374.2
ε = 0.10 Simulation 1D-Lasso85%15%6.65.5
MM-Lasso80.6%19.4%79.5106
Sparse-LTS78.6%21.4%78123
Huber-LTS82.2%17.8%86196
LAD-Lasso76.2%23.8%8884
classical Lasso41%59%167.5321.3
Simulation 2D-Lasso90%10%15.48.8
MM-Lasso85.6%14.4%69101
Sparse-LTS85.6%14.4%78123
Huber-LTS90.1%9.9%1665
LAD-Lasso92.2%7.8%98.991.8
classical Lasso68%32%184.9384.2
Simulation 3D-Lasso90%10%40.910.9
MM-Lasso99.9%11.1%67130
Sparse-LTS78.8%12.2%77121
Huber-LTS90%10%9797
LAD-Lasso92.1%7.9%9387
classical Lasso65%35%105482.4
ε = 0.15 Simulation 1D-Lasso83%17%6.76.0
MM-Lasso76%24%8883
Sparse-LTS82%18%83118
Huber-LTS78%22%76113
LAD-Lasso80%20%74101
classical Lasso38%62%349.6323.1
Simulation 2D-Lasso90%10%14.67.1
MM-Lasso92%8%9198
Sparse-LTS90%10%1914
Huber-LTS85%15%78128
LAD-Lasso85%15%96103
classical Lasso60%40%287.2544.1
Simulation 3D-Lasso90%10%34.814.7
MM-Lasso92%8%9898
Sparse-LTS90%10%93121
Huber-LTS87%13%78134
LAD-Lasso88%12%69131
classical Lasso57%43%374.6448.0
ε = 0.20 Simulation 1D-Lasso81%19%7.76.8
MM-Lasso80%20%99129
Sparse-LTS78.6%21.4%94103
Huber-LTS82.1%17.9%8698
LAD-Lasso76.2%23.8%8998
classical Lasso36%65%246.9562.6
Simulation 2D-Lasso90%10%48.517.7
MM-Lasso85.6%14.4%69%132
Sparse-LTS85.6%14.4%74114
Huber-LTS90.1%9.9%1923
LAD-Lasso92.2%7.8%9198
classical Lasso58%42%316.8471.5
Simulation 3D-Lasso90%10%25.010.8
MM-Lasso88.9%11.1%98131
Sparse-LTS87.8%12.2%8989
Huber-LTS90%10%99109
LAD-Lasso92.1%7.9%9899
classical Lasso55%45%523.1380.2
Table 3. The results of three scenarios of simulation study for data with ε of leverage points contamination.
Table 3. The results of three scenarios of simulation study for data with ε of leverage points contamination.
Lasso EstimatorsZ.cofN.Z.cof m s e ¯ Med(mse)
ε = 0.05 Simulation 1D-Lasso80%20%4.94.2
MM-Lasso81%19%6.26.4
Sparse-LTS44%56%178318
Huber-LTS40%60%176318
LAD-Lasso16%84%138308
classical Lasso45%55%88.4219.5
Simulation 2D-Lasso89%11%11.53.2
MM-Lasso88%12%12.75.4
Sparse-LTS78%22%143318
Huber-LTS75%25%114287
LAD-Lasso0%100%104262
classical Lasso79%21%154.1359.3
Simulation 3D-Lasso90%10%16.65.8
MM-Lasso89%11%18.87.9
Sparse-LTS78%22%168374
Huber-LTS77%23%155490
LAD-Lasso0%100%155474
classical Lasso76%24%179.2385.3
ε = 0.10 Simulation 1D-Lasso81%187.76.5
MM-Lasso80%20%9.18.7
Sparse-LTS44%56%457332
Huber-LTS43%57%309213
LAD-Lasso15%85%168554
classical Lasso40%58%178.6132.4
Simulation 2D-Lasso90%10%16.59.9
MM-Lasso89%11%18.711.1
Sparse-LTS78%22%184.9384
Huber-LTS77%23%545461
LAD-Lasso0%100%554888
classical Lasso69%31%295.1495.3
Simulation 3D-Lasso90%10%41.111.1
MM-Lasso89%11%62.322.3
Sparse-LTS78%22%310428
Huber-LTS78%22%247411
LAD-Lasso0%100%409392
classical Lasso66%34%216395.5
ε = 0.15 Simulation 1D-Lasso80%20%7.87.0
MM-Lasso89%11%9.19.2
Sparse-LTS45%55%334321
Huber-LTS48%52%281322
LAD-Lasso17%83%672602
classical Lasso39%61%351.7234.2
Simulation 2D-Lasso90%10%15.68.2
MM-Lasso90%10%17.810.4
Sparse-LTS78%22%287144
Huber-LTS79%21%178772
LAD-Lasso0%100%156716
classical Lasso61%39%298.3155.2
Simulation 3D-Lasso90%10%45.915.8
MM-Lasso88%12%67.227.1
Sparse-LTS78%22%372281
Huber-LTS76%24%365482
LAD-Lasso0%100%348274
classical Lasso58%42%385.7159.1
ε = 0.20 Simulation 1D-Lasso80%20%8.87.9
MM-Lasso79%21%10.19.1
Sparse-LTS44%56%246262
Huber-LTS52%48%225251
LAD-Lasso19%81%775687
classical Lasso37%64%257.1273.7
Simulation 2D-Lasso90%10%59.428.8
MM-Lasso89%11%71.641.1
Sparse-LTS78%22%516517
Huber-LTS81%19%205207
LAD-Lasso1%99%485377
classical Lasso59%41%327.9182.6
Simulation 3D-Lasso90%10%36.011.9
MM-Lasso89%11%58.222.3
Sparse-LTS41%59%311280
Huber-LTS55%45%299249
LAD-Lasso0%100%250208
classical Lasso54%44%534.2191.3
Table 4. The V I F for the covariates of the multiple regression model for Ozone data set.
Table 4. The V I F for the covariates of the multiple regression model for Ozone data set.
VariableTempinvHtPressVismilPressHuminvTempWind
V I F 8.6264.3192.6581.4595.3722.27118.0371.283
Table 5. The six Lasso estimators methods for the Ozone data set.
Table 5. The six Lasso estimators methods for the Ozone data set.
VariablesClassical-LassoLAD-LassoHuber-LassoSparce-LTSMM-LassoD-Lasso
intercept−1.6758−1.8422−0.01241.87863.3786−0.9643
temp17.971118.588319.852016.812915.295718.3841
invHt−2.9138−3.2466−4.3513−2.8686−3.40730
press000000
vis−1.53990−1.4379000
milPress000000
hum5.41935.83415.08342.51831.79163.0894
invTemp5.53735.24980000
wind000000
R 2 0.56230.61980.70700.75940.79880.9220
R M S E 5.91235.12484.52774.43344.45783.7297
Z.coef335556
NZ.coef664443
Table 6. The V I F for the covariates of the multiple regression model for Prostate cancer data set.
Table 6. The V I F for the covariates of the multiple regression model for Prostate cancer data set.
VariableLcavolLweightAgeLbphsviLcpGleasonPgg45
V I F 2.1021.4531.3361.3851.9555.0972.4685.974
Table 7. The six Lasso estimators methods for Prostate cancer data set.
Table 7. The six Lasso estimators methods for Prostate cancer data set.
VariablesClassical-LassoLAD-LassoHuber-LassoSparce-LTSMM-LassoD-Lasso
intercept0.433900.32800.06772.44720.5809
lcavol0.51130.54820.51310.56280.55680.3418
lweight0.32920.47720.35310.416100.3388
age0−0.022800−0.0138−0.1239
lbph0.04210.16040.05250.01390.13030.0767
svi0.54360.77770.55710.59930.65510.6361
lcp0−0.09380000
gleason00.18260000.0020
pgg450.001200.001600.00140
R 2 0.56230.36190.70700.75940.59210.8932
R M S E 6.9123.12484.52772.43344.45781.7297
Z.coef323432
NZ.coef676567
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshqaq, S.S.; Abuzaid, A.H. An Efficient Method for Variable Selection Based on Diagnostic-Lasso Regression. Symmetry 2023, 15, 2155. https://doi.org/10.3390/sym15122155

AMA Style

Alshqaq SS, Abuzaid AH. An Efficient Method for Variable Selection Based on Diagnostic-Lasso Regression. Symmetry. 2023; 15(12):2155. https://doi.org/10.3390/sym15122155

Chicago/Turabian Style

Alshqaq, Shokrya Saleh, and Ali H. Abuzaid. 2023. "An Efficient Method for Variable Selection Based on Diagnostic-Lasso Regression" Symmetry 15, no. 12: 2155. https://doi.org/10.3390/sym15122155

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop