Kibria–Lukman-Type Estimator for Regularization and Variable Selection with Application to Cancer Data

: Following the idea presented with regard to the elastic-net and Liu-LASSO estimators, we proposed a new penalized estimator based on the Kibria–Lukman estimator with L1-norms to perform both regularization and variable selection. We deﬁned the coordinate descent algorithm for the new estimator and compared its performance with those of some existing machine learning techniques, such as the least absolute shrinkage and selection operator (LASSO), the elastic-net, Liu-LASSO, the GO estimator and the ridge estimator, through simulation studies and real-life applications in terms of test mean squared error (TMSE), coefﬁcient mean squared error ( β MSE), false-positive (FP) coefﬁcients and false-negative (FN) coefﬁcients. Our results revealed that the new penalized estimator performs well for both the simulated low-and high-dimensional data in simulations. Also, the two real-life results show that the new method predicts the target variable better than the existing ones using the test RMSE metric.


Introduction
The linear regression model describes the response variable as a linear combination of one or more predictors, and it is widely applicable in fields such as medicine, biological science, economics, etc.The model is defined as follows: where y = (y 1 , . . . ,y n ) T is a n × 1 vector of the dependent (response) variable; X = (x 1 , . . . ,x n ) T , x i ∈ R n×p and i = 1, . . ., n are the known n × p matrix of predictors; β = β 1 , . . ., β p T is a p × 1 vector of regression coefficients; and ε = (ε 1 , . . . ,ε n ) T is the random error vector such that E(ε i ) = 0,.Var(ε i ) = σ 2 ∈ R + , i = 1, . . ., n.
The model parameters are usually estimated using the least squares (LS) estimator.The LS estimator minimized the residual sum of squares (RSS) of model (1) and produced the following expression: where • 2 denotes the Euclidean norm.The high-dimensional quantitative structureactivity relationship (QSAR) modeling method is a popular example of the linear regression models.A high-dimensional problem arises when there are more features (variables) than the sample size (n) in a dataset.The popular OLSE does not give a unique solution via high-dimensional QSAR modeling since the number of molecular descriptors exceeds the compound number [1,2].Likewise, the performance of the OLSE suffers a setback when there is linear dependency among the features (predictors), a condition called multicollinearity [3][4][5][6].The setback includes a high standard error, insignificant t-tests, wrong conclusions, etc. [6][7][8].
The shrinkage estimators were mainly adopted to account for the linear dependency among the predictors.The shrinkage estimators include the ridge estimator [9], the Liu estimator [10], the Liu-type estimator [11], the two-parameter estimator [12], the modified ridge-type estimator [6], the modified Liu estimator [13], the Kibria-Lukman (KL) estimator [14], the modified KL estimator [15], and the Dawoud-Kibria (DK) estimator [16].These shrinkage estimators account for more multicollinearity than the OLSE.The ridge estimator minimizes the residual sum of the squares (RSS) subject to an L2-norm of the coefficients, and it is defined as follows: where λ > 0 represents the penalized parameter.The estimator produces a more stable estimate and a better prediction than the LS estimator.The ridge estimator will only perform regularization in high-dimensional settings and will not perform variable selection.It fails to reduce any of its coefficients to zero.Hence, it becomes difficult to interpret the regression models in high-dimensional settings.
Variable selection forms a crucial aspect of modeling high-dimensional problems.It is worthwhile to state that removing the important features (predictors) produces more biased regression estimates and predictions.Likewise, including irrelevant predictors degrades the estimation efficiency and leads to imprecise predictions [17].Thus, variable selection seeks to select the most relevant variables from a large dataset and remove irrelevant and redundant variables.Moreover, it improves the performance of the model prediction and reduces the computational cost.The problem of variable selection has been studied extensively in the literature [17][18][19][20][21][22].
The least absolute penalty and selection operator (LASSO) [18] is a well-known variable selection technique obtained by subjecting the RSS to an L1-norm penalty, and it is defined as follows: where λ > 0 is the penalized parameter.LASSO retains the regularization property of the ridge and performs variable selection due to the introduction of the L1-norm penalty.It also improves the model prediction by eliminating irrelevant variables.However, the estimator often chooses only one variable from the group of pair-wise correlated predictors, and it is dominated by the ridge estimator when the multicollinearity level is high [1,8,18,22].
The elastic-net (E-net) approach [1] is another popular variable selection technique obtained by subjecting the RSS to the L1 and L2 norms penalty simultaneously.This technique retains the features of the ridge and the LASSO estimators.It is defined as follows: where λ 1 , λ 2 > 0, • 2 is the L2-norm, and • 1 is L1-norm.The drawback of this method is that the L2-penalty term introduced extra bias that increases its variance [23].
Recently, the GO estimator [23] was developed as an improvement on the elastic net model.The estimator combines the features of the two-parameter estimator [24] and the LASSO.It is defined as follows: where λ 1 , λ 2 , d > 0. The Liu-LASSO [21] is another type of E-net that is defined as follows: where λ 1 , d > 0. Both the GO estimator and the Liu-LASSO adopted two other shrinkage estimators (the two-parameter and the Liu estimators) other than the ridge estimator.
Research shown that these shrinkage estimators compete well with the ridge estimator.
Recently, the Kibria-Lukman (KL) estimator [14] shows competitive advantage over the ridge and the Liu estimator.Thus, the KL estimator will be adopted in a manner similar to the Liu-LASSO to form a new variable selection technique.The goal is to produce a new estimator in the class of the previously mentioned ones and assess its ability to deal with multicollinearity and high-dimensional problems.The outline of the article is as follows: We define the new estimator in Section 2 and derive its properties.We analyze low-dimensional data and high-dimensional QSAR data to compare the proposed estimator to the existing ones, as shown in Section 4. The simulation study and conclusion are given in Sections 3 and 5, respectively.

Proposed Estimator (KL1)
The Kibria-Lukman estimator is an optimization problem given as follows: where λ 2 > 0. The KL estimator possesses smaller bias compared to the ridge and the Liu estimator.Therefore, in accordance with the Liu-LASSO estimator, we proposed the KL1 estimator, and it is defined as follows: where λ 1 , λ 2 > 0. The estimator automatically retains the regularization and the variable selection property due to the L1-norm penalty.

Coordinate Descent for KL1
We assume that the predictor variables are standardized and the response variable is centered.We attempt to find βj , which minimizes the following objective function: where y x ik β k .The objective function (10) is differentiable everywhere except for β j .The first derivative for f βj when β j = 0 is given as follows: where sgn . The soft threshold estimate of β j was obtained by setting Equation ( 11) to zero.Hence, where i is the partial residual for fitting β j , and S(z,λ 1 ) is the soft-thresholding operator defined by [22] as follows:

Simulation Studies
In this section, we designed three different cases to explore the performances of the proposed method (KL1) and other estimators.The simulation model is defined as follows: where the random error is ∼ N(0, I).The predictor X is generated such that X ∼ N p (0, ∑) where ∑ kj = ρ |k−j| .ρ is the correlation between the predictors.Each independent simulated dataset is divided into three parts: a training set, a validation set and a test set.The division is performed at a ratio of 60%:20%:20%.The training data were used to train the model, while the validation data were used to predict the response while tuning the model's hyperparameters.The tuning parameters used in the coordinate descent algorithm (Algorithm 1) specify a grid in the interval [0, 1].The test data are used to provide an unbiased evaluation of a final model fit on the training dataset.The simulation is conducted in two senses of dimensionality-high-dimensional data and low-dimensional data.
The low-dimensional data simulation was carried out for sample sizes of n = 100, 200, and 400.The simulation was also computed to determine the various correlations between the predictors and variances such that ρ = 0.5, 0.8, and 0.99 and σ = 5, 10.We simulated 8 true βs with 50 datasets for different cases as follows: Case 1: β = (3, 1.5, 0, 0, 2, 0, 0, 0) Case 2: β = (3, 0, 0, 0, 0, 0, 0, 0) Case 3: β = (3, 1.5, 2, 1.6, 2, 1.5, 1.5, 1.5) The high-dimensional data simulation was carried out for sample sizes of n = 150 and 200.The simulation was also computed to determine the various correlations between the predictors and variances such that ρ = 0.5, 0.8, and 0.99 and σ = 5, 10.We simulated 100 true βs with 50 datasets for different cases as follows: Initialize: ∼ β j , j = 1, . . ., p While iter < max_iter do: The model performance was measured based on the mean square error.The mean square errors of the predictions from the test data were estimated as given in (15).The mean square error for estimating true coefficients β-MSE through the training model was also observed in (16).The mean and standard errors of these MSEs were recorded and reported.We also reported the median of the false positives (FP) and false negatives (FN) involved in estimating the true coefficients of the model.By false positive, we mean the coefficients that were initially zero from the simulation model specification but returned a positive real number as the estimated coefficient.False-negative coefficients return negative estimated coefficients, though the true coefficients are positive.This allows us to see how well the estimator reports the significant coefficients when it performs dimension reduction.
The simulation results of the low-dimensional and high-dimensional simulations are reported in Supplementary Materials (Sections S1 and S2).We also report the spread of the TMSEs for the test prediction and βMSE for true coefficients for the high-dimensional cases in Supplementary Materials (Section S3).
A general observation of the results of the simulated data (Supplementary Materials, Sections S1 and S2) indicates that the values of the MSEs of the estimators increase as the standard deviation (σ) and the degree of multicollinearity (ρ) increase.There are also decreases in the values of the MSEs of the estimators as the sample size increases.Although our main focus is feature reduction in high-dimensional data, we observe a prediction for our proposed estimator better than those of other estimators.
For the low-dimensional data result discussed in case 1 (Supplementary Materials, Section S1, Tables S1-S3), we observed that the KL1 and GO estimators performed better than the other estimators.The KL1 estimator performed better at predicting the test sample based on the MSE criterion, while the GO estimator had the minimum MSE for estimating the beta coefficients.In case 2 (Supplementary Materials, Section S1, Tables S4-S6), the KL1 estimator performed better both at model prediction and beta coefficient estimation.The performance of the KL1 estimator in case 3 (Supplementary Materials, Section S1, Tables S7-S9) was the same as that in case 1.In the high-dimensional simulation result, the KL1 estimator performed better in model prediction, beta coefficient estimation and sensitivity with regard to detecting the right coefficients using FN and FP (Supplementary Materials, Section S2, Tables S10-S15).
Furthermore, though the GO estimator may compete with the KL1 estimator in the MSEs in terms of estimating the beta coefficients for the low-dimensional data, it is more appropriate to look at the FP and the FN.The FP and the FN proves that the KL1 estimator is preferable to the GO estimator where feature reduction is needed, that is, in cases 1 and 2.

Real-Life Application
In this section, we analyzed two datasets to evaluate the performances of the following estimators: the ridge, LASSO (L), E-net (E), GO, Liu-LASSO (LL) and the proposed (KL1) estimators.We used two datasets: the prostate cancer dataset and the asphalt binder dataset.Each dataset was analyzed by splitting them into two-training (80%) and test (20%) data.Model fitting and parameter tunning were carried out via k-fold cross classification, with k = 10.The tuning parameters were selected from a sequence 0 to 1.We computed the average errors over the folds for each of the tuning parameters.We chose the tuning parameters that minimized the mean square error.The coefficients from the better model were used to predict the response in the test sample, and the model performance was judged via the test mean squared error (TMSE), which is defined as follows: Figure 1 below gives the general framework of the analytical process employed in the analysis of the real-life dataset.This process is followed through for implementing all the estimation techniques outlined in Section 2.

Dataset I (Prostate Cancer Data)
The prostate data are employed to model the relationship between the level of prostate-specific antigen (lpsa) and a number of clinical measures derived from men who were about to receive a radial prostatectomy [1,21,22].The model is defined as follows:

Dataset I (Prostate Cancer Data)
The prostate data are employed to model the relationship between the level of prostatespecific antigen (lpsa) and a number of clinical measures derived from men who were about to receive a radial prostatectomy [1,21,22].The model is defined as follows: where response y i is the level of prostate-specific antigen, x i1 is the logarithm of cancer volume, x i2 is the logarithm of prostate weight, x i3 is age in years, x i4 is the natural logarithm of the amount of benign prostatic hyperplasia, x i5 is seminal vesicle invasion, x i6 is the logarithm of capsule penetration, x i7 is the Gleason score, and x i8 is a percentage of Gleason score 4 or 5. Table 1 gives the statistical summary of the dataset and the correlation plot in Figure 2. The response variable was centered, while the predictors were standardized.
The following estimators were compared: the ridge estimator, the LASSO estimator, the Liu-LASSO estimator, the GO estimator, the elastic-net (E-net), and the KL1 estimator.The estimated TMSE and the regression coefficients are provided in Table 2.The results in Table 1 show that the LASSO estimator, Liu-LASSO estimator and E-Net selected six active sets.The GO estimator failed to shrink any of the variables zero for this particular data, and the ridge estimator, as expected, only performed regularization.The proposed estimator, namely KL1, selected five active sets and possessed the lowest test mean squared error.This makes the KL1 estimator a useful fit for this dataset.

Dataset 2 (Asphalt Binder Data)
The data were employed to model the effects of some chemical compositions on surface free energy [8].The model was defined as follows: where y i denotes surface free energy, x i1 denotes saturates, x i2 denotes aromatics, x i3 denotes resins, x i4 denotes asphaltenes, x i5 denotes wax, x i6 denotes carbon, x i7 denotes hydrogen, x i8 denotes oxygen, x i9 denotes nitrogen, x i10 denotes sulfur, x i11 denotes nickel and x i12 denotes vanadium.The summary statistics are presented in Table 3 and the correlation plot in Figure 3.The response variable is centered, and the predictors are standardized to have zero mean and unit standard deviation before model fitting.The following estimators were compared: the ridge estimator, the LASSO estimator, the Liu-LASSO estimator, the GO estimator, the elastic-net (E-net), and the KL1 estimator.The estimated TMSE and the regression coefficients are provided in Table 4.The ridge estimator did not shrink any of its coefficients, as expected.LASSO estimator, the Liu-LASSO estimator, the E-net estimator and the KL1 estimator selected ten active sets for the regression coefficients.Of all the estimators, the KL1 estimator possessed the smallest test mean squared error.

Table 1 .
Summary statistics of the prostate cancer dataset.

Table 3 .
Summary statistics for asphalt dataset.

Table 3 .
Summary statistics for asphalt dataset.

Table 4 .
Regression coefficients and MSE y .