Abstract
Multicollinearity is a common issue in regression analyses that occurs when some predictor variables are highly correlated, leading to unstable least squares estimates of model parameters. Various estimation strategies have been proposed to address this problem. In this study, we enhanced a ridge-type estimator by incorporating pretest and shrinkage techniques. We conducted an analytical comparison to evaluate the performance of the proposed estimators in terms of their bias, quadratic risk, and numerical performance using both simulated and real data. Additionally, we assessed several penalization methods and three machine learning algorithms to facilitate a comprehensive comparison. Our results demonstrate that the proposed estimators outperformed the standard ridge-type estimator with respect to the mean squared error of the simulated data and the mean squared prediction error of two real data applications.
MSC:
62J05; 62J07; 62H12; 92B20
1. Introduction
A regression analysis provides answers to queries concerning the dependence of a variable, known as the response, on one or a set of independent variables, known as predictors. These include problems like predicting the response value for a given collection of predictors or identifying the most significant group of predictors with a plausible impact on the response variable. A commonly used method to estimate the functional relationship between the response variable and the set of predictors is the ordinary least squares (OLS) method. This method provides the best linear unbiased estimator (BLUE) of the regression vector of coefficients based on the Gauss–Markov theorem. This result holds, assuming that the columns of the design matrix, which are represented by the independent variables, are not correlated. However, the OLS estimator becomes less efficient if a strong or near-to-strong linear relationship exists among the columns of the design matrix, known as multicollinearity.
The primary objectives of this research were threefold. The initial goal of this study was to improve the ridge estimator (RE) of the regression vector, proposed by Kibria and Lukman [1], by integrating pretest and shrinkage approaches in the presence of multicollinearity among the regressor variables. This enhancement primarily focuses on situations where particular regression coefficients are considered insignificant. The second purpose was to examine the analytical findings on the bias and quadratic risks of the estimators for the newly proposed set. The third step involved evaluating the performance of the estimators using a numerical simulation, specifically by measuring the mean squared error. Additionally, the estimators were assessed using real-world data to determine the prediction error.
The contributions of this research are as follows: (i) the formulation of novel ridge-type pretest and shrinkage estimators in the context of multicollinearity, (ii) the analytical derivation of their bias and quadratic risks, and (iii) extensive Monte Carlo simulations to evaluate their efficacy against established estimators, in parallel with an application to a real-data example in conjunction with penalization and machine learning techniques to evaluate the predictive accuracy.
The subsequent sections of this paper are structured in accordance with our objectives. Section 2 contains a discussion of relevant work. In Section 3, we provide the regression model and the KL estimator. The strategy improvements for the RE estimator are outlined in Section 4. In Section 5, we study some analytical properties of the array of estimators—specifically, the bias and quadratic risks. Some penalizing techniques are presented in Section 6, and we present three machine learning algorithms in Section 7. In Section 8, we conduct a comparison of the array of estimators using both Monte Carlo simulations and two real-data examples. Section 9 provides concluding remarks, and a Supplementary Material is appended at the conclusion of the manuscript, which includes the proofs of the analytical findings.
2. Related Work
The multicollinearity issue lessens the accuracy of the OLS-estimated coefficients; the coefficient estimates can also exhibit significant fluctuations depending on the inclusion of different independent variables in the model, in addition to a high degree of sensitivity to even minor changes in the regression model. Several estimation methods have been developed to improve the OLS estimator when multicollinearity exists. For instance, Hoerl and Kennard [2] proposed the ridge regression estimator and analytically and numerically demonstrated the superiority of this new estimator over the OLS estimator. Later, Liu [3] introduced a novel biased estimator and analytically and numerically verified its superiority over the OLS estimator, known as the Liu-type estimator. Recently, Kibria and Lukman [1] developed a new ridge-type estimator for the regression parameter vector. They showed that the new estimator performed better than the ridge and Liu estimators in terms of the MSE criterion.
The OLS technique uses the sample information obtained from the data set to draw inference about the unknown population regression vector of parameters, which refers to the classical inference method. However, in the Bayesian context, the sample information is combined with non-sample information (NSI) to make an inference about the regression parameters. Such NSI may not be available at all times. However, model selection and building procedures, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), penalization methods, and machine learning algorithms, can still be utilized to yield NSI. Bancroft [4] was one of the first to try estimating regression coefficients by merging NSI with sample information and produced what is known as the pretest estimator. The pretest estimator relies on evaluating the statistical significance of certain regression coefficients. After the determination is made, the pretest selects either the estimate of the full model or the estimator of the revised model, known as the sub-model, which has fewer coefficients. The pretest algorithm chooses either the full or sub-model estimators using binary weights. A modified version of the pretest estimator, known as the shrinkage estimator, was formulated by Stein [5]. This estimator utilizes smooth weights to merge the estimations from both the overall and sub-model estimators. The regression coefficients are modified to converge towards a desired value that is influenced by the NSI. Nevertheless, the enhanced shrinkage estimator sometimes encounters an excessive reduction. Following that, a more improved version of this estimator was proposed by Stein [6] to efficiently address the issue of excessive shrinkage, known as the positive shrinkage estimator.
Many studies have been drawn to the idea of using shrinkage and pretest estimation approaches. For instance, Al-Momani et al. [7] proposed a novel approach utilizing pretest and shrinkage approaches to accurately estimate the regression coefficient vector of the marginal model for multinomial responses. Al-Momani [8] presented Liu-type pretest, shrinkage, and positive shrinkage estimators for the conditional autoregressive regression model’s large-scale effect parameter vector, and demonstrated that these estimators are more efficient than Liu-type estimators. By employing the concept of shrinkage, Arashi et al. [9] proposed an enhanced version of the Liu-type estimator. The proposed method’s superiority was demonstrated using analytical and numerical data. Subsequently, Arashi et al. [10] proposed a rank-based Liu regression estimator that improves the robustness against non-normal errors and outliers while addressing multicollinearity issues in multiple linear regression. Yüzbaşı et al. [11] proposed the use of pretest and shrinkage ridge estimation techniques for the linear regression model. They demonstrated the advantages of employing the suggested estimators alongside specific penalty estimators. Yüzbaşı et al. [12] introduced pretest and shrinkage approaches that use generalized ridge regression estimation to address issues related to multicollinearity and high-dimensional situations. Al-Momani et al. [13] introduced effective shrinkage and penalty estimators for regression coefficients in spatial error models. This showcases the efficiency improvements using asymptotic and numerical studies. In a recent study, Al-Momani and Arashi [14] showed that the ridge-type pretest and shrinkage estimators outperformed the maximum likelihood estimator in the presence of multicollinearity in a spatial error regression model. To obtain further information on shrinkage estimators, please consult with Saleh et al. [15], Saleh et al. [16], and Nkurunziza et al. [17], among other relevant sources.
3. Statistical Model and New Ridge Estimation
To explain the problem, let us consider the linear regression model given below:
where is an vector of random responses; is an full-rank design matrix with ; is a vector of unknown, but fixed, regression parameters; and is an random error vector such that the expected value and the variance of are, respectively, and . The ordinary least squares (OLS) estimator of is given by
where . Based on the Gauss–Markov theorem, is the best linear unbiased estimator (BLUE) of , which is the case when the columns of the design matrix are not correlated, with . However, becomes less efficient if a strong or near-to-strong linear relationship exists among the columns of , which is known as multicollinearity. The ridge-type estimator proposed by [1], which is an estimator of the regression coefficient , is denoted by and given by
where is a identity matrix and is known as a biasing parameter, to be estimated using the data.
The incorporation of NSI in a model often involves the inclusion of a hypothesized restriction on the model parameters, resulting in the emergence of potential sub-models. Bayesian statistical approaches have been developed in response to the need to incorporate NSI into models fitted to objective sample data. This allows for the consideration of the uncertainty introduced by both sources of information. If the NSI asserts that some of the regression coefficients are irrelevant, then it is possible to integrate this information into the estimation process via testing a linear hypothesis with the following form:
where is a known matrix of rank , and is a vector of zeros.
It is important to emphasize that the NSI represented by the linear hypothesis in Equation (3) is not obtained from the same data; instead, it represents external or prior knowledge about the original model. In several practical fields, such knowledge may be obtained from previous empirical studies, theoretical models, or, as in some cases, a researcher with sufficient expertise about the important covariates in the model. For instance, in medical frameworks, previous studies may indicate that specific regression coefficients are insignificant, comparable, or functionally related to the response variable. In such cases, the pretest and shrinkage estimation procedures will provide a methodological basis for incorporating NSI into the estimation process.
Traditional variable selection methods (such as forward, backward, or best subset) are usually employed just to illustrate the execution of a possible sub-model used for comparison purposes, they cannot be interpreted as the theoretical foundation of NSI, but rather as a practical demonstration of the fact that a researcher can formulate a sub-model in the absence of robust prior information. Therefore, the pretest and shrinkage estimators aim to combine the NSI with the sample information to obtain better estimators with respect to a specific metric.
The new sub-model often incorporates a reduced number of regression variables, facilitating interpretation and mitigating the complexity associated with a large number of irrelevant variables. A candidate sub-model may be obtained by using some known variable selection methods, such as the AIC or BIC, or by penalization techniques such as ridge estimation; the least absolute shrinkage and selection operator (LASSO); an elastic net which is a regularized method that combines the penalties of both the LASSO and ridge regression; smoothly clipped absolute deviation (SCAD); or an adaptive LASSO, among others. Under the restriction given by the hypothesis in Equation (3), and by using Lagrange multipliers, the sub-model estimator of , denoted by , is given by the following:
Obviously, if the restriction in Equation (3) is correct, then , which is an unbiased estimator for . On the other hand, when the restriction is not true, becomes less efficient than , particularly in cases where there is ongoing multicollinearity among the columns. This can be diminished by employing the Kibria and Lukman estimation method to calculate the coefficients of sub-model regression, which are denoted by and given as follows:
4. Strategy Improvements for the RE Estimator
Given that a sub-model was acquired via the use of penalization methods or machine learning algorithms, we explored various shrinkage and pretest procedures to combine the estimations from both the full and sub-models. First, we established the definition of a test statistic, designated as , which was used to test the hypothesis stated in Equation (3) as follows:
where is an estimator of . Assuming the null hypothesis in Equation (3), the test statistic is asymptotically distributed according to an F-distribution with degrees of freedom. However, if the alternative hypothesis holds, the test statistic follows a non-central -distribution with a non-centrality parameter . An improved set of shrinkage estimators using the technique in [1] can be formulated as follows:
where is a Borel measurable function of and certain selections of may provide shrinkage estimators that are both plausible and helpful.
4.1. Preliminary and Shrinkage Estimators
The preliminary test estimator can be obtained when the function ; it is denoted by and given by the following:
where is the indicator function, is the level of significance, and represents the upper critical -value of the F-distribution with degrees of freedom. Clearly, if the indicator function is zero, then ; otherwise, . One limitation of this estimator is its reliance on a binary choice of the two estimators, which is influenced by the level of significance . A more refined method of assigning weights may be attained by setting , the James–Stein shrinkage estimator, which can be formally stated as follows:
The shrinkage estimator experiences over-shrinkage, resulting in negative coordinates when . The positive James–Stein estimator solves this issue. It is denoted by , and can by obtained by setting
which can be simplifies as follows:
4.2. Modified Preliminary and Shrinkage Estimators
Another way to enhance the regression estimate when the null hypothesis in Equation (3) is true, which could utilize any type of NSI, is to consider the linear shrinkage estimator of the sub- and full models, the RE estimator. Ahmed S. E. [18] provided evidence that the estimator is competitive when the NSI is correct. The estimator is denoted by and given by the following:
where serves as a tuning parameter, which is selected to minimize the estimator’s mean squared error. Obviously, when , , and when , . Additionally, we can employ the idea proposed by Ahmed S. E. [19] to produce a new RE estimator that is often referred to as the shrinkage pretest estimator. It is denoted by and given as follows:
5. Analytical Properties
This section presents the bias and quadratic risk functions of the proposed RE estimators. For this purpose, assume is any of these RE-type estimators; consequently, the bias of is . The bias expressions are given in the following Theorem 1.
Theorem 1.
Let and ; then,
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
where is the non-central cumulative distribution function of F random variable with degrees of freedom, is the -critical value for the F-distribution, and is a non-centrality parameter. For the proof of the above theorem, see Appendix A.
In order to obtain the quadratic risk expressions, we used the quadratic loss function of any RE-type estimator , which is defined for any positive definite matrix as follows:
The quadratic risk function of any estimator is denoted by and defined as . The quadratic risk expressions are given in the following Theorem 2.
Theorem 2.
Let be a positive definite matrix; then,
For the proof of the above theorem, see Appendix A. In the following section, we shall enumerate some extant penalization methods to produce a new sub-model from the literature.
6. Commonly Used Penalizing Techniques
This section provides a brief overview of some common penalization techniques—ridge, LASSO, elastic net, SCAD, and adaptive LASSO—that perform a concurrent parameter estimation and variable selection. These techniques are incorporated for comparative analyses and to develop sub-models to support the proposed ridge-type pretest and shrinkage estimators. Utilizing these common techniques underscores the practical significance and relative benefits of the techniques presented in this study. Penalty estimators are produced as a consequence of simultaneous model selection and parameter estimation processes by applying a penalty to the least squares equation. As a result, model selection and estimating processes are included in penalty techniques. Some of the penalty techniques are provided below.
6.1. Ridge Estimator
The ridge estimator proposed by Hoerl and Kennard [2] efficiently addresses the issue of multicollinearity by using a technique known as coefficient shrinkage, which reduces the magnitudes of the coefficients associated with strongly correlated variables. This approach aids in achieving model stability and mitigating the influence of multicollinearity on the estimation of coefficients. It is denoted by , and can obtained by minimizing the penalized residual sum of squares. It is given by the following:
where is a constant known as a biasing parameter.
6.2. LASSO Estimator
The LASSO, or the least absolute selection and shrinkage operator, was proposed by Tibshirani [20]. It is a method for selecting variables and estimating parameters in linear models. In the usual least squares estimation of regression coefficients, the LASSO algorithm employs the norm of the vector to define a penalty term. The LASSO estimator is denoted by and given by the following:
where is the tuning parameter to be estimated from the data. However, it is known that the LASSO technique may not be the most ideal approach when dealing with a set of columns in a design matrix that exhibit significant levels of correlation. As a solution, Zou et al. [21] introduced the elastic net (ELNT) method, which combines and penalty terms in a linear manner.
6.3. Elastic Net Estimator
The elastic net estimator is denoted by and can be obtained as follows:
where and are, respectively, the LASSO and the ridge-tuning parameters. A variable selection process is said to possess the oracle property when it successfully identifies the correct subset of zero coefficients inside the regression model being examined. Additionally, the estimators of the remaining non-zero coefficients demonstrate consistency and asymptotic normality. The LASSO estimator does not enjoy this property. However, Fan and Li [22] and Zou [23] developed two approaches that possess the oracle property. In the following two subsections, we establish the definitions for these methods.
6.4. SCAD Estimator
The SCAD, or smoothly clipped absolute deviation estimator, is denoted by and obtained as follows:
where is a continuous function of t. Its derivative is given by the following:
where and are known as tuning parameters. Note that, when , the function is equivalent to the penalty.
6.5. Adaptive LASSO Estimator
Using adaptive weights on the penalties of regression coefficients, the adaptive LASSO modifies the LASSO penalty. Theoretically, it has been demonstrated that the adaptive LASSO estimator enjoys the oracle property. It is denoted by and obtained as follows:
where is a weight function defined as . The estimator is considered to be root-n consistent for . The minimization process for the adaptive LASSO solution does not provide any computing challenges and can be easily solved. One possible choice for is the OLS estimate of . Next, we will describe some machine learning algorithms that are used in this manuscript.
7. Machine Learning Algorithms
This section simply presents three machine learning algorithms—random forest, K-nearest neighbors, and neural networks—employed to enhance the penalization strategies and to further evaluate the predictive effectiveness of the suggested estimators. Integrating these algorithms allows for a more comprehensive comparison between traditional statistical estimators and contemporary data-driven techniques.
The area of regression analysis has been significantly transformed by the advent of machine learning, providing robust methodologies and strategies to extract meaningful insights and achieve precise predictions from intricate data sets. Yet, the emergence of machine learning has expanded the scope of regression analyses, allowing us to use the predictive capacities of algorithms in order to unveil latent patterns and make informed judgments based on data. The three machine learning algorithms are briefly described in the following subsections.
7.1. Random Forest
Ho [24] suggested a technique for creating tree-based classifiers that may be expanded as needed to improve the accuracy on both training and hidden data, which is known as random forest. Random forest is one of the most effective and flexible algorithms that can be used for both classification and regression. It is a kind of ensemble learning that takes the results of many different decision trees and uses them together to come up with more reliable results. It builds several decision trees on distinct subsets of the training data and allows them to generate their own predictions. Randomization is introduced via bootstrapping using random data points and training set replacements and selecting a random subset of features for each tree. Later on, Ho [25] discussed the problem of overfitting and obtaining the most accurate results by building a decision-tree-based predictor that stays as accurate as possible on the training data and becomes more accurate as it becomes more complicated. In a random forest, the final prediction is derived by aggregating the predictions of all the individual trees naturally, through regression averaging. This ensemble method reduces overfitting and increases the algorithm’s precision and stability. More details about the extensions and developments of the algorithm can be found in the studies by Breiman [26] and Liaw et al. [27], among others.
7.2. K-Nearest Neighbors
K-nearest neighbors (KNN) is an effective machine learning algorithm used for classification and regression, as with the Random Forest algorithm. It was initially proposed by Fix and Hodges [28], and later on, Cover and Hart [29] made modifications to it. The KNN method employs supervised learning in a non-parametric fashion. In the context of economic forecasting, Imandoust and Bolandraftar [30] examined the application of the KNN method and showed that it exhibits a greater efficacy compared to alternative methodologies. Lubis et al. [31] examined the implementation of the Euclidean distance formula in KNN in comparison to the normalized Euclidean distance, Manhattan, and normalized Manhattan. Cunningham et al. [32] provided a comprehensive examination of several methodologies employed in nearest neighbor classification.
KNN uses instance-based, non-parametric learning. The method does not rely heavily on assumptions about the underlying data distribution. However, it produces predictions via evaluating similarities between the data points. In a regression context, the method computes the average or weighted average of the response value of the K-nearest neighbors to predict the response value for the new data point. The KNN algorithm is often used for initial machine learning applications due to its simplicity in comprehension and implementation. Yet, the performance of the system may be influenced by the selection of K and the specific distance measure used. Moreover, the computational cost associated with this approach is high, since it necessitates the calculation of distances for every data point in the data set during the prediction stage.
7.3. Neural Network
A neural network is a computer model that draws inspiration from the anatomical and functional characteristics of the human brain. One of the firsts attempts of this method was proposed by McCulloch and Pitts [33], who established the foundation for comprehending the ability of basic computational units, resembling neurons, to execute intricate logical operations. Subsequently, other improvements and enhancements have been incorporated into this approach. Masri et al. [34] introduced a detection approach based on a neural network method for identifying changes in the properties of systems with unknown structures. Angelini et al. [35] presented a case study showcasing the effective operation of neural networks in the evaluation of credit risk. Du [36] provided a thorough examination of clustering neural network approaches based on competitive learning. Adeli [37] provided an overview of academic articles on neural networks that have been published in archival research journals from 1989 to 2000.
The purpose of a neural network is to effectively process information and acquire knowledge via the analysis of data. Neural networks are composed of linked nodes, which are arranged in layers. These networks have an exceptional proficiency in addressing intricate issues, notably in domains like pattern recognition, picture and audio recognition, natural language processing, and other related fields.
After obtaining a sub-model estimator in addition to the full-model estimator, which contains all the available variables, it becomes possible to assess the accuracy of the sub-model using different criteria. The objective of this study was to identify estimators that are functions of both and , with the purpose of mitigating the potential risks associated with either of these estimators over a wide range of parameter values. This objective can be accomplished by developing the pretest and shrinkage estimators in the subsequent section.
8. Numerical Illustrations
In this section, we investigate the finite-sample properties of the proposed estimators using Monte Carlo simulation experiments and two real-data examples.
8.1. Simulation Experiments
The properties of the estimator under investigation and the standards being applied to evaluate the findings will determine how the simulation experiment should be prepared. Following [1], the following equation was used to produce the regressor variables:
where is an independent and identically standard normal random variable and is the correlation between and for . The response variable was then obtained via the following equation:
where is , the regression vector is partitioned as , and varies over the set , which represents the degree of deviation from the null hypothesis in Equation (3). In this simulation, we chose the value of k as follows: Following [1,38], we rewrote the model in Equation (1) in its canonical form as , where , , and are the eigenvalues of and is an orthogonal matrix whose columns are the corresponding eigenvectors. Thus, we have ; then, we estimated the value k as , and is the estimated value of obtained by fitting the regression model to the generated data. The biasing parameter was determined by utilizing the KL heuristic outlined in Appendix A.2—Algorithm A1. We set , as suggested by [18], and . The correlation coefficient was chosen to vary over the set , , and . We set for testing the hypothesis in Equation (3). It was seen that the performance of all the estimators had a similar pattern when the values of , and q were varied. To save space, we chose , , and ; then, we ran the simulation over iterations for . For each estimator, we computed the mean squared error (MSE) as follows:
where is any of the proposed estimators in this study.
For the purpose of comparison, we used the relative efficiency of the mean squared error (RE) with respect to , which is defined as follows:
The complete simulation framework used to calculate the MSEs and relative efficiencies of all estimators is provided in Appendix A.2—Algorithm A2. A number greater than one of the REs indicated the superiority of over , and vice versa. Figure 1 shows the graphs for the cases we considered. The following conclusion can be obtained:
Figure 1.
SRE of the suggested estimators with respect to the RE estimator for , , , and . Note: The abbreviations used in this figure denote the following ridge-type estimators: RRE= ; REPT= ; RES = ; REPS =; LSRE= ; SPTRE=.
- The sub-model estimate consistently beat all other estimators when the null hypothesis in Equation (3) was true or approximately true. However, its relative efficiency decreased and eventually approached zero as increased. Moreover, all the estimators outperformed the regular estimator in terms of the mean squared error across all values of .
- For all values of , the RE positive shrinkage estimator dominated over all other estimators except when the sub-model was true, in which case the RE sub-model and the pretest estimators outperformed it.
- The relative efficiencies exhibited a consistent pattern when the values of , and were held constant for both sample sizes used in this simulation.
8.2. Example 1: Biomass Production in the Cape Fear Estuary
In this section, we consider a case study discussed by Rawlings et. al [39] in two different chapters of their book, also available for free within the "VisCollin"-R-package by Friendly [40]. The data were originally studied by Rick [41]. His goal was to detect the key soil factors that affect the aerial biomass in the marsh grass Spartina alterniflora in the Cape Fear Estuary of North Carolina at three different locations. At each location, three types of Spartina vegetation areas were sampled, namely devegetated "dead" areas, "short" Spartina areas, and "tall" Spartina areas. Five samples of the soil substrate from different sites were collected within each location and vegetation type. These samples were then evaluated for 14 different physico-chemical parameters of the soil monthly for several months, resulting in a total of 45 samples. Thus, there were 45 observations in this data set covering the following 17 variables: the location loc; area type type; hydrogen sulfide ; salinity in percentage SAL; ester hydrolase Eh7; soil acidity in water pH; buffer acidity at pH 6.6 BUF; concentration of the following elements: phosphorus P, potassium K, calcium Ca, magnesium Mg, sodium Na, manganese Mn, zinc Zn, copper Cu, and ammonium NH4; and aerial biomass BIO as the response variable.
One main objective was to employ the set of variables that can accurately predict the response variable. As a first investigation of multicollinearity, we constructed a correlation plot among these variables, which is given below in Figure 2.
Figure 2.
Correlation matrix plot for Biomass data.
The plot shows that there are many significant relationships between these variables, and some of these relations are very strong and highly significant with . This indicates that multicollinearity exists among these variables, which can be easily detected by the variance inflation factor plot given below in Figure 3.
Figure 3.
Variance inflation factor plot for Biomass data.
The VIF plot shows a serious multicollinearity problem, as some of the value are more than 5 in many cases. Hence, the RE estimator will reduce this problem and produce a better estimate of the parameters. Moreover, the proposed estimators will be an additional improvement to estimate and predict the target response variable BIO.
In the absence of prior knowledge, the limitation on the parameters is established either through the judgment of an expert or by utilizing existing methodologies for variable selection, such as the Akaike information criterion (AIC), forward selection (FW), backward elimination (BE), best subset selection (BS), or the Bayes information criterion (BIC), or by using some penalization algorithms, such as the LASSO, adaptive LASSO, and others, to produce a sub-model. In this example, we first employed the forward, backward, and best subset selection methods to produce a sub-model, and then obtained the RE, pretest, and shrinkage estimators. Secondly, we applied random forest, K-nearest neighbors, and a neural network as machine learning algorithms to compare the prediction error with the seven proposed estimators. The sub-models selected by the forward, backward, and best subset selection methods are summarized in Table 1 below.
Table 1.
Full and sub-models for Biomass data.
In our analysis, we examined two sub-models: the forward sub-model, which included the variables pH, Ca, Mg, and Cu along with the intercept, and the second one, which was the backward/best subset that included the variables SAL, K, Zn, Eh7, Mg, Cu, and NH4 along with the intercept. The two sub-models are designated as Sub.1 and Sub.2, respectively.
We fit a full model with all the available variables and the selected sub-model. The whole model yielded an estimated value of . Kibria and Lukman [1] presented several methods for estimating the biasing parameter , but we chose the one with the lowest MSE, which was also provided by [38]. The estimated value of was determined by writing the model in Equation (1) in its canonical form as follows:
where , , and are the eigenvalues of , and is an orthogonal matrix whose columns are the eigenvectors corresponding to the eigenvalues in . In this case, , and the estimated value of k is given by the following:
Using the previous approach, we found . Then, the RE-type estimators were calculated. To ensure full reproducibility and methodological transparency, the hyperparameter tuning and training settings for all machine learning models, including the neural network (NN), random forest (RF), and K-nearest neighbors (KNN), were trained and tuned using the train function provided by the caret package in R.
For the NN model, the nnet implementation was used. A grid search was applied over two main hyperparameters, the number of hidden neurons (size) and the weight decay parameter (decay). The grid systematically explored hidden-layer sizes ranging from 1 to 10 and decay values between 0.1 and 0.5 in increments of 0.1. This procedure aimed to balance the model complexity and generalization ability, while the weight-decay term controlled overfitting through regularization. The RF model was trained using the random forest method. The number of predictors randomly selected at each node split mtry was optimized through internal cross-validation, and the total number of trees (ntree) was fixed at 500 to maintain model stability and ensure comparability across repeated runs. For the KNN regression model, a grid search was conducted over the number of neighbors (k), covering odd integer values from 1 to 19 to prevent ties in distance-based voting. Prior to model fitting, all the predictor variables were standardized by centering around the zero mean and scaling to the unit variance to ensure fair distance computation. All the models were trained under identical resampling conditions using 5-fold CV, as specified through the trainControl function in caret. The model performance was assessed via the root mean square error, which served as the common evaluation metric across all algorithms. This consistent and fully specified setup guaranteed both the reproducibility and fair comparability of the RF, KNN, and NN models in the presented study.
In order to evaluate the estimators’ performance, we implemented a bootstrap method—see Ahmed et al. [42]—and calculated the mean squared prediction error in the subsequent manner:
- Select, with replacement, a sample of size from the data set K times, say .
- Partition each sample from (1) into separate training and testing sets. Divide the training () and testing () sets at a ratio of 70% and 30%, respectively. Then, fit the full and sub-models using the training data set, and obtain the values of all the RE-type estimators.
- Evaluate the predicted response values using each estimator based on the testing data set as follows:where ; ; is the matrix of the other variables in the model; and is any of the proposed RE estimators.
- Find the prediction error of each estimator for each sample as follows:where .
- Calculate the average prediction error of all the estimators as follows:
- Finally, calculate the relative efficiency of the prediction error with respect to as follows:
The findings shown in Table 2 align with the outcomes of the simulations discussed in the preceding subsection.
Table 2.
The relative efficiency of the prediction error of the RE shrinkage estimators for the biomass data.
Next, we employed various penalizations and three machine learning algorithms to analyze prostate cancer data. Our objective was to determine the prediction error. We first scaled all variables, including the responses (BIO), and then applied the penalizations or machine learning algorithms to avoid any differences in the variable units. The results of our investigation are summarized in the following table.
As shown in Table 3, every machine learning algorithm outperformed the traditional RE estimator. However, the performance of the ridge, LASSO, and SCAD penalization methods was inferior to that of the RE estimator. This discrepancy may be attributed, in part, to the presence of multicollinearity among the predictor variables.
Table 3.
Relative efficiency of the prediction error for penalized and machine learning algorithms (MLMs) for biomass data.
Overall, the RE-type shrinkage estimators exhibited the highest relative efficiency, outperforming both the penalized and ML models; the penalized estimators achieved moderate gains under correlated predictors, while the ML algorithms performed well in capturing nonlinear interactions, but remained less efficient in structured linear settings. The superior performance of RE-type shrinkage estimators can be attributed to their ability to incorporate valid linear restrictions and simultaneously apply shrinkage, which reduces estimator variance under multicollinearity. When the underlying data structure is approximately linear, these estimators achieve a favorable bias–variance trade-off, whereas the penalized and ML models, despite their flexibility, may lose efficiency due to over-regularization or overfitting in such structured settings.
Upon the careful examination of the numerical data, it became apparent that the prediction error’s relative efficiencies differed from one method to another. To gain a deeper comprehension of the range of values and potential outliers in our prediction, we examined the associated prediction error using the following box plots, and based on 1000 replications.
The box plot in Figure 4 clearly illustrates the distribution of our data set, and with further analysis, it became apparent that there are clues of possible outliers. The extended whiskers and isolated data points beyond the usual range indicate variability and occurrences that diverge from the general pattern. Furthermore, applying a shrinkage technique to the (RE) estimator revealed a significant effect on the suppression of outliers. Applying shrinkage not only improves the estimate process, but also helps us identify and reduce the impact of outliers, resulting in a more flexible and dependable representation of the underlying data structure.
Figure 4.
Box plots of the prediction error for all estimators and algorithms for biomass data.
8.3. Example 2: Air Pollution Data Set
The air pollution data initially utilized by McDonald and Schwing [43] were subsequently employed by [12,44] to demonstrate universal ridge shrinkage methods. The data can be accessed freely at Carnegie Mellon University’s StatLib (https://lib.stat.cmu.edu/datasets/, accessed on 17 November 2025), and it has 15 covariates pertaining to air pollution and socio-economic and meteorological observations, with the mortality rate as the dependent variable for 60 US cities in 1960. These variables are the average annual precipitation in inches Precip, annual average percentage of relative humidity at 1:00 pm Humidly, average January temperature in degrees F JanTemp, average July temperature in degrees F JulyTemp, percentage of people aged 65 or older Over65, population per household House, median number of school years completed by persons over 22 years Educ, percentage of housing units that are sound and with all facilities Sound, population density per squared mile in urbanized areas Density, percentage of non-white population NonWhite, percentage of people employed in white-collar occupations WhiteCol, percentage of families with an income less than USD 3000 Poor, relative hydrocarbon pollution potential HC, relative nitric oxide pollution potential NOX, relative sulphur dioxide pollution potential SO2, and total age-adjusted mortality rate per 100,000 MORT.
As we did in the first example, to examine multicollinearity among the explanatory variables, we created the correlation matrix shown in Figure 5 below, which shows that a significant relationship exists among some of the variables. This result is also supported by the VIF plot given in Figure 6.
Figure 5.
Correlation matrix plot for the air pollution data.
Figure 6.
Variance inflation factor plot for the air pollution data.
Using the ols_step_best_subset function from the olsrr package applied to the pollution data set, the best subset selection procedure identified the variables Precip, JanTemp, JulyTemp, Educ, NonWhite, and SO2 based on the Cp criterion, whereas the variables Precip, JanTemp, JulyTemp, House, Educ, NonWhite, and SO2 were selected according to the AIC.
Following the methods employed in the first example and detailed in Section 8.2, the relative efficiencies of the prediction error for the proposed estimators about are presented below in Table 4.
Table 4.
The relative efficiency of the prediction error for RE shrinkage estimators for the air pollution data.
Table 4 reiterates the analogous conclusions derived in Example 1 regarding the preferences of the proposed estimators. Secondly, we utilized the identical penalized approaches and machine learning algorithms applied in Example 1, and a summary of our results is displayed in the subsequent table (see Table 5).
Table 5.
Relative efficiency of the prediction error for penalized and machine learning algorithms (MLMs) for the air pollution data.
The results of the penalization methods and machine learning algorithms matched with those obtained in Example 1. Furthermore, we created box plots for the variables in the second data set in Figure 7, which again depicted the distribution of this data set. The figure indicates the presence of probable outliers in the data set; hence, utilizing shrinkage estimation methods will mitigate the influence of these outliers, enhance the estimating process, and yield a more adaptable and reliable representation of the underlying data structure.
Figure 7.
Box plots of the prediction error for all estimators and algorithms for air pollution data.
9. Conclusions
This article introduces the pretest and shrinkage of the new ridge estimators when there is multicollinearity among predictor variables in a multiple linear regression model. Since the pretest and shrinkage technique relies on prior information, we developed a linear hypothesis to assess the significance of certain regressor variables in the regression model. We then combined the result in our estimation process. Subsequently, we benefited from Kibria Lukman’s idea by implementing the novel estimating approach to mitigate the extent of multicollinearity.
In order to compare, we implemented several penalization approaches along with three machine learning algorithms. We assessed the comparative efficacy of each method’s prediction error in relation to the usual RE estimator.
Our findings demonstrated that the utilization of the shrinkage estimation technique resulted in a significant enhancement in the mean squared error when applied to simulated data under various configurations of the correlation coefficient among the predictor variables. Furthermore, these findings were supported by two real data illustrations in which we employed the relative efficiency of the prediction errors as a benchmark for comparison.
The suggested estimation techniques can be expanded to encompass various different types of regression models, such as Poisson, logistic, and beta regression models, among others. In addition, the analysis can consider the case of high-dimensional data, evaluating the effectiveness of the suggested estimators in comparison to penalization methods and machine learning algorithms.
Supplementary Materials
The following supporting information can be downloaded at https://github.com/byuzbasi/RT-Shrinkage-Paper-Codes. The codes and supplementary files related to this paper are available at this link (accessed on 17 November 2025).
Author Contributions
Conceptualization, M.A.-M.; Methodology, M.A.-M. and B.Y.; Software, M.A.-M., B.Y., R.A. and A.M.; Validation, M.A.-M., B.Y., M.S.B. and R.A.; Formal analysis, M.A.-M. and R.A.; Investigation, M.A.-M., B.Y., M.S.B. and A.M.; Resources, M.A.-M.; Data curation, M.A.-M., B.Y. and R.A.; Writing—original draft, M.A.-M.; Writing—review & editing, B.Y., M.S.B., R.A. and A.M.; Supervision, M.A.-M.; Project administration, M.A.-M.; Funding acquisition, M.A.-M. and M.S.B. All authors have read and agreed to the published version of the manuscript.
Funding
This research was self funded by the authors.
Data Availability Statement
The first data set is available for free within the R-package VisCollin, https://cran.r-project.org/web/packages/VisCollin/index.html (accessed on 17 November 2025), and the second data example is available at Carnegie Mellon University’s StatLib (https://lib.stat.cmu.edu/datasets/, accessed on 17 November 2025).
Acknowledgments
We would like to sincerely express our appreciation to the reviewers for their valuable comments, insightful suggestions, and the time they gave to evaluating our manuscript. Their feedback significantly contributed to improving the quality of this work. We also express our gratitude to the editor for the careful handling of our submission and for offering valuable guidance throughout the review process.
Conflicts of Interest
We, the authors, declare that we have no affiliations or involvement in any organizations; also, we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Abbreviations
The following abbreviations are used in this manuscript:
| OLS | Ordinary Least Squares |
| RE | Ridge Estimator |
| KL | Kibria–Lukman Estimator |
| MSE | Mean Squared Error |
| NSI | Non-Sample Information |
| AIC | Akaike Information Criterion |
| BIC | Bayesian Information Criterion |
| LASSO | Least Absolute Shrinkage and Selection Operator |
| SCAD | Smoothly Clipped Absolute Deviation |
| ALASSO | Adaptive LASSO |
| MLM | Machine Learning Method |
| VIF | Variance Inflation Factor |
Appendix A.
Appendix A.1.
Proof of Theorem A1.
1. Note that , where . Therefore,
2. Note that
Hence,
For the proofs of (3)–(7), we can relay on the following general proof.
Note that ; then, by Equation (6),
Using the result of Theorem (1) Appendix B by Judge and Bock [45], we have
. Therefore,
Using the suitable function , the bias functions of , and can be easily obtained. □
Proof of Theorem A2.
1.
Let ; then,
where
Hence,
and
2.
Let ; then,
where
similarly,
Using a little algebra, we can add the terms , and to obtain the expression , so that
Now, for the proofs of (3)–(7), we can perform the general proof for any and then use the suitable function to obtain the quadratic risk expression, as we did in the previous theorem’s proof.
Let ; then,
where
Using the result of Theorems (1)–(3) of Appendix B in [45], we have
Similarly,
By using a little algebra and combining the four expected values, we can obtain the value of . Now,
Therefore, in order to obtain the quadratic risk expression for the proofs of (3)–(7), we employed the appropriate function , just as we did in the proof of the previous Theorem 1. □
Appendix A.2. Selecting the Biasing Parameter k and Relative Efficiency Algorithms
| Algorithm A1 Selection of the biasing parameter k (KL heuristic). |
|
| Algorithm A2 Simulation and relative efficiencies with respect to KL (using Equation (18)). |
|
References
- Kibria, B.M.G.; Lukman, A.F. A New Ridge-Type Estimator for the Linear Regression Model: Simulations and Applications. Scientifica 2020, 2020, 9758378. [Google Scholar] [CrossRef]
- Hoerl, A.E.; Kennard, R.W. Ridge Regression: Applications to Nonorthogonal Problems. Technometrics 1970, 12, 69–82. [Google Scholar] [CrossRef]
- Liu, K. A New Class of Biased Estimate in Linear Regression. Commun. Stat.- Methods 1993, 22, 393–402. [Google Scholar]
- Bancroft, T.A. On Biases in Estimation Due to the Use of Preliminary Tests of Significance. Ann. Math. Statist. 1944, 15, 190–204. [Google Scholar] [CrossRef]
- Stein, C. Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1956; Volume 1, pp. 197–206. [Google Scholar]
- Stein, C. An Approach to the Recovery of Inter-Block Information in Balanced Incomplete Block Designs. In Research Papers in Statistics; Wiley: New York, NY, USA, 1966; pp. 351–366. [Google Scholar]
- Al-Momani, M.; Riaz, M.; Saleh, M.F. Pretest and Shrinkage Estimation of the Regression Parameter Vector of the Marginal Model with Multinomial Responses. Stat. Pap. 2022, 64, 2101–2117. [Google Scholar] [CrossRef]
- Al-Momani, M. Liu-Type Pretest and Shrinkage Estimation for the Conditional Autoregressive Model. PLoS ONE 2023, 18, e0283339. [Google Scholar] [CrossRef]
- Arashi, M.; Kibria, B.M.G.; Norouzirad, M.; Nadarajah, S. Improved Preliminary Test and Stein-Rule Liu Estimators for the Ill-Conditioned Elliptical Linear Regression Model. J. Multivar. Anal. 2014, 126, 53–74. [Google Scholar] [CrossRef]
- Arashi, M.; Norouzirad, M.; Ahmed, S.E.; Bahadir, Y. Rank-Based Liu Regression. Comput. Stat. 2018, 33, 53–74. [Google Scholar] [CrossRef]
- Yüzbaşı, B.; Ahmed, S.E.; Güngör, M. Improved Penalty Strategies in Linear Regression Models. REVSTAT Stat. J. 2017, 15, 251–276. [Google Scholar]
- Yüzbaşı, B.; Arashi, M.; Ahmed, S.E. Shrinkage Estimation Strategies in Generalised Ridge Regression Models: Low/High-Dimension Regime. Int. Stat. Rev. 2020, 88, 229–251. [Google Scholar] [CrossRef]
- Al-Momani, M.; Hussein, A.A.; Ahmed, S.E. Penalty and Related Estimation Strategies in the Spatial Error Model. Stat. Neerl. 2016, 71, 4–30. [Google Scholar] [CrossRef]
- Al-Momani, M.; Arashi, M. Ridge-Type Pretest and Shrinkage Estimation Strategies in Spatial Error Models with an Application to a Real Data Example. Mathematics 2024, 12, 390. [Google Scholar] [CrossRef]
- Saleh, A.K.M.E. Theory of Preliminary Test and Stein-Type Estimation with Applications; Wiley: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
- Ehsanes, S.A.K.M.; Arashi, M.; Saleh, R.A.; Norouzirad, M. Theory of Ridge Regression Estimation with Applications; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
- Nkurunziza, S.; Al-Momani, M.; Lin, E.Y. Shrinkage and Lasso Strategies in High-Dimensional Heteroscedastic Models. Commun. Stat. Theory Methods 2016, 45, 4454–4470. [Google Scholar] [CrossRef]
- Ahmed, S.E. Penalty, Shrinkage and Pretest Strategies: Variable Selection and Estimation; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Ahmed, S.E. Shrinkage Preliminary Test Estimation in Multivariate Normal Distributions. J. Stat. Comput. Simul. 1992, 43, 177–195. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B (Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
- Fan, J.; Li, R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Zou, H. The Adaptive Lasso and Its Oracle Properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
- Ho, T.K. Random Decision Forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; IEEE: Montreal, QC, Canada, 1995; pp. 278–282. [Google Scholar] [CrossRef]
- Ho, T.K. The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Liaw, A.; Wiener, M. Classification and Regression by RandomForest. R News 2002, 2, 18–22. Available online: https://CRAN.R-project.org/doc/Rnews/ (accessed on 4 November 2025).
- Fix, E.; Hodges, J.L. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties; Technical Report 4; USAF School of Aviation Medicine: Randolph Field, TX, USA, 1951. [Google Scholar]
- Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Imandoust, S.B.; Bolandraftar, M. Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background. Int. J. Eng. Res. Appl. 2013, 3, 605–610. [Google Scholar]
- Lubis, A.R.; Lubis, M.; Al-Khowarizmi, A. Optimization of Distance Formula in K-Nearest Neighbor Method. Bull. Electr. Eng. Inform. 2020, 9, 326–338. [Google Scholar] [CrossRef]
- Cunningham, P.; Delany, S.J. K-Nearest Neighbour Classifiers—A Tutorial. ACM Comput. Surv. (CSUR) 2021, 54, 1–25. [Google Scholar] [CrossRef]
- McCulloch, W.S.; Pitts, W. A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
- Masri, S.F.; Nakamura, M.; Chassiakos, A.G.; Caughey, T.K. Neural Network Approach to Detection of Changes in Structural Parameters. J. Eng. Mech. 1996, 122, 350–360. [Google Scholar] [CrossRef]
- Angelini, E.; di Tollo, G.; Roli, A. A Neural Network Approach for Credit Risk Evaluation. Q. Rev. Econ. Financ. 2008, 48, 733–755. [Google Scholar] [CrossRef]
- Du, K.L. Clustering: A Neural Network Approach. Neural Netw. 2010, 23, 89–107. [Google Scholar] [CrossRef]
- Adeli, H. Neural Networks in Civil Engineering: 1989–2000. Comput.-Aided Civ. Infrastruct. Eng. 2001, 16, 126–142. [Google Scholar] [CrossRef]
- Özkale, M.R.; Kaçiranlar, S. The Restricted and Unrestricted Two-Parameter Estimators. Commun. Stat. Theory Methods 2007, 36, 2707–2725. [Google Scholar] [CrossRef]
- Rawlings, J.O.; Pantula, S.G.; Dickey, D.A. Applied Regression Analysis: A Research Tool; Springer: New York, NY, USA, 1998. [Google Scholar]
- Friendly, M. VisCollin: Visualizing Collinearity Diagnostics, R Package Version 0.1.2. 2023. Available online: https://CRAN.R-project.org/package=VisCollin (accessed on 4 November 2025).
- Linthurst, R. Aeration, Nitrogen, pH and Salinity as Factors Affecting Spartina Alterniflora Growth and Dieback. Ph.D. Thesis, North Carolina State University, Raleigh, NC, USA, 1979. [Google Scholar]
- Ahmed, S.E.; Ahmed, F.; Yüzbaşı, B. Post-Shrinkage Strategies in Statistical and Machine Learning for High Dimensional Data; CRC Press: Boca Raton, FL, USA, 2023. [Google Scholar]
- McDonald, G.C.; Schwing, R.C. Instabilities of regression estimates relating air pollution to mortality. Technometrics 1973, 15, 463–481. [Google Scholar] [CrossRef]
- Yüzbaşı, B.; Asar, Y.; Ahmed, S.E. Liu-type shrinkage estimations in linear models. Statistics 2022, 56, 396–420. [Google Scholar] [CrossRef]
- Judge, G.G.; Bock, M.E. The Statistical Implications of Pre-Test and Stein-Rule Estimators in Econometrics; North-Holland: Amsterdam, The Netherlands, 1978. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).






