1. Introduction
Regression models play a crucial role in many scientific fields including medicine, economics, psychology, environmental science, and engineering. They are used for various purposes, and many researchers distinguish between models for description, which aims to capture the data structure parsimoniously; prediction, which aims to predict the outcome for new observations; and explanation, which tests a causal hypothesis by assuming that a specific set of covariates causes an underlying effect on the outcome variable [
1,
2,
3,
4,
5]. A regression model that closely approximates the true data-generating model can be used for both descriptive and predictive purposes [
1,
4].
In this article, we focus only on prediction models, which can be complex as they may include variables with both strong and weak effects, as well as some noise variables, provided the prediction accuracy is not compromised [
4,
6]. However, overly complex models may overfit the training data, meaning that the model fits idiosyncrasies in the data rather than generalizable patterns [
1,
3]. Consequently, they may generate extreme predictions when applied to new data [
3,
7]. Additionally, if the costs of collecting variables are high, these models may not be practically useful and may be quickly forgotten [
8]. On the other hand, models with too few covariates may underfit the data, resulting in poor generalization [
9,
10]. To strike a balance between overfitting and underfitting, a good variable selection approach that produces a simpler and more accurate model is required [
9,
11,
12]. Simpler prediction models are easier to interpret and provide insights into which variables may be important predictors of the outcome [
9].
In practice, many variables are often measured, and a more parsimonious model may be preferred. Traditional methods for selecting variables have been in use for over five decades. Although many alternative approaches have been proposed in recent years, meaningful comparisons and a clear understanding of their properties remain limited [
4,
13]. Sauerbrei et al. [
4] reviewed the state of the art in variable and functional form selection in multivariable analysis and categorized variable selection methods into five classes: traditional (classical) methods, change-in-estimate procedures, penalized likelihood approaches, boosting, and resampling-based methods. They concluded that further comparative research is needed to define the state of the art and provide evidence-supported guidance. Among seven key issues they identified for further investigation, issue 1 is the “Investigation and comparison of the properties of variable selection strategies.” To address this gap, Kipruto and Sauerbrei [
12] designed a simulation study in the context of linear regression to compare selected traditional and penalized variable selection procedures, with a particular focus on the role of shrinkage in model selection and prediction—a key component of penalized likelihood approaches. The study was designed following the principles of neutral comparison studies, which aim to provide objective insights into the properties of methods without favoring any specific approach [
14]. The simulation design borrowed information from published simulation studies with related investigations to inform its structure and gain insight into the weaknesses and strengths of alternative designs, as summarized in their
Supplementary Materials (
Figures S1–S8). In addition, they published their study protocol prior to conducting the analysis to mitigate potential issues of publication bias. To ensure the reproducibility of our results, we will make the programs available.
We focus on evaluating prediction models for continuous outcomes in the context of low-dimensional data, assuming linear effects for all signal variables and no interactions. We consider three classical variable selection methods (best subset selection (BSS), backward elimination (BE), and forward selection (FS)) and four penalized regression methods: nonnegative garrote (NNG) [
9], lasso [
11], adaptive lasso (ALASSO) [
15], and relaxed lasso (RLASSO) [
16]. The prediction accuracy and model complexity of NNG and ALASSO depend on the choice of initial estimates [
15,
17,
18,
19]. Therefore, our first objective (O1) is to evaluate the effect of three initial estimates on prediction accuracy and model complexity: those from ordinary least squares (OLS), ridge [
20], and lasso. The tuning parameters of ridge and lasso are selected via a cross-validation scheme aimed at optimizing prediction performance. Our second objective (O2) is to compare the prediction accuracy and model complexity of NNG and ALASSO against the models that generated the initial estimates.
Despite some criticism [
3], classical variable selection methods are still popular and remain practically useful. They either retain or drop covariates from the model based on certain criteria, rather than gradually shrinking their coefficients to zero [
11]. This discrete process can be problematic, as even slight changes in the data may lead to different models, resulting in unstable predictions [
10,
11,
21,
22,
23].
Penalized regression methods are alternative techniques for variable selection that continuously shrink regression coefficients towards zero, while setting some coefficients exactly to zero [
3,
9,
11]. They differ based on the penalty function. According to Fan and Li [
23], an ideal penalty function should (i) produce nearly unbiased estimates for large coefficients to avoid estimation bias, (ii) exhibit continuous shrinkage to avoid instability in model predictions, and (iii) perform variable selection by forcing some regression coefficients to be exactly zero. The penalty functions of the NNG and ALASSO possess these properties, as they shrink small and large coefficients differently, while the lasso penalty may produce biased estimates for large coefficients, which may increase prediction errors, especially in situations where minimal shrinkage is required [
15]. To reduce over-shrinkage of large coefficients and achieve good prediction accuracy, the lasso, when tuned via cross-validation for optimal prediction, often selects many variables, leading to a relatively high rate of false positives [
24,
25].
The continuous process of shrinking regression coefficients by penalized methods can lead to more stable predictions than classical methods, due to the bias–variance trade-off, especially in situations with small sample sizes or low signal-to-noise ratio (SNR) [
11,
21,
23]. For this reason, penalized methods have been recommended for prediction [
26,
27]. However, researchers have argued that they should not be viewed as a solution to small sample sizes or low SNR. Instead, they should be used when a sufficiently large training dataset is available and the SNR is moderate. These conditions help to reduce uncertainty in the estimation of tuning parameters, which control the quality of regression estimates and, consequently, the prediction accuracy of the models [
7,
28].
Model selection criteria such as cross-validation (CV), Akaike information criterion (AIC) [
29], and Bayesian information criterion (BIC) [
30] have been proposed for selecting tuning parameters in penalized methods [
31,
32], as well as for choosing the best models in classical methods [
33,
34,
35]. In this study, we employed these three popular criteria, which target different types of models. CV and AIC aim to select the best model for prediction on new unseen data, while BIC aims to identify the true data-generating model or the model that is closest to it [
35]. Some studies have shown that BIC may outperform AIC in prediction when there are a few large effects and all other covariates are noise. In such situations, BIC applies a heavier penalty for model complexity, favoring simpler models that retain only covariates with large effects, while AIC may retain many noise variables [
36]. Conversely, AIC may perform better than BIC when there are a few covariates with large effects, followed by many covariates with smaller effects that gradually decrease. In such settings, AIC’s penalty for model complexity is less severe than BIC’s, allowing it to more effectively capture these smaller effects and achieve better predictive performance [
35,
36].
Since CV is a widely used approach for selecting tuning parameters, our third objective (O3) is to compare the prediction accuracy and model complexity of classical and penalized regression methods tuned using this approach. Our fourth objective (O4) is to assess how the three model selection criteria (CV, AIC, and BIC) influence both prediction accuracy and model complexity of variable selection methods. The performance of variable selection methods is known to be sensitive to the proportion of noise variables [
9]. Therefore, our fifth objective (O5) is to evaluate the robustness of the approaches in settings with a high proportion of noise variables.
Throughout this paper, we standardized each covariate in the training data to have a mean of zero and unit variance. Additionally, we centered the response variable by its mean to omit the intercept from the model. All variables in the new dataset were standardized using the statistics derived from the training data.
The rest of the paper is organized as follows:
Section 2 provides a summary of the simulation design, following the ADEMP structure (
Table 1), which entails defining aims (A), data-generating mechanisms (D), estimands/targets of analysis (E), methods (M), and performance (P) measures [
37]. In addition, it describes the methods used, including their tuning parameters and initial estimates for two-stage approaches (NNG and ALASSO).
Section 3 presents the results from the simulation studies, including a detailed summary in a structured format (
Table 2), organized according to the five objectives outlined in the “Aims” section of the ADEMP.
Section 4 presents the results of a real data example.
Section 5 contains the discussion,
Section 6 provides the conclusions, and
Section 7 outlines some directions for future research.
3. Results
This section presents the findings from our simulation study, structured according to the study’s objectives (see
Table 1). For concreteness, we focused on specific scenarios where the true regression coefficients follow beta-type A, with low (C2) and high (C3) correlation settings across different sample sizes (100 and 400) and SNRs. The only exception is
Section 3.5, which shows results for low-correlation type C1, used to assess the impact of many noise variables on the prediction accuracy of the models. Additional results for other beta-types and correlation types are available in the Supporting Materials (see
Figures S1–S8) and the findings are consistent with those reported here. A high-level summary of all simulation results is provided in
Section 3.6.
3.1. Effects of Initial Estimates on the Prediction Accuracy of NNG and ALASSO
We investigated whether the choice of initial estimates (OLS, ridge, and lasso) influenced the prediction accuracy of NNG and ALASSO models. To explore this, we used a Bland–Altman plot [
56] to evaluate the agreement between the predictions measured using RR. We observed that the effects of initial estimates on predictions for NNG and ALASSO were very similar; therefore, only the results for ALASSO are reported.
Figure 1 compares the prediction errors of three ALASSO models: ALASSO (O, CV), ALASSO (R, CV), and ALASSO (L, CV). Specifically, it evaluates the agreement between the prediction errors of (i) ALASSO (O, CV) and ALASSO (R, CV), (ii) ALASSO (O, CV) and ALASSO (L, CV), and (iii) ALASSO (R, CV) and ALASSO (L, CV).
The first and second rows compare the three models in scenarios with limited information: a small sample size (100), low SNR (0.25) ( of about 20%), and high correlation among covariates. These conditions provide little information for accurately estimating both the tuning parameters and initial estimates, leading all initial estimates to produce models with different predictions. This is indicated by a nonzero mean difference (horizontal solid line) and some differences falling outside the wider limits of agreement (LOA) (dashed lines) in the Bland–Altman plot. On average, ridge initial estimates produced models with the best predictions, followed by lasso estimates, while OLS estimates performed the worst.
The third and fourth rows compare the three models in scenarios with moderate information: a small sample size (100), moderate SNR (1) ( of about 50%), and high correlation among covariates. In this setting, predictions across the three initial estimates differed, but not as substantially as in the limited-information scenarios.
The fifth and sixth rows compare the three models in scenarios with sufficient information: a large sample size (400), high SNR (2.5) (
of about 71%), and low correlation among covariates. In this setting, the data contained adequate information to estimate the parameters accurately, resulting in nearly identical predictions across the three initial estimates. Differences were close to zero, and the LOAs were narrow, suggesting that the three initial estimates can be used interchangeably. Lasso initial estimates may be preferred, as they yielded simpler models (see
Figure A1 in
Appendix A).
Overall, ridge initial estimates produced models with superior predictions. Therefore, we will compare the prediction accuracy of NNG and ALASSO using ridge initial estimates to those of other approaches (see
Section 3.3).
3.2. Comparison of Prediction Accuracy of NNG and ALASSO with the Models That Generated the Initial Estimates
We investigated whether NNG and ALASSO, using optimal tuning parameters from CV, could yield better prediction accuracy than the models that generated the initial estimates: OLS, Ridge (CV), and Lasso (CV). Specifically, we compared three sets of models: (i) NNG (L, CV), ALASSO (L, CV), and Lasso (CV); (ii) NNG (R, CV), ALASSO (R, CV), and Ridge (CV); and (iii) NNG (O, CV), ALASSO (O, CV), and OLS. The first two sets showed similar prediction patterns (
Figure 2 and ), while NNG (O, CV) and ALASSO (O, CV) consistently outperformed OLS models in the third set (see
Figure A3 in
Appendix A). Therefore, we focus on the first set.
Figure 2 compares the predictive performance of NNG (L, CV), ALASSO (L, CV), and Lasso (CV) across various sample sizes and SNRs, in low- (upper panel) and high-correlation (lower panel) settings. NNG and ALASSO outperformed lasso only in scenarios requiring minimal shrinkage, such as those with large sample sizes, low correlation, and high SNR (top-right panel). However, in scenarios where severe shrinkage is often required, such as small sample sizes with low SNR (top-left panel), or high-correlation settings (lower panel), lasso outperformed both NNG and ALASSO.
Overall, NNG and ALASSO produced simpler models than those used to generate their initial estimates (see
Figure A4 in
Appendix A). While they consistently outperformed OLS models in terms of prediction, they only outperformed lasso and ridge models in scenarios that required minimal shrinkage. When model simplicity is the primary goal, NNG and ALASSO may be preferred.
3.3. Comparison of Prediction Accuracy for All Variable Selection Approaches
We compared the prediction accuracy and the average number of selected variables across various variable selection methods, all tuned using optimal parameters from 10-fold CV. We investigated the influence of sample size, SNR, and correlation on the model performance. We began by comparing the prediction accuracy of classical methods (BSS, BE, and FS), followed by a comparison of classical and penalized methods. To clearly observe the patterns, we plotted the average of each metric as a function of SNR.
3.3.1. Comparison of Classical Variable Selection Methods
Figure A5 compares the predictive performance of BSS, BE, and FS across different sample sizes and SNRs in both low- (upper panel) and high (lower panel)-correlation settings. In low-correlation settings, all three methods produced similar prediction accuracy. In high-correlation settings, BSS and BE produced nearly identical results, with only minor differences at large sample sizes and high SNR (bottom-right panel), while FS showed slightly better performance than both BSS and BE. Therefore, we compare the results of BE and FS with those of the penalized methods.
3.3.2. Comparison of Prediction Accuracy of Classical and Penalized Regression Methods
Figure 3 compares the prediction accuracy of BE, FS, NNG, ALASSO, RLASSO, and lasso in low- (upper panel) and high (lower panel)-correlation settings.
Low-Correlation Settings
BE and FS performed similarly and showed inferior prediction accuracy compared to penalized methods in small sample sizes (top-left panel) and in large sample sizes with low SNR (top-right panel, lower end). However, in large sample sizes with high SNR (top-right panel, upper end), BE and FS outperformed lasso and were comparable to the other penalized methods.
Among the penalized methods, lasso performed best in small sample sizes with low SNR (top-left panel), but was outperformed by other methods in large sample sizes with high SNR (top-left panel). NNG, ALASSO, and RLASSO performed similarly, although RLASSO was slightly less accurate in scenarios with large sample sizes and low SNR (top-right panel, lower end).
High-Correlation Settings
BE and FS consistently showed inferior predictive performance to the penalized methods, regardless of sample size or SNR (lower panel). In scenarios with small sample sizes or low SNR, FS performed slightly better than BE—likely due to the fact that BE begins with a full model containing many parameters, which can increase variability in coefficient estimates and lead to suboptimal predictions under limited information. Lasso outperformed other approaches across different sample sizes and SNR levels (bottom-left panel). NNG, ALASSO, and RLASSO showed comparable performance across settings (lower panel).
Overall, lasso performed best in scenarios with small sample sizes, low SNR, and high correlation, while BE and FS performed the worst under these settings. Conversely, in scenarios with large sample sizes, low correlation, and high SNR, lasso performed poorly, whereas other approaches, including BE and FS, produced better results. In terms of model complexity, lasso selected models with a larger number of variables, while BE and FS selected simpler models. The other penalized methods selected models of intermediate complexity, with only minor differences among them (
Figure A6).
3.4. Influence of Model Selection Criteria on Predictions and Complexity of Selected Models
This section compares the prediction accuracy and the average number of variables selected by various variable selection approaches, each tuned using CV, AIC, and BIC criteria.
Figure 4 shows the prediction performance for small (upper panel) and large (lower panel) sample sizes under low-correlation settings across various values of SNR.
In small sample sizes, models tuned with BIC generally showed worse prediction accuracy than those tuned with AIC and CV across methods. An exception was observed with the BE and FS methods, where BIC and CV performed similarly. Overall, AIC and CV showed comparable prediction performance across most methods, with CV slightly outperforming AIC in RLASSO at low SNR, while AIC had a slight edge in the BE and FS methods.
In large sample sizes, interesting patterns emerged. When the SNR was low-to-moderate (SNR < 1), models tuned with BIC showed worse prediction accuracy than those tuned with AIC and CV. However, at high SNR, BIC provided models with better predictions, except for lasso, where BIC consistently resulted in inferior accuracy. AIC and CV produced very similar predictions across all penalized approaches, except at high SNR, where CV slightly outperformed AIC in RLASSO. In contrast, CV and AIC produced different results in the classical methods, where AIC performed better at low SNR, while CV was more effective at high SNR.
Regarding variable selection, BIC selected simpler models than AIC and CV, which selected a similar number of variables (see
Figure A7 in
Appendix A).
Figure A8 (
Appendix A) shows the prediction performance under high-correlation settings. Again, AIC and CV showed comparable prediction accuracy in penalized methods, except in RLASSO, where CV outperformed AIC. In classical methods, AIC and CV produced similar predictions in small sample sizes, but AIC had a slight advantage in large sample sizes. Models tuned with BIC generally showed worse prediction accuracy compared to AIC and CV, particularly in large sample sizes (lower panel).
3.5. Impact of a High Proportion of Noise Variables on Prediction Performance of Approaches
The proportion of noise variables can influence the effectiveness of variable selection methods. Therefore, we evaluated the robustness of approaches under scenarios with a high proportion of noise variables (23 noise and 7 signal variables), focusing on their predictive accuracy and the complexity of the selected models in both small and large sample sizes within low-correlation settings. All methods were tuned using cross-validation.
Figure 5 shows the prediction accuracy (upper panel) and the average number of variables selected by each approach (lower panel). In scenarios with small sample sizes (top-left panel), classical methods (BE and FS) consistently exhibited inferior prediction accuracy across all SNR levels, with FS marginally better. In contrast, the lasso performed best overall, with its advantage most pronounced at low SNR levels. The predictive accuracy of NNG, RLASSO, and ALASSO was comparable, with ALASSO slightly less accurate. In large sample settings (top-right panel), the predictive performance varied with SNR. At low-to-moderate SNR levels (SNR < 1), classical methods were markedly outperformed by penalized methods, which exhibited similar accuracy. At high SNR levels, lasso performed worse, while classical methods achieved the best accuracy. Overall, classical methods selected simpler models, followed by RLASSO, ALASSO, and NNG. Conversely, lasso consistently selected a larger number of variables (lower panel).
3.6. A Summary of the Results for the Entire Simulation
Table 2 provides a structured overview of the key findings from the simulation study, organized according to the study’s objectives. A column titled “Figures” lists the corresponding figures that illustrate the results for each objective. Scenarios with small sample sizes, low SNR, and high correlation are described as having “limited information”, whereas those with large sample sizes, high SNR, and low correlation are considered to have “sufficient information”. These terms are used throughout the table and in the discussion section (
Section 5).
Table 2.
Summary of results according to the five objectives of the simulation study.
Table 2.
Summary of results according to the five objectives of the simulation study.
Study Objectives
|
Key Findings
|
Figures
|
---|
Effects of initial estimates (OLS, ridge, and lasso) on the prediction accuracy of NNG and ALASSO (Objective 1) | In scenarios with limited information, the three initial estimates yielded different predictions, with ridge estimates producing models with superior predictive performance. In scenarios with sufficient information, all three initial estimates produced models with similar predictions, indicating that they can be used interchangeably. In such cases, lasso estimates may be preferred, as they tend to yield simpler models, which are easier to interpret.
| Prediction: Figure 1
Model complexity: Figure A1 |
Comparison of prediction accuracy of NNG and ALASSO over the models that generated initial estimates (OLS, ridge, and lasso initial models) (Objective 2) | NNG and ALASSO consistently outperformed OLS models. They also outperformed ridge and lasso models in scenarios with sufficient information, where minimal shrinkage was required. However, with limited information, they performed worse than ridge and lasso initial models. The main advantage of NNG and ALASSO is their ability to select simpler models than the initial models, which is beneficial for descriptive modeling, where understanding the relationships between the outcome and covariates is more important than prediction.
| Prediction: Figure 2, Figure A2 and Figure A3
Model complexity: Figure A4 |
Comparison of prediction accuracy of classical and penalized regression methods (Objective 3) | Classical methods (BSS, BE, and FS) showed similar prediction performance in low-correlation settings. In high-correlation settings, BSS and BE remained comparable, while FS had a slight advantage over both BE and BSS. All classical methods were inferior to penalized methods in scenarios with limited information, where shrinkage is often beneficial. Conversely, under sufficient-information scenarios, their predictions were better than those of lasso and comparable to NNG, RLASSO, and ALASSO. In these settings, shrinkage is less beneficial, as parameter estimates exhibit less variability. Among penalized methods, lasso consistently produced the best accuracy in scenarios with limited information but performed worse in sufficient-information scenarios, where NNG, ALASSO, and RLASSO produced superior accuracy. Overall, there was no clear winner among NNG, ALASSO, and RLASSO. The lasso consistently selected a larger number of variables compared to other approaches, while classical methods selected simpler models, especially in scenarios with limited information.
| Prediction: Figure 3 and Figure A5
Model complexity: Figure A6 |
Influence of model selection criteria (CV, AIC, and BIC) on predictions and complexity of selected models (Objective 4) | In scenarios with small sample sizes and high correlation, BIC-tuned models generally exhibited inferior prediction accuracy compared to AIC and CV. However, in scenarios with sufficient information, BIC produced better predictions, except for lasso, which consistently resulted in inferior accuracy. While BIC performed better in scenarios with sufficient information, it performed worse when models contained several small effects ( Figure A9). AIC and CV showed comparable predictions across most methods, with CV slightly outperforming AIC in RLASSO, while AIC had a slight edge in classical methods. BIC-tuned models selected fewer variables on average than AIC and CV across small and large sample sizes, whereas AIC and CV selected a similar number of variables.
| Prediction: Figure 4, Figure A8 and Figure A9
Model complexity: Figure A7 |
Impact of a high proportion of noise variables on prediction performance and model complexity of approaches (Objective 5) | In small sample sizes, classical methods consistently demonstrated poor prediction accuracy because several relevant variables were not selected, whereas lasso performed the best, particularly at low SNR. NNG, RLASSO, and ALASSO showed comparable performance, with ALASSO being slightly inferior at low SNR. In large sample sizes, classical methods yielded inferior predictions at low-to-moderate SNR (SNR < 1) but performed best at high SNR compared to penalized methods. This pattern was consistent with scenarios involving a moderate proportion of noise variables. In low-to-moderate SNR, penalized methods produced comparable predictions, with lasso having a slight edge. However, at high SNR, lasso performed poorly. Regarding variable selection, classical methods selected simpler models, followed closely by RLASSO, while lasso selected a large number of variables. The number of variables selected by NNG and ALASSO was comparable.
| Prediction and model
complexity: Figure 5 |
4. Example: Respiratory Health Data
The ozone data originates from a study investigating the effects of ozone on school chidren’s lung growth. The study was conducted from February 1996 to October 1999 in Germany, involving 1101 school children who were initially in the first and second primary school classes (ages 6–8) [
57]. Over the four years, lung function measurements were collected three times per year (spring, summer, and autumn), except for spring 1998 [
57]. A subset of 496 children with complete data has been used in previous studies to investigate medical issues [
57], and as an example in methodological papers to assess the stability of model-building strategies [
57] and bootstrap model averaging [
58]. The same subset is analyzed here to illustrate various issues relevant to our simulation study.
The outcome variable is forced vital capacity (FVC), which measures the amount of air that can be forcibly exhaled from the lungs after taking the deepest breath possible, with 24 covariates as shown in
Table 3. For further details, see Ihorst et al. [
57] and Buchholz et al. [
59].
To assess the predictive accuracy of the models, we split the dataset into a training set (70%) and a test set (30%). This approach was chosen because it allows for the comparison of the number of variables selected by each method and their corresponding RMSE. The full OLS model with 24 covariates fitted to the training data yielded an
of 65%, which we refer to as the adequate information setting. To illustrate how the amount of information in the data affects variable selection and prediction accuracy, we conducted two analyses. The first analysis used the full set of candidate variables (
Section 4.1). The second analysis (
Section 4.2), representing the inadequate-information setting, removed the three most influential predictors (based on
p-values in the full model) from the set of candidate variables, resulting in a training OLS model with 21 variables and an
of 23%. This range of
values is similar to our simulation setting, which ranged from 20% to 71%.
4.1. Adequate Information
Table 3 shows the
p-values from the full OLS model fitted on the training data with 24 predictors. Three variables (x1–x3) are highly significant and explain about 62% of the total variation, which is similar to the full model that explains 63%. The table also reports the variance inflation factors (VIFs) for all variables (last column); only three variables (x8, x20, and x22) exhibit moderate multicollinearity, with VIF values between 5 and 10, while the remaining variables have low multicollinearity. In addition,
Table 3 shows the variables selected by each method when tuned using CV and BIC, with variables selected by BIC in brackets.
All three classical methods (BE, FS, and BSS), as well as RLASSO, selected the same three variables (x1–x3), regardless of the tuning criterion. In contrast, variable selection by NNG, lasso, and ALASSO depended on the tuning method. When tuned using CV, each of these methods selected the same 13 variables, producing more complex models than the classical methods and RLASSO. This suggests that CV may not be suitable when simpler models are preferred. When BIC was used, all three methods selected simpler models with the same three variables (x1–x3) as the classical methods, except for lasso, which included one additional variable (x6). These findings are consistent with the results of our simulation study, where CV tended to favor more complex models, while BIC led to simpler models.
Prediction performance was also compared using RMSE on the test data under both CV (denoted RMSE (CV)) and BIC (denoted RMSE (BIC)) tuning criteria. Despite differences in model complexity, all methods achieved comparable RMSE values. This is not surprising, as the three highly significant variables (x1, x2, and x3) were consistently selected across all approaches, likely contributing to stable predictive accuracy.
4.2. Inadequate Information
As shown in
Table 3, variables x1, x2, and x3 accounted for a large proportion of the total variation, with an adjusted
of 62%. To assess model performance under conditions of low
, these variables were removed from the datasets, and the analysis was repeated. The results are summarized in
Table 4. Removing these variables led to a substantial decrease in adjusted
from 63% to 18%, highlighting the impact of these covariates on model fit. Their removal also altered the significance of other variables—x15 and x16 became significant, whereas x4 became nonsignificant. In addition, the selected models differed substantially across methods and tuning criteria. A similar strategy, in which these three variables were removed one at a time to demonstrate how variables with strong effects can help stabilize the model selection process, has been used previously [
57].
Several key observations were evident under this low setting. First, unlike in the high settings, the variables selected by classical methods differed when tuned using CV, whereas tuning with the BIC resulted in selection of the same three variables. Second, among the penalized methods, RLASSO produced simpler models with two variables under both tuning criteria. In contrast, NNG and ALASSO each selected eight variables when tuned using CV and three variables when tuned using BIC. Lasso, when tuned using CV, selected nine variables, corresponding to the same eight selected by NNG and ALASSO, with the addition of variable x22. However, when tuned using BIC, lasso selected only two variables, which were identical to those selected by RLASSO.
Third, the prediction accuracy decreased, as indicated by the larger RMSE values reported in
Table 4 compared to those in
Table 3. This reduction in accuracy is expected, as the exclusion of strong-effect variables negatively affects model performance. The prediction accuracy of all selected models was very similar, with a slight advantage observed for models tuned with CV (RMSE ranging from 0.358 to 0.369 for CV, and from 0.363 to 0.384 for BIC). For practical application, simpler models are generally preferable (e.g., those selected by BIC with two or three variables), but the broader issue concerns the reliability of variable selection procedures when the available information is inadequate to select a sensible model. This issue was further explored in our simulation study, which considered a low
scenario with a value of 20%.
5. Discussion
In a large simulation study, we compared the prediction accuracy and model complexity of several popular variable selection methods. Our published protocol [
12] included BSS and BE as representatives of long-established classical methods. Although FS was not part of the original protocol, we decided to include it during the analysis phase, as it is also a widely used classical method. We then compared their performances with those of penalized methods: NNG, lasso, ALASSO, and RLASSO. While RLASSO was originally proposed for high-dimensional data, we included it to assess its performance in low-dimensional data. Our focus on low-dimensional data allowed us to better understand the results. The penalized methods considered can also be applied to high-dimensional data, but among the classical methods, only FS is suitable for such settings. Due to the complexity of the simulation design, which involved seven methods, three model selection criteria, five sets of true regression coefficients, four correlation structures, four SNRs, and two sample sizes, we structured our results into five objectives and summarized the key findings for each. Below, we briefly discuss the results of each objective before drawing our final conclusions.
5.1. Effects of Initial Estimates on the Prediction Accuracy of NNG and ALASSO
We investigated whether the choice of initial estimates (OLS, ridge, and lasso) affects the prediction accuracy of NNG and ALASSO in scenarios with limited and sufficient information. Under limited information, the three initial estimates produced models with different predictions, with ridge initial estimates performing best. The superiority of ridge estimates over lasso and OLS may be explained by the limitations of the latter two: lasso may exclude important variables due to a variable selection process in the first stage, whereas OLS often exhibits high variability. These drawbacks can adversely affect the prediction performance of the NNG and ALASSO models [
11,
15,
17]. In contrast, when sufficient information was available, all three initial estimates produced models with comparable predictions, which agrees with the findings reported by Kipruto and Sauerbrei [
19] for low-correlation settings with high
. This is because sufficient information enhances the estimation accuracy of both initial estimates and tuning parameters.
Our findings suggest that, if the primary aim of the analysis is prediction, ridge initial estimates may serve as an alternative to OLS initial estimates in both low- and high-correlation settings, aligning with the findings of Makalic and Schmidt [
60]. However, when the goal is to obtain simpler, more interpretable models for descriptive purposes, lasso estimates may be preferable, especially when the data contain sufficient information to allow lasso to accurately select important variables in the first stage. This is crucial because once a variable is eliminated in the first stage, it cannot be reintroduced into the NNG or ALASSO model in the second stage [
49].
5.2. Comparison of Prediction Accuracy of NNG and ALASSO over Initial Models
We compared the prediction accuracies of the NNG and ALASSO models with those of the models used to generate the initial estimates (OLS, ridge, and lasso). Results showed that NNG and ALASSO consistently outperformed OLS models, which is not surprising given that OLS models estimate all parameters without shrinkage, increasing the risk of overfitting and leading to higher prediction errors on new data [
3,
10]. These findings are consistent with those of Yuan and Lin [
17], who demonstrated that NNG is effective in improving on the initial estimator in terms of both variable selection and estimation accuracy.
However, when ridge and lasso estimates were used as initial estimates, NNG and ALASSO often produced worse predictions than those from ridge and lasso initial models in scenarios with high correlation as well as in small sample sizes with low-to-moderate SNR (SNR < 1). In such settings, distinguishing between relevant and irrelevant variables is challenging [
3], and NNG and ALASSO may select models with too few variables, resulting in underfitting and poor generalization [
10,
17]. Conversely, in scenarios with sufficient information, where minimal shrinkage is required [
6], NNG and ALASSO outperformed the ridge and lasso initial models. In these settings, ridge or lasso shrinkage can be excessive, leading to suboptimal predictions [
15,
40].
Overall, our findings suggest that, while NNG and ALASSO consistently outperformed OLS models, they do not always outperform ridge and lasso initial models in data with small sample sizes, high correlation, and low SNR. In such cases, any modeling strategy may be limited, and a more appropriate course of action may be to describe the data and recommend conducting a more informative study. However, when sufficient information is available, NNG and ALASSO enhance both the variable selection and prediction accuracy of initial models.
5.3. Comparison of Prediction Accuracy of Classical and Penalized Methods Using CV as Criterion
We evaluated the prediction accuracy of classical variable selection methods (BE, BSS, and FS) and penalized regression methods (NNG, lasso, ALASSO, and RLASSO), using CV as the model selection criterion.
BSS, BE, and FS performed quite similarly in most scenarios; however, in certain scenarios, such as large sample sizes with high correlation and high SNR, BE and FS had a slight edge (see
Figure A5 in
Appendix A). This difference can be attributed to the nature of BSS, which considers a much larger set of models compared to BE and FS. While this extensive search may seem advantageous, it can lead to overfitting and high variance in coefficient estimates, resulting in poor generalization [
4,
43]. FS outperformed BE in high-correlation settings with low-to-moderate SNR. This advantage is likely due to the sequential nature of FS, which begins with a simple model and gradually adds variables. In contrast, BE starts with a more complex model and removes variables, which may be less effective when information is limited and the risk of overfitting is high. Mantel [
46] demonstrated the advantages of BE over FS and concluded that BE is more appropriate than FS. This is not consistent with our findings, likely because their evaluation focused on variable selection rather than prediction performance. In addition, they did not consider a broader range of scenarios.
Overall, the predictive accuracy of classical methods was inferior to those of penalized methods, especially in settings with small sample sizes, high correlation among covariates, and low SNR (i.e., SNR < 1), which aligns with previous findings [
40]. In such scenarios, classical methods are prone to instability and high variance due to their discrete nature of either retaining or discarding variables in the model, unlike penalized methods, which apply continuous shrinkage to coefficients estimates [
9,
21,
40]. Some researchers [
41] have argued against the use of classical variable selection methods, citing their limitations in settings with a large number of potential predictors. However, our findings indicate that classical methods can outperform modern variable selection methods in both predictive accuracy and model simplicity under certain settings. This observation is supported by other studies [
24,
40,
42], which have reported similar results under comparable conditions.
To improve the prediction performance of classical methods, stacking and post-estimation shrinkage have been proposed [
42,
61]. Instead of selecting a single best model for prediction, stacking combines linear predictors from various models with different numbers of covariates and estimates optimal weights for each linear predictor [
10,
61]. While this approach can improve prediction, it sacrifices interpretability compared to selecting a single model [
10]. Alternatively, post-estimation shrinkage applies shrinkage (estimated via CV) to the regression estimates of the best-selected model [
22,
38,
42,
62]. Although stacking and post-estimation shrinkage both offer ways to enhance prediction accuracy, they are beyond the scope of this study.
In settings with large sample sizes, low correlation, and high SNR, classical methods outperformed lasso, which is consistent with previous findings [
24,
40]. In these scenarios, classical methods produced predictions comparable to those of NNG, RLASSO, and ALASSO. Lasso selected too many variables (
Figure A6), which, combined with its tendency to over-shrink large effects, led to inferior prediction accuracy despite the availability of sufficient information [
15,
24,
40]. These results are in agreement with findings from [
24,
40,
63].
Among the penalized methods, lasso consistently outperformed NNG, ALASSO, and RLASSO in scenarios with small sample sizes and low SNR, as well as in high correlation. The performance of ALASSO and NNG depends on initial estimates, which are less reliable when data contain limited information, thus negatively affecting their prediction accuracy. RLASSO has two tuning parameters that also require sufficient information for accurate estimation. These factors may explain why lasso is often preferred in challenging conditions [
51]. Conversely, in scenarios with low correlation and high SNR, NNG, ALASSO, and RLASSO outperformed the lasso, for the reasons previously explained.
In summary, lasso performed best in prediction accuracy in scenarios with limited information, whereas classical methods as well as NNG, ALASSO, and RLASSO performed best in scenarios with sufficient information. Additionally, NNG, ALASSO, and RLASSO outperformed classical methods in scenarios with low SNR. Among NNG, ALASSO, and RLASSO, no single approach consistently outperformed the others. In scenarios with sufficient information, NNG, ALASSO, RLASSO, or classical methods may be used, as they yield similar predictions. However, classical methods may be preferred due to their tendency to produce simpler models.
5.4. Influence of Model Selection Criteria on Predictions and Model Complexity
We compared the predictive performance of models tuned using three model selection criteria: CV, AIC, and BIC. Our findings showed that the performance of these criteria varies according to sample size, SNR, and the correlation among covariates.
In scenarios with limited information, BIC generally led to worse prediction accuracy than AIC and CV. This is because BIC’s strong penalty for model complexity often results in models with too few variables (
Figure A7), leading to higher prediction errors. These findings corroborate those from other related studies on variable selection [
64]. In contrast, when sufficient information was available, BIC often outperformed AIC and CV in accuracy. However, this was not the case for the lasso model, where BIC consistently resulted in inferior predictions. This likely stems from lasso’s use of a single tuning parameter to simultaneously control both variable selection and shrinkage [
16]. Since BIC tends to favor larger tuning parameters, it can cause excessive shrinkage, leading to suboptimal model performance.
Initial estimates in ALASSO and NNG, as well as the auxiliary parameter in RLASSO, help reduce the amount of shrinkage applied to coefficients. This can improve both model selection and prediction accuracy [
16,
40]. Furthermore, in scenarios with several small effects (beta-type B), BIC-tuned models performed poorly, even when the data contained sufficient information (
Figure A9), as BIC tends to eliminate small effects and thereby underfit the model. This observation aligns with findings by Burnham and Anderson [
35], who reported that BIC performs best when only large effects are present, whereas AIC is more effective when both large and small effects exist.
AIC and CV performed similarly in most scenarios, which is consistent with their shared objective of selecting models that generalize well to unseen data [
35]. Stone [
65] provided a theoretical basis, showing that under certain conditions, the model selected by leave-one-out CV (a special case of k-fold CV) asymptotically converges to the model selected by AIC.
Regarding model complexity, BIC generally selected models with fewer variables than AIC and CV, as expected due to its more stringent penalty for model complexity. AIC and CV selected a comparable number of variables.
It is important to note that there is a link between model selection criteria and model selection tests (i.e., hypothesis tests comparing different models). Teräsvirta and Mellin [
66] demonstrated that using model selection criteria to select models is akin to selecting a model at a certain significance level. Under certain assumptions, they derived approximate significance levels for different model selection criteria. For AIC, the asymptotic significance level for including a variable in a model corresponds to 0.157, derived from the upper tail area of the chi-squared distribution with one degree of freedom, where the cutoff value is
a = 2. In contrast, for BIC, the significance level is a function of the sample size, where a = ln(
n); for example, the significance levels for
n = 100 and
n = 400 are approximately 0.032 and 0.014, respectively. These smaller significance levels explain BIC’s tendency to select simpler models compared to AIC.
5.5. Impact of a High Proportion of Noise Variables on Prediction Accuracy and Model Complexity of Approaches
We assessed the robustness of variable selection methods in the presence of a relatively large number of noise variables. The results were similar to those obtained with a lower proportion of noise variables. In contrast, previous simulation studies [
9,
21] have reported that subset selection methods are sensitive to the proportion of noise variables, which does not align with our findings. This discrepancy is likely because we added only 15 noise variables, which may have been too few to show substantial differences in performance across methods. The number of noise variables added was constrained by the computational demands of best subset selection (BSS).
5.6. Application of Variable Selection Methods
Variable selection is an important topic in all scientific fields that involve empirical data analysis. This includes areas such as biomedical research, economics, epidemiology, sociology, and engineering [
3]. The need for variable selection is especially relevant when analysts are faced with many potential predictors but lack sufficient subject-matter knowledge to prespecify which variables are most relevant for inclusion in the model [
3]. While our background and example come from health research, the underlying concepts are broadly applicable across many other research fields.
In our example, we used adjusted
to obtain a rough estimate of the explained variation associated with the selected variables. We acknowledge that using this metric after variable selection is a longstanding concern in the statistical community due to the severe biases it can introduce [
67]. Therefore, we used it only for model comparison.
To evaluate predictive accuracy, we employed a data-splitting approach rather than cross-validation or bootstrapping. While resampling methods have been increasingly advocated due to the limitations of data splitting [
3], we preferred the latter. This choice allowed us to directly compare the number of variables selected, which is more challenging to assess through cross-validation, as different variables can be selected in each fold.
5.7. Limitation of the Simulation Study
A limitation of our work is that the simulation study did not cover all possible types of scenarios that may occur in real-world applications and did not incorporate coefficient settings from real-world data. In addition, we considered only classical linear models, assuming linearity, no interactions, no outliers or influential points, and homoscedasticity. These issues could have been included to improve the reliability of the methods recommendations.
6. Conclusions
In this study, we compared the prediction accuracy of classical and penalized methods across a range of scenarios. Classical methods generally exhibited inferior accuracy, except in scenarios with sufficient information (large sample size, low correlation, and high SNR), where their predictions were comparable to those of penalized methods. Among the penalized approaches, lasso performed best when information was limited (small sample size, high correlation, and low SNR). In contrast, NNG, ALASSO, and RLASSO performed similarly and outperformed lasso in scenarios with sufficient information.
We also evaluated the effects of initial estimates (OLS, ridge, and lasso) on the prediction accuracy and model complexity of NNG and ALASSO. Ridge initial estimates produced superior predictive performance compared to OLS and lasso initial estimates, suggesting that ridge estimates may be preferable when prediction is the primary goal. Conversely, lasso initial estimates produced simpler models, making them suitable for descriptive purposes, particularly when data contain sufficient information.
Furthermore, we assessed model performance under different tuning criteria. Models tuned with AIC and CV produced comparable prediction accuracy and generally outperformed those tuned with BIC. However, BIC was most effective in scenarios with sufficient information (excluding scenarios with several small effects) across all approaches, except for lasso, where it consistently underperformed. BIC produced simpler models compared to AIC and CV, which is advantageous when simplicity and practical usability are key considerations.
Overall, our findings indicate that no single method performs best in all scenarios. The choice of method depends on the amount of information in the data and the criteria used for selecting tuning parameters. Lasso is the preferred choice in settings with limited information due to its superior predictive accuracy. However, when sufficient information is available, classical methods are preferred, as they provide better predictive performance and select simpler models.
7. Directions for Future Research
In this study, we focused only on low-dimensional settings, as certain methods, such as BE, are not applicable in high-dimensional settings. Future research should extend this comparison to include both classical (FS and BSS) and penalized methods in high-dimensional settings, particularly under conditions of both sufficient and limited information, to evaluate the generalizability of our findings.
Additionally, we did not examine the robustness of the methods to violations of standard model assumptions, such as heteroscedasticity and non-normal error distributions, nor did we assess the impact of influential points or outliers. In real-data analysis, it is highly relevant to check for influential points or extreme values, which can easily be identified and handled manually [
68]. However, identifying influential points or outliers in simulation studies is challenging due to the large number of repetitions. Therefore, we followed a preventive strategy by truncating values at the 1st and 99th percentiles, as recommended in the literature [
68]. Several robust extensions have been proposed, including the robust nonnegative garrote [
69], robust adaptive lasso, and robust lasso [
70]. These methods were not included in our evaluation. Evaluating the performance of both the standard and robust methods under such challenging conditions would offer valuable insights into their practical reliability and broader applicability.
Our simulations did not incorporate interaction terms or nonlinear functional forms for continuous variables. Future research could explore how these methods perform when such complexities are included in the data-generating process. A simulation study protocol addressing nonlinearity using methods also considered in our work has been published [
71], and we are currently working on a paper investigating the linearity assumption and its influence on prediction accuracy in multivariable models.
Although the comparisons in this study were conducted within the framework of classical linear regression, the methods evaluated are also applicable in generalized linear models (e.g., binary outcomes) and survival models. Therefore, further studies should investigate whether the findings hold across these alternative model types. Ongoing work by Ullmann et al. [
71] is evaluating the performance of several variable selection methods, including some considered here, in both linear and logistic regression models. Such complementary research may provide additional evidence on the applicability and robustness of these methods in broader modeling contexts.
Finally, other variable selection methods, such as smoothly clipped absolute deviation (SCAD), elastic net, boosting, and minimax concave penalty (MCP), were not considered. Future research could compare their performance with the methods evaluated here.