Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data

Kipruto, Edwin; Sauerbrei, Willi

doi:10.3390/app15137443

Open AccessArticle

Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data

by

Edwin Kipruto

and

Willi Sauerbrei

^*

Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Street 26, 79104 Freiburg, Germany

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7443; https://doi.org/10.3390/app15137443

Submission received: 14 May 2025 / Revised: 25 June 2025 / Accepted: 27 June 2025 / Published: 2 July 2025

(This article belongs to the Special Issue Machine Learning in Biomedical Sciences)

Download

Browse Figures

Versions Notes

Abstract

Variable selection is important for developing accurate and interpretable prediction models. While classical and penalized methods are widely used, few simulation studies provide meaningful comparisons. This study compares their predictive performance and model complexity in low-dimensional data. Three classical methods (best subset selection, backward elimination, and forward selection) and four penalized methods (nonnegative garrote (NNG), lasso, adaptive lasso (ALASSO), and relaxed lasso (RLASSO)) were compared. Tuning parameters were selected using cross-validation (CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Classical methods performed similarly and produced worse predictions than penalized methods in limited-information scenarios (small samples, high correlation, and low signal-to-noise ratio (SNR)), but performed comparably or better in sufficient-information scenarios (large samples, low correlation, and high SNR). Lasso was superior under limited information but was less effective in sufficient-information scenarios. NNG, ALASSO, and RLASSO outperformed lasso in sufficient-information scenarios, with no clear winner among them. AIC and CV produced similar results and outperformed BIC, except in sufficient-information settings, where BIC performed better. Our findings suggest that no single method consistently outperforms others, as performance depends on the amount of information in the data. Lasso is preferred in limited-information settings, whereas classical methods are more suitable in sufficient-information settings, as they also tend to select simpler models.

Keywords:

classical variable selection methods; low-dimensional data; initial estimates; model complexity; model selection criterion; penalized regression methods; prediction accuracy; simulation study

1. Introduction

Regression models play a crucial role in many scientific fields including medicine, economics, psychology, environmental science, and engineering. They are used for various purposes, and many researchers distinguish between models for description, which aims to capture the data structure parsimoniously; prediction, which aims to predict the outcome for new observations; and explanation, which tests a causal hypothesis by assuming that a specific set of covariates causes an underlying effect on the outcome variable [1,2,3,4,5]. A regression model that closely approximates the true data-generating model can be used for both descriptive and predictive purposes [1,4].

In this article, we focus only on prediction models, which can be complex as they may include variables with both strong and weak effects, as well as some noise variables, provided the prediction accuracy is not compromised [4,6]. However, overly complex models may overfit the training data, meaning that the model fits idiosyncrasies in the data rather than generalizable patterns [1,3]. Consequently, they may generate extreme predictions when applied to new data [3,7]. Additionally, if the costs of collecting variables are high, these models may not be practically useful and may be quickly forgotten [8]. On the other hand, models with too few covariates may underfit the data, resulting in poor generalization [9,10]. To strike a balance between overfitting and underfitting, a good variable selection approach that produces a simpler and more accurate model is required [9,11,12]. Simpler prediction models are easier to interpret and provide insights into which variables may be important predictors of the outcome [9].

In practice, many variables are often measured, and a more parsimonious model may be preferred. Traditional methods for selecting variables have been in use for over five decades. Although many alternative approaches have been proposed in recent years, meaningful comparisons and a clear understanding of their properties remain limited [4,13]. Sauerbrei et al. [4] reviewed the state of the art in variable and functional form selection in multivariable analysis and categorized variable selection methods into five classes: traditional (classical) methods, change-in-estimate procedures, penalized likelihood approaches, boosting, and resampling-based methods. They concluded that further comparative research is needed to define the state of the art and provide evidence-supported guidance. Among seven key issues they identified for further investigation, issue 1 is the “Investigation and comparison of the properties of variable selection strategies.” To address this gap, Kipruto and Sauerbrei [12] designed a simulation study in the context of linear regression to compare selected traditional and penalized variable selection procedures, with a particular focus on the role of shrinkage in model selection and prediction—a key component of penalized likelihood approaches. The study was designed following the principles of neutral comparison studies, which aim to provide objective insights into the properties of methods without favoring any specific approach [14]. The simulation design borrowed information from published simulation studies with related investigations to inform its structure and gain insight into the weaknesses and strengths of alternative designs, as summarized in their Supplementary Materials (Figures S1–S8). In addition, they published their study protocol prior to conducting the analysis to mitigate potential issues of publication bias. To ensure the reproducibility of our results, we will make the programs available.

We focus on evaluating prediction models for continuous outcomes in the context of low-dimensional data, assuming linear effects for all signal variables and no interactions. We consider three classical variable selection methods (best subset selection (BSS), backward elimination (BE), and forward selection (FS)) and four penalized regression methods: nonnegative garrote (NNG) [9], lasso [11], adaptive lasso (ALASSO) [15], and relaxed lasso (RLASSO) [16]. The prediction accuracy and model complexity of NNG and ALASSO depend on the choice of initial estimates [15,17,18,19]. Therefore, our first objective (O1) is to evaluate the effect of three initial estimates on prediction accuracy and model complexity: those from ordinary least squares (OLS), ridge [20], and lasso. The tuning parameters of ridge and lasso are selected via a cross-validation scheme aimed at optimizing prediction performance. Our second objective (O2) is to compare the prediction accuracy and model complexity of NNG and ALASSO against the models that generated the initial estimates.

Despite some criticism [3], classical variable selection methods are still popular and remain practically useful. They either retain or drop covariates from the model based on certain criteria, rather than gradually shrinking their coefficients to zero [11]. This discrete process can be problematic, as even slight changes in the data may lead to different models, resulting in unstable predictions [10,11,21,22,23].

Penalized regression methods are alternative techniques for variable selection that continuously shrink regression coefficients towards zero, while setting some coefficients exactly to zero [3,9,11]. They differ based on the penalty function. According to Fan and Li [23], an ideal penalty function should (i) produce nearly unbiased estimates for large coefficients to avoid estimation bias, (ii) exhibit continuous shrinkage to avoid instability in model predictions, and (iii) perform variable selection by forcing some regression coefficients to be exactly zero. The penalty functions of the NNG and ALASSO possess these properties, as they shrink small and large coefficients differently, while the lasso penalty may produce biased estimates for large coefficients, which may increase prediction errors, especially in situations where minimal shrinkage is required [15]. To reduce over-shrinkage of large coefficients and achieve good prediction accuracy, the lasso, when tuned via cross-validation for optimal prediction, often selects many variables, leading to a relatively high rate of false positives [24,25].

The continuous process of shrinking regression coefficients by penalized methods can lead to more stable predictions than classical methods, due to the bias–variance trade-off, especially in situations with small sample sizes or low signal-to-noise ratio (SNR) [11,21,23]. For this reason, penalized methods have been recommended for prediction [26,27]. However, researchers have argued that they should not be viewed as a solution to small sample sizes or low SNR. Instead, they should be used when a sufficiently large training dataset is available and the SNR is moderate. These conditions help to reduce uncertainty in the estimation of tuning parameters, which control the quality of regression estimates and, consequently, the prediction accuracy of the models [7,28].

Model selection criteria such as cross-validation (CV), Akaike information criterion (AIC) [29], and Bayesian information criterion (BIC) [30] have been proposed for selecting tuning parameters in penalized methods [31,32], as well as for choosing the best models in classical methods [33,34,35]. In this study, we employed these three popular criteria, which target different types of models. CV and AIC aim to select the best model for prediction on new unseen data, while BIC aims to identify the true data-generating model or the model that is closest to it [35]. Some studies have shown that BIC may outperform AIC in prediction when there are a few large effects and all other covariates are noise. In such situations, BIC applies a heavier penalty for model complexity, favoring simpler models that retain only covariates with large effects, while AIC may retain many noise variables [36]. Conversely, AIC may perform better than BIC when there are a few covariates with large effects, followed by many covariates with smaller effects that gradually decrease. In such settings, AIC’s penalty for model complexity is less severe than BIC’s, allowing it to more effectively capture these smaller effects and achieve better predictive performance [35,36].

Since CV is a widely used approach for selecting tuning parameters, our third objective (O3) is to compare the prediction accuracy and model complexity of classical and penalized regression methods tuned using this approach. Our fourth objective (O4) is to assess how the three model selection criteria (CV, AIC, and BIC) influence both prediction accuracy and model complexity of variable selection methods. The performance of variable selection methods is known to be sensitive to the proportion of noise variables [9]. Therefore, our fifth objective (O5) is to evaluate the robustness of the approaches in settings with a high proportion of noise variables.

Throughout this paper, we standardized each covariate in the training data to have a mean of zero and unit variance. Additionally, we centered the response variable by its mean to omit the intercept from the model. All variables in the new dataset were standardized using the statistics derived from the training data.

The rest of the paper is organized as follows: Section 2 provides a summary of the simulation design, following the ADEMP structure (Table 1), which entails defining aims (A), data-generating mechanisms (D), estimands/targets of analysis (E), methods (M), and performance (P) measures [37]. In addition, it describes the methods used, including their tuning parameters and initial estimates for two-stage approaches (NNG and ALASSO). Section 3 presents the results from the simulation studies, including a detailed summary in a structured format (Table 2), organized according to the five objectives outlined in the “Aims” section of the ADEMP. Section 4 presents the results of a real data example. Section 5 contains the discussion, Section 6 provides the conclusions, and Section 7 outlines some directions for future research.

2. Materials and Method

2.1. Simulation Design

We used the simulation design outlined in the simulation protocol by Kipruto and Sauerbrei [12]. However, we decided to separate the evaluation of prediction models from the description models to better understand the former. Table 1 provides a summary of the simulation design, clearly stating the aims and targets of analysis. Therefore, we focus our discussion on the data-generating mechanisms (Section 2.2), performance measures (Section 2.3), and methods (Section 2.4) in detail. We have also made minor changes to the original protocol to investigate situations that were not captured in the original protocol. These changes are discussed in Section 2.2 and Section 2.3. The R code for reproducing our results will be made available at https://github.com/EdwinKipruto/varselectionpredict.

Table 1. Simulation design following the ADEMP structure reproduced from Kipruto and Sauerbrei [12]. Some objectives differ from those in the original study, and changes made to the simulation parameters are italicized and bolded.

Aims and objectives	Aim: Comparison of variable selection methods in linear regression with respect to their predictive accuracy and number of variables selected. Objectives: O1. To evaluate the effects of initial estimates on the prediction accuracy of NNG and ALASSO. O2. To investigate whether NNG and ALASSO yield superior or inferior predictions compared to the models (OLS, ridge, and lasso) used to generate the initial estimates. O3. To compare the prediction accuracy of both classical and penalized regression methods using CV as the model selection criterion. Based on the results from objective 1, the initial estimates that produced the best predictions for NNG and ALASSO will be used. O4. To assess how different model selection criteria (CV, AIC, and BIC) influence the prediction performance and complexity of the selected models. O5. To assess the impact of many noise variables on the prediction performance of approaches
Data-generating mechanism	Training/development dataset $X \sim N_{p} (0, Σ)$ where p = 15 and $Σ \in R^{p \times p}; Σ_{i j}$ is equal to the correlation coefficient between covariate $x_{i}$ and $x_{j}$ $y = X β + ϵ$ where $β \in (β_{A}, β_{B}, β_{C}, β_{D})$ and $ϵ \sim N (0, σ^{2} I_{n})$ True regression coefficients (β) for 15 covariates β_A: 1.5, 0, 1, 0, 1, 0, 0.5, 0, 0.5, 0, 0.5, 0, −0.5, 0, 0 β_B: 1.50, 0, 0.50, 0, 0.50, 0, 0.25, 0, 0.25, 0, 0.25, 0, −0.25, 0, 0 β_C: 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0 β_D: 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 Correlation structure ( $C$ ) C1: Correlation structure used by van Houwelingen and Sauerbrei [38]-low correlation C2: Autoregressive structure with $Σ_{i j}$ = 0.3^\|i−j\|—moderate correlation C3 Autoregressive structure with $Σ_{i j}$ = 0.8^\|i−j\|—high correlation C4: Correlation structure from education body fat data [39]-high correlation R² and sample size (n) R² = {0.20, 0.30, 0.50, 0.71}, corresponding to SNR = {0.25, 0.43, 1.00, 2.45}. n = {100, 400} Number of scenarios (full factorial design) and simulation runs $β \times C \times R^{2} \times n = 4 \times 4 \times 4 \times 2 = 128$ scenarios N = 2000 simulation repetitions per scenario Test dataset New simulations with the same design as training dataset ( $n_{t e s t} = 100,000$ Additional analyses (specified in the published protocol but n = 100 added) n = (100, 400), β_A, C1, and R² with 30 covariates (i.e., the 15 original covariates and 15 uncorrelated noise variables)
Target of analysis	Model complexity, which is measured by the average number of variables selected. We do not distinguish between signal and noise variables, as this distinction is more relevant for descriptive modelling. Model prediction errors on new data.
Methods	Variable selection methods: Lasso, NNG, ALASSO, RLASSO, BSS, BE, and FS Tuning criteria: AIC, BIC, and 10-fold CV Initial estimates for ALASSO and NNG: OLS, ridge, and Lasso
Performance measures	Model complexity: average number of selected variables (nonzero coefficients) Prediction accuracy: Relative risk (RR) and relative test error (RTE)(see Section 2.3)

2.2. Data Generating Mechanisms

We considered 128 simulation scenarios (see Table 1). For each scenario, we generated 2000 training datasets, each consisting of a continuous response variable (y) and a matrix X of 15 continuous covariates. The number of simulation repetitions was set to 2000 to ensure that the Monte Carlo standard error (MCSE) of the model error was smaller than 0.005 for better precision (see Kipruto and Sauerbrei [12] for details). The X matrix was sampled from a multivariate normal distribution with a mean vector of zero and a variance–covariance matrix

Σ

, where

Σ_{i j}

is the pairwise correlation coefficient between covariate

x_{i}

and

x_{j}

(Table 1). After generating X, we standardized each covariate to have a mean of zero and unit variance to ensure they were on a comparable scale. Using the standardized covariates, we calculated the response variable y with the formula

y = X β + ε

, where β is a vector of true regression coefficients (see Table 1). The error term

ε_{i}

was sampled from a normal distribution with a mean of zero and variance of

σ^{2} .

The value of

σ^{2}

was chosen to achieve the desired SNR, calculated as

σ^{2} = β^{T} Σ β / S N R

[40].

For each of the 2000 simulated datasets in each scenario, classical and penalized regression methods were used to develop prediction models. To evaluate the prediction performance of each model, we generated an independent test dataset with a sample size of 100,000 using the same design as the training data and calculated the relative test error and relative risk (see Section 2.3 for details).

To assess the impact of many noise variables (23 noise and 7 signal) on prediction accuracy and the complexity of the selected models, we conducted additional analyses using both smaller (n = 100) and larger (n = 400) sample sizes in low-correlation settings with beta-type A (see Table 1 for details on the additional analysis). We added a sample size of n = 100 to investigate the performance of methods on small sample sizes with many noise variables.

2.3. Performance Measures

We used two performance measures to evaluate the prediction accuracy of models: relative test error and relative risk. Additionally, we reported the average number of variables selected by each regression approach to gain insights into the complexity of the selected models.

2.3.1. Relative Test Error

The relative test error (RTE) measures the prediction accuracy of models by quantifying the expected test error relative to the Bayes error rate [40]. It is computed by rescaling the test mean squared error (MSE) using the irreducible noise variance

σ^{2}

as shown below [41]:

R T E (\hat{β}) = \frac{E {[(y_{0} - X_{0}^{T} \hat{β})}^{2}]}{σ^{2}} = \frac{{(\hat{β} - β)}^{T} Σ (\hat{β} - β) + σ^{2}}{σ^{2}} = \frac{M E + σ^{2}}{σ^{2}} = \frac{M E}{σ^{2}} + 1

where

X_{0}

denotes a random matrix of test covariates,

y_{0}

is a random vector of test response variable,

β

is a vector of true coefficients, and

\hat{β}

is a vector of regression estimates from a regression approach such as NNG.

M E = {(\hat{β} - β)}^{T} Σ (\hat{β} - β)

is the model error. A null model (model without covariates) has an RTE score of

S N R + 1,

while a perfect model (

\hat{β} = β

) has a score of one since ME = 0. A good prediction model should have an RTE close to 1 [40,41].

2.3.2. Relative Risk of a Prediction Model

Relative risk (RR) is another measure of prediction accuracy for models [24,40], defined as:

R R (\hat{β}) = \frac{E {[(X_{0}^{T} \hat{β} - X_{0}^{T} β)}^{2}]}{E {(X_{0}^{T} β)}^{2}} = \frac{{(\hat{β} - β)}^{T} Σ (\hat{β} - β)}{β^{T} Σ β} = \frac{M E}{β^{T} Σ β}

A null model has RR = 1, while a perfect model (

\hat{β} = β

) has RR = 0. If a model has RR > 1, this indicates inferior prediction performance compared to a null model [41]. Ideally, a good prediction model should have an RR close to zero.

In the original protocol, we stated that ME would be used to evaluate the prediction performance of the models. However, we decided to use RR and RTE, as they are easier to interpret and are functions of ME. While RR is simpler to understand than RTE, RTE is also important because it magnifies minor differences between methods [40,41].

Other metrics, such as the mean absolute error (MAE), have also been proposed for evaluating predictive performance, but they are not considered here. Although MSE and MAE may lead to similar conclusions, MSE is often preferred because it can be decomposed into ME and irreducible error, providing more insight into model properties. In the real data example, we use root MSE (RMSE) as the performance metric. A detailed discussion of predictive performance metrics can be found in Van Calster et al. [42].

2.4. Variable Selection Methods

We evaluated three classical methods for variable selection, BSS, BE, and FS (Section 2.4.1), as well as four penalized regression methods: NNG, lasso, ALASSO, and RLASSO (Section 2.4.2). Additionally, we employed three criteria (CV, AIC, and BIC) to select tuning parameters for penalized regression methods and to determine the best model in classical methods. There is a large literature on variable selection approaches and model selection criteria (e.g., Kipruto and Sauerbrei [12] and the references therein). We briefly introduce key concepts and discuss relevant issues. Lastly, we describe the notations used in the results sections to facilitate comparison of methods (Section 2.5).

2.4.1. Classical Variable Selection Methods

These are traditional methods for variable selection, with extensive literature in statistics. Here, key issues are briefly presented, and interested readers are referred to James et al. [43] and Miller [44] for further details.

BSS involves fitting OLS models for all possible combinations of the p covariates, using either an exhaustive search or leaps and bounds algorithm [45]. It returns the best models with one covariate, two covariates, and so on, where the best model is the one with the smallest residual sum of squares (RSS) [43].

In contrast, BE starts with a model containing all variables and iteratively removes the covariate that contributes the least to the model, one by one [43]. This process also produces the best model for each subset size, including the null model [45].

FS starts with a null model (containing only the intercept) and iteratively adds one predictor at a time—the variable that most significantly reduces the RSS. This procedure also identifies the best model for each subset size, including the full model.

For all approaches, a pre-specified stopping criterion such as AIC can be used to choose the best-fitting subset of variables [43,44]. Once the best subset is selected, the regression coefficients for these variables are estimated using the least squares method.

BSS is computationally intensive, while BE and FS are less intensive because they explore a smaller set of models [43]. BE is often preferred over FS, especially in the presence of collinearity and when the goal is to derive a suitable model for description [3,46]. Mantel [46] demonstrated the advantages of BE over FS and concluded that BE eliminates variables only when appropriate. If a set of variables should be kept as a group, BE retains the whole set, whereas FS may fail to select the complete set. Other variants of BE and FS allow for omitted variables to be re-included and included variables to be removed, as summarized by Royston and Sauerbrei [47]. These variants were not considered in this study. AIC, BIC, and CV were used to select the single best model for each approach.

2.4.2. Penalized Regression Methods

Nonnegative Garrote

The NNG is a two-stage approach proposed for low-dimensional data. In the first stage, the initial estimates,

{\hat{β}}^{i n i t},

are estimated using least squares or standard maximum likelihood estimation methods. In the second stage, the nonnegative shrinkage factors,

\hat{c} (λ),

are estimated. After obtaining the shrinkage factors, the NNG regression estimate for covariate

x_{j}

is calculated as

{\hat{β}}_{j}^{N N G} (λ) = {\hat{c}}_{j} (λ) {\hat{β}}_{j}^{i n i t},

which depends on the initial estimates and the tuning parameter

λ

[12,17,19].

The original NNG depends on OLS estimates as initial estimates. However, Yuan and Lin [17] argue that NNG is a general procedure that can be used with other initial estimates such as ridge, lasso, and elastic net [48]. Ridge initial estimates are preferred over OLS estimates when covariates are highly correlated because they are more stable [15].

For a linear regression with standardized covariates and a centered response variable, the shrinkage factors,

\hat{c} (λ) = {({\hat{c}}_{1} (λ), \dots, {\hat{c}}_{p} (λ))}^{T},

are obtained by minimizing the following objective function:

\frac{1}{2 n} {\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{p} c_{j} {x^{*}}_{i j})}^{2} + λ \sum_{j = 1}^{p} c_{j}, c_{j} \geq 0, λ \geq 0,

where

x_{i j}^{*} = {\hat{β}}_{j}^{i n i t} x_{i j}

and

λ

is the tuning parameter that controls the amount of shrinkage applied to the initial estimates. The initial estimates allow NNG to shrink small and large coefficients differently. We used three initial estimates from OLS, ridge, and lasso, where the latter two were tuned using optimal tuning parameters from 10-fold CV. Unlike OLS and ridge, the lasso initial estimates may contain zeros, which means that the corresponding NNG estimates would be set to zero because once an initial estimate is zero, it remains zero throughout the estimation process [49]. When lasso does not include all relevant covariates, then there is a high risk for NNG and ALASSO to break down and the resulting models may perform poorly in prediction [50]. To prevent this risk, we only used optimal lasso initial estimates when the number of nonzero coefficients was at least two; otherwise, we opted for the lasso solution with two nonzero coefficients. For NNG, we evaluated 100 values of

λ

and used a one-dimensional 10-fold CV, AIC, and BIC to select the tuning parameters.

Lasso

The lasso is a popular penalized regression approach, due to its prediction accuracy and computational efficiency [16,49,51]. It is a general procedure that can be applied to a broad variety of models, including generalized linear models [52,53] and Cox models for censored survival data [54]. For a linear regression with standardized covariates and a centered response variable, the lasso estimates are obtained by minimizing the following objective function:

\frac{1}{2 n} {\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{p} β_{j} x_{i j})}^{2} + λ \sum_{j = 1}^{p} {| β}_{j} |, λ \geq 0 .

The

l_{1}

penalty term continuously shrinks the regression coefficients toward zero as the tuning parameter, λ, increases. When λ is sufficiently large, some coefficients are forced to be exactly zero, thus conducting variable selection [11,15]. However, the lasso penalty term shrinks small and large coefficients equally, which can lead to large bias, particularly for large coefficients, resulting in suboptimal predictions [15]. The ALASSO and RLASSO were proposed to reduce this bias and improve model selection accuracy. The lasso was tuned over 100 values of λ, and one-dimensional 10-fold CV, AIC, and BIC were used to select the tuning parameters.

Adaptive Lasso

The ALASSO is a two-stage approach proposed to enhance the performance of the lasso, particularly in variable selection [15]. It replaces the

l_{1}

penalty term of the lasso with a re-weighted version [15]. Similar to NNG, the data-dependent weights are calculated using the initial estimates. These weights impose a heavy penalty on coefficients with small initial estimates and a smaller penalty on coefficients with large initial estimates [15,49]. By utilizing these weights, the ALASSO can reduce the bias of large effects and improve variable selection accuracy [15,16]. For a linear regression with standardized covariates and a centered response variable, the ALASSO estimates are obtained by minimizing the following objective function:

\frac{1}{2 n} {\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{p} β_{j} x_{i j})}^{2} + λ \sum_{j = 1}^{p} {w_{j} | β}_{j} |, λ \geq 0,

where the weights

w_{j} = {|{\hat{β}}_{j}^{i n i t}|}^{- γ}

are a function of initial estimates

{\hat{β}}_{j}^{i n i t}

and

γ > 0

.

For comparison, we employed the same three initial estimates used in NNG for ALASSO. ALASSO was tuned over 100 values of λ and four values of γ (0.5, 1.0, 1.5, and 2.0), resulting in a total of 400 tuning parameter values. Two-dimensional 10-fold CV, AIC, and BIC were used to select a pair (λ, γ) of tuning parameters.

Relaxed Lasso

The RLASSO was proposed to address two weaknesses of the lasso: the over-shrinkage of nonzero coefficients and the selection of too many noise variables when CV is used to select the tuning parameter [16]. The RLASSO estimator combines lasso estimates (

{\hat{β}}^{l a s s o} (λ)

) and unpenalized OLS estimates (

{\hat{β}}^{O L S} (λ)

) for the variables selected by the lasso at the tuning parameter λ > 0. The vector (

{\hat{β}}^{O L S} (λ)

) includes the OLS estimates for the selected variables and zero values for the coefficients of unselected variables to match the zero coefficients in the lasso estimates. This ensures that the lengths of the lasso and OLS estimates are identical [40].

For a linear regression with standardized covariates and a centered response variable, the RLASSO estimator is defined as:

{\hat{β}}^{r e l a x} (λ, γ) = γ {\hat{β}}^{l a s s o} (λ) + (1 - γ) {\hat{β}}^{O L S} (λ), λ > 0, 0 \leq γ \leq 1 .

The RLASSO has two tuning parameters, λ and γ, which serve two different purposes. The λ determines which variables are included in the model, similar to lasso, while γ controls the amount of shrinkage applied to the coefficients. When γ = 1, RLASSO produces estimates which are identical to those of lasso, and when γ = 0, it produces estimates which are identical to the OLS estimates for the variables selected by the lasso. It is important to note that the λ value for RLASSO can differ (and may be larger in order to select simpler models) from that of ordinary lasso, except when γ = 1. RLASSO was tuned over 100 values of λ (the same as lasso) and five values of γ (0, 0.25, 0.50, 0.75, and 1.00), as per the default in glmnet R package (version 4.1-9) [55], resulting in a total of 500 combinations of tuning parameter. Two-dimensional 10-fold CV, AIC, and BIC were used to select the optimal pair (λ, γ).

2.5. Notations

We introduce the notations used in the results section to facilitate understanding of various approaches and their criteria for selecting tuning parameters. The NNG and ALASSO methods, with OLS (O), ridge (R), and lasso (L) as initial estimates and CV for selecting tuning parameters, are denoted as follows: NNG (O, CV), NNG (R, CV), NNG (L, CV), ALASSO (O, CV), ALASSO (R, CV), and ALASSO (L, CV). In these notations, the first element in the brackets indicates the choice of the initial estimate, while the second element indicates the criterion for selecting the tuning parameters.

Additionally, the lasso, RLASSO, BE, BSS, and FS, with tuning parameters determined through CV, are denoted as Lasso (CV), RLASSO (CV), BE (CV), BSS (CV), and FS (CV), respectively, as these approaches do not require initial estimates. When AIC or BIC was used instead of CV for parameter selection, “CV” was replaced with “AIC” or “BIC” in the respective method notations. For example, Lasso (AIC) indicates that the AIC criterion was used to select the tuning parameter.

3. Results

This section presents the findings from our simulation study, structured according to the study’s objectives (see Table 1). For concreteness, we focused on specific scenarios where the true regression coefficients follow beta-type A, with low (C2) and high (C3) correlation settings across different sample sizes (100 and 400) and SNRs. The only exception is Section 3.5, which shows results for low-correlation type C1, used to assess the impact of many noise variables on the prediction accuracy of the models. Additional results for other beta-types and correlation types are available in the Supporting Materials (see Figures S1–S8) and the findings are consistent with those reported here. A high-level summary of all simulation results is provided in Section 3.6.

3.1. Effects of Initial Estimates on the Prediction Accuracy of NNG and ALASSO

We investigated whether the choice of initial estimates (OLS, ridge, and lasso) influenced the prediction accuracy of NNG and ALASSO models. To explore this, we used a Bland–Altman plot [56] to evaluate the agreement between the predictions measured using RR. We observed that the effects of initial estimates on predictions for NNG and ALASSO were very similar; therefore, only the results for ALASSO are reported.

Figure 1 compares the prediction errors of three ALASSO models: ALASSO (O, CV), ALASSO (R, CV), and ALASSO (L, CV). Specifically, it evaluates the agreement between the prediction errors of (i) ALASSO (O, CV) and ALASSO (R, CV), (ii) ALASSO (O, CV) and ALASSO (L, CV), and (iii) ALASSO (R, CV) and ALASSO (L, CV).

The first and second rows compare the three models in scenarios with limited information: a small sample size (100), low SNR (0.25) (

R^{2}

of about 20%), and high correlation among covariates. These conditions provide little information for accurately estimating both the tuning parameters and initial estimates, leading all initial estimates to produce models with different predictions. This is indicated by a nonzero mean difference (horizontal solid line) and some differences falling outside the wider limits of agreement (LOA) (dashed lines) in the Bland–Altman plot. On average, ridge initial estimates produced models with the best predictions, followed by lasso estimates, while OLS estimates performed the worst.

The third and fourth rows compare the three models in scenarios with moderate information: a small sample size (100), moderate SNR (1) (

R^{2}

of about 50%), and high correlation among covariates. In this setting, predictions across the three initial estimates differed, but not as substantially as in the limited-information scenarios.

The fifth and sixth rows compare the three models in scenarios with sufficient information: a large sample size (400), high SNR (2.5) (

R^{2}

of about 71%), and low correlation among covariates. In this setting, the data contained adequate information to estimate the parameters accurately, resulting in nearly identical predictions across the three initial estimates. Differences were close to zero, and the LOAs were narrow, suggesting that the three initial estimates can be used interchangeably. Lasso initial estimates may be preferred, as they yielded simpler models (see Figure A1 in Appendix A).

Overall, ridge initial estimates produced models with superior predictions. Therefore, we will compare the prediction accuracy of NNG and ALASSO using ridge initial estimates to those of other approaches (see Section 3.3).

3.2. Comparison of Prediction Accuracy of NNG and ALASSO with the Models That Generated the Initial Estimates

We investigated whether NNG and ALASSO, using optimal tuning parameters from CV, could yield better prediction accuracy than the models that generated the initial estimates: OLS, Ridge (CV), and Lasso (CV). Specifically, we compared three sets of models: (i) NNG (L, CV), ALASSO (L, CV), and Lasso (CV); (ii) NNG (R, CV), ALASSO (R, CV), and Ridge (CV); and (iii) NNG (O, CV), ALASSO (O, CV), and OLS. The first two sets showed similar prediction patterns (Figure 2 and ), while NNG (O, CV) and ALASSO (O, CV) consistently outperformed OLS models in the third set (see Figure A3 in Appendix A). Therefore, we focus on the first set.

Figure 2 compares the predictive performance of NNG (L, CV), ALASSO (L, CV), and Lasso (CV) across various sample sizes and SNRs, in low- (upper panel) and high-correlation (lower panel) settings. NNG and ALASSO outperformed lasso only in scenarios requiring minimal shrinkage, such as those with large sample sizes, low correlation, and high SNR (top-right panel). However, in scenarios where severe shrinkage is often required, such as small sample sizes with low SNR (top-left panel), or high-correlation settings (lower panel), lasso outperformed both NNG and ALASSO.

Overall, NNG and ALASSO produced simpler models than those used to generate their initial estimates (see Figure A4 in Appendix A). While they consistently outperformed OLS models in terms of prediction, they only outperformed lasso and ridge models in scenarios that required minimal shrinkage. When model simplicity is the primary goal, NNG and ALASSO may be preferred.

3.3. Comparison of Prediction Accuracy for All Variable Selection Approaches

We compared the prediction accuracy and the average number of selected variables across various variable selection methods, all tuned using optimal parameters from 10-fold CV. We investigated the influence of sample size, SNR, and correlation on the model performance. We began by comparing the prediction accuracy of classical methods (BSS, BE, and FS), followed by a comparison of classical and penalized methods. To clearly observe the patterns, we plotted the average of each metric as a function of SNR.

3.3.1. Comparison of Classical Variable Selection Methods

Figure A5 compares the predictive performance of BSS, BE, and FS across different sample sizes and SNRs in both low- (upper panel) and high (lower panel)-correlation settings. In low-correlation settings, all three methods produced similar prediction accuracy. In high-correlation settings, BSS and BE produced nearly identical results, with only minor differences at large sample sizes and high SNR (bottom-right panel), while FS showed slightly better performance than both BSS and BE. Therefore, we compare the results of BE and FS with those of the penalized methods.

3.3.2. Comparison of Prediction Accuracy of Classical and Penalized Regression Methods

Figure 3 compares the prediction accuracy of BE, FS, NNG, ALASSO, RLASSO, and lasso in low- (upper panel) and high (lower panel)-correlation settings.

Low-Correlation Settings

BE and FS performed similarly and showed inferior prediction accuracy compared to penalized methods in small sample sizes (top-left panel) and in large sample sizes with low SNR (top-right panel, lower end). However, in large sample sizes with high SNR (top-right panel, upper end), BE and FS outperformed lasso and were comparable to the other penalized methods.

Among the penalized methods, lasso performed best in small sample sizes with low SNR (top-left panel), but was outperformed by other methods in large sample sizes with high SNR (top-left panel). NNG, ALASSO, and RLASSO performed similarly, although RLASSO was slightly less accurate in scenarios with large sample sizes and low SNR (top-right panel, lower end).

High-Correlation Settings

BE and FS consistently showed inferior predictive performance to the penalized methods, regardless of sample size or SNR (lower panel). In scenarios with small sample sizes or low SNR, FS performed slightly better than BE—likely due to the fact that BE begins with a full model containing many parameters, which can increase variability in coefficient estimates and lead to suboptimal predictions under limited information. Lasso outperformed other approaches across different sample sizes and SNR levels (bottom-left panel). NNG, ALASSO, and RLASSO showed comparable performance across settings (lower panel).

Overall, lasso performed best in scenarios with small sample sizes, low SNR, and high correlation, while BE and FS performed the worst under these settings. Conversely, in scenarios with large sample sizes, low correlation, and high SNR, lasso performed poorly, whereas other approaches, including BE and FS, produced better results. In terms of model complexity, lasso selected models with a larger number of variables, while BE and FS selected simpler models. The other penalized methods selected models of intermediate complexity, with only minor differences among them (Figure A6).

3.4. Influence of Model Selection Criteria on Predictions and Complexity of Selected Models

This section compares the prediction accuracy and the average number of variables selected by various variable selection approaches, each tuned using CV, AIC, and BIC criteria.

Figure 4 shows the prediction performance for small (upper panel) and large (lower panel) sample sizes under low-correlation settings across various values of SNR.

In small sample sizes, models tuned with BIC generally showed worse prediction accuracy than those tuned with AIC and CV across methods. An exception was observed with the BE and FS methods, where BIC and CV performed similarly. Overall, AIC and CV showed comparable prediction performance across most methods, with CV slightly outperforming AIC in RLASSO at low SNR, while AIC had a slight edge in the BE and FS methods.

In large sample sizes, interesting patterns emerged. When the SNR was low-to-moderate (SNR < 1), models tuned with BIC showed worse prediction accuracy than those tuned with AIC and CV. However, at high SNR, BIC provided models with better predictions, except for lasso, where BIC consistently resulted in inferior accuracy. AIC and CV produced very similar predictions across all penalized approaches, except at high SNR, where CV slightly outperformed AIC in RLASSO. In contrast, CV and AIC produced different results in the classical methods, where AIC performed better at low SNR, while CV was more effective at high SNR.

Regarding variable selection, BIC selected simpler models than AIC and CV, which selected a similar number of variables (see Figure A7 in Appendix A).

Figure A8 (Appendix A) shows the prediction performance under high-correlation settings. Again, AIC and CV showed comparable prediction accuracy in penalized methods, except in RLASSO, where CV outperformed AIC. In classical methods, AIC and CV produced similar predictions in small sample sizes, but AIC had a slight advantage in large sample sizes. Models tuned with BIC generally showed worse prediction accuracy compared to AIC and CV, particularly in large sample sizes (lower panel).

3.5. Impact of a High Proportion of Noise Variables on Prediction Performance of Approaches

The proportion of noise variables can influence the effectiveness of variable selection methods. Therefore, we evaluated the robustness of approaches under scenarios with a high proportion of noise variables (23 noise and 7 signal variables), focusing on their predictive accuracy and the complexity of the selected models in both small and large sample sizes within low-correlation settings. All methods were tuned using cross-validation.

Figure 5 shows the prediction accuracy (upper panel) and the average number of variables selected by each approach (lower panel). In scenarios with small sample sizes (top-left panel), classical methods (BE and FS) consistently exhibited inferior prediction accuracy across all SNR levels, with FS marginally better. In contrast, the lasso performed best overall, with its advantage most pronounced at low SNR levels. The predictive accuracy of NNG, RLASSO, and ALASSO was comparable, with ALASSO slightly less accurate. In large sample settings (top-right panel), the predictive performance varied with SNR. At low-to-moderate SNR levels (SNR < 1), classical methods were markedly outperformed by penalized methods, which exhibited similar accuracy. At high SNR levels, lasso performed worse, while classical methods achieved the best accuracy. Overall, classical methods selected simpler models, followed by RLASSO, ALASSO, and NNG. Conversely, lasso consistently selected a larger number of variables (lower panel).

3.6. A Summary of the Results for the Entire Simulation

Table 2 provides a structured overview of the key findings from the simulation study, organized according to the study’s objectives. A column titled “Figures” lists the corresponding figures that illustrate the results for each objective. Scenarios with small sample sizes, low SNR, and high correlation are described as having “limited information”, whereas those with large sample sizes, high SNR, and low correlation are considered to have “sufficient information”. These terms are used throughout the table and in the discussion section (Section 5).

Table 2. Summary of results according to the five objectives of the simulation study.

Study Objectives	Key Findings	Figures
Effects of initial estimates (OLS, ridge, and lasso) on the prediction accuracy of NNG and ALASSO (Objective 1)	In scenarios with limited information, the three initial estimates yielded different predictions, with ridge estimates producing models with superior predictive performance. In scenarios with sufficient information, all three initial estimates produced models with similar predictions, indicating that they can be used interchangeably. In such cases, lasso estimates may be preferred, as they tend to yield simpler models, which are easier to interpret.	Prediction: Figure 1 Model complexity: Figure A1
Comparison of prediction accuracy of NNG and ALASSO over the models that generated initial estimates (OLS, ridge, and lasso initial models) (Objective 2)	NNG and ALASSO consistently outperformed OLS models. They also outperformed ridge and lasso models in scenarios with sufficient information, where minimal shrinkage was required. However, with limited information, they performed worse than ridge and lasso initial models. The main advantage of NNG and ALASSO is their ability to select simpler models than the initial models, which is beneficial for descriptive modeling, where understanding the relationships between the outcome and covariates is more important than prediction.	Prediction: Figure 2, Figure A2 and Figure A3 Model complexity: Figure A4
Comparison of prediction accuracy of classical and penalized regression methods (Objective 3)	Classical methods (BSS, BE, and FS) showed similar prediction performance in low-correlation settings. In high-correlation settings, BSS and BE remained comparable, while FS had a slight advantage over both BE and BSS. All classical methods were inferior to penalized methods in scenarios with limited information, where shrinkage is often beneficial. Conversely, under sufficient-information scenarios, their predictions were better than those of lasso and comparable to NNG, RLASSO, and ALASSO. In these settings, shrinkage is less beneficial, as parameter estimates exhibit less variability. Among penalized methods, lasso consistently produced the best accuracy in scenarios with limited information but performed worse in sufficient-information scenarios, where NNG, ALASSO, and RLASSO produced superior accuracy. Overall, there was no clear winner among NNG, ALASSO, and RLASSO. The lasso consistently selected a larger number of variables compared to other approaches, while classical methods selected simpler models, especially in scenarios with limited information.	Prediction: Figure 3 and Figure A5 Model complexity: Figure A6
Influence of model selection criteria (CV, AIC, and BIC) on predictions and complexity of selected models (Objective 4)	In scenarios with small sample sizes and high correlation, BIC-tuned models generally exhibited inferior prediction accuracy compared to AIC and CV. However, in scenarios with sufficient information, BIC produced better predictions, except for lasso, which consistently resulted in inferior accuracy. While BIC performed better in scenarios with sufficient information, it performed worse when models contained several small effects (Figure A9). AIC and CV showed comparable predictions across most methods, with CV slightly outperforming AIC in RLASSO, while AIC had a slight edge in classical methods. BIC-tuned models selected fewer variables on average than AIC and CV across small and large sample sizes, whereas AIC and CV selected a similar number of variables.	Prediction: Figure 4, Figure A8 and Figure A9 Model complexity: Figure A7
Impact of a high proportion of noise variables on prediction performance and model complexity of approaches (Objective 5)	In small sample sizes, classical methods consistently demonstrated poor prediction accuracy because several relevant variables were not selected, whereas lasso performed the best, particularly at low SNR. NNG, RLASSO, and ALASSO showed comparable performance, with ALASSO being slightly inferior at low SNR. In large sample sizes, classical methods yielded inferior predictions at low-to-moderate SNR (SNR < 1) but performed best at high SNR compared to penalized methods. This pattern was consistent with scenarios involving a moderate proportion of noise variables. In low-to-moderate SNR, penalized methods produced comparable predictions, with lasso having a slight edge. However, at high SNR, lasso performed poorly. Regarding variable selection, classical methods selected simpler models, followed closely by RLASSO, while lasso selected a large number of variables. The number of variables selected by NNG and ALASSO was comparable.	Prediction and model complexity: Figure 5

4. Example: Respiratory Health Data

The ozone data originates from a study investigating the effects of ozone on school chidren’s lung growth. The study was conducted from February 1996 to October 1999 in Germany, involving 1101 school children who were initially in the first and second primary school classes (ages 6–8) [57]. Over the four years, lung function measurements were collected three times per year (spring, summer, and autumn), except for spring 1998 [57]. A subset of 496 children with complete data has been used in previous studies to investigate medical issues [57], and as an example in methodological papers to assess the stability of model-building strategies [57] and bootstrap model averaging [58]. The same subset is analyzed here to illustrate various issues relevant to our simulation study.

The outcome variable is forced vital capacity (FVC), which measures the amount of air that can be forcibly exhaled from the lungs after taking the deepest breath possible, with 24 covariates as shown in Table 3. For further details, see Ihorst et al. [57] and Buchholz et al. [59].

To assess the predictive accuracy of the models, we split the dataset into a training set (70%) and a test set (30%). This approach was chosen because it allows for the comparison of the number of variables selected by each method and their corresponding RMSE. The full OLS model with 24 covariates fitted to the training data yielded an

R^{2}

of 65%, which we refer to as the adequate information setting. To illustrate how the amount of information in the data affects variable selection and prediction accuracy, we conducted two analyses. The first analysis used the full set of candidate variables (Section 4.1). The second analysis (Section 4.2), representing the inadequate-information setting, removed the three most influential predictors (based on p-values in the full model) from the set of candidate variables, resulting in a training OLS model with 21 variables and an

R^{2}

of 23%. This range of

R^{2}

values is similar to our simulation setting, which ranged from 20% to 71%.

4.1. Adequate Information

Table 3 shows the p-values from the full OLS model fitted on the training data with 24 predictors. Three variables (x1–x3) are highly significant and explain about 62% of the total variation, which is similar to the full model that explains 63%. The table also reports the variance inflation factors (VIFs) for all variables (last column); only three variables (x8, x20, and x22) exhibit moderate multicollinearity, with VIF values between 5 and 10, while the remaining variables have low multicollinearity. In addition, Table 3 shows the variables selected by each method when tuned using CV and BIC, with variables selected by BIC in brackets.

All three classical methods (BE, FS, and BSS), as well as RLASSO, selected the same three variables (x1–x3), regardless of the tuning criterion. In contrast, variable selection by NNG, lasso, and ALASSO depended on the tuning method. When tuned using CV, each of these methods selected the same 13 variables, producing more complex models than the classical methods and RLASSO. This suggests that CV may not be suitable when simpler models are preferred. When BIC was used, all three methods selected simpler models with the same three variables (x1–x3) as the classical methods, except for lasso, which included one additional variable (x6). These findings are consistent with the results of our simulation study, where CV tended to favor more complex models, while BIC led to simpler models.

Prediction performance was also compared using RMSE on the test data under both CV (denoted RMSE (CV)) and BIC (denoted RMSE (BIC)) tuning criteria. Despite differences in model complexity, all methods achieved comparable RMSE values. This is not surprising, as the three highly significant variables (x1, x2, and x3) were consistently selected across all approaches, likely contributing to stable predictive accuracy.

4.2. Inadequate Information

As shown in Table 3, variables x1, x2, and x3 accounted for a large proportion of the total variation, with an adjusted

R^{2}

of 62%. To assess model performance under conditions of low

R^{2}

, these variables were removed from the datasets, and the analysis was repeated. The results are summarized in Table 4. Removing these variables led to a substantial decrease in adjusted

R^{2}

from 63% to 18%, highlighting the impact of these covariates on model fit. Their removal also altered the significance of other variables—x15 and x16 became significant, whereas x4 became nonsignificant. In addition, the selected models differed substantially across methods and tuning criteria. A similar strategy, in which these three variables were removed one at a time to demonstrate how variables with strong effects can help stabilize the model selection process, has been used previously [57].

Several key observations were evident under this low

R^{2}

setting. First, unlike in the high

R^{2}

settings, the variables selected by classical methods differed when tuned using CV, whereas tuning with the BIC resulted in selection of the same three variables. Second, among the penalized methods, RLASSO produced simpler models with two variables under both tuning criteria. In contrast, NNG and ALASSO each selected eight variables when tuned using CV and three variables when tuned using BIC. Lasso, when tuned using CV, selected nine variables, corresponding to the same eight selected by NNG and ALASSO, with the addition of variable x22. However, when tuned using BIC, lasso selected only two variables, which were identical to those selected by RLASSO.

Third, the prediction accuracy decreased, as indicated by the larger RMSE values reported in Table 4 compared to those in Table 3. This reduction in accuracy is expected, as the exclusion of strong-effect variables negatively affects model performance. The prediction accuracy of all selected models was very similar, with a slight advantage observed for models tuned with CV (RMSE ranging from 0.358 to 0.369 for CV, and from 0.363 to 0.384 for BIC). For practical application, simpler models are generally preferable (e.g., those selected by BIC with two or three variables), but the broader issue concerns the reliability of variable selection procedures when the available information is inadequate to select a sensible model. This issue was further explored in our simulation study, which considered a low

R^{2}

scenario with a value of 20%.

5. Discussion

In a large simulation study, we compared the prediction accuracy and model complexity of several popular variable selection methods. Our published protocol [12] included BSS and BE as representatives of long-established classical methods. Although FS was not part of the original protocol, we decided to include it during the analysis phase, as it is also a widely used classical method. We then compared their performances with those of penalized methods: NNG, lasso, ALASSO, and RLASSO. While RLASSO was originally proposed for high-dimensional data, we included it to assess its performance in low-dimensional data. Our focus on low-dimensional data allowed us to better understand the results. The penalized methods considered can also be applied to high-dimensional data, but among the classical methods, only FS is suitable for such settings. Due to the complexity of the simulation design, which involved seven methods, three model selection criteria, five sets of true regression coefficients, four correlation structures, four SNRs, and two sample sizes, we structured our results into five objectives and summarized the key findings for each. Below, we briefly discuss the results of each objective before drawing our final conclusions.

5.1. Effects of Initial Estimates on the Prediction Accuracy of NNG and ALASSO

We investigated whether the choice of initial estimates (OLS, ridge, and lasso) affects the prediction accuracy of NNG and ALASSO in scenarios with limited and sufficient information. Under limited information, the three initial estimates produced models with different predictions, with ridge initial estimates performing best. The superiority of ridge estimates over lasso and OLS may be explained by the limitations of the latter two: lasso may exclude important variables due to a variable selection process in the first stage, whereas OLS often exhibits high variability. These drawbacks can adversely affect the prediction performance of the NNG and ALASSO models [11,15,17]. In contrast, when sufficient information was available, all three initial estimates produced models with comparable predictions, which agrees with the findings reported by Kipruto and Sauerbrei [19] for low-correlation settings with high

R^{2}

. This is because sufficient information enhances the estimation accuracy of both initial estimates and tuning parameters.

Our findings suggest that, if the primary aim of the analysis is prediction, ridge initial estimates may serve as an alternative to OLS initial estimates in both low- and high-correlation settings, aligning with the findings of Makalic and Schmidt [60]. However, when the goal is to obtain simpler, more interpretable models for descriptive purposes, lasso estimates may be preferable, especially when the data contain sufficient information to allow lasso to accurately select important variables in the first stage. This is crucial because once a variable is eliminated in the first stage, it cannot be reintroduced into the NNG or ALASSO model in the second stage [49].

5.2. Comparison of Prediction Accuracy of NNG and ALASSO over Initial Models

We compared the prediction accuracies of the NNG and ALASSO models with those of the models used to generate the initial estimates (OLS, ridge, and lasso). Results showed that NNG and ALASSO consistently outperformed OLS models, which is not surprising given that OLS models estimate all parameters without shrinkage, increasing the risk of overfitting and leading to higher prediction errors on new data [3,10]. These findings are consistent with those of Yuan and Lin [17], who demonstrated that NNG is effective in improving on the initial estimator in terms of both variable selection and estimation accuracy.

However, when ridge and lasso estimates were used as initial estimates, NNG and ALASSO often produced worse predictions than those from ridge and lasso initial models in scenarios with high correlation as well as in small sample sizes with low-to-moderate SNR (SNR < 1). In such settings, distinguishing between relevant and irrelevant variables is challenging [3], and NNG and ALASSO may select models with too few variables, resulting in underfitting and poor generalization [10,17]. Conversely, in scenarios with sufficient information, where minimal shrinkage is required [6], NNG and ALASSO outperformed the ridge and lasso initial models. In these settings, ridge or lasso shrinkage can be excessive, leading to suboptimal predictions [15,40].

Overall, our findings suggest that, while NNG and ALASSO consistently outperformed OLS models, they do not always outperform ridge and lasso initial models in data with small sample sizes, high correlation, and low SNR. In such cases, any modeling strategy may be limited, and a more appropriate course of action may be to describe the data and recommend conducting a more informative study. However, when sufficient information is available, NNG and ALASSO enhance both the variable selection and prediction accuracy of initial models.

5.3. Comparison of Prediction Accuracy of Classical and Penalized Methods Using CV as Criterion

We evaluated the prediction accuracy of classical variable selection methods (BE, BSS, and FS) and penalized regression methods (NNG, lasso, ALASSO, and RLASSO), using CV as the model selection criterion.

BSS, BE, and FS performed quite similarly in most scenarios; however, in certain scenarios, such as large sample sizes with high correlation and high SNR, BE and FS had a slight edge (see Figure A5 in Appendix A). This difference can be attributed to the nature of BSS, which considers a much larger set of models compared to BE and FS. While this extensive search may seem advantageous, it can lead to overfitting and high variance in coefficient estimates, resulting in poor generalization [4,43]. FS outperformed BE in high-correlation settings with low-to-moderate SNR. This advantage is likely due to the sequential nature of FS, which begins with a simple model and gradually adds variables. In contrast, BE starts with a more complex model and removes variables, which may be less effective when information is limited and the risk of overfitting is high. Mantel [46] demonstrated the advantages of BE over FS and concluded that BE is more appropriate than FS. This is not consistent with our findings, likely because their evaluation focused on variable selection rather than prediction performance. In addition, they did not consider a broader range of scenarios.

Overall, the predictive accuracy of classical methods was inferior to those of penalized methods, especially in settings with small sample sizes, high correlation among covariates, and low SNR (i.e., SNR < 1), which aligns with previous findings [40]. In such scenarios, classical methods are prone to instability and high variance due to their discrete nature of either retaining or discarding variables in the model, unlike penalized methods, which apply continuous shrinkage to coefficients estimates [9,21,40]. Some researchers [41] have argued against the use of classical variable selection methods, citing their limitations in settings with a large number of potential predictors. However, our findings indicate that classical methods can outperform modern variable selection methods in both predictive accuracy and model simplicity under certain settings. This observation is supported by other studies [24,40,42], which have reported similar results under comparable conditions.

To improve the prediction performance of classical methods, stacking and post-estimation shrinkage have been proposed [42,61]. Instead of selecting a single best model for prediction, stacking combines linear predictors from various models with different numbers of covariates and estimates optimal weights for each linear predictor [10,61]. While this approach can improve prediction, it sacrifices interpretability compared to selecting a single model [10]. Alternatively, post-estimation shrinkage applies shrinkage (estimated via CV) to the regression estimates of the best-selected model [22,38,42,62]. Although stacking and post-estimation shrinkage both offer ways to enhance prediction accuracy, they are beyond the scope of this study.

In settings with large sample sizes, low correlation, and high SNR, classical methods outperformed lasso, which is consistent with previous findings [24,40]. In these scenarios, classical methods produced predictions comparable to those of NNG, RLASSO, and ALASSO. Lasso selected too many variables (Figure A6), which, combined with its tendency to over-shrink large effects, led to inferior prediction accuracy despite the availability of sufficient information [15,24,40]. These results are in agreement with findings from [24,40,63].

Among the penalized methods, lasso consistently outperformed NNG, ALASSO, and RLASSO in scenarios with small sample sizes and low SNR, as well as in high correlation. The performance of ALASSO and NNG depends on initial estimates, which are less reliable when data contain limited information, thus negatively affecting their prediction accuracy. RLASSO has two tuning parameters that also require sufficient information for accurate estimation. These factors may explain why lasso is often preferred in challenging conditions [51]. Conversely, in scenarios with low correlation and high SNR, NNG, ALASSO, and RLASSO outperformed the lasso, for the reasons previously explained.

In summary, lasso performed best in prediction accuracy in scenarios with limited information, whereas classical methods as well as NNG, ALASSO, and RLASSO performed best in scenarios with sufficient information. Additionally, NNG, ALASSO, and RLASSO outperformed classical methods in scenarios with low SNR. Among NNG, ALASSO, and RLASSO, no single approach consistently outperformed the others. In scenarios with sufficient information, NNG, ALASSO, RLASSO, or classical methods may be used, as they yield similar predictions. However, classical methods may be preferred due to their tendency to produce simpler models.

5.4. Influence of Model Selection Criteria on Predictions and Model Complexity

We compared the predictive performance of models tuned using three model selection criteria: CV, AIC, and BIC. Our findings showed that the performance of these criteria varies according to sample size, SNR, and the correlation among covariates.

In scenarios with limited information, BIC generally led to worse prediction accuracy than AIC and CV. This is because BIC’s strong penalty for model complexity often results in models with too few variables (Figure A7), leading to higher prediction errors. These findings corroborate those from other related studies on variable selection [64]. In contrast, when sufficient information was available, BIC often outperformed AIC and CV in accuracy. However, this was not the case for the lasso model, where BIC consistently resulted in inferior predictions. This likely stems from lasso’s use of a single tuning parameter to simultaneously control both variable selection and shrinkage [16]. Since BIC tends to favor larger tuning parameters, it can cause excessive shrinkage, leading to suboptimal model performance.

Initial estimates in ALASSO and NNG, as well as the auxiliary parameter in RLASSO, help reduce the amount of shrinkage applied to coefficients. This can improve both model selection and prediction accuracy [16,40]. Furthermore, in scenarios with several small effects (beta-type B), BIC-tuned models performed poorly, even when the data contained sufficient information (Figure A9), as BIC tends to eliminate small effects and thereby underfit the model. This observation aligns with findings by Burnham and Anderson [35], who reported that BIC performs best when only large effects are present, whereas AIC is more effective when both large and small effects exist.

AIC and CV performed similarly in most scenarios, which is consistent with their shared objective of selecting models that generalize well to unseen data [35]. Stone [65] provided a theoretical basis, showing that under certain conditions, the model selected by leave-one-out CV (a special case of k-fold CV) asymptotically converges to the model selected by AIC.

Regarding model complexity, BIC generally selected models with fewer variables than AIC and CV, as expected due to its more stringent penalty for model complexity. AIC and CV selected a comparable number of variables.

It is important to note that there is a link between model selection criteria and model selection tests (i.e., hypothesis tests comparing different models). Teräsvirta and Mellin [66] demonstrated that using model selection criteria to select models is akin to selecting a model at a certain significance level. Under certain assumptions, they derived approximate significance levels for different model selection criteria. For AIC, the asymptotic significance level for including a variable in a model corresponds to 0.157, derived from the upper tail area of the chi-squared distribution with one degree of freedom, where the cutoff value is a = 2. In contrast, for BIC, the significance level is a function of the sample size, where a = ln(n); for example, the significance levels for n = 100 and n = 400 are approximately 0.032 and 0.014, respectively. These smaller significance levels explain BIC’s tendency to select simpler models compared to AIC.

5.5. Impact of a High Proportion of Noise Variables on Prediction Accuracy and Model Complexity of Approaches

We assessed the robustness of variable selection methods in the presence of a relatively large number of noise variables. The results were similar to those obtained with a lower proportion of noise variables. In contrast, previous simulation studies [9,21] have reported that subset selection methods are sensitive to the proportion of noise variables, which does not align with our findings. This discrepancy is likely because we added only 15 noise variables, which may have been too few to show substantial differences in performance across methods. The number of noise variables added was constrained by the computational demands of best subset selection (BSS).

5.6. Application of Variable Selection Methods

Variable selection is an important topic in all scientific fields that involve empirical data analysis. This includes areas such as biomedical research, economics, epidemiology, sociology, and engineering [3]. The need for variable selection is especially relevant when analysts are faced with many potential predictors but lack sufficient subject-matter knowledge to prespecify which variables are most relevant for inclusion in the model [3]. While our background and example come from health research, the underlying concepts are broadly applicable across many other research fields.

In our example, we used adjusted

R^{2}

to obtain a rough estimate of the explained variation associated with the selected variables. We acknowledge that using this metric after variable selection is a longstanding concern in the statistical community due to the severe biases it can introduce [67]. Therefore, we used it only for model comparison.

To evaluate predictive accuracy, we employed a data-splitting approach rather than cross-validation or bootstrapping. While resampling methods have been increasingly advocated due to the limitations of data splitting [3], we preferred the latter. This choice allowed us to directly compare the number of variables selected, which is more challenging to assess through cross-validation, as different variables can be selected in each fold.

5.7. Limitation of the Simulation Study

A limitation of our work is that the simulation study did not cover all possible types of scenarios that may occur in real-world applications and did not incorporate coefficient settings from real-world data. In addition, we considered only classical linear models, assuming linearity, no interactions, no outliers or influential points, and homoscedasticity. These issues could have been included to improve the reliability of the methods recommendations.

6. Conclusions

In this study, we compared the prediction accuracy of classical and penalized methods across a range of scenarios. Classical methods generally exhibited inferior accuracy, except in scenarios with sufficient information (large sample size, low correlation, and high SNR), where their predictions were comparable to those of penalized methods. Among the penalized approaches, lasso performed best when information was limited (small sample size, high correlation, and low SNR). In contrast, NNG, ALASSO, and RLASSO performed similarly and outperformed lasso in scenarios with sufficient information.

We also evaluated the effects of initial estimates (OLS, ridge, and lasso) on the prediction accuracy and model complexity of NNG and ALASSO. Ridge initial estimates produced superior predictive performance compared to OLS and lasso initial estimates, suggesting that ridge estimates may be preferable when prediction is the primary goal. Conversely, lasso initial estimates produced simpler models, making them suitable for descriptive purposes, particularly when data contain sufficient information.

Furthermore, we assessed model performance under different tuning criteria. Models tuned with AIC and CV produced comparable prediction accuracy and generally outperformed those tuned with BIC. However, BIC was most effective in scenarios with sufficient information (excluding scenarios with several small effects) across all approaches, except for lasso, where it consistently underperformed. BIC produced simpler models compared to AIC and CV, which is advantageous when simplicity and practical usability are key considerations.

Overall, our findings indicate that no single method performs best in all scenarios. The choice of method depends on the amount of information in the data and the criteria used for selecting tuning parameters. Lasso is the preferred choice in settings with limited information due to its superior predictive accuracy. However, when sufficient information is available, classical methods are preferred, as they provide better predictive performance and select simpler models.

7. Directions for Future Research

In this study, we focused only on low-dimensional settings, as certain methods, such as BE, are not applicable in high-dimensional settings. Future research should extend this comparison to include both classical (FS and BSS) and penalized methods in high-dimensional settings, particularly under conditions of both sufficient and limited information, to evaluate the generalizability of our findings.

Additionally, we did not examine the robustness of the methods to violations of standard model assumptions, such as heteroscedasticity and non-normal error distributions, nor did we assess the impact of influential points or outliers. In real-data analysis, it is highly relevant to check for influential points or extreme values, which can easily be identified and handled manually [68]. However, identifying influential points or outliers in simulation studies is challenging due to the large number of repetitions. Therefore, we followed a preventive strategy by truncating values at the 1st and 99th percentiles, as recommended in the literature [68]. Several robust extensions have been proposed, including the robust nonnegative garrote [69], robust adaptive lasso, and robust lasso [70]. These methods were not included in our evaluation. Evaluating the performance of both the standard and robust methods under such challenging conditions would offer valuable insights into their practical reliability and broader applicability.

Our simulations did not incorporate interaction terms or nonlinear functional forms for continuous variables. Future research could explore how these methods perform when such complexities are included in the data-generating process. A simulation study protocol addressing nonlinearity using methods also considered in our work has been published [71], and we are currently working on a paper investigating the linearity assumption and its influence on prediction accuracy in multivariable models.

Although the comparisons in this study were conducted within the framework of classical linear regression, the methods evaluated are also applicable in generalized linear models (e.g., binary outcomes) and survival models. Therefore, further studies should investigate whether the findings hold across these alternative model types. Ongoing work by Ullmann et al. [71] is evaluating the performance of several variable selection methods, including some considered here, in both linear and logistic regression models. Such complementary research may provide additional evidence on the applicability and robustness of these methods in broader modeling contexts.

Finally, other variable selection methods, such as smoothly clipped absolute deviation (SCAD), elastic net, boosting, and minimax concave penalty (MCP), were not considered. Future research could compare their performance with the methods evaluated here.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15137443/s1, Figure S1. Prediction errors (measured by RTE) for NNG (L, CV), ALASSO (L, CV), and Lasso (CV) using beta-type B across different sample sizes (n = 100 and 400) in low correlation (C1 and C2) and high correlation (C3 and C4) settings. Figure S2. Prediction errors (measured by RTE) for NNG (L, CV), ALASSO (L, CV), and Lasso (CV) using beta-type D across different sample sizes (n = 100 and 400) in low correlation (C1 and C2) and high correlation (C3 and C4) settings. Figure S3. Prediction errors (measured by RTE) for classical (BE) and penalized (NNG, ALASSO, RLASSO, and Lasso) methods using beta-type B across different sample sizes (n = 100 and 400) in low correlation (C1 and C2) and high correlation (C3 and C4) settings. Figure S4. Prediction errors (measured by RTE) for classical (BE) and penalized (NNG, ALASSO, RLASSO, and Lasso) methods using beta-type D across different sample sizes (n = 100 and 400) in low correlation (C1 and C2) and high correlation (C3 and C4) settings. Figure S5. Comparison of prediction errors quantified using RTE for classical and penalized methods, using CV, AIC, and BIC criteria across small (n = 100, upper panel) and large (n = 400, lower panel) sample sizes with low correlation (C1) using beta-type B. Figure S6. Comparison of prediction errors quantified using RTE for classical and penalized methods, using CV, AIC, and BIC criteria across small (n = 100, upper panel) and large (n = 400, lower panel) sample sizes with low correlation (C4) using beta-type B. Figure S7. Comparison of prediction errors quantified using RTE for classical and penalized methods, using CV, AIC, and BIC criteria across small (n = 100, upper panel) and large (n = 400, lower panel) sample sizes with low correlation (C1) using beta-type D. Figure S8. Comparison of prediction errors quantified using RTE for classical and penalized methods, using CV, AIC, and BIC criteria across small (n = 100, upper panel) and large (n = 400, lower panel) sample sizes with low correlation (C4) using beta-type D.

Author Contributions

Conceptualization, E.K. and W.S.; software, E.K.; data analysis, E.K.; writing—original draft preparation, E.K. and W.S.; writing—review and editing, E.K. and W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the German Research Foundation (DFG), grant number SA580/10-3 to W.S. (https://www.dfg.de/en/ (accessed on 26 June 2025)).

Data Availability Statement

The codes used to generate the datasets analyzed in the simulation study will be made available in a repository at https://github.com/EdwinKipruto/varselectionpredict. This repository will also contain the Ozone data analyzed in our study.

Acknowledgments

The authors are grateful to Sarah Hag-Yahia (Medical Center–University of Freiburg, Germany) for her invaluable administrative support. We also thank the four anonymous reviewers for their insightful and constructive comments, which helped us improve the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BSS	best subset selection
BE	backward elimination
FS	forward selection
NNG	nonnegative garrote
ALASSO	adaptive lasso
RLASSO	relaxed lasso
CV	cross-validation
AIC	Akaike information criterion
BIC	Bayesian information criterion
SNR	signal-to-noise ratio
ADEMP	aims, data-generating mechanisms, estimands/targets of analysis, methods, and performance measures
MCSE	Monte Carlo standard error
RTE	relative test error
RR	relative risk
OLS	ordinary least square
RSS	residual sum of squares
R²	coefficient of determination
LOA	limits of agreement
RMSE	root mean squared error

Appendix A

Figure A1. Effects of initial estimates (OLS, ridge, and lasso) on ALASSO model complexity for beta-type A, across small (left panel) and large (right panel) sample sizes. The upper panel shows results for low correlation (C2), while the lower panel shows results for high correlation (C3). The horizontal dashed line indicates the number of true signal variables (7).

Figure A2. Comparison of prediction errors measured using RTE for NNG (R, CV) and ALASSO (R, CV) against Ridge (CV). The upper and lower panels shows results for low (C2) and high (C3) correlation, respectively. The left and right panels show results for small (n = 100) and large (n = 400) sample sizes, respectively.

Figure A3. Comparison of prediction errors for NNG (O, CV) and ALASSO (O, CV) against OLS model. The upper panel represents low correlation (C2) and the lower panel represents high correlation (C3). The left panel shows results for small sample sizes (n = 100), and the right panel shows results for large sample sizes (n = 400).

Figure A4. The average number of variables selected by NNG (L, CV) and ALASSO (L, CV) compared to Lasso (CV) for beta-type A. The upper panel represents low correlation (C2) and the lower panel represents high correlation (C3). The left panel shows results for small sample sizes (n = 100), and the right panel shows results for large sample sizes (n = 400). The horizontal dashed line indicates the number of true signal variables (7).

Figure A5. Comparison of prediction errors for BSS, BE and FS using beta-type A. The upper panel shows results for low correlation (C2) and the lower panel shows results for high correlation (C3). The left and right panels show results for small (n = 100), and large (n = 400) sample sizes, respectively.

Figure A6. Average number of variables selected by different approaches using beta-type A. The upper panel represents low correlation (C2) and the lower panel represents high correlation (C3). The left panel shows results for small sample sizes (n = 100), and the right panel shows results for large sample sizes (n = 400). The horizontal dashed line indicates the number of true signal variables (7).

Figure A7. Average number of variables selected by different approaches, each tuned using different criteria. The upper panel shows results for small sample size (n = 100) and the lower panel shows results for large sample size (n = 400).

Figure A8. Prediction errors for classical and penalized methods using CV, AIC, and BIC criteria in high correlation settings (C3) with beta-type A. The upper panel is a small sample size (n = 100) and the lower panel is a large sample size (n = 400).

Figure A9. Prediction errors for classical and penalized methods using CV, AIC, and BIC criteria in low correlation settings (C2) with beta-type B. The upper panel is a small sample size (n = 100) and the lower panel is a large sample size (n = 400).

References

Steyerberg, E.W. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, 2nd ed.; Springer Nature: Cham, Switzerland, 2020; pp. 281–282. [Google Scholar]
Shmueli, G. To explain or to predict? Stat. Sci. 2010, 25, 289–310. [Google Scholar] [CrossRef]
Harrell, F.E. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis; Springer International Publishing: Cham, Switzerland, 2016; pp. 68–73, 112, 209. [Google Scholar]
Sauerbrei, W.; Perperoglou, A.; Schmid, M.; Abrahamowicz, M.; Becher, H.; Binder, H.; Dunkler, D.; Harrell, F.E.; Royston, P.; Heinze, G., on behalf of TG2 of the STRATOS initiative. State of the art in selection of variables and functional forms in multivariable analysis—Outstanding issues. Diagn. Progn. Res. 2020, 4, 3. [Google Scholar] [CrossRef] [PubMed]
Heinze, G.; Wallisch, C.; Dunkler, D. Variable selection–a review and recommendations for the practicing statistician. Biom. J. 2018, 60, 431–449. [Google Scholar] [CrossRef] [PubMed]
Copas, J.B. Regression, prediction and shrinkage. J. R. Stat. Soc. Ser. B Stat. Methodol. 1983, 45, 311–335. [Google Scholar] [CrossRef]
Riley, R.D.; Snell, K.I.E.; Martin, G.P.; Whittle, R.; Archer, L.; Sperrin, M.; Collins, G.S. Penalization and shrinkage methods produced unreliable clinical prediction models especially when sample size was small. J. Clin. Epidemiol. 2021, 132, 88–96. [Google Scholar] [CrossRef] [PubMed]
Wyatt, J.C.; Altman, D.G. Commentary: Prognostic models: Clinically useful or quickly forgotten? BMJ 1995, 311, 1539–1541. [Google Scholar] [CrossRef]
Breiman, L. Better subset regression using the nonnegative garrote. Technometrics 1995, 37, 373–384. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Kipruto, E.; Sauerbrei, W. Comparison of variable selection procedures and investigation of the role of shrinkage in linear regression-protocol of a simulation study in low-dimensional data. PLoS ONE 2022, 17, e0271240. [Google Scholar] [CrossRef]
Hennig, C. Some thoughts on simulation studies to compare clustering methods. Arch. Data Sci. Ser. A 2018, 5, 1–21. [Google Scholar]
Boulesteix, A.-L.; Wilson, R.; Hapfelmeier, A. Towards evidence-based computational statistics: Lessons from clinical research on the role and design of real-data benchmark studies. BMC Med. Res. Methodol. 2017, 17, 138. [Google Scholar] [CrossRef] [PubMed]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Meinshausen, N. Relaxed lasso. Comput. Stat. Data Anal. 2007, 52, 374–393. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. On the non-negative garrote estimator. J. R. Stat. Soc. Ser. B Stat. Methodol. 2007, 69, 143–161. [Google Scholar] [CrossRef]
Huang, J.; Ma, S.; Zhang, C.H. Adaptive Lasso for sparse high-dimensional regression models. Stat. Sin. 2008, 18, 1603–1618. [Google Scholar]
Kipruto, E.; Sauerbrei, W. Exhuming Nonnegative Garrote From Oblivion Using Suitable Initial Estimates: Illustration in Low and High-Dimensional Real Data. arXiv 2022, arXiv:2210.15592. [Google Scholar] [CrossRef]
Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Breiman, L. Heuristics of instability and stabilization in model selection. Ann. Stat. 1996, 24, 2350–2383. [Google Scholar] [CrossRef]
Sauerbrei, W. The use of resampling methods to simplify regression models in medical statistics. J. R. Stat. Soc. Ser. C Appl. Stat. 1999, 48, 313–329. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Bertsimas, D.; King, A.; Mazumder, R. Best subset selection via a modern optimization lens. Ann. Stat. 2016, 44, 813–852. [Google Scholar] [CrossRef]
Huang, J.; Breheny, P.; Ma, S. A selective review of group selection in high-dimensional models. Stat. Sci. A Rev. J. Inst. Math. Stat. 2012, 27, 481–499. [Google Scholar] [CrossRef] [PubMed]
Ambler, G.; Seaman, S.; Omar, R.Z. An evaluation of penalised survival methods for developing prognostic models with rare events. Stat. Med. 2012, 31, 1150–1161. [Google Scholar] [CrossRef]
Pavlou, M.; Ambler, G.; Seaman, S.R.; Guttmann, O.; Elliott, P.; King, M.; Omar, R.Z. How to develop a more accurate risk prediction model when there are few events. BMJ 2015, 351, h3868. [Google Scholar] [CrossRef]
Riley, R.D.; Collins, G.S. Stability of clinical prediction models developed using statistical or machine learning methods. Biom. J. 2023, 65, 2200302. [Google Scholar] [CrossRef]
Akaike, H. Information theory as an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory; Petrov, B.N., Csaki, F., Eds.; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T.; Tibshirani, R. On the “degrees of freedom” of the lasso. Ann. Stat. 2007, 35, 2173–2192. [Google Scholar] [CrossRef]
Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef]
Shao, J. Linear model selection by cross-validation. J. Am. Stat. Assoc. 1993, 88, 486–494. [Google Scholar] [CrossRef]
Shao, J. An asymptotic theory for linear model selection. Stat. Sin. 1997, 7, 221–242. [Google Scholar]
Burnham, K.P.; Anderson, D.R. Multimodel inference: Understanding AIC and BIC in model selection. Sociol. Methods Res. 2004, 33, 261–304. [Google Scholar] [CrossRef]
Brewer, M.J.; Butler, A.; Cooksley, S.L. The relative performance of AIC, AICC and BIC in the presence of unobserved heterogeneity. Methods Ecol. Evol. 2016, 7, 679–692. [Google Scholar] [CrossRef]
Morris, T.P.; White, I.R.; Crowther, M.J. Using simulation studies to evaluate statistical methods: Using simulation studies to evaluate statistical methods. Stat. Med. 2019, 38, 2074–2102. [Google Scholar] [CrossRef] [PubMed]
Van Houwelingen, H.C.; Sauerbrei, W. Cross-validation, shrinkage and variable selection in linear regression revisited. Open J. Stat. 2013, 3, 79–102. [Google Scholar] [CrossRef]
Johnson, R.W. Fitting percentage of body fat to simple body measurements. J. Stat. Educ. 1996, 4. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Tibshirani, R. Best subset, forward stepwise or lasso? Analysis and recommendations based on extensive comparisons. Stat. Sci. 2020, 35, 579–592. [Google Scholar] [CrossRef]
Smith, G. Step away from stepwise. J. Big Data 2018, 5, 32. [Google Scholar] [CrossRef]
Kipruto, E.; Sauerbrei, W. Post-Estimation Shrinkage in Full and Selected Linear Regression Models in Low-Dimensional Data Revisited. Biom. J. 2024, 66, e202300368. [Google Scholar] [CrossRef]
Van Calster, B.; Collins, G.S.; Vickers, A.J.; Wynants, L.; Kerr, K.F.; Barreñada, L.; Varoquaux, G.; Singh, K.; Moons, K.G.M.; Hernandez-Boussard, T., on behalf of TG6 of the STRATOS initiative; et al. Performance evaluation of predictive AI models to support medical decisions: Overview and guidance. arXiv 2024, arXiv:2412.10288. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013. [Google Scholar]
Miller, A. Subset Selection in Regression; Chapman and Hall/CRC: Boca Raton, FL, USA, 2002. [Google Scholar]
Lumley, T. leaps: Regression Subset Selection, R package version 3.2; R Foundation for Statistical Computing: Vienna, Austria, 2024; Available online: https://CRAN.R-project.org/package=leaps (accessed on 26 June 2025).
Mantel, N. Why stepdown procedures in variable selection. Technometrics 1970, 12, 621–625. [Google Scholar] [CrossRef]
Royston, P.; Sauerbrei, W. Multivariable Model-Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables; John Wiley & Sons: Chichester, UK, 2008. [Google Scholar]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Buehlmann, P.; Van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Benner, A.; Zucknick, M.; Hielscher, T.; Ittrich, C.; Mansmann, U. High-dimensional Cox models: The choice of penalty as part of the model building process. Biom. J. 2010, 52, 50–69. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011, 73, 273–282. [Google Scholar] [CrossRef]
Tay, J.K.; Narasimhan, B.; Hastie, T. Elastic net regularization paths for all generalized linear models. J. Stat. Softw. 2023, 106, 1–31. [Google Scholar] [CrossRef]
Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 2011, 39, 1–13. [Google Scholar] [CrossRef]
Friedman, J.H.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef]
Bland, J.M.; Altman, D.G. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986, 327, 307–310. [Google Scholar] [CrossRef]
Ihorst, G.; Frischer, T.; Horak, F.; Schumacher, M.; Kopp, M.; Forster, J.; Mattes, J.; Kuehr, J. Long- and medium-term ozone effects on lung growth including a broad spectrum of exposure. Eur. Respir. J. 2004, 23, 292–299. [Google Scholar] [CrossRef]
Sauerbrei, W.; Boulesteix, A.L.; Binder, H.; Schumacher, M. On stability issues in deriving multivariable regression models. Biom. J. 2015, 57, 531–555. [Google Scholar] [CrossRef]
Buchholz, A.; Holländer, N.; Sauerbrei, W. On properties of predictors derived with a two-step bootstrap model averaging approach—A simulation study in the linear regression model. Comput. Stat. Data Anal. 2008, 52, 2778–2793. [Google Scholar] [CrossRef]
Makalic, E.; Schmidt, D.F. Logistic regression with the nonnegative garrote. In Australasian Joint Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2011; pp. 82–91. [Google Scholar]
Breiman, L. Stacked regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
Van Houwelingen, J.C.; Le Cessie, S. Predictive value of statistical models. Stat. Med. 1990, 9, 1303–1325. [Google Scholar] [CrossRef] [PubMed]
Su, W.; Bogdan, M.; Candès, E. False discoveries occur early on the lasso path. Ann. Stat. 2017, 45, 2133–2150. [Google Scholar] [CrossRef]
Wang, H.; Li, R.; Tsai, C.L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94, 553–568. [Google Scholar] [CrossRef]
Stone, M. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Ser. B 1977, 39, 44–47. [Google Scholar] [CrossRef]
Teräsvirta, T.; Mellin, I. Model selection criteria and model selection tests in regression models. Scand. J. Stat. 1986, 13, 159–171. [Google Scholar]
Breiman, L. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J. Am. Stat. Assoc. 1992, 87, 738–754. [Google Scholar] [CrossRef]
Royston, P.; Sauerbrei, W. Interaction of treatment with a continuous variable: Simulation study of significance level for several methods of analysis. Stat. Med. 2013, 32, 3788–3803. [Google Scholar] [CrossRef]
Gijbels, I.; Vrinssen, I. Robust nonnegative garrote variable selection in linear regression. Comput. Stat. Data Anal. 2015, 85, 1–22. [Google Scholar] [CrossRef]
Machkour, J.; Muma, M.; Alt, B.; Zoubir, A.M. A robust adaptive Lasso estimator for the independent contamination model. Signal Process. 2020, 174, 107608. [Google Scholar] [CrossRef]
Ullmann, T.; Heinze, G.; Hafermann, L.; Schilhart-Wallisch, C.; Dunkler, D.; for TG2 of the STRATOS initiative. Evaluating variable selection methods for multivariable regression models: A simulation study protocol. PLoS ONE 2024, 19, e0308543. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Effect of initial estimates (OLS, ridge, and lasso) on ALASSO predictions (measured by RR) for beta-type A. Rows 1 and 2 represent scenarios with limited information: small sample size (n = 100), low SNR (0.25), and high correlation (C3). Rows 3 and 4 correspond to scenarios with moderate information: small sample size (n = 100), moderate SNR (1), and high correlation (C3). Rows 5 and 6 show scenarios with sufficient information: large sample size (n = 400), high SNR (2.5), and low correlation (C2). The odd-numbered rows display scatterplots with a line of identity (black), while the even-numbered rows show Bland-Altman plots. The horizontal solid lines (red) in the Bland-Altman plots indicate the mean difference, while the dashed lines (green) represent the lower and upper limits of agreement (green). Smoothed differences (black lines) with 95% pointwise confidence intervals (yellow) are also shown. Left panel: RR difference between ALASSO with OLS and ridge initial estimates. Middle panel: RR difference between ALASSO with OLS and lasso initial estimates. Right panel: RR difference between ALASSO with ridge and lasso initial estimates.

Figure 2. RTE as a function of SNR for beta-type A. The upper panel shows low-correlation settings (C2), and the lower panel shows high-correlation settings (C3). Left and right panels display results for small (n = 100), and large (n = 400) sample sizes, respectively. A good prediction model should have an RTE close to 1. One standard error bars were added but are not visible due to the small standard errors.

Figure 3. Comparison of RTE between classical (BE and FS) and penalized (NNG, ALASSO, RLASSO, and lasso) methods for beta-type A. The upper panel shows results for low correlation (C2), while the lower panel shows results for high correlation (C3). The left and right panels show results for small (n = 100) and large (n = 400) sample sizes, respectively. Different y-axis scales are used to enhance the visibility of performance differences. Overlapping lines indicate methods with very similar results, such as BE and FS in the upper panel. One standard error bars were added but are not visible due to the small standard errors.

Figure 4. Relative test error of classical and penalized methods tuned using CV, AIC, and BIC. Results are shown for small (n = 100, upper panel) and large (n = 400, lower panel) sample sizes under low correlation (C2) with beta-type A coefficients.

Figure 5. Comparison of RTE (upper panel) and the average number of variables selected (lower panel) for classical and penalized methods in scenarios with a high proportion of noise variables (23 noise, and 7 signal) in low-correlation settings (C1). The left panel shows results for small sample sizes (n = 100), whereas the right panel shows results for large sample sizes (n = 400). Overlapping lines indicate methods with very similar results, such as BE and FS in the lower panel. One standard error bars were added but are not visible due to the small standard errors.

Table 3. Variables selected by different approaches tuned using CV and BIC criteria under the high condition. Variables selected using BIC are in brackets. Blank spaces indicate variables excluded by a method under both criteria. A check mark (✓) indicates that a variable was selected using a given criterion, while a dash (–) indicates that the corresponding variable was excluded. The last column reports the VIF, which measures the degree of multicollinearity.

Variable	Full Model p-Value	BE	BSS	FS	NNG	ALASSO	Lasso	RLASSO	VIF
FLGROSS (x1)	0.000	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	2.52
SEX (x2)	0.000	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	1.10
FLGEW (x3)	0.000	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	2.23
HOCHOZON (x4)	0.023				✓ (–)	✓ (–)	✓ (–)		2.00
FNOH24 (x5)	0.026				✓ (–)	✓ (–)	✓ (–)		1.97
FSPFEI (x6)	0.052				✓ (–)	✓ (–)	✓ (✓)		1.68
FLTOTMED (x7)	0.053				✓ (–)	✓ (–)	✓ (–)		1.10
FO3H24 (x8)	0.070								5.62
FTEH24 (x9)	0.144								4.50
FSATEM (x10)	0.194				✓ (–)	✓ (–)	✓ (–)		1.60
FTIER (x11)	0.241				✓ (–)	✓ (–)	✓ (–)		1.54
AVATOP (x12)	0.264				✓ (–)	✓ (–)	✓ (–)		1.11
FMILB (x13)	0.293				✓ (–)	✓ (–)	✓ (–)		2.36
FSAUGE (x14)	0.321								1.34
AGEBGEW (x15)	0.332				✓ (–)	✓ (–)	✓ (–)		1.13
ALTER (x16)	0.519				✓ (–)	✓ (–)	✓ (–)		1.47
ARAUCH (x17)	0.521								1.17
ADHEU (x18)	0.741								1.37
FSHLAUF (x19)	0.766								1.31
FPOLL (x20)	0.771								5.30
AMATOP (x21)	0.917								1.23
FSPT (x22)	0.938								7.06
FSNIGHT (x23)	0.941								1.23
ADEKZ (x24)	0.986								1.14
#variables	24	3 (3)	3 (3)	3 (3)	13 (3)	13 (3)	13 (4)	3 (3)
Adjusted R²	0.63	0.62 (0.62)	0.62 (0.62)	0.62 (0.62)	0.63 (0.57)	0.63 (0.57)	0.63 (0.61)	0.62 (0.62)
RMSE (CV)	0.238	0.242	0.242	0.242	0.237	0.237	0.241	0.242
RMSE (BIC)	0.238	0.242	0.242	0.242	0.259	0.256	0.249	0.242

#variables denotes the number of covariates.

Table 4. Variables selected by different approaches tuned using CV and BIC criteria under the low condition. Variables selected using BIC are enclosed in brackets. Blank spaces indicate variables excluded by a method under both criteria. A check mark (✓) indicates that a variable was selected using a given criterion, while a dash (–) indicates that the corresponding variable was excluded. The last column reports the VIF, which measures the degree of multicollinearity.

Variable	Full Model p-Value	BE	BSS	FS	NNG	ALASSO	Lasso	RLASSO	VIF
x4	0.184		✓ (–)						1.99
x5	0.044	✓ (–)	✓ (–)		✓ (–)	✓ (–)	✓ (–)		1.97
x6	0.598								1.65
x7	0.535								1.10
x8	0.735								5.56
x9	0.471				✓ (–)	✓ (–)	✓ (–)		4.49
x10	0.084	✓ (✓)	✓ (✓)	– (✓)	✓ (✓)	✓(✓)	✓ (–)		1.59
x11	0.877								1.51
x12	0.231				✓ (–)	✓ (–)	✓ (–)		1.10
x13	0.281								2.34
x14	0.221		✓ (–)		✓ (–)	✓ (–)	✓ (–)		1.33
x15	0.000	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	1.05
x16	0.000	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	✓ (✓)	1.13
x17	0.893								1.11
x18	0.820								1.36
x19	0.458								1.31
x20	0.478								5.29
x21	0.684								1.23
x22	0.277						✓ (–)		7.00
x23	0.379				✓ (–)	✓ (–)	✓ (–)		1.21
x24	0.588								1.14
#variables	21	4 (3)	6 (3)	2 (3)	8 (3)	8 (3)	9 (2)	2 (2)
Adjusted R²	0.18	0.20 (0.19)	0.20 (0.19)	0.17 (0.19)	0.20 (0.17)	0.20 (0.17)	0.18 (0.14)	0.17 (0.17)
RMSE (CV)	0.356	0.360	0.358	0.369	0.361	0.361	0.367	0.369
RMSE (BIC)	0.356	0.363	0.364	0.364	0.363	0.363	0.384	0.369

#variables denotes the number of covariates.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kipruto, E.; Sauerbrei, W. Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data. Appl. Sci. 2025, 15, 7443. https://doi.org/10.3390/app15137443

AMA Style

Kipruto E, Sauerbrei W. Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data. Applied Sciences. 2025; 15(13):7443. https://doi.org/10.3390/app15137443

Chicago/Turabian Style

Kipruto, Edwin, and Willi Sauerbrei. 2025. "Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data" Applied Sciences 15, no. 13: 7443. https://doi.org/10.3390/app15137443

APA Style

Kipruto, E., & Sauerbrei, W. (2025). Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data. Applied Sciences, 15(13), 7443. https://doi.org/10.3390/app15137443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data

Abstract

1. Introduction

2. Materials and Method

2.1. Simulation Design

2.2. Data Generating Mechanisms

2.3. Performance Measures

2.3.1. Relative Test Error

2.3.2. Relative Risk of a Prediction Model

2.4. Variable Selection Methods

2.4.1. Classical Variable Selection Methods

2.4.2. Penalized Regression Methods

Nonnegative Garrote

Lasso

Adaptive Lasso

Relaxed Lasso

2.5. Notations

3. Results

3.1. Effects of Initial Estimates on the Prediction Accuracy of NNG and ALASSO

3.2. Comparison of Prediction Accuracy of NNG and ALASSO with the Models That Generated the Initial Estimates

3.3. Comparison of Prediction Accuracy for All Variable Selection Approaches

3.3.1. Comparison of Classical Variable Selection Methods

3.3.2. Comparison of Prediction Accuracy of Classical and Penalized Regression Methods

Low-Correlation Settings

High-Correlation Settings

3.4. Influence of Model Selection Criteria on Predictions and Complexity of Selected Models

3.5. Impact of a High Proportion of Noise Variables on Prediction Performance of Approaches

3.6. A Summary of the Results for the Entire Simulation

4. Example: Respiratory Health Data

4.1. Adequate Information

4.2. Inadequate Information

5. Discussion

5.1. Effects of Initial Estimates on the Prediction Accuracy of NNG and ALASSO

5.2. Comparison of Prediction Accuracy of NNG and ALASSO over Initial Models

5.3. Comparison of Prediction Accuracy of Classical and Penalized Methods Using CV as Criterion

5.4. Influence of Model Selection Criteria on Predictions and Model Complexity

5.5. Impact of a High Proportion of Noise Variables on Prediction Accuracy and Model Complexity of Approaches

5.6. Application of Variable Selection Methods

5.7. Limitation of the Simulation Study

6. Conclusions

7. Directions for Future Research

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI