The One Standard Error Rule for Model Selection: Does It Work?

: Previous research provided a lot of discussion on the selection of regularization parameters when it comes to the application of regularization methods for high-dimensional regression. The popular “One Standard Error Rule” (1se rule) used with cross validation (CV) is to select the most parsimonious model whose prediction error is not much worse than the minimum CV error. This paper examines the validity of the 1se rule from a theoretical angle and also studies its estimation accuracy and performances in applications of regression estimation and variable selection, particularly for Lasso in a regression framework. Our theoretical result shows that when a regression procedure produces the regression estimator converging relatively fast to the true regression function, the standard error estimation formula in the 1se rule is justiﬁed asymptotically. The numerical results show the following: 1. the 1se rule in general does not necessarily provide a good estimation for the intended standard deviation of the cross validation error. The estimation bias can be 50–100% upwards or downwards in various situations; 2. the results tend to support that 1se rule usually outperforms the regular CV in sparse variable selection and alleviates the over-selection tendency of Lasso; 3. in regression estimation or prediction, the 1se rule often performs worse. In addition, comparisons are made over two real data sets: Boston Housing Prices (large sample size n , small/moderate number of variables p ) and Bardet–Biedl data (large p , small n ). Data guided simulations are done to provide insight on the relative performances of the 1se rule and the regular CV.


Background
Resampling and subsampling methods were widely utilized in many statistical and machine learning applications. In high-dimensional regression learning, for instance, bootstrap and subsampling play important roles in model selection uncertainty quantification and other model selection diagnostics (see, e.g., [1] for references). Cross-validation (CV) (as a subsampling tool) and closely related methods were proposed to assess quality of variable selection performance in terms of F-and G-measures ( [2]) and to determine variable importances ( [3]).
In this paper, we focus on the examination of cross-validation as used for model selection. A core issue is how to properly quantify variabilities in subsamplings and associated evaluations, which is a really challenging issue that is yet to be solved, to the best of our knowledge. In application of variable and model selection, a popular practice is to use the "one standard error rule". However, its validity hinges crucially on the goodness of the standard error formula used in the approach. Our aim in this work is to study the standard error estimation issue and its consequences on variable selection and regression estimation.
In this section, we provide a background on tuning parameter selection for highdimensional regression.

Regularization Methods
Regularization methods are now widely used to tackle the problem of curse of dimensionality in high-dimensional regression analysis. By imposing a penalty on the complexity of the solution, it can solve an ill-posed problem or prevent overfitting. Examples include Lasso ( [4]) and Ridge regression ( [5,6]), which add an l 1 and l 2 penalty on the coefficient estimates, respectively, to the usual least-squares fit objective function.
Specifically, given a response vector y ∈ R n , a matrix X ∈ R n×p of predictor variables and a vector of intercept α, the Lasso estimate is defined aŝ and the ridge estimate is defined aŝ The tuning parameter λ controls the strength of the penalty. As is well known, the nature of the 1 penalty causes some coefficients to be shrunken to zero exactly while the 2 penalty can shrink all of the coefficients towards zero, but it typically does not set any of them exactly to zero. Thus, an advantage of Lasso is that it makes the results simpler and more interpretable. In this work, ridge regression will be considered when regression estimation (i.e., the focus is on the accurate estimation of the regression function) or prediction is the goal.

Tuning Parameter Selection
The regularization parameter, or the tuning parameter, determines the extent of penalization. A traditional way to select a model is to use an information criterion such as Akaike information criterion (AIC) ( [7,8]), Bayesian information criterion (BIC) and its extension ( [9][10][11]). A more commonly used way for tuning parameter selection is the crossvalidation ( [12][13][14]). For recent work on cross-validation for consistent model selection for high-dimensional regression, see [15].
The idea of K-fold cross validation is to split the data into K roughly equal-sized parts: F 1 , . . . , F K . For the k-th part where k = 1, 2, . . . , K, we consider training on (x i , y i ), i / ∈ F k and evaluation on (x i , y i ), i ∈ F k . For each value of the tuning parameter λ ∈ {λ 1 , . . . , λ m }, a set of candidate values, we compute the estimatef λ (−k) on the training set, and record the total error on the validation set: For each tuning parameter value λ, we compute the average error over all folds: We then choose the value of tuning parameter that minimizes the above average error, λ min = argmin λ∈{λ 1 ,...,λ m } CV(λ).

Goal for the One Standard Error Rule
The "one standard error rule" (1se rule) for model selection was first proposed for selecting the right sized tree ( [16]). It is also suggested to pick the most parsimonious model within one standard error of the minimum cross validation error ( [17]). Such a rule acknowledges the fact that the bias-variance trade-off curve is estimated with error, and hence takes a conservative approach: a smaller model is preferred when the prediction errors are more or less indistinguishable. To estimate the standard deviation of CV(λ) at each λ ∈ {λ 1 , . . . , λ m }, one computes the sample standard deviation of validation errors in CV 1 (λ), . . . , CV K (λ): Then, we estimate the standard deviation of CV(λ), which is declared the standard error of CV(λ), by The organization of the rest of the paper is as follows. Section 2 examines validity of the 1se rule from a theoretical perspective. Section 3 describes the objectives and approaches of our numerical investigations. Section 4 presents the experimental results of the 1se rule in three different aspects. Section 5 applies the 1se rule on two real data sets and examine its performances via data guided simulations. Section 6 concludes the paper. Appendix A gives additional numerical results and Appendix B provides the proof of the main theorem in the paper.

When Is the Standard Error Formula Valid?
The 1se rule makes intuitive sense, but its validity is not clear at all. The issue is that the CV errors completed at different folds are not independent, which invalidates the standard error formula for the sample mean of i.i.d. observations. Note that the validity of the standard error is equivalent to the validity of the sample standard deviation expressed in Equation (6). Thus, it suffices to study if and to what extent the usual sample variance properly estimates the targeted theoretical variance.
In this section, we investigate the legitimacy of the standard error formula for a general learning procedure. We focus on the regression case, but similar results hold for classification as well.
Let δ be a learning procedure that produces the regression estimatorf δ (x; D n ) based on the data D n = (x i , y i ) n i=1 . Throughout the section, we assume K is given and for simplicity, n is a multiple of K. Define CV k (δ) = ∑ i∈F k (y i −f δ (x i ; D n (−k) )) 2 , where D n (−k) denote the observations in D n that do not belong to F k and k = 1, 2, . . . , K.
The SE of the CV errors of the learning procedure δ is defined as To study the SE formula, given that K is fixed, it is basically equivalent to studying the sample variance formula, i.e., The validity of these formulas (in an asymptotic sense) hinges on that the correlation between CV k 1 (δ) and CV k 2 (δ) for k 1 = k 2 approaches zero as n → ∞. Definition 1. The sample variance S n (δ) is said to be asymptotically relatively unbiased (ARU) if Clearly, if the property above holds, the SE formula is properly justified in terms of the relative estimation bias being negligible. Then, the 1se rule is sensible.
For p > 0, define the L p -norm where P X denotes the probability distribution of X from which X 1 , X 2 , . . . , X n are drawn. Letf δ,n denote the estimatorf δ (x; D n ).
holds if the regression procedure is based on a parametric model under regularity conditions. For instance, if we consider a linear model with p 0 terms, then under sensible conditions (e.g., [15], p. 111), E(|| f −f δ,n || 4 ) = O( p 2 0 n 2 ) and consequently as long as p 0 = o( √ n), the variance estimator S n (δ) is asymptotically relatively unbiased.

Remark 2.
The above result suggests that when relatively simple models are considered, the 1se rule may be reasonably applicable. In the same spirit, when sparsity oriented variable selection methods are considered as candidate learning methods, if the most relevant selected models are quite sparse, the 1se rule may be proper.

An Illustrative Example
Our theorem shows that the standard error formula is asymptotically valid when the correlation between the CV errors at different folds is relatively small, which is related to the rate of convergence of the regression procedure. In this section, we illustrate how the dependence affects the validity of the standard error formula in a concrete example. For ease in highlighting the dependence among the CV errors at different folds, we consider a really simple setting, where the true regression function is a constant. The simplicity allows us to calculate the variance and covariance of the CV errors explicitly and then see the essence. The insight gained, however, is more broadly applicable.
The true data generating model is Y = θ + , where θ ∈ R is unknown. Given data D n , for an integer r as a factor of n, the regression procedure considers only the first, r-th, (2r)-th, . . . , (n/r)-th observations and take the corresponding sample mean of Y to estimate f (x) = 0.
Suppose l = n Kr is an integer. Then, when applying the K-fold CV, the training sample size is (K − 1) n K but only (K − 1)l observations are actually used to estimate f . Let W i , W j denote the CV errors on the i-th and j-th fold, 1 ≤ i, j ≤ K. Then, it can be shown that and Cov(W i , W j ) n 2 l 2 (9) for i = j, where denotes the same order. Therefore, the standard error formula is ARU if and only if l √ n → ∞. In this illustrating example, when l is small (recall that only l observations are used in each fold for training), the regression estimator obtained from excluding one fold respectively are highly dependent, which makes the sample variance formula problematic. In contrast, when l is large, e.g., l = n K (i.e., all the observations are used), the dependence of the CV errors at different folds becomes negligible compared to that of the randomness in the response. In the extreme case of l = n K , we have for i = j. Note also that the choice of l determines the convergence rate of the regression estimator, i.e., E(|| f −f δ,n || 2 2 ) is of order 1/l. Obviously, with large l, the estimator converges fast, which dilutes the dependence of the CV errors and helps the sample variance formula to be more reliable.

Numerical Research Objectives and Setup
One standard error rule is intended to improve the usual cross-validation method stated in Section 1.2, where λ is chosen to minimize the CV error. However, to date, no empirical or theoretical studies were conducted to justify the application of the one standard error rule in favor of parsimony, to the best of our knowledge. There are some interesting open questions that deserve investigations.

1.
Does the 1se itself provide a good estimation of the standard deviation of the cross validation error CV(λ) as intended? 2.
Does the model selected by the 1se rule (model with λ 1se ) typically outperform the model selected by minimizing the CV error (model with λ min ) in variable selection? 3.
What if estimating the regression function or prediction is the goal?
In this paper, we study the estimation accuracy of the one standard error rule and its application in regression estimation and variable selection under various regression scenarios, including both simulations and real data examples.

Simulation Settings
Our numerical investigations are done in a regression framework where our simulated response is generated by linear models. Both dependent and independent predictors are considered. The data generating process is shown as follows.
Denote the sample size as n and the total number of predictors as p. Firstly, the predictors are generated by normal distributions. In the independent case, the vector of the explanatory variables (X 1 , X 2 , . . . . . . , X p ) has i.i.d. N(0, 1) components and in the dependent case, the vector follows N(0, Σ) distribution, where Σ takes either Σ 1 or Σ 2 as follows where 0 < ρ < 1. Then, the response vector y ∈ R n is generated by where β is the chosen coefficient vector (to be specified later) and the random error vector has mean 0 and covariance matrix σ 2 I n×n for some σ 2 > 0.

Regression Estimation
For estimation of the regression function, we intend to minimize the estimation risk. Given the data, we consider a regression method and its loss. For Lasso, we definê where λ is chosen by the regular CV, i.e., λ = λ min . We letβ 1se be the parameter estimate obtained from the simplest model whose CV error is within the one standard error of the minimum CV error, i.e., λ = λ 1se . Given an estimateβ, the loss for estimating the regression function is where β is the true parameter vector and the expectation is taken with respect to the randomness of x that follows the same distribution as in the data generation. The loss can be approximated by where z j , j = 1, . . . , J are i.i.d. with the same distribution as x in the data generation and J is large (In the following numerical exercise, we set J = 500). In the case of using real data to compare λ 1se and λ min , clearly, it is not possible to compute the estimation loss as described above. We will use cross-validation and data guided simulations (DGS) instead for comparison of λ 1se and λ min . See Section 4 for details.

Variable Selection
Parameter value specification can substantially affect the accuracy of the variable selection by Lasso. In our simulation study, we investigate the performances of the model with λ min and that with λ 1se by comparing their probabilities of accurate variable selection. When considering real data, since we do not know the true data generation process, we employ data guided simulations to provide more insight. Let S denote the set of variables in the true model andŜ be the set of selected variables by a method. In the DGS over real data sets, we define the symmetric difference S Ŝ to be the variables that are either in S but not inŜ or inŜ but not in S . We use the size of the symmetric difference to measure the performance of the variable selection method. Performance of the model with λ min and the model with λ 1se in terms of variable selection will be compared this way for the data examples, where some plausible models based on the data are used to generate data.
Organization of the numerical results is as follows. Section 4 presents the experimental results of the 1se rule in its estimation accuracy as well as its performances in regression estimation and variable selection, respectively. Section 5 applies the 1se rule over two real data sets and investigates its performance relative to the regular CV via DGS. Some additional figures and tables are in Appendix A. All of R codes used in this paper are available upon request.

Is 1se on Target of Estimating the Standard Deviation?
This subsection presents a simulation study of the estimation accuracy of the 1se with respect to different parameter settings in three sampling settings, by varying value of the correlation (ρ) in the design matrix, error variance σ 2 , sample size n, and the number of cross validation fold K.

Data Generating Processes (DGP)
We consider the following three sampling settings through which we first simulate the dependent (Σ 1 with ρ = 0.9) or independent design matrix X ∈ R n×p , and then generate the response y = Xβ + with either decaying or constant coefficient vector β. Recall that p is the total number of predictors and let q denote the number of nonzero coefficients. DGP 1: Y is generated by 5 predictors (out of 30, q = 5, p = 30). DGP 2: Y is generated by 20 predictors (out of 30, q = 20, p = 30). DGP 3: Y is generated by 20 predictors (out of 1000, q = 20, p = 1000).
In these three sampling settings, are i.i.d. N(0, σ 2 ). Therefore, the first and third configuration yield sparse models and the third configuration also represents the case of high-dimensional data.

Procedure
We apply the Lasso and Ridge procedures on the simulated data sets with tuning parameter selection by the two cross validation versions. The simulation results obtained with various parameter choices are summarized in Tables A1-A4 in the Appendix A. The simulation choices are sample size n ∈ {100, 1, 000}, the error variance σ 2 ∈ {0.01, 1, 4} and K-fold cross validation: K ∈ {2, 5, 10, 20}.

1.
Do the j-th simulation (j = 1, . . . , N). Simulate a data set of sample size n given the parameter values.

3.
Calculate the total cross validation error, CV.err j (λ) , by taking the mean of these K numbers. Find λ min that minimizes the total CV error. Now, we take λ = λ min throughout the remaining steps. Calculate the standard error of cross validation error SE j (λ) by one standard error rule:

4.
Repeat step 1 to step 3 for N times (N = 100). Calculate the standard deviation of the cross validation error CV.err j (λ), and call it SD(CV.error), and the mean standard error of cross validation error (as used for 1se rule), SE(CV.error) = 1 N ∑ SE j (CV.error).

5.
Performance assessment. Calculate the ratio of (the claimed) standard error over (the simulated true) standard deviation: If the 1se works well, the ratio should be close to 1. However, the results in the Tables A1-A4 show that, in some cases, the standard deviation is even more than 100% overestimated or more than 50% underestimated. The histograms of the ratios ( Figures A1 and A2) also show that the standard error estimates are not close to the targets. Not surprisingly, 2-fold is particularly untrustworthy. But for the common choice of 10-fold, even for independent predictors with constant coefficient, the standard error estimate by 1se can be 40% over or under the target. In addition, there is no obvious pattern on when the 1se is performing well or not. Overall, this simulation study provides evidence that the 1se fails to provide good estimation as intended (see [18] for more discussions on the standard error issues).

Is 1se Rule Better for Regression Estimation?
Since the standard error estimation in 1se is not the end goal, the poor performance of 1se in the previous subsection does not necessary mean that the 1se rule is poor for regression estimation and variable selection. In this part, we study the performance of model λ 1se in regression estimation and compare it with that of model λ min . A large number of samples are drawn so that we know in which cases the 1se rule can consistently outperforms the regular CV in regression estimation.
By simulations, we compare λ 1se and λ min in term of accuracy in estimating the regression function. We consider several factors: nonzero coefficients β (1) , number of zero coefficients p − q, sample size n, standard deviation of the noise σ and the dependency of the predictors.

2.
Each time randomly simulate a training set of n observations from the same data generating process.

3.
Apply the 1se rule (10-fold cross validation) over the training set and record the selected model: model with λ 1se and the model with λ min . Calculate the estimation losses of these two models: whereβ is based on λ 1se or λ min respectively, and z i are independently generated from the same distribution used to generate X.

4.
Repeat this process for M times (M = 500) and calculate the fraction that model with λ min has a smaller loss.
In our simulations, we make various comparisons: small or large nonzero coefficients; small or large error variance; small or large sample size; low-or high-dimensional data; independent or dependent design matrix. The parameter values used in our study are displayed in Table 1 below. The results are reported in Tables A5 and A6. Note that when λ min and λ 1se give identical selection over the M = 500 runs, the entry in the table is NaN.

Parameter
Values The simulation results show that, in general, for the design matrix with either AR(1) correlation or constant correlation, the model with λ min tends to provide better regression estimation (with smaller estimation errors), especially when the data is generated with relatively large coefficient. In addition, we consider two more settings below.

Is 1se Rule Better for Variable Selection?
There are two main issues of the application of the 1se rule in variable selection. One is that the accuracy of variable selection by Lasso itself is very sensitive to the parameter values in the data generating process. In various situations (especially when the covariance are complicatedly correlated such as in gene expression type of data), Lasso may perform very poorly in accurate variable selection. The second issue is that λ min and λ 1se returned by the glmnet function are the same in many cases and then there is no difference in the result of variable selection. The cases where λ min = λ 1se are eliminated in the following general study. To examine the real difference between λ min and λ 1se , we consider the difference of their probabilities of selecting the true model given that they have distinct select outcome.
Randomly simulate a data set of sample size n given the parameter values.

2.
Perform variable selection with λ min and λ 1se over the simulated data set, respectively. If the two returned models are the same, then discard the result and go back to step 1. Otherwise, check their variable selection results. 3.
Repeat the above process for M times (M = 500) and calculate the fraction of correct variable selection, denoted by P c (λ 1se ) and P c (λ min ), respectively. We report the proportion difference P c (λ 1se ) − P c (λ min ). Positive means 1se rule is better.
We adopt the same framework and parameter settings (Table 1) as that used for regression estimation. The results show that λ 1se tends to outperform λ min in most of the cases when the data are generated with AR(1) correlation design matrix (see Tables A7 and A8). But with constant correlation design matrix, these two methods have no practically difference in variable selection.
Below we consider several special cases where model λ min still needs to be considered, and although it can only do slightly better when the error variance is large, nonzero coefficients are small and no extra variables exist to interrupt the selection process.

Case 1: Constant Coefficients
In this case, let q = 5, p − q ∈ {0, 1, 2, 3, 4}, σ 2 = 4, n ∈ {100, 1000}, ρ = 0.01 and the common nonzero β ∈ [0.5, 1.3] with Σ 1 . The result is in Figure 3. In the case of small sample size, say n = 100, model λ min has higher probability of correct variable selection when the constant coefficient lies between (0.5,1). The largest difference appears when p − q = 0, not surprisingly. As the common coefficient value becomes larger than 1, the probability increases from negative to positive, which indicates the reverse to be true: model λ 1se now works better. However, in the case of large sample size, say n = 1000, model λ 1se significantly outperforms model λ min unless when p − q = 0.

Data Examples
In this section, the performances of the 1se rule and the regular CV in regression estimation and variable selection are examined over real data sets: Boston Housing Price (an example of n > p) and Bardet-Biedl data (an example of p > n).
Boston Housing Price. The data were collected by [19] for the purpose of discovering whether or not clean air influenced the value of houses in Boston. It consists of 506 observations and 14 non constant independent variables. Of these, medv is the response variable while the other 13 variables are possible predictors.
Bardet-Biedl. The gene expression data are from the microarray experiments of mammalianeye tissue samples of [20]. The data consists of 120 rats with 200 gene probes and the expression level of TRIM32 gene. It is interesting to know which gene probes are more able to explain the expression level of TRIM32 gene by regression analysis.

Regression Estimation
To test which method does better in regression estimation, we perform both cross validation and data guided simulation (DGS) over these two real data sets as follows.

1.
We randomly select n 1 observations from the data sets as training set and the rest as validation set, n 1 ∈ {100, 200, 400} for Boston Housing Price and n 1 ∈ {40, 80} for Bardet-Biedl data.

2.
Apply K-fold cross validation K ∈ {5, 10, 20}, for Boston Housing Price and K ∈ {5, 10} for Bardet-Biedl data over the training set and compute the mean square prediction error of model λ 1se and model λ min over the validation set. 3.
Repeat the above process for 500 times and compute the proportion of better estimation for each method.
The results (Tables 2 and 3) show that, for all the training size n 1 considered, the proportion of model λ min doing better is about 52∼57% for Boston Housing Price; 56∼59% for Bardet-Biedl data. The consistent slight advantage of λ min agrees with our earlier simulation results in supporting λ min as the better method for regression estimation. Obtain model λ min and model λ 1se with K-fold cross validation, K ∈ {5, 10, 20}, over the data and also the coefficient estimatesβ min ,β 1se , the standard deviation estimatesσ min ,σ 1se and the estimated responsesŷ min,1 ,ŷ 1se,1 (fitted values).

3.
Here, apply Lasso with λ min and λ 1se (using the same K-fold CV) on the new data set (i.e., new response and the original design matrix) and get the new estimated responsesŷ min,2 andŷ 1se,2 for each of the two scenarios.
Repeat the above resampling process for 500 times and compute the proportion of better estimation for each method.
The estimation results (Table 4) show that, overall, the proportion of better estimation for the model λ min is between 46.8∼52.8% and for the model λ 1se , it is between 47.2∼53.2%. Therefore, based on the DGS, in term of regression estimation accuracy, there is not much difference before λ min and λ 1se : the observed properties are not significantly different from 50% at 0.05 level.

Variable Selection: DGS
To test which model does better in variable selection, we perform the DGS over the above real data sets. Since the exact variable selection might not be reached, especially when applied to the high-dimensional data, we use symmetric difference | S Ŝ | to measure their performance, which is the number of variables either in S but not inŜ or in S but not in S . Below is the algorithm for the DGS.

1.
Apply the 10-fold (default) cross validation over real data set and select the true set of variables S : S min or S 1se , either by the model with λ min or by the model with λ 1se .

2.
Do least square estimation by regressing the response on the true set of variables and obtain the estimated responseŷ and the residual standard errorσ.

3.
Simulate the new response y new by adding the error term randomly generated from N(0,σ 2 ) toŷ. Apply K-fold cross validation, K ∈ {5, 10, 20}, over the simulated data set (i.e., y new and the original design matrix) and select set of variablesŜ:Ŝ min orŜ 1se , either by model λ min or by model λ 1se . Repeat this process for 500 times.
The distributions of the symmetric difference are shown in the appendix. (Figures A3 and A4 for Boston Housing Price; Figures A5 and A6 for Bardet-Biedl data) The mean of |Ŝ | is reported in Tables 5 and 6. For Boston Housing Price, the size of the candidate variables is 13 while model λ min selects 11 variables and model λ 1se selects 8 variables on average, i.e., the mean of | S min | is 11 and the mean of | S 1se | is 8. For Bardet-Biedl data, the size of the candidate variables is 200, while model λ min selects 21 variables and model λ 1se selects 19 variables on average, i.e., the mean of | S min | is 21 and the mean of | S 1se | is 19.
On average, the 1se rule does better in variable selection since the mean of |Ŝ 1se | is more closed to | S |. In the case of n > p, if we want to have a perfect or near perfect variable selection result (i.e., | S Ŝ |≤ 1), then the 1se rule is a good choice, despite the fact that its selection result is more unstable: the range of | S Ŝ 1se | is larger than that of | S Ŝ min |. In the case of n < p, both models fail to provide perfect selection result. But overall the 1se rule is more accurate and stable given that | S Ŝ 1se | has smaller mean and range.

Conclusions
The one standard error rule is proposed to pick the most parsimonious model within one standard error of the minimum. Despite the fact that it was widely used as a conservative and alternative approach for cross validation, there is no evidence confirming that the 1se rule can consistently outperform the regular CV.
Our theoretical result shows that the standard error formula is asymptotically valid when the regression procedure converges relatively fast to the true regression function. Our illustrative example also shows that when the regression estimator converges slowly, the CV errors on the different evaluation folds may be highly dependent, which unfortunately makes the usual sample variance (or sample standard deviation) formula problematic. In such a case, the use of the 1se rule may easily lead to a choice of inferior candidates.
Our numerical results offer finite-sample understandings. First of all, the simulation study also casts doubt on the estimation accuracy of the 1se rule in terms of estimating the standard error of the minimum cross validation error. In some cases, the 1se is likely to have 100% overestimation or underestimation, depending on the number of cross validation fold and the dependency among the predictors. More specifically, for example, the 1se tends to have overestimation when used with 2-fold cross validation in case of dependent predictors; but it tends to have underestimation when used with 20-fold cross validation in case of independent predictors.
Secondly, the performances of the 1se rule and the regular CV in regression estimation and variable selection are compared via both simulated and real data sets. In general, on the one hand, the 1se rule often performs better in variable selection for sparse modeling; on the other hand, it often does worse in regression estimation.
While our theoretical and numerical results clearly challenge indiscriminate uses of the 1se rule in general model selection, its parsimony nature may still be appealing when a sparse model is highly desirable. For the real data sets, for regression estimation, the regular CV does better than the 1se rule but the difference is small. There is almost no difference in the estimation accuracy between these two methods when regression estimation is assessed by the DGS. Overall, considering the model simplicity and interpretability, the 1se rule would be a better choice here.
Future studies are encouraged to provide more theoretical understandings on the 1se rule. Also, it may be studied numerically in more complex settings (e.g., when used to compare several nonlinear/nonparametric classification methods) than those in this work. In a larger context, since it is well known that the penalty methods for model selection are often unstable (e.g., [1,21,22]), alternative methods to strengthen reliability and reproductivity of variable selection may be applied (see reference in the above mentioned papers and [23]).    Table A7. Variable selection: P c (λ 1se ) − P c (λ min ), AR(1) correlation.

Low-Dimension
Remark A1. Depending on ρ = 0, ρ > 0 or ρ < 0, the sample variance is unbiased, underestimating, or over estimating the variance of W i . For a general learning procedure, the correlation between the test errors from different folds can be both positive and negative, which makes the estimated standard error based on the usual sample variance formula under-estimating or overestimating the true standard error. This is seen in our numerical results.

Remark A2.
Note that for the sample mean W, we have Var(W) = Var(W 1 )/K + Var(W 1 )ρ(K − 1)/K. Thus for Var(W 1 )/K to be asymptotically valid (in approximation) for Var(W) in the sense that their ratio approaches 1, we must have ρ → 0, which is also the key requirement for the sample variance formula to be asymptotically valid (ARU).
For the first part, we have For the other terms in the earlier expression of Cov(W 1 , W 2 ), they can be handled similarly. Indeed, |Cov(C, R 2,1 )| ≤ Var(C) Var(R 2,1 ) and the other terms are bounded similarly. Next, we calculate Var(C). Actually, the calculation is basically the same as for Var(W k ) and we get Note that (A10) For the other term,