Next Article in Journal
Credibility of Causal Estimates from Regression Discontinuity Designs with Multiple Assignment Variables
Next Article in Special Issue
Resampling Plans and the Estimation of Prediction Error
Previous Article in Journal
Articulating Spatial Statistics and Spatial Optimization Relationships: Expanding the Relevance of Statistics
Previous Article in Special Issue
Conditional Inference in Small Sample Scenarios Using a Resampling Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The One Standard Error Rule for Model Selection: Does It Work?

1
Carlson School of Management, University of Minnesota, Minneapolis, MN 55455, USA
2
School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
*
Author to whom correspondence should be addressed.
Stats 2021, 4(4), 868-892; https://doi.org/10.3390/stats4040051
Submission received: 25 September 2021 / Revised: 28 October 2021 / Accepted: 1 November 2021 / Published: 5 November 2021
(This article belongs to the Special Issue Re-sampling Methods for Statistical Inference of the 2020s)

Abstract

:
Previous research provided a lot of discussion on the selection of regularization parameters when it comes to the application of regularization methods for high-dimensional regression. The popular “One Standard Error Rule” (1se rule) used with cross validation (CV) is to select the most parsimonious model whose prediction error is not much worse than the minimum CV error. This paper examines the validity of the 1se rule from a theoretical angle and also studies its estimation accuracy and performances in applications of regression estimation and variable selection, particularly for Lasso in a regression framework. Our theoretical result shows that when a regression procedure produces the regression estimator converging relatively fast to the true regression function, the standard error estimation formula in the 1se rule is justified asymptotically. The numerical results show the following: 1. the 1se rule in general does not necessarily provide a good estimation for the intended standard deviation of the cross validation error. The estimation bias can be 50–100% upwards or downwards in various situations; 2. the results tend to support that 1se rule usually outperforms the regular CV in sparse variable selection and alleviates the over-selection tendency of Lasso; 3. in regression estimation or prediction, the 1se rule often performs worse. In addition, comparisons are made over two real data sets: Boston Housing Prices (large sample size n, small/moderate number of variables p) and Bardet–Biedl data (large p, small n). Data guided simulations are done to provide insight on the relative performances of the 1se rule and the regular CV.

1. Background

Resampling and subsampling methods were widely utilized in many statistical and machine learning applications. In high-dimensional regression learning, for instance, bootstrap and subsampling play important roles in model selection uncertainty quantification and other model selection diagnostics (see, e.g., [1] for references). Cross-validation (CV) (as a subsampling tool) and closely related methods were proposed to assess quality of variable selection performance in terms of F- and G-measures ([2]) and to determine variable importances ([3]).
In this paper, we focus on the examination of cross-validation as used for model selection. A core issue is how to properly quantify variabilities in subsamplings and associated evaluations, which is a really challenging issue that is yet to be solved, to the best of our knowledge. In application of variable and model selection, a popular practice is to use the “one standard error rule”. However, its validity hinges crucially on the goodness of the standard error formula used in the approach. Our aim in this work is to study the standard error estimation issue and its consequences on variable selection and regression estimation.
In this section, we provide a background on tuning parameter selection for high-dimensional regression.

1.1. Regularization Methods

Regularization methods are now widely used to tackle the problem of curse of dimensionality in high-dimensional regression analysis. By imposing a penalty on the complexity of the solution, it can solve an ill-posed problem or prevent overfitting. Examples include Lasso ([4]) and Ridge regression ([5,6]), which add an l 1 and l 2 penalty on the coefficient estimates, respectively, to the usual least-squares fit objective function.
Specifically, given a response vector y R n , a matrix X R n × p of predictor variables and a vector of intercept α , the Lasso estimate is defined as
β ^ l a s s o = argmin β R p y α X β 2 2 + λ j = 1 p β j ,
and the ridge estimate is defined as
β ^ r i d g e = argmin β R p y α X β 2 2 + λ j = 1 p β j 2 .
The tuning parameter λ controls the strength of the penalty. As is well known, the nature of the 1 penalty causes some coefficients to be shrunken to zero exactly while the 2 penalty can shrink all of the coefficients towards zero, but it typically does not set any of them exactly to zero. Thus, an advantage of Lasso is that it makes the results simpler and more interpretable. In this work, ridge regression will be considered when regression estimation (i.e., the focus is on the accurate estimation of the regression function) or prediction is the goal.

1.2. Tuning Parameter Selection

The regularization parameter, or the tuning parameter, determines the extent of penalization. A traditional way to select a model is to use an information criterion such as Akaike information criterion (AIC) ([7,8]), Bayesian information criterion (BIC) and its extension ([9,10,11]). A more commonly used way for tuning parameter selection is the cross-validation ([12,13,14]). For recent work on cross-validation for consistent model selection for high-dimensional regression, see [15].
The idea of K-fold cross validation is to split the data into K roughly equal-sized parts: F 1 , , F K . For the k-th part where k = 1 , 2 , , K , we consider training on ( x i , y i ) , i F k and evaluation on ( x i , y i ) , i F k . For each value of the tuning parameter λ { λ 1 , , λ m } , a set of candidate values, we compute the estimate f λ ^ ( k ) on the training set, and record the total error on the validation set:
C V k ( λ ) = i F k ( y i f λ ^ ( k ) ( x i ) ) 2 .
For each tuning parameter value λ , we compute the average error over all folds:
C V ( λ ) = 1 K k = 1 K C V k ( λ ) .
We then choose the value of tuning parameter that minimizes the above average error,
λ ^ m i n = argmin λ { λ 1 , , λ m } C V ( λ ) .

1.3. Goal for the One Standard Error Rule

The “one standard error rule” (1se rule) for model selection was first proposed for selecting the right sized tree ([16]). It is also suggested to pick the most parsimonious model within one standard error of the minimum cross validation error ([17]). Such a rule acknowledges the fact that the bias-variance trade-off curve is estimated with error, and hence takes a conservative approach: a smaller model is preferred when the prediction errors are more or less indistinguishable. To estimate the standard deviation of C V ( λ ) at each λ { λ 1 , , λ m } , one computes the sample standard deviation of validation errors in C V 1 ( λ ) , ,   C V K ( λ ) :
S D ( λ ) = V a r ( C V 1 ( λ ) , , C V K ( λ ) ) .
Then, we estimate the standard deviation of C V ( λ ) , which is declared the standard error of C V ( λ ) , by
S E ( λ ) = S D ( λ ) / K .
The organization of the rest of the paper is as follows. Section 2 examines validity of the 1se rule from a theoretical perspective. Section 3 describes the objectives and approaches of our numerical investigations. Section 4 presents the experimental results of the 1se rule in three different aspects. Section 5 applies the 1se rule on two real data sets and examine its performances via data guided simulations. Section 6 concludes the paper. Appendix A gives additional numerical results and Appendix B provides the proof of the main theorem in the paper.

2. Theoretical Result on the One-Standard-Error Rule

2.1. When Is the Standard Error Formula Valid?

The 1se rule makes intuitive sense, but its validity is not clear at all. The issue is that the C V errors completed at different folds are not independent, which invalidates the standard error formula for the sample mean of i.i.d. observations. Note that the validity of the standard error is equivalent to the validity of the sample standard deviation expressed in Equation (6). Thus, it suffices to study if and to what extent the usual sample variance properly estimates the targeted theoretical variance.
In this section, we investigate the legitimacy of the standard error formula for a general learning procedure. We focus on the regression case, but similar results hold for classification as well.
Let δ be a learning procedure that produces the regression estimator f ^ δ ( x ; D n ) based on the data D n = ( x i , y i ) i = 1 n . Throughout the section, we assume K is given and for simplicity, n is a multiple of K. Define C V k ( δ ) = i F k ( y i f ^ δ ( x i ; D ( k ) n ) ) 2 , where D ( k ) n denote the observations in D n that do not belong to F k and k = 1 , 2 , , K .
The S E of the C V errors of the learning procedure δ is defined as
S E ( δ ) = 1 K 1 k = 1 K C V k ( δ ) 1 K k = 1 K C V k ( δ ) 2 K .
To study the S E formula, given that K is fixed, it is basically equivalent to studying the sample variance formula, i.e.,
S n ( δ ) = 1 K 1 k = 1 K ( C V k ( δ ) 1 K k = 1 K C V k ( δ ) ) 2 .
The validity of these formulas (in an asymptotic sense) hinges on that the correlation between C V k 1 ( δ ) and C V k 2 ( δ ) for k 1 k 2 approaches zero as n .
Definition 1.
The sample variance S n ( δ ) is said to be asymptotically relatively unbiased (ARU) if
E S n ( δ ) V a r ( C V k ( δ ) ) V a r ( C V k ( δ ) ) 0
as n .
Clearly, if the property above holds, the S E formula is properly justified in terms of the relative estimation bias being negligible. Then, the 1se rule is sensible.
For p > 0 , define the L p -norm
| | f | | = f p ( x ) P X ( d x ) 1 / p ,
where P X denotes the probability distribution of X from which X 1 , X 2 , , X n are drawn. Let f ^ δ , n denote the estimator f ^ δ ( x ; D n ) .
Theorem 1.
If E ( | | f f ^ δ , n | | 4 4 ) = o ( 1 n ) , then S n ( δ ) is asymptotically relatively unbiased.
Remark 1.
The condition E ( | | f f ^ δ , n | | 4 4 ) = o ( 1 n ) holds if the regression procedure is based on a parametric model under regularity conditions. For instance, if we consider a linear model with p 0 terms, then under sensible conditions (e.g., [15], p. 111), E ( | | f f ^ δ , n | | 4 ) = O ( p 0 2 n 2 ) and consequently as long as p 0 = o ( n ) , the variance estimator S n ( δ ) is asymptotically relatively unbiased.
Remark 2.
The above result suggests that when relatively simple models are considered, the 1se rule may be reasonably applicable. In the same spirit, when sparsity oriented variable selection methods are considered as candidate learning methods, if the most relevant selected models are quite sparse, the 1se rule may be proper.

2.2. An Illustrative Example

Our theorem shows that the standard error formula is asymptotically valid when the correlation between the C V errors at different folds is relatively small, which is related to the rate of convergence of the regression procedure. In this section, we illustrate how the dependence affects the validity of the standard error formula in a concrete example. For ease in highlighting the dependence among the C V errors at different folds, we consider a really simple setting, where the true regression function is a constant. The simplicity allows us to calculate the variance and covariance of the C V errors explicitly and then see the essence. The insight gained, however, is more broadly applicable.
The true data generating model is Y = θ + ϵ , where θ R is unknown. Given data D n , for an integer r as a factor of n, the regression procedure considers only the first, r-th, ( 2 r ) -th, ..., ( n / r ) -th observations and take the corresponding sample mean of Y to estimate f ( x ) = 0 .
Suppose l = n K r is an integer. Then, when applying the K-fold C V , the training sample size is ( K 1 ) n K but only ( K 1 ) l observations are actually used to estimate f. Let W i , W j denote the C V errors on the i-th and j-th fold, 1 i , j K . Then, it can be shown that
V a r ( W i ) n ( 1 + n l 2 )
and
C o v ( W i , W j ) n 2 l 2
for i j , where ≍ denotes the same order.
Therefore, the standard error formula is ARU if and only if l n .
In this illustrating example, when l is small (recall that only l observations are used in each fold for training), the regression estimator obtained from excluding one fold respectively are highly dependent, which makes the sample variance formula problematic. In contrast, when l is large, e.g., l = n K (i.e., all the observations are used), the dependence of the C V errors at different folds becomes negligible compared to that of the randomness in the response. In the extreme case of l = n K , we have
ρ ( W i , W j ) 1 n
for i j .
Note also that the choice of l determines the convergence rate of the regression estimator, i.e., E ( | | f f ^ δ , n | | 2 2 ) is of order 1 / l . Obviously, with large l, the estimator converges fast, which dilutes the dependence of the C V errors and helps the sample variance formula to be more reliable.

3. Numerical Research Objectives and Setup

One standard error rule is intended to improve the usual cross-validation method stated in Section 1.2, where λ is chosen to minimize the CV error. However, to date, no empirical or theoretical studies were conducted to justify the application of the one standard error rule in favor of parsimony, to the best of our knowledge. There are some interesting open questions that deserve investigations.
(1)
Does the 1se itself provide a good estimation of the standard deviation of the cross validation error C V ( λ ) as intended?
(2)
Does the model selected by the 1se rule (model with λ 1 s e ) typically outperform the model selected by minimizing the CV error (model with λ m i n ) in variable selection?
(3)
What if estimating the regression function or prediction is the goal?
In this paper, we study the estimation accuracy of the one standard error rule and its application in regression estimation and variable selection under various regression scenarios, including both simulations and real data examples.

3.1. Simulation Settings

Our numerical investigations are done in a regression framework where our simulated response is generated by linear models. Both dependent and independent predictors are considered. The data generating process is shown as follows.
Denote the sample size as n and the total number of predictors as p. Firstly, the predictors are generated by normal distributions. In the independent case, the vector of the explanatory variables ( X 1 , X 2 , , X p ) has i.i.d. N ( 0 , 1 ) components and in the dependent case, the vector follows N ( 0 , Σ ) distribution, where Σ takes either Σ 1 or Σ 2 as follows
Σ 1 = 1 ρ ρ n 1 ρ 1 ρ n 2 ρ n 1 ρ n 2 1
or
Σ 2 = 1 ρ ρ ρ 1 ρ ρ ρ 1 .
where 0 < ρ < 1 . Then, the response vector y R n is generated by
y = X β + ϵ ,
where β is the chosen coefficient vector (to be specified later) and the random error vector ϵ has mean 0 and covariance matrix σ 2 I n × n for some σ 2 > 0 .

3.2. Regression Estimation

For estimation of the regression function, we intend to minimize the estimation risk. Given the data, we consider a regression method and its loss. For Lasso, we define
β ^ m i n = argmin β R p i = 1 n ( y i α i x i β ) 2 + λ j = 1 p β j ,
where λ is chosen by the regular CV, i.e., λ = λ m i n . We let β ^ 1 s e be the parameter estimate obtained from the simplest model whose CV error is within the one standard error of the minimum CV error, i.e., λ = λ 1 s e . Given an estimate β ^ , the loss for estimating the regression function is
L ( β ^ , β ) = E ( x β ^ x β ) 2 ,
where β is the true parameter vector and the expectation is taken with respect to the randomness of x that follows the same distribution as in the data generation. The loss can be approximated by
L ( β ^ , β ) 1 J j = 1 J ( z j β ^ z j β ) 2 ,
where z j , j = 1 , , J are i.i.d. with the same distribution as x in the data generation and J is large (In the following numerical exercise, we set J = 500 ). In the case of using real data to compare λ 1 s e and λ m i n , clearly, it is not possible to compute the estimation loss as described above. We will use cross-validation and data guided simulations (DGS) instead for comparison of λ 1 s e and λ m i n . See Section 4 for details.

3.3. Variable Selection

Parameter value specification can substantially affect the accuracy of the variable selection by Lasso. In our simulation study, we investigate the performances of the model with λ m i n and that with λ 1 s e by comparing their probabilities of accurate variable selection. When considering real data, since we do not know the true data generation process, we employ data guided simulations to provide more insight. Let S denote the set of variables in the true model and S ^ be the set of selected variables by a method. In the DGS over real data sets, we define the symmetric difference S S ^ to be the variables that are either in S but not in S ^ or in S ^ but not in S . We use the size of the symmetric difference to measure the performance of the variable selection method. Performance of the model with λ m i n and the model with λ 1 s e in terms of variable selection will be compared this way for the data examples, where some plausible models based on the data are used to generate data.
Organization of the numerical results is as follows. Section 4 presents the experimental results of the 1se rule in its estimation accuracy as well as its performances in regression estimation and variable selection, respectively. Section 5 applies the 1se rule over two real data sets and investigates its performance relative to the regular CV via DGS. Some additional figures and tables are in Appendix A. All of R codes used in this paper are available upon request.

4. Simulation Results

4.1. Is 1se on Target of Estimating the Standard Deviation?

This subsection presents a simulation study of the estimation accuracy of the 1se with respect to different parameter settings in three sampling settings, by varying value of the correlation ( ρ ) in the design matrix, error variance σ 2 , sample size n, and the number of cross validation fold K.

4.1.1. Data Generating Processes (DGP)

We consider the following three sampling settings through which we first simulate the dependent ( Σ 1 with ρ = 0.9 ) or independent design matrix X R n × p , and then generate the response y = X β + ϵ with either decaying or constant coefficient vector β . Recall that p is the total number of predictors and let q denote the number of nonzero coefficients.
  • DGP 1: Y is generated by 5 predictors (out of 30, q = 5 , p = 30 ).
  • DGP 2: Y is generated by 20 predictors (out of 30, q = 20 , p = 30 ).
  • DGP 3: Y is generated by 20 predictors (out of 1000, q = 20 , p = 1000 ).
For the coefficient β = ( β 1 , , β q , β q + 1 , , β p ) , it equals ( 1 , 1 2 , , 1 q , 0 , , 0 ) for decaying coefficient case, and it equals ( 1 , 1 , 1 q , 0 , , 0 p q ) for constant coefficient case. In these three sampling settings, ϵ are i.i.d. N ( 0 , σ 2 ) . Therefore, the first and third configuration yield sparse models and the third configuration also represents the case of high-dimensional data.

4.1.2. Procedure

We apply the Lasso and Ridge procedures on the simulated data sets with tuning parameter selection by the two cross validation versions. The simulation results obtained with various parameter choices are summarized in Table A1, Table A2, Table A3 and Table A4 in the Appendix A. The simulation choices are sample size n { 100 , 1 , 000 } , the error variance σ 2 { 0.01 , 1 , 4 } and K-fold cross validation: K { 2 , 5 , 10 , 20 } .
(1)
Do the j-th simulation ( j = 1 , , N ). Simulate a data set of sample size n given the parameter values.
(2)
K-fold cross validation. Randomly split the data set into K equal-sized parts: F 1 , F 2 , , F K and record the cross validation error for each part C V k ( λ ) where k = 1 , 2 , , K .
(3)
Calculate the total cross validation error, C V . e r r j ( λ ) , by taking the mean of these K numbers. Find λ m i n that minimizes the total CV error. Now, we take λ = λ m i n throughout the remaining steps. Calculate the standard error of cross validation error S E j ( λ ) by one standard error rule:
S E j ( C V . e r r o r ) = V a r ( C V 1 , j , C V 2 , j , , C V K , j ) / K .
(4)
Repeat step 1 to step 3 for N times ( N = 100 ). Calculate the standard deviation of the cross validation error C V . e r r j ( λ ) , and call it S D ( C V . e r r o r ) , and the mean standard error of cross validation error (as used for 1se rule), S E ( C V . e r r o r ) = 1 N S E j ( C V . e r r o r ) .
(5)
Performance assessment. Calculate the ratio of (the claimed) standard error over (the simulated true) standard deviation:
S E ( C V . e r r o r ) S D ( C V . e r r o r ) .
If the 1se works well, the ratio should be close to 1. However, the results in the Table A1, Table A2, Table A3 and Table A4 show that, in some cases, the standard deviation is even more than 100% overestimated or more than 50% underestimated. The histograms of the ratios (Figure A1 and Figure A2) also show that the standard error estimates are not close to the targets. Not surprisingly, 2-fold is particularly untrustworthy. But for the common choice of 10-fold, even for independent predictors with constant coefficient, the standard error estimate by 1se can be 40% over or under the target. In addition, there is no obvious pattern on when the 1se is performing well or not. Overall, this simulation study provides evidence that the 1se fails to provide good estimation as intended (see [18] for more discussions on the standard error issues).

4.2. Is 1se Rule Better for Regression Estimation?

Since the standard error estimation in 1se is not the end goal, the poor performance of 1se in the previous subsection does not necessary mean that the 1se rule is poor for regression estimation and variable selection. In this part, we study the performance of model λ 1 s e in regression estimation and compare it with that of model λ m i n . A large number of samples are drawn so that we know in which cases the 1se rule can consistently outperforms the regular CV in regression estimation.
By simulations, we compare λ 1 s e and λ m i n in term of accuracy in estimating the regression function. We consider several factors: nonzero coefficients β ( 1 ) , number of zero coefficients p q , sample size n, standard deviation of the noise σ and the dependency of the predictors.

4.2.1. Model Specification

We consider the data generation model: y = X β + ϵ , where ϵ N ( 0 , σ 2 ) and β = ( β ( 1 ) , β ( 2 ) ) , with β ( 1 ) = ( β 1 , β 2 , , β q ) being the nonzero coefficients and β ( 2 ) = ( β q + 1 , β q + 2 , , β p ) being the zero coefficients. The vector of explanatory variables is generated by the multivariate normal distribution,
X M V ( 0 , Σ i ) ,
where i = 1 , 2 .

4.2.2. Procedure

  • Simulate a fixed validation set of 500 observations of the predictors for estimation of the loss.
  • Each time randomly simulate a training set of n observations from the same data generating process.
  • Apply the 1se rule (10-fold cross validation) over the training set and record the selected model: model with λ 1 s e and the model with λ m i n . Calculate the estimation losses of these two models: 1 500 i = 1 500 ( z i β ^ z i β ) 2 , where β ^ is based on λ 1 s e or λ m i n respectively, and z i are independently generated from the same distribution used to generate X.
  • Repeat this process for M times ( M = 500 ) and calculate the fraction that model with λ m i n has a smaller loss.
In our simulations, we make various comparisons: small or large nonzero coefficients; small or large error variance; small or large sample size; low- or high-dimensional data; independent or dependent design matrix. The parameter values used in our study are displayed in Table 1 below. The results are reported in Table A5 and Table A6. Note that when λ m i n and λ 1 s e give identical selection over the M = 500 runs, the entry in the table is NaN.
The simulation results show that, in general, for the design matrix with either AR(1) correlation or constant correlation, the model with λ m i n tends to provide better regression estimation (with smaller estimation errors), especially when the data is generated with relatively large coefficient. In addition, we consider two more settings below.

4.2.3. Case 1: Large p / q

In this case, we consider high-dimensional data with AR(1) covariance design matrix, Σ 1 , where q = 10 and p = 1000 . We set σ 2 = 1 , n { 100 , 1000 } , ρ { 0.01 , 0.9 } , constant nonzero coefficients with value β { 1 , 6 , 11 , 16 , 21 } . We report the probability that model λ m i n estimates better. Clearly, in this case, the model λ m i n significantly outperforms the model λ 1 s e for a range of β values at both sample size (See Figure 1).

4.2.4. Case 2: Small p / q

In this case, we consider low-dimensional data with AR(1) covariance design matrix, Σ 1 , where q = 10 and p q { 2 , 5 , 10 } . We set σ 2 = 0.01 , n { 100 , 1000 } , ρ = 0.1 , the common nonzero coefficient value β { 0.2 , 0.4 , 0.6 , 0.8 , 1 , 1.2 } . We report the probability that model λ m i n estimates the regression function better. In this case, model λ m i n significantly outperforms model λ 1 s e as the common coefficient value increases over 1, getting close to or even over 0.8 (See Figure 2).

4.3. Is 1se Rule Better for Variable Selection?

There are two main issues of the application of the 1se rule in variable selection. One is that the accuracy of variable selection by Lasso itself is very sensitive to the parameter values in the data generating process. In various situations (especially when the covariance are complicatedly correlated such as in gene expression type of data), Lasso may perform very poorly in accurate variable selection. The second issue is that λ m i n and λ 1 s e returned by the glmnet function are the same in many cases and then there is no difference in the result of variable selection. The cases where λ m i n = λ 1 s e are eliminated in the following general study. To examine the real difference between λ m i n and λ 1 s e , we consider the difference of their probabilities of selecting the true model given that they have distinct select outcome.

4.3.1. Procedure

  • Randomly simulate a data set of sample size n given the parameter values.
  • Perform variable selection with λ m i n and λ 1 s e over the simulated data set, respectively. If the two returned models are the same, then discard the result and go back to step 1. Otherwise, check their variable selection results.
  • Repeat the above process for M times ( M = 500 ) and calculate the fraction of correct variable selection, denoted by P c ( λ 1 s e ) and P c ( λ m i n ) , respectively. We report the proportion difference P c ( λ 1 s e ) P c ( λ m i n ) . Positive means 1se rule is better.
We adopt the same framework and parameter settings (Table 1) as that used for regression estimation. The results show that λ 1 s e tends to outperform λ m i n in most of the cases when the data are generated with AR(1) correlation design matrix (see Table A7 and Table A8). But with constant correlation design matrix, these two methods have no practically difference in variable selection.
Below we consider several special cases where model λ m i n still needs to be considered, and although it can only do slightly better when the error variance is large, nonzero coefficients are small and no extra variables exist to interrupt the selection process.

4.3.2. Case 1: Constant Coefficients

In this case, let q = 5 , p q { 0 , 1 , 2 , 3 , 4 } , σ 2 = 4 , n { 100 , 1000 } , ρ = 0.01 and the common nonzero β [ 0.5 , 1.3 ] with Σ 1 . The result is in Figure 3. In the case of small sample size, say n = 100 , model λ m i n has higher probability of correct variable selection when the constant coefficient lies between (0.5,1). The largest difference appears when p q = 0 , not surprisingly. As the common coefficient value becomes larger than 1, the probability increases from negative to positive, which indicates the reverse to be true: model λ 1 s e now works better. However, in the case of large sample size, say n = 1000 , model λ 1 s e significantly outperforms model λ m i n unless when p q = 0 .

4.3.3. Case 2: Decaying Coefficients

In this case, let q = 5 , σ 2 = 4 , n = 1000 , ρ = 0.5 , p q 0 , 5 , 10 , 15 , 20 , 25 , 30 , 35 , 40 , 45 , 50 , and β i = 1 i , i = 1 , 2 , , q with Σ 1 . Figure 4 indicates that, with decaying coefficients lie between (0,1) and large sample size n = 1000 , model λ m i n has higher probability of correct variable selection when the number of zero-coefficient variables p q is small. The largest difference appears when p q = 0 , i.e., there is no extra variables provided to interrupt the variable selection process.

4.3.4. Case 3: Hybrid Case

Let n = 100 , σ 2 = 4 and ρ { 0 , 0.5 } with Σ 1 . In this data generating process, we set the first ten coefficients to be positive, i.e., q = 10 , among which the first nine nonzero coefficients are constant 1, while the last one β 10 increases from 0 to 10, i.e., β 10 { 0 , 0.05 , 0.1 , 0.15 , 0.2 , 0.4 , 0.6 , 0.8 , 1 , 3 , 5 , 10 } . The number of zero coefficients p q { 0 , 1 , 2 , 3 , 5 , 20 } . From Figure 5, when β 10 is very small, i.e., β 10 { 0.05 , 0.2 } and p q = 0 , λ m i n performs better. For other situations, λ 1 s e is significantly better.

5. Data Examples

In this section, the performances of the 1se rule and the regular CV in regression estimation and variable selection are examined over real data sets: Boston Housing Price (an example of n > p ) and Bardet–Biedl data (an example of p > n ).
Boston Housing Price. The data were collected by [19] for the purpose of discovering whether or not clean air influenced the value of houses in Boston. It consists of 506 observations and 14 non constant independent variables. Of these, medv is the response variable while the other 13 variables are possible predictors.
Bardet–Biedl. The gene expression data are from the microarray experiments of mammalianeye tissue samples of [20]. The data consists of 120 rats with 200 gene probes and the expression level of TRIM32 gene. It is interesting to know which gene probes are more able to explain the expression level of TRIM32 gene by regression analysis.

5.1. Regression Estimation

To test which method does better in regression estimation, we perform both cross validation and data guided simulation (DGS) over these two real data sets as follows.

5.1.1. Procedure for Cross Validation

  • We randomly select n 1 observations from the data sets as training set and the rest as validation set, n 1 { 100 , 200 , 400 } for Boston Housing Price and n 1 { 40 , 80 } for Bardet–Biedl data.
  • Apply K-fold cross validation K { 5 , 10 , 20 } , for Boston Housing Price and K { 5 , 10 } for Bardet–Biedl data over the training set and compute the mean square prediction error of model λ 1 s e and model λ m i n over the validation set.
  • Repeat the above process for 500 times and compute the proportion of better estimation for each method.
The results (Table 2 and Table 3) show that, for all the training size n 1 considered, the proportion of model λ m i n doing better is about 52∼57% for Boston Housing Price; 56∼59% for Bardet–Biedl data. The consistent slight advantage of λ m i n agrees with our earlier simulation results in supporting λ m i n as the better method for regression estimation.

5.1.2. Procedure for DGS

  • Obtain model λ m i n and model λ 1 s e with K-fold cross validation, K { 5 , 10 , 20 } , over the data and also the coefficient estimates β ^ m i n , β ^ 1 s e , the standard deviation estimates σ ^ m i n , σ ^ 1 s e and the estimated responses y ^ m i n , 1 , y ^ 1 s e , 1 (fitted values).
  • Simulate the new responses in two scenarios:
    Scenario 1: y = y ^ m i n , 1 + ϵ with ϵ N ( 0 , σ ^ m i n 2 I ) ,
    Scenario 2: y = y ^ 1 s e , 1 + ϵ with ϵ N ( 0 , σ ^ 1 s e 2 I ) .
  • Here, apply Lasso with λ m i n and λ 1 s e (using the same K-fold CV) on the new data set (i.e., new response and the original design matrix) and get the new estimated responses y ^ m i n , 2 and y ^ 1 s e , 2 for each of the two scenarios.
  • Calculate the mean square estimation error 1 n i = 1 n ( y ^ m i n , 2 y ^ m i n , 1 ) 2 and 1 n i = 1 n ( y ^ 1 s e , 2 y ^ m i n , 1 ) 2 for scenario 1;
    1 n i = 1 n ( y ^ m i n , 2 y ^ 1 s e , 1 ) 2 and 1 n i = 1 n ( y ^ 1 s e , 2 y ^ 1 s e , 1 ) 2 for scenario 2.
  • Repeat the above resampling process for 500 times and compute the proportion of better estimation for each method.
The estimation results (Table 4) show that, overall, the proportion of better estimation for the model λ m i n is between 46.8∼52.8% and for the model λ 1 s e , it is between 47.2∼53.2%. Therefore, based on the DGS, in term of regression estimation accuracy, there is not much difference before λ m i n and λ 1 s e : the observed properties are not significantly different from 50% at 0.05 level.

5.2. Variable Selection: DGS

To test which model does better in variable selection, we perform the DGS over the above real data sets. Since the exact variable selection might not be reached, especially when applied to the high-dimensional data, we use symmetric difference S S ^ to measure their performance, which is the number of variables either in S but not in S ^ or in S ^ but not in S . Below is the algorithm for the DGS.

Procedure

  • Apply the 10-fold (default) cross validation over real data set and select the true set of variables S : S m i n or S 1 s e , either by the model with λ m i n or by the model with λ 1 s e .
  • Do least square estimation by regressing the response on the true set of variables and obtain the estimated response y ^ and the residual standard error σ ^ .
  • Simulate the new response y n e w by adding the error term randomly generated from N ( 0 , σ ^ 2 ) to y ^ . Apply K-fold cross validation, K { 5 , 10 , 20 } , over the simulated data set (i.e., y n e w and the original design matrix) and select set of variables S ^ : S ^ m i n or S ^ 1 s e , either by model λ m i n or by model λ 1 s e . Repeat this process for 500 times.
  • Calculate the symmetric difference S S ^ .
The distributions of the symmetric difference are shown in the appendix. (Figure A3 and Figure A4 for Boston Housing Price; Figure A5 and Figure A6 for Bardet–Biedl data) The mean of S ^ is reported in Table 5 and Table 6. For Boston Housing Price, the size of the candidate variables is 13 while model λ m i n selects 11 variables and model λ 1 s e selects 8 variables on average, i.e., the mean of S m i n is 11 and the mean of S 1 s e is 8. For Bardet–Biedl data, the size of the candidate variables is 200, while model λ m i n selects 21 variables and model λ 1 s e selects 19 variables on average, i.e., the mean of S m i n is 21 and the mean of S 1 s e is 19.
On average, the 1se rule does better in variable selection since the mean of S ^ 1 s e is more closed to S . In the case of n > p , if we want to have a perfect or near perfect variable selection result (i.e., S S ^ 1 ), then the 1se rule is a good choice, despite the fact that its selection result is more unstable: the range of S S ^ 1 s e is larger than that of S S ^ m i n . In the case of n < p , both models fail to provide perfect selection result. But overall the 1se rule is more accurate and stable given that S S ^ 1 s e has smaller mean and range.

6. Conclusions

The one standard error rule is proposed to pick the most parsimonious model within one standard error of the minimum. Despite the fact that it was widely used as a conservative and alternative approach for cross validation, there is no evidence confirming that the 1se rule can consistently outperform the regular CV.
Our theoretical result shows that the standard error formula is asymptotically valid when the regression procedure converges relatively fast to the true regression function. Our illustrative example also shows that when the regression estimator converges slowly, the CV errors on the different evaluation folds may be highly dependent, which unfortunately makes the usual sample variance (or sample standard deviation) formula problematic. In such a case, the use of the 1se rule may easily lead to a choice of inferior candidates.
Our numerical results offer finite-sample understandings. First of all, the simulation study also casts doubt on the estimation accuracy of the 1se rule in terms of estimating the standard error of the minimum cross validation error. In some cases, the 1se is likely to have 100% overestimation or underestimation, depending on the number of cross validation fold and the dependency among the predictors. More specifically, for example, the 1se tends to have overestimation when used with 2-fold cross validation in case of dependent predictors; but it tends to have underestimation when used with 20-fold cross validation in case of independent predictors.
Secondly, the performances of the 1se rule and the regular CV in regression estimation and variable selection are compared via both simulated and real data sets. In general, on the one hand, the 1se rule often performs better in variable selection for sparse modeling; on the other hand, it often does worse in regression estimation.
While our theoretical and numerical results clearly challenge indiscriminate uses of the 1se rule in general model selection, its parsimony nature may still be appealing when a sparse model is highly desirable. For the real data sets, for regression estimation, the regular CV does better than the 1se rule but the difference is small. There is almost no difference in the estimation accuracy between these two methods when regression estimation is assessed by the DGS. Overall, considering the model simplicity and interpretability, the 1se rule would be a better choice here.
Future studies are encouraged to provide more theoretical understandings on the 1se rule. Also, it may be studied numerically in more complex settings (e.g., when used to compare several nonlinear/nonparametric classification methods) than those in this work. In a larger context, since it is well known that the penalty methods for model selection are often unstable (e.g., [1,21,22]), alternative methods to strengthen reliability and reproductivity of variable selection may be applied (see reference in the above mentioned papers and [23]).

Author Contributions

Conceptualization, Y.Y.; methodology, Y.C. and Y.Y.; numerical results, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Additional Numerical Results

Figure A1. Histogram of Table A1 and Table A2.
Figure A1. Histogram of Table A1 and Table A2.
Stats 04 00051 g0a1
Figure A2. Histogram of Table A3 and Table A4.
Figure A2. Histogram of Table A3 and Table A4.
Stats 04 00051 g0a2
Table A1. Mean ratio of claimed SE and true SD: dependent predictors ( Σ 1 ) and decaying coefficients.
Table A1. Mean ratio of claimed SE and true SD: dependent predictors ( Σ 1 ) and decaying coefficients.
DGP1DGP2DGP3
2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold
n = 100 σ 2 = 0.01 Lasso1.4891.3051.2751.2711.0681.1851.1831.1271.0151.2171.2661.222
Ridge1.1871.1040.9750.9241.3701.1711.0821.0310.9181.1901.2521.194
σ 2 = 1 Lasso1.7591.3761.1451.0881.2471.1811.1481.1340.8861.0940.9060.728
Ridge1.4991.3501.2311.1291.5121.3101.2021.1021.0031.0921.0931.049
σ 2 = 4 Lasso1.2601.1621.0671.0711.3331.2021.1641.1191.1210.5800.3190.229
Ridge1.5401.3321.2171.1081.5531.3601.2521.1421.0800.9690.9680.949
n = 1000 σ 2 = 0.01 Lasso1.2211.2261.1231.1341.8761.1001.0280.9971.5191.1661.1011.111
Ridge1.7251.0110.8800.8531.4900.8820.7800.7771.3851.2991.2541.251
σ 2 = 1 Lasso1.9601.2331.0681.0311.7251.2191.0730.9591.8530.8890.7110.637
Ridge2.1731.2431.0601.0292.0691.1881.0100.9781.3461.2241.1831.177
σ 2 = 4 Lasso1.6501.0310.9550.8732.1221.1841.0030.8801.8680.8310.6680.598
Ridge2.1391.2181.0320.9962.1541.2441.0571.0201.1801.1041.0851.079
Table A2. Mean ratio of claimed SE and true SD: independent predictors and decaying coefficients.
Table A2. Mean ratio of claimed SE and true SD: independent predictors and decaying coefficients.
DGP1DGP2DGP3
2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold
n = 100 σ 2 = 0.01 Lasso1.2751.2251.1301.0771.0291.0871.1101.1281.2210.9520.5810.379
Ridge0.9730.9870.9180.9091.0351.0120.9640.9051.1521.1301.1431.108
σ 2 = 1 Lasso1.5321.1971.0550.9991.1211.0541.1321.0301.2430.9780.6510.459
Ridge1.0390.9890.9360.9241.0831.0430.9770.9461.0490.9861.0040.967
σ 2 = 4 Lasso1.5691.2411.0741.0261.1091.1251.0611.0411.2361.0330.7670.584
Ridge1.0270.9900.9380.9331.0501.0300.9630.9461.0060.9510.9630.930
n = 1000 σ 2 = 0.01 Lasso1.1260.9140.7480.6681.4991.0460.8640.8201.8550.9610.7890.741
Ridge1.5091.1080.9640.9091.6771.0070.9000.8541.2031.1271.1001.091
σ 2 = 1 Lasso1.1000.7490.6410.5971.6241.1810.9840.8882.0651.0260.8230.747
Ridge1.2641.0370.9750.9591.3270.9940.9440.9261.0030.9620.9620.964
σ 2 = 4 Lasso1.8641.1930.9640.8791.8660.9650.7970.7102.1021.0420.8360.756
Ridge1.1451.0040.9630.9601.1310.9750.9440.9380.9550.9350.9370.941
Table A3. Mean ratio of claimed SE and true SD: dependent predictors ( Σ 1 ) and constant coefficients.
Table A3. Mean ratio of claimed SE and true SD: dependent predictors ( Σ 1 ) and constant coefficients.
DGP1DGP2DGP3
2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold
n = 100 σ 2 = 0.01 Lasso1.4521.3461.3221.3070.9581.1321.1401.1030.9531.1551.2141.161
Ridge1.2111.2791.0681.0471.5731.3911.2461.1841.0461.3701.3991.356
σ 2 = 1 Lasso1.5951.2841.1701.1340.9551.1861.1951.1360.9471.1781.2311.203
Ridge1.3071.2801.1441.0741.7831.4281.3841.2901.0681.3971.4401.380
σ 2 = 4 Lasso1.5291.4151.1761.0700.9981.2801.2601.2541.0391.3121.2901.270
Ridge1.4321.3401.2211.1301.5661.3121.2081.1111.0921.4221.4711.389
n = 1000 σ 2 = 0.01 Lasso1.1931.1871.1101.1191.0840.8040.7730.7441.4921.1281.1221.079
Ridge2.0871.1230.9700.9121.7421.4481.4231.3011.7261.5961.5091.482
σ 2 = 1 Lasso1.9951.4691.3301.2921.3891.0350.9500.9481.5731.2251.1541.159
Ridge2.0171.1961.0291.0112.3151.4481.2001.0881.7441.5991.5051.486
σ 2 = 4 Lasso2.2001.0230.8830.8062.1281.3751.1621.1272.1441.3851.2351.204
Ridge2.1631.2601.0781.0541.9911.2611.0501.0081.7831.6121.5001.490
Table A4. Mean ratio of claimed SE and true SD: independent predictors and constant coefficients.
Table A4. Mean ratio of claimed SE and true SD: independent predictors and constant coefficients.
DGP1DGP2DGP3
2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold2-fold5-fold10-fold20-fold
n = 100 σ 2 = 0.01 Lasso1.1801.2481.1781.1580.9211.0821.1601.1091.1971.1720.5990.386
Ridge0.9181.1491.0991.1041.1211.0891.0090.8511.1031.1121.0961.050
σ 2 = 1 Lasso1.3111.2411.1331.0750.9301.1101.1951.1291.3711.1660.5940.380
Ridge0.9530.9800.8980.8861.1160.9970.8580.7401.1221.1161.0961.054
σ 2 = 4 Lasso1.3661.2021.0871.0570.9661.1041.1511.1231.3871.1550.5970.382
Ridge1.0180.9850.9170.9061.1351.0160.8960.7991.1591.1071.0921.052
n = 1000 σ 2 = 0.01 Lasso0.9390.9780.8420.7631.1491.0301.0931.0031.0681.0501.0571.067
Ridge1.4750.9690.7680.7191.1011.0450.9420.8961.5431.4461.4121.396
σ 2 = 1 Lasso1.3111.0030.8330.7880.9911.0011.0650.9541.1811.0271.0631.017
Ridge1.5111.0860.9240.8741.3620.5720.4600.4331.3701.1851.1541.142
σ 2 = 4 Lasso0.9960.8240.6500.6311.3170.9710.9090.7981.2931.0001.0140.957
Ridge1.4051.0990.9740.9341.5520.7930.6790.6521.3031.1571.1361.131
Table A5. Regression estimation: probability of λ m i n doing better, AR(1) correlation.
Table A5. Regression estimation: probability of λ m i n doing better, AR(1) correlation.
Low-DimensionHigh-Dimension
β = 0.1 β = 1 β = 3 Decay β β = 0.1 β = 1 β = 3 Decay β
n = 100 ρ = 0 σ 2 = 0.01 0.520.621.000.520.460.640.840.54
σ 2 = 4 0.420.520.520.540.580.500.500.52
ρ = 0.5 σ 2 = 0.01 0.540.830.920.520.580.700.900.58
σ 2 = 4 0.440.560.540.580.380.600.600.60
ρ = 0.9 σ 2 = 0.01 0.560.881.000.600.540.840.960.54
σ 2 = 4 0.540.560.560.540.570.520.520.46
n = 1000 ρ = 0 σ 2 = 0.01 0.580.57NaN0.580.640.71NaN0.74
σ 2 = 4 0.450.580.560.520.570.660.660.72
ρ = 0.5 σ 2 = 0.01 0.50NaNNaN0.540.68NaNNaN0.68
σ 2 = 4 0.520.500.520.540.640.680.680.70
ρ = 0.9 σ 2 = 0.01 0.56NaNNaN0.600.66NaNNaN0.70
σ 2 = 4 0.560.540.580.540.680.660.640.66
Table A6. Regression estimation: probability of λ m i n doing better, constant correlation.
Table A6. Regression estimation: probability of λ m i n doing better, constant correlation.
Low-DimensionHigh-Dimension
β = 0.1 β = 1 β = 3 Decay β β = 0.1 β = 1 β = 3 Decay β
n = 100 ρ = 0.5 σ 2 = 0.01 0.560.880.890.530.540.841.000.64
σ 2 = 4 0.530.560.540.520.610.520.560.60
ρ = 0.9 σ 2 = 0.01 0.601.001.000.700.560.820.900.70
σ 2 = 4 0.580.540.620.620.620.580.640.64
n = 1000 ρ = 0.5 σ 2 = 0.01 0.50NaNNaN0.500.74NaNNaN0.78
σ 2 = 4 0.500.420.500.440.720.740.760.78
ρ = 0.9 σ 2 = 0.01 0.48NaNNaN0.540.74NaNNaN0.78
σ 2 = 4 0.520.500.520.460.720.740.760.76
Table A7. Variable selection: P c ( λ 1 s e ) P c ( λ m i n ) , AR(1) correlation.
Table A7. Variable selection: P c ( λ 1 s e ) P c ( λ m i n ) , AR(1) correlation.
Low-DimensionHigh-Dimension
β = 0.1 β = 1 β = 3 Decay β β = 0.1 β = 1 β = 3 Decay β
n = 100 ρ = 0 σ 2 = 0.01 0.130.1300.06000.02NaN
σ 2 = 4 00.120.110.010000.90
ρ = 0.5 σ 2 = 0.01 0.450000.250.010NaN
σ 2 = 4 00.400.430.0500.160.270.87
ρ = 0.9 σ 2 = 0.01 0.400000.6100NaN
σ 2 = 4 00.140.390.0100.250.620.7
n = 1000 ρ = 0 σ 2 = 0.01 0.720.10NaN0.700.270.16NaNNaN
σ 2 = 4 00.710.70000.250.250.52
ρ = 0.5 σ 2 = 0.01 0.86NaNNaN0.030.86NaNNaNNaN
σ 2 = 4 00.850.85000.850.850.95
ρ = 0.9 σ 2 = 0.01 0.56NaNNaN00.76NaNNaNNaN
σ 2 = 4 00.670.36000.770.170.65
Table A8. Variable selection: P c ( λ 1 s e ) P c ( λ m i n ) : constant correlation.
Table A8. Variable selection: P c ( λ 1 s e ) P c ( λ m i n ) : constant correlation.
Low-DimensionHigh-Dimension
β = 0.1 β = 1 β = 3 Decay β β = 0.1 β = 1 β = 3 Decay β
n = 100 ρ = 0.5 σ 2 = 0.01 000.100000
σ 2 = 4 00000000
ρ = 0.9 σ 2 = 0.01 00000000
σ 2 = 4 00000000
n = 1000 ρ = 0.5 σ 2 = 0.01 0NaNNaN0.010NaNNaN0
σ 2 = 4 00000000
ρ = 0.9 σ 2 = 0.01 0NaNNaN0.010NaNNaN0
σ 2 = 4 00000000
Figure A3. Boston Housing Price: S S ^ ; S = v a r m i n , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Figure A3. Boston Housing Price: S S ^ ; S = v a r m i n , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Stats 04 00051 g0a3
Figure A4. Boston Housing Price: S S ^ ; S = v a r 1 s e , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Figure A4. Boston Housing Price: S S ^ ; S = v a r 1 s e , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Stats 04 00051 g0a4
Figure A5. Bardet–Biedl data: S S ^ ; S = v a r m i n , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Figure A5. Bardet–Biedl data: S S ^ ; S = v a r m i n , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Stats 04 00051 g0a5
Figure A6. Bardet–Biedl data: S S ^ ; S = v a r 1 s e , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Figure A6. Bardet–Biedl data: S S ^ ; S = v a r 1 s e , (a) S ^ = v a r m i n ; (b) S ^ = v a r 1 s e .
Stats 04 00051 g0a6

Appendix B. Proof of the Main Result

Appendix B.1. A Lemma

Lemma A1.
Let W 1 , W 2 , , W K have identical variance and common pairwise correlation ρ. Then
E 1 K 1 i = 1 K ( W i W ¯ ) 2 = ( 1 ρ ) V a r ( W i ) .
Remark A1.
Depending on ρ = 0 , ρ > 0 or ρ < 0 , the sample variance is unbiased, under-estimating, or over estimating the variance of W i . For a general learning procedure, the correlation between the test errors from different folds can be both positive and negative, which makes the estimated standard error based on the usual sample variance formula under-estimating or over-estimating the true standard error. This is seen in our numerical results.
Remark A2.
Note that for the sample mean W ¯ , we have V a r ( W ¯ ) = V a r ( W 1 ) / K + V a r ( W 1 ) ρ ( K 1 ) / K . Thus for V a r ( W 1 ) / K to be asymptotically valid (in approximation) for V a r ( W ¯ ) in the sense that their ratio approaches 1, we must have ρ 0 , which is also the key requirement for the sample variance formula to be asymptotically valid (ARU).

Appendix B.2. Proof of Lemma A1

Since the sample variance stays the same with a common shift of the observations, without loss of generality, we assume E W i = 0 . Then
S 2 = 1 K 1 i = 1 K W i W ¯ 2 = 1 K 1 i = 1 K W i 2 K W ¯ 2 = 1 K 1 i = 1 K W i 2 1 K W i 2 2 K 1 i < j K W i W j = 1 K i = 1 K W i 2 2 K ( K 1 ) i < j W i W j .
Thus we conclude E S 2 = ( 1 ρ ) V a r ( W i ) .

Appendix B.3. Proof of Theorem 1

From Lemma A1, we only need to show that the correlation between the test errors on any two folds approaches 0 as n . In the following, we examine the covariance between two test errors and show it is negligible compared to the variance of the test error under the condition in the theorem.
Let W 1 , , W K denote the C V test errors for the k-th fold, k = 1 , 2 , , K . Note W k = j F k ( Y j f ^ ( F k ) ( X j ) ) 2 , where f ^ ( F k ) denote the estimator of f by the learning procedure δ based on D ( F k ) n = { ( x i , y i ) : 1 i n , i F k } . Below, we first calculate the variance of W 1 and then bound ρ ( W k 1 , W k 2 ) with k 1 k 2 . Note that V a r ( W k ) is equal to
V a r j F k f ( X j ) + ϵ j f ^ ( F k ) ( X j ) 2 = V a r j F k f ( X j ) f ^ ( F k ) ( X j ) 2 + j F k ϵ j 2 + 2 j F k f ( X j ) f ^ ( F k ) ( X j ) ϵ j = V a r j F k f ( X j ) f ^ ( F k ) ( X j ) 2 + V a r ( j F k ϵ j 2 ) + 4 V a r j F k f ( X j ) f ^ ( F k ) ( X j ) ϵ j + 2 C o v j F k f ( X j ) f ^ ( F k ) ( X j ) 2 , j F k ϵ j 2 + 4 C o v j F k f ( X j ) f ^ ( F k ) ( X j ) 2 , j F k f ( X j ) f ^ ( F k ) ( X j ) ϵ j + 4 C o v j F k ϵ j 2 , j F k f ( X j ) f ^ ( F k ) ( X j ) ϵ j = V a r j F k f ( X j ) f ^ ( F k ) ( X j ) 2 + j F k V a r ϵ j 2 + 4 j F k V a r ( ϵ j ) E | | f f ^ ( F k ) | | 2 2 + 0 + 0 + 4 j F k E ϵ j 3 f ( X j ) f ^ ( F k ) ( X j ) = V a r j F k f ( X j ) f ^ ( F k ) ( X j ) 2 + n 0 V a r ( ϵ j 2 ) + 4 n 0 σ 2 E | | f f ^ ( F k ) | | 2 2 ,
where we have used the assumption that the error ϵ j is independent of X j and E ϵ j 3 = 0 . Now,
V a r j F k f ( X j ) f ^ ( F k ) ( X j ) 2 = E V a r j F k ( f ( X j ) f ^ ( F k ) ( X j ) ) 2 | D ( F k ) n + V a r E j F k ( f ( X j ) f ^ ( F k ) ( X j ) ) 2 | D ( F k ) n = E n 0 V a r f ( X j ) f ^ ( F k ) ( X j ) | D ( F k ) n 2 + V a r n 0 | | f f ^ ( F k ) | | 2 2 = n 0 E | | f f ^ ( F k ) | | 4 4 | | f f ^ ( F k ) | | 2 4 + n 0 2 V a r | | f f ^ ( F k ) | | 2 2 .
Therefore,
V a r ( W k ) = n 0 E | | f f ^ ( F k ) | | 4 4 | | f f ^ ( F k ) | | 2 4 + n 0 2 V a r | | f f ^ ( F k ) | | 2 2 + n 0 V a r ( ϵ j 2 ) + 4 n 0 σ 2 E | | f f ^ ( F k ) | | 2 2 .
Now for W k 1 and W k 2 , k 1 k 2 , we calculate C o v ( W k 1 , W k 2 ) . Clearly, WLOG, we can take k 1 = 1 , k 2 = 2 .
W 1 = j F 1 f ( X j ) f ^ ( F 1 ) ( X j ) + ϵ j 2 = j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j + f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) 2 = j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j 2 + j F 1 f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) 2 + 2 j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) C + R 1 , 1 + R 1 , 2 .
Similarly,
W 2 = j F 2 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j 2 + j F 2 f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 2 ) ( X j ) 2 + 2 j F 2 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 2 ) ( X j ) D + R 2 , 1 + R 2 , 2 .
Then
C o v ( W 1 , W 2 ) = C o v ( C , D ) + C o v ( C , R 2 , 1 ) + C o v ( C , R 2 , 2 ) + C o v ( R 1 , 1 , D ) + C o v ( R 1 , 1 , R 2 , 1 ) + C o v ( R 1 , 1 , R 2 , 2 ) + C o v ( R 1 , 2 , D ) + C o v ( R 1 , 2 , R 2 , 1 ) + C o v ( R 1 , 2 , R 2 , 2 ) .
For C o v ( C , D ) , we have
C o v ( C , D ) = E C D ( E C ) ( E D ) = E j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j 2 j F 2 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j 2 E j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) 2 + ϵ j 2 + 2 ϵ j f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) 2 .
For the first part, we have
E C D = E E C D | D ( F 1 , F 2 ) n = E E [ j F 1 ( f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) ) 2 + ϵ j 2 + 2 ϵ j f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) × j F 2 ( f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) ) 2 + ϵ j 2 + 2 ϵ j f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) | D ( F 1 , F 2 ) n ] . = n 0 2 E | | f f ^ ( F 1 , F 2 ) | | 2 4 + n 0 2 σ 4 + 2 n 0 2 σ 2 E | | f f ^ ( F 1 , F 2 ) | | 2 2 ,
and
E C E D = E C 2 = n 0 E | | f f ^ ( F 1 , F 2 ) | | 2 2 + σ 2 2 = n 0 2 E | | f f ^ ( F 1 , F 2 ) | | 2 2 2 + 2 n 0 2 σ 2 E | | f f ^ ( F 1 , F 2 ) | | 2 2 + n 0 2 σ 4 .
Thus, C o v ( C , D ) = n 0 2 V a r | | f f ^ ( F 1 , F 2 ) | | 2 2 . For the other terms in the earlier expression of C o v ( W 1 , W 2 ) , they can be handled similarly. Indeed, | C o v ( C , R 2 , 1 ) | V a r ( C ) V a r ( R 2 , 1 ) and the other terms are bounded similarly.
Next, we calculate V a r ( C ) . Actually, the calculation is basically the same as for V a r ( W k ) and we get
V a r ( C ) = n 0 E | | f f ^ ( F 1 , F 2 ) | | 4 4 | | f f ^ ( F 1 , F 2 ) | | 2 4 + n 0 2 V a r | | f f ^ ( F 1 , F 2 ) | | 2 2 + n 0 V a r ( ϵ j 2 ) + 4 n 0 σ 2 E | | f f ^ ( F 1 , F 2 ) | | 2 2 .
Note that
V a r ( R 1 , 1 ) = V a r j F 1 f ^ ( F 1 , F 2 ) ( X j ) f ^ F 1 ( X j ) 2 = E V a r j F 1 f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) 2 | D ( F 1 ) n + V a r E j F 1 f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) 2 | D ( F 1 ) n = E n 0 | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 4 4 | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 4 + n 0 2 V a r | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2 n 0 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 4 4 + n 0 2 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 4 .
For the other term,
V a r ( R 1 , 2 ) = 4 V a r j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) = 4 E V a r j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) | D ( F 1 ) n + 4 V a r E j F 1 f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) | D ( F 1 ) n = 4 E n 0 V a r f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) + ϵ j f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) | D ( F 1 ) n + 4 V a r n 0 E f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) | D ( F 1 ) n = 4 n 0 E V a r f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) | D ( F 1 ) n + 4 n 0 E σ 2 V a r | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2 + 4 n 0 2 V a r E f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) | D ( F 1 ) n 4 n 0 E E f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) 2 f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) 2 | D ( F 1 ) n + 4 n 0 σ 2 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2 + 4 n 0 2 E E ( f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) ) 2 | D ( F 1 ) n E ( f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) ) 2 | D ( F 1 ) n 4 n 0 E E f ( X j ) f ^ ( F 1 , F 2 ) ( X j ) 4 | D ( F 1 ) n 1 2 E f ^ ( F 1 , F 2 ) ( X j ) f ^ ( F 1 ) ( X j ) 4 | D ( F 1 ) n 1 2 + 4 n 0 σ 2 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2 + 4 n 0 2 E | | f f ^ ( F 1 , F 2 ) | | 2 2 | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2
4 n 0 E | | f f ^ ( F 1 , F 2 ) | | 4 2 | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 4 2 + 4 n 0 σ 2 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2 + 4 n 0 2 E | | f f ^ ( F 1 , F 2 ) | | 2 2 | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2 4 n 0 E | | f f ^ ( F 1 , F 2 ) | | 4 4 1 2 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 4 4 1 2 + 4 n 0 σ 2 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 2 + 4 n 0 2 E | | f f ^ ( F 1 , F 2 ) | | 2 4 1 2 E | | f ^ ( F 1 , F 2 ) f ^ ( F 1 ) | | 2 4 1 2 .
Under the condition E | | f f ^ ( δ , n ) | | 4 4 = o ( 1 / n ) , it is seen that V a r ( W 1 ) and V a r ( C ) are both of order n, V a r ( R 1 , 1 ) , V a r ( R 1 , 2 ) , and C o v ( C , D ) are all of order o ( n ) , which together imply | C o v ( W 1 , W 2 | = o ( V a r ( W 1 ) ) and consequently ρ ( W 1 , W 2 ) 0 as n . This completes the proof of the theorem.

References

  1. Nan, Y.; Yang, Y. Variable selection diagnostics measures for high-dimensional regression. J. Comput. Graph. Stat. 2014, 23, 636–656. [Google Scholar] [CrossRef]
  2. Yu, Y.; Yang, Y.; Yang, Y. Performance assessment of high-dimensional variable identification. arXiv 2017, arXiv:1704.08810. [Google Scholar]
  3. Ye, C.; Yang, Y.; Yang, Y. Sparsity oriented importance learning for high-dimensional linear regression. J. Am. Stat. Assoc. 2018, 113, 1797–1812. [Google Scholar] [CrossRef]
  4. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  5. Tikhonov, A.N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR 1943, 39, 195–198. [Google Scholar]
  6. Horel, A.E. Applications of ridge analysis to regression problems. Chem. Eng. Progress. 1962, 58, 54–59. [Google Scholar]
  7. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  8. Hurvich, C.M.; Tsai, C.L. Regression and time series model selection in small samples. Biometrika 1989, 76, 297–307. [Google Scholar] [CrossRef]
  9. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  10. Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef] [Green Version]
  11. Wang, H.; Li, R.; Tsai, C.L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94, 553–568. [Google Scholar] [CrossRef]
  12. Allen, D.M. The relationship between variable selection and data agumentation and a method for prediction. Technometrics 1974, 16, 125–127. [Google Scholar] [CrossRef]
  13. Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. (Methodol.) 1974, 36, 111–133. [Google Scholar] [CrossRef]
  14. Geisser, S. The predictive sample reuse method with applications. J. Am. Stat. Assoc. 1975, 70, 320–328. [Google Scholar] [CrossRef]
  15. Zhang, Y.; Yang, Y. Cross-validation for selecting a model selection procedure. J. Econom. 2015, 187, 95–112. [Google Scholar] [CrossRef]
  16. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: Oxfordshire, UK, 2017. [Google Scholar]
  17. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  18. Yang, Y. Comparing learning methods for classification. Stat. Sin. 2006, 16, 635–657. [Google Scholar]
  19. Harrison, D., Jr.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef] [Green Version]
  20. Scheetz, T.E.; Kim, K.Y.; Swiderski, R.E.; Philp, A.R.; Braun, T.A.; Knudtson, K.L.; Dorrance, A.M.; DiBona, G.F.; Huang, J.; Casavant, T.L.; et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA 2006, 103, 14429–14434. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  21. Meinshausen, N.; Bühlmann, P. Stability selection. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2010, 72, 417–473. [Google Scholar] [CrossRef]
  22. Lim, C.; Yu, B. Estimation stability with cross-validation (ESCV). J. Comput. Graph. Stat. 2016, 25, 464–492. [Google Scholar] [CrossRef] [Green Version]
  23. Yang, W.; Yang, Y. Toward an objective and reproducible model choice via variable selection deviation. Biometrics 2017, 73, 20–30. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Probability that model with λ m i n outperforms model with λ 1 s e for estimation.
Figure 1. Probability that model with λ m i n outperforms model with λ 1 s e for estimation.
Stats 04 00051 g001
Figure 2. Probability that model with λ m i n outperforms model with λ 1 s e for estimation.
Figure 2. Probability that model with λ m i n outperforms model with λ 1 s e for estimation.
Stats 04 00051 g002
Figure 3. Difference in proportions of selecting true model: P c ( λ 1 s e ) P c ( λ m i n ) (positive means 1se is better).
Figure 3. Difference in proportions of selecting true model: P c ( λ 1 s e ) P c ( λ m i n ) (positive means 1se is better).
Stats 04 00051 g003
Figure 4. Difference in proportions of selecting true model: P c ( λ 1 s e ) P c ( λ m i n ) (positive means 1se is better).
Figure 4. Difference in proportions of selecting true model: P c ( λ 1 s e ) P c ( λ m i n ) (positive means 1se is better).
Stats 04 00051 g004
Figure 5. Difference in proportions of selecting true model: P c ( λ 1 s e ) P c ( λ m i n ) (positive means 1se is better).
Figure 5. Difference in proportions of selecting true model: P c ( λ 1 s e ) P c ( λ m i n ) (positive means 1se is better).
Stats 04 00051 g005
Table 1. Simulation settings for regression estimation and variable selection.
Table 1. Simulation settings for regression estimation and variable selection.
ParameterValues
β ( 1 ) { 0.1 , 1 , 3 } for constant coefficients
1 i where i = 1 , , q for decaying coefficients
σ 2 { 0.01 , 4 }
ρ { 0 , 0.5 , 0.9 }
n { 100 , 1 , 000 }
p q { 20 , 990 } where q = 10
Σ i i = 1 for AR(1) covariance; i = 2
for compound symmetry covariance
Table 2. Regression estimation (cross validation): Boston Housing Price.
Table 2. Regression estimation (cross validation): Boston Housing Price.
Proportion of λ min Being Better in Prediction
5-fold10-fold20-fold
train set n 1 = 100 0.5540.5260.538
train set n 1 = 200 0.5600.5560.564
train set n 1 = 400 0.5520.5640.558
Table 3. Regression estimation (cross validation): Bardet–Biedl.
Table 3. Regression estimation (cross validation): Bardet–Biedl.
Proportion of λ min Being Better in Prediction
5-fold10-fold
train set n 1 = 40 0.5820.566
train set n 1 = 80 0.5800.594
Table 4. Regression estimation (DGS).
Table 4. Regression estimation (DGS).
Proportion of λ min Being Better in Estimation
Boston Housing PriceBardet–Biedl
5-fold10-fold20-fold5-fold10-fold
Scenario 10.4780.4860.4680.5280.523
Scenario 20.5150.4890.4870.4940.480
Table 5. Mean of selected model size S ^ : Boston Housing Price.
Table 5. Mean of selected model size S ^ : Boston Housing Price.
5-fold10-fold20-fold
S m i n = 11 S ^ m i n 12.74412.67212.636
S ^ 1 s e 10.56410.60810.560
S 1 s e = 8 S ^ m i n 10.79210.82410.772
S ^ 1 s e 8.2508.2548.246
Table 6. Mean of selected model size S ^ : Bardet–Biedl.
Table 6. Mean of selected model size S ^ : Bardet–Biedl.
5-fold10-fold20-fold
S m i n = 21 S ^ m i n 32.27233.60435.500
S ^ 1 s e 21.06421.86422.404
S 1 s e = 19 S ^ m i n 24.20825.42626.350
S ^ 1 s e 19.85219.99220.274
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Chen, Y.; Yang, Y. The One Standard Error Rule for Model Selection: Does It Work? Stats 2021, 4, 868-892. https://doi.org/10.3390/stats4040051

AMA Style

Chen Y, Yang Y. The One Standard Error Rule for Model Selection: Does It Work? Stats. 2021; 4(4):868-892. https://doi.org/10.3390/stats4040051

Chicago/Turabian Style

Chen, Yuchen, and Yuhong Yang. 2021. "The One Standard Error Rule for Model Selection: Does It Work?" Stats 4, no. 4: 868-892. https://doi.org/10.3390/stats4040051

Article Metrics

Back to TopTop