Next Article in Journal
Exact Solutions of a Mathematical Model Describing Competition and Co-Existence of Different Language Speakers
Next Article in Special Issue
Towards a Unified Theory of Learning and Information
Previous Article in Journal
Gaussian Process Regression for Data Fulfilling Linear Differential Equations with Localized Sources
Previous Article in Special Issue
Estimation of Dynamic Networks for High-Dimensional Nonstationary Time Series
Article

Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors

by 1,2,† and 1,2,*,†
1
Institute of Computer Science, Polish Academy of Sciences, Jana Kazimierza 5, 01-248 Warsaw, Poland
2
Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2020, 22(2), 153; https://doi.org/10.3390/e22020153
Received: 13 November 2019 / Revised: 22 January 2020 / Accepted: 24 January 2020 / Published: 28 January 2020

Abstract

We consider selection of random predictors for a high-dimensional regression problem with a binary response for a general loss function. An important special case is when the binary model is semi-parametric and the response function is misspecified under a parametric model fit. When the true response coincides with a postulated parametric response for a certain value of parameter, we obtain a common framework for parametric inference. Both cases of correct specification and misspecification are covered in this contribution. Variable selection for such a scenario aims at recovering the support of the minimizer of the associated risk with large probability. We propose a two-step selection Screening-Selection (SS) procedure which consists of screening and ordering predictors by Lasso method and then selecting the subset of predictors which minimizes the Generalized Information Criterion for the corresponding nested family of models. We prove consistency of the proposed selection method under conditions that allow for a much larger number of predictors than the number of observations. For the semi-parametric case when distribution of random predictors satisfies linear regressions condition, the true and the estimated parameters are collinear and their common support can be consistently identified. This partly explains robustness of selection procedures to the response function misspecification.
Keywords: high-dimensional regression; loss function; random predictors; misspecification; consistent selection; subgaussianity; generalized information criterion; robustness high-dimensional regression; loss function; random predictors; misspecification; consistent selection; subgaussianity; generalized information criterion; robustness

1. Introduction

Consider a random variable ( X , Y ) R p × { 0 , 1 } and a corresponding response function defined as a posteriori probability q ( x ) = P ( Y = 1 | X = x ) . Estimation of the a posteriori probability is of paramount importance in machine learning and statistics since many frequently applied methods, e.g., logistic or tree-based classifiers, rely on it. One of the main estimation methods of q is a parametric approach for which the response function is assumed to have parametric form
q ( x ) = q 0 ( β T x )
for some fixed β and known q 0 ( x ) . If Equation (1) holds, that is the underlying structure is correctly specified, then it is known that
β = argmin b R p { E X , Y ( Y log q 0 ( b T X ) + ( 1 Y ) log ( 1 q 0 ( b T X ) ) } ,
or, equivalently (cf., e.g., [1])
β = argmin b E X K L ( q ( X ) , q 0 ( X T b ) ) ,
where E X f ( X ) is the expected value of a random variable f ( X ) and K L ( q ( X ) , q 0 ( X T b ) ) is Kullback–Leibler distance between the binary distributions with success probabilities q ( X ) and q 0 ( X T b ) :
K L ( q ( X ) , q 0 ( X T b ) ) = q ( X ) log q ( X ) q 0 ( X T b ) + ( 1 q ( X ) ) log 1 q ( X ) 1 q 0 ( X T b ) .
The equalities in Equations (2) and (3) form the theoretical underpinning of (conditional) maximum likelihood (ML) method as the expression under the expected value in Equation (2) is the conditional log-likelihood of Y given X in the parametric model. Moreover, it is a crucial property needed to show that ML estimates of β under appropriate conditions approximate β .
However, more frequently than not, the model in Equation (1) does not hold, i.e., response q is misspecified and ML estimators do not approximate β , but the quantity defined by the right-hand side of Equation (3), namely
β * = argmin b E X K L ( q ( X ) , q 0 ( X T b ) ) ,
Thus, parametric fit using conditional ML method, which is the most popular approach to modeling binary response, also has very intuitive geometric and information-theoretic flavor. Indeed, fitting a parametric model, we try to approximate the β * which yields averaged KL projection of unknown q on set of parametric models { q 0 ( b T x ) } b R p . A typical situation is a semi-parametric framework the true response function satisfies when
q ( x ) = q ˜ ( β T x )
for some unknown q ˜ ( x ) and the model in Equation (1) is fitted where q ˜ q 0 . An important problem is then how β * in Equation (4) relates to β in Equation (5). In particular, a frequently asked question is what can be said about a support of β = ( β 1 , , β p ) T , i.e., the set { i : β i 0 } , which consists of indices of predictors which truly influence Y. More specifically, an interplay between supports of β and analogously defined support of β * is of importance as the latter is consistently estimated and the support of ML estimator is frequently considered as an approximation of the set of true predictors. Variable selection, or equivalently the support recovery of β in high-dimensional setting, is one of the most intensively studied subjects in contemporary statistics and machine learning. This is related to many applications in bioinformatics, biology, image processing, spatiotemporal analysis, and other research areas (see [2,3,4]). It is usually studied under a correct model specification, i.e., under theassumption that data are generated following a given parametric model (e.g., logistic or, in the case of quantitative Y, linear model).
Consider the following example: let q ˜ ( x ) = q L ( x 3 ) , where q L ( x ) = e x / ( 1 + e x ) is the logistic function. Define regression model by P ( Y = 1 | X ) = q ˜ ( β T X ) = q L ( ( X 1 + X 2 ) 3 ) , where X = ( X 1 , , X p ) is N ( 0 , I p × p ) -distributed vector of predictors, p > 2 and β = ( 1 , 1 , 0 , , 0 ) R p . Then, the considered model will obviously be misspecified when the family of logistic models is fitted. However, it turns out in this case that, as X is elliptically contoured, β * = η β = η ( 1 , 1 , 0 , , 0 ) and η 0 (see [5]) and thus supports of β and β * coincide. Thus, in this case, despite misspecification variable selection, i.e., finding out that X 1 and X 2 are the only active predictors, it can be solved using the methods described below.
For recent contributions to the study of Kullback–Leibler projections on logistic model (which coincide with Equation (4) for a logistic loss, see below) and references, we refer to the works of Kubkowski and Mielniczuk [6], Kubkowski and Mielniczuk [7] and Kubkowski [8]. We also refer to the work of Lu et al. [9], where the asymptotic distribution of adaptive Lasso is studied under misspecification in the case of fixed number of deterministic predictors. Questions of robustness analysis evolve around an interplay between β and β * , in particular under what conditions the directions of β and β * coincide (cf. the important contribution by Brillinger [10] and Ruud [11]).
In the present paper, we discuss this problem in a more general non-parametric setting. Namely, the minus conditional log-likelihood ( y log q 0 ( b T x + ( 1 y ) log ( 1 q 0 ( b T x ) ) is replaced by a general loss function of the form
l ( b , x , y ) = ρ ( b T x , y ) ,
where ρ : R × { 0 , 1 } R is some function, b , x R p , y { 0 , 1 } , and
R ( b ) = E X , Y l ( b , X , Y )
is the associated risk function for b R p . Our aim is to determine a support of β * , where
β * = argmin b R p n R ( b ) .
Coordinates of β * corresponding to non-zero coefficients are called active predictors and vector β * the pseudo-true vector.
The most popular loss functions are related to minus log-likelihood of specific parametric models such as logistic loss
l l o g i s t ( b , x , y ) = y b T x + log ( 1 + exp ( b T x ) )
related to q 0 ( b T x ) = exp ( b T x ) / ( 1 + exp ( b T x ) , probit loss
l p r o b i t ( b , x , y ) = y log Φ ( b T x ) + ( 1 y ) log ( 1 Φ ( b T x ) )
related to q 0 ( b T x ) = Φ ( b T x ) , or quadratic loss l l i n ( b , x , y ) = ( y b T x ) 2 / 2 related to linear regression and quantitative response. Other losses which do not correspond to any parametric model such as Huber loss (see [12]) are constructed with a specific aim to induce certain desired properties of corresponding estimators such as robustness to outliers. We show in the following that variable selection problem can be studied for a general loss function imposing certain analytic properties such as its convexity and Lipschitz property.
For fixed number p of predictors smaller than sample size n, the statistical consequences of misspecification of a semi-parametric regression model were intensively studied by H. White and his collaborators in the 1980s. The concept of a projection on the fitted parametric model is central to these investigations which show how the distribution of maximum likelihood estimator of β * centered by β * changes under misspecification (cf. e.g., [13,14]). However, for the case when p > n , the maximum likelihood estimator, which is a natural tool for fixed p n case, is ill-defined and a natural question arises: What can be estimated and by what methods?
The aim of the present paper is to study the above problem in high-dimensional setting. To this end, we introduce two-stage approach in which the first stage is based on Lasso estimation (cf., e.g., [2])
β ^ L = argmin b R p n { R n ( b ) + λ L i = 1 p n | b i | }
where b = ( b 1 , , b p n ) T and the empirical risk R n ( b ) corresponding to R ( b ) is
R n ( b ) = n 1 i = 1 n ρ ( b T X i , Y i ) .
Parameter λ L > 0 is Lasso penalty, which penalizes large l 1 -norms of potential candidates for a solution. Note that the criterion function in Equation (8) for ρ ( s , y ) = log ( 1 + exp ( s ( 2 y 1 ) ) can be viewed as penalized empirical risk for the logistic loss. Lasso estimator is thoroughly studied in the case of the linear model when considered loss is square loss (see, e.g., [2,4] for references and overview of the subject) and some of the papers treat the case when such model is fitted to Y, which is not necessarily linearly dependent on regressors (cf. [15]). In this case, regression model is misspecified with respect to linear fit. However, similar results are scarce for other scenarios such as logistic fit under misspecification in particular. One of the notable exceptions is Negahban et al. [16], who studied the behavior of Lasso estimate i for a general loss function and possibly misspecified models.
The output of the first stage is Lasso estimate β ^ L . The second stage consists in ordering of predictors according to the absolute values of corresponding non-zero coordinates of Lasso estimator and then minimization of Generalized Information Criterion (GIC) on the resulting nested family. This is a variant of SOS (Screening-Ordering-Selection) procedure introduced in [17]. Let s ^ * be the model chosen by GIC procedure.
Our main contributions are as follows:
  • We prove that under misspecification when the sample size grows support s ^ * coincides with support of β * with probability tending to 1. In the general framework allowing for misspecification this means that selection rule s ^ * is consistent, i.e., P ( s ^ * = s * ) 1 when n . In particular, when the model in Equation (1) is correctly specified this means that we recover the support of the true vector β with probability tending to 1.
  • We also prove approximation result for Lasso estimator when predictors are random and ρ is a convex Lipschitz function (cf. Theorem 1).
  • A useful corollary of the last result derived in the paper is determination of sufficient conditions under which active predictors can be separated from spurious ones based on the absolute values of corresponding coordinates of Lasso estimator. This makes construction of nested family containing s * with a large probability possible.
  • Significant insight has been gained for fitting of parametric model when predictors are elliptically contoured (e.g., multivariate normal). Namely, it is known that in such situation β * = η β , i.e., these two vectors are collinear [5]. Thus, in the case when η 0 we have that support s * of β * coincides with support s of β and the selection consistency of two-step procedure proved in the paper entails direction and support recovery of β . This may be considered as a partial justification of a frequent observation that classification methods are robust to misspecification of the model for which they are derived (see, e.g., [5,18]).
We now discuss how our results relate to previous results. Most of the variable selection methods in high-dimensional case are studied for deterministic regressors; here, our results concern random regressors with subgaussian distributions. Note that random regressors scenario is much more realistic for experimental data than deterministic one. The stated results to the best of our knowledge are not available for random predictors even when the model is correctly specified. As to novelty of SS procedure, for its second stage we assume that the number of active predictors is bounded by a deterministic sequence k n tending to infinity and we minimize GIC on family M of models with sizes satisfying also this condition. Such exhaustive search has been proposed in [19] for linear models and extended to GLMs in [20] (cf. [21]). In these papers, GIC has been optimized on all possible subsets of regressors with cardinality not exceeding certain constant k n . Such method is feasible for practical purposes only when p n is small. Here, we consider a similar set-up but with important differences: M is a data-dependent small nested family of models and optimization of GIC is considered in the case when the original model is misspecified. The regressors are supposed random and assumptions are carefully tailored to this case. We also stress the fact that the presented results also cover the case when the regression model is correctly specified and Equation (5) is satisfied.
In numerical experiments, we study the performance of grid version of logistic and linear SOS and compare it to its several Lasso-based competitors.
The paper is organized as follows. Section 2 contains auxiliaries, including new useful probability inequalities for empirical risk in the case of subgaussian random variables (Lemma 2). In Section 3, we prove a bound on approximation error for Lasso when the loss function is convex and Lipschitz and regressors are random (Theorem 1). This yields separation property of Lasso. In Theorems 2 and 3 of Section 4, we prove GIC consistency on nested family, which in particular can be built according to the order in which the Lasso coordinates are included in the fitted model. In Section 5.1, we discuss consequences of the proved results for semi-parametric binary model when distribution of predictors satisfies linear regressions condition. In Section 6, we numerically compare the performance of two-stage selection method for two closely related models, one of which is a logistic model and the second one is misspecified.

2. Definitions and Auxiliary Results

In the following, we allow random vector ( X , Y ) , q ( x ) , and p to depend on sample size n, i.e., ( X , Y ) = ( X ( n ) , Y ( n ) ) R p n × { 0 , 1 } and q n ( x ) = P ( Y ( n ) = 1 | X ( n ) = x ) . We assume that n copies X 1 ( n ) , , X n ( n ) of a random vector X ( n ) in R p n are observed together with corresponding binary responses Y 1 ( n ) , , Y n ( n ) . Moreover, we assume that observations ( X i ( n ) , Y i ( n ) ) , i = 1 , , n are independent and identically distributed (iid). If this condition is satisfied for each n, but not necessarily for different n and m, i.e., distributions of ( X i ( n ) , Y i ( n ) ) may be different from that of ( X j ( m ) , Y j ( m ) ) or they may be dependent for m n , then such framework is called a triangular scenario. A frequently considered scenario is the sequential one. In this case, when sample size n increases, we observe values of new predictors additionally to the ones observed earlier. This is a special case of the above scheme as then X i ( n + 1 ) = ( X i ( n ) T , X i , p n + 1 , , X i , p n + 1 ) T . In the following, we skip the upper index n if no ambiguity arises. Moreover, we write q ( x ) = q n ( x ) . We impose a condition on distributions of random predictors assume that coordinates X i j of X i are subgaussian S u b g ( σ j n 2 ) with subgaussianity parameter σ j n 2 , i.e., it holds that (see [22])
E exp ( t X i j ) exp ( t 2 σ j n 2 / 2 )
for all t R . This condition basically says that the tails of of X i j do not decrease more slowly than tails of normal distribution N ( 0 , σ j n 2 ) . For future reference, let
s n 2 = max j = 1 , , p n σ j n 2
and assume in the following that
γ 2 : = lim sup n s n 2 < .
We assume moreover that X i 1 , , X i p n are linearly independent in the sense that their arbitrary linear combination is not constant almost everywhere. We consider a general form of response function q ( x ) = P ( Y = 1 | X = x ) and assume that for the given loss function β * , as defined in Equation (7), exists and is unique. For s { 1 , , p n } , let β * ( s ) be defined as in Equation (7) when minimum is taken over b with support in s. We let
s * = supp ( β * ( { 1 , , p n } ) = { i p n : β i * 0 } ,
denote the support of β * ( { 1 , , p n } ) with β * ( { 1 , , p n } ) = ( β 1 * , , β p n * ) T .
Let v π = ( v j 1 , , v j k ) T R | π | for v R p n and π = { j 1 , , j k } { 1 , , p n } . Let β s * * R | s * | be β * = β * ( { 1 , , p n } ) restricted to its support s * . Note that if s * s , then provided projections are unique (see Section 2) we have
β s * * = β * ( s * ) = β * ( s ) s * .
Note that this implies that for every superset s s * of s the projection β * ( s ) on the model pertaining to s is obtained by appending projection β * ( s * ) with appropriate number of zeros. Moreover, let
β m i n * = min i s * | β i * | .
We remark that β * , s * and β m i n * may depend on n. We stress that β m i n * is an important quantity in the development here as it turns out that it may not decrease too quickly in order to obtain approximation results for β ^ L * (see Theorem 1). Note that, when the parametric model is correctly specified, i.e., q ( x ) = q 0 ( β T x ) for some β with l being an associated log-likelihood loss, if s is the support of β , we have s = s * .
First, we discuss quantities and assumptions needed for the first step of SS procedure.
We consider cones of the form:
C ε = { Δ R p n : | | Δ s * c | | 1 ( 3 + ε ) | | Δ s * | | 1 } ,
where ε > 0 , s * c = { 1 , , p n } s * and Δ s * = ( Δ s 1 * , , Δ s | s * | * ) for s * = { s 1 * , , s | s * | * } . Cones C ε are of special importance because we prove that β ^ L β * C ε (see Lemma 3). In addition, we note that since l 1 -norm is decomposable in the sense that | | v A | | 1 + | | v A c | | 1 = | | v | | 1 the definition of the cone above can be stated as
C ε = { Δ R p n : | | Δ | | 1 ( 4 + ε ) | | Δ s * | | 1 } .
Thus, C ε consists of vectors which do not put too much mass on the complement of s * . Let H R p n × p n be a fixed non-negative definite matrix. For cone C ε , we define a quantity κ H ( ε ) which can be regarded as a restricted minimal eigenvalue of a matrix in high-dimensional set-up:
κ H ( ε ) = inf Δ C ε { 0 } Δ T H Δ Δ T Δ .
In the considered context, H is usually taken as hessian D 2 R ( β * ) and, e.g., for quadratic loss, it equals E X T X . When H is non-negative definite and not strictly positive definite its smallest eigenvalue λ 1 = 0 and thus inf Δ R p { 0 } Δ T H Δ Δ T Δ = λ 1 = 0 . That is why we have to restrict minimization in Equation (12) in order to have κ H ( ε ) > 0 in the high-dimensional case. As we prove that Δ 0 = β ^ L β * C ε and would use 0 < κ H ( ε ) Δ 0 T H Δ 0 / Δ 0 T Δ 0 it is useful to restrict minimization in Equation (12) to C ε { 0 } . Let R and R n be the risk and the empirical risk defined above. Moreover, we introduce the following notation:
W ( b ) = R ( b ) R ( β * ) ,
W n ( b ) = R n ( b ) R n ( β * ) ,
B p ( r ) = { Δ R p n : | | Δ | | p r } , for p = 1 , 2 ,
S ( r ) = sup b R p n : b β * B 1 ( r ) | W ( b ) W n ( b ) | .
Note that E R n ( b ) = R ( b ) . Thus, S ( r ) corresponds to oscillation of centred empirical risk over ball B 1 ( r ) . We need the following Margin Condition (MC) in Lemma 3 and Theorem 1:
(MC)
There exist ϑ , ε , δ > 0 and non-negative definite matrix H R p n × p n such that for all b with b β * C ε B 1 ( δ ) we have
R ( b ) R ( β * ) ϑ 2 ( b β * ) T H ( b β * ) .
The above condition can be viewed as a weaker version of strong convexity of function R (when the right-hand side is replaced by ϑ | | b β * | | 2 ) in the restricted neighbourhood of β * (namely, in the intersection of ball B 1 ( δ ) and cone C ε ). We stress the fact that H is not required to be positive definite, as in Section 3 we use Condition (MC) together with stronger conditions than κ H ( ε ) > 0 which imply that right hand side of inequality in (MC) is positive. We also do not require here twice differentiability of R. We note in particular that Condition (MC) is satisfied in the case of logistic loss, X being bounded random variable and H = D 2 R ( β * ) (see [23,24,25]). It is also easily seen that that (MC) is satisfied for quadratic loss, X such that E | | X | | 2 2 < and H = D 2 R ( β * ) . Similar condition to (MC) (called Restricted Strict Convexity) was considered in [16] for empirical risk R n :
R n ( β * + Δ ) R n ( β * ) D R n ( β * ) T Δ + κ L | | Δ | | 2 τ 2 ( β * )
for all Δ C ( 3 , s * ) , some κ L > 0 , and tolerance function τ . Note however that MC is a deterministic condition, whereas Restricted Strict Convexity has to be satisfied for random empirical risk function.
Another important assumption, used in Theorem 1 and Lemma 2, is the Lipschitz property of ρ :
(LL)
L > 0 b 1 , b 2 R , y { 0 , 1 } : | ρ ( b 1 , y ) ρ ( b 2 , y ) | L | b 1 b 2 | .
Now, we discuss preliminaries needed for the development of the second step of SS procedure. Let | w | stand for dimension of w. For the second step of the procedure we consider an arbitrary family M 2 { 1 , , p n } of models (which are identified with subsets of { 1 , , p n } and may be data-dependent) such that s * M , w M : | w | k n a.e. and k n N + is some deterministic sequence. We define Generalized Information Criterion (GIC) as:
G I C ( w ) = n R n ( β ^ ( w ) ) + a n | w | ,
where
β ^ ( w ) = error b R p n : b w c = 0 | w c | R n ( b )
is ML estimator for model w as minimization above is taken over all vectors b with support in w. Parameter a n > 0 is some penalty factor depending on the sample size n which weighs how important is the complexity of the model described by the number of its variables | w | . Typical examples of a n include:
  • AIC (Akaike Information Criterion): a n = 2 ;
  • BIC (Bayesian Information Criterion): a n = log n ; and
  • EBIC(d) (Extended BIC): a n = log n + 2 d log p n , where d > 0 .
AIC, BIC and EBIC were introduced by Akaike [26], Schwarz [27], and Chen and Chen [19], respectively. Note that for n 8 BIC penalty is larger than AIC penalty and in its turn EBIC penalty is larger than BIC penalty.
We study properties of S k ( r ) for k = 1 , 2 , where:
S k ( r ) = sup b D k : b β * B 2 ( r ) | ( W n ( b ) W ( b ) |
and is the maximal absolute value of the centred empirical risk W n ( · ) and sets D k for k = 1 , 2 are defined as follows:
D 1 = { b R p n : w M : | w | k n s * w supp b w } ,
D 2 = { b R p n : supp b s * } .
The idea here is simply to consider sets D i consisting of vectors having no more that k n non-zero coordinates. However, for s * k n , we need that for b D i , we have | supp ( b β * ) | k n , what we exploit in Lemma 2. This entails additional condition in the definition of D 1 . Moreover, in Section 4, we consider the following condition C ϵ ( w ) for ϵ > 0 , w { 1 , , p n } and some θ > 0 :
  • C ϵ ( w ) : R ( b ) R ( β * ) θ | | b β * | | 2 2 for all b R p n such that supp b w and b β * B 2 ( ϵ ) .
We observe also that, although Conditions (MC) and C ϵ ( w ) are similar, they are not equivalent, as they hold for v = b β * belonging to different sets: B 1 ( r ) C ε and B 2 ( ϵ ) { Δ R p n : supp Δ w } , respectively. If the minimal eigenvalue λ m i n of matrix H in Condition (MC) is positive and Condition (MC) holds for b β * B 1 ( r ) (instead of for b β * C ε B 1 ( r ) ), then we have for b β * B 2 ( r / p n ) B 1 ( r ) :
R ( b ) R ( β * ) ϑ 2 ( b β * ) T H ( b β * ) ϑ λ m i n 2 | | b β * | | 2 2 .
Furthermore, if λ m a x is the maximal eigenvalue of H and Condition C ϵ ( w ) holds for all v = b β * B 2 ( r ) without restriction on supp b , then we have for b β * B 1 ( r ) B 2 ( r ) :
R ( b ) R ( β * ) θ | | b β * | | 2 2 θ λ m a x ( b β * ) T H ( b β * ) .
Thus, Condition (MC) holds in this case. A similar condition to Condition C ϵ ( w ) for empirical risk R n was considered by Kim and Jeon [28] (formula (2.1)) in the context of GIC minimization. It turns out that Condition C ϵ ( w ) together with ρ ( · , y ) being convex for all y and satisfying Lipschitz Condition (LL) are sufficient to establish bounds which ensure GIC consistency for k n ln p n = o ( n ) and k n ln p n = o ( a n ) (see Corollaries 2 and 3). First, we state the following basic inequality. W ( v ) and S ( r ) are defined above the definition of Margin Condition.
Lemma 1.
(Basic inequality). Let ρ ( · , y ) be convex function for all y . If for some r > 0 we have
u = r r + | | β ^ L β | | 1 , v = u β ^ L + ( 1 u ) β * ,
then
W ( v ) + λ | | v β * | | 1 S ( r ) + 2 λ | | v s * β s * * | | 1 .
The proof of the lemma is moved to the Appendix A. It follows from the lemma that, as in view of decomposability of l 1 -distance we have | | v β * | | 1 = | | ( v β * ) s * | | 1 + | | ( v β * ) s * c | | 1 , when S ( r ) is small we have | | ( v β * ) s * c | | 1 is not large in comparison with | | ( v β * ) s * | | 1 .
Quantities S k ( r ) are defined in Equation (18). Recall that S 2 ( r ) is an oscillation taken over ball B 2 ( r ) , whereas S i , i = 1 , 2 are oscillations taken over B 1 ( r ) ball with restriction on the number of nonzero coordinates.
Lemma 2.
Let ρ ( · , y ) be convex function for all y and satisfy Lipschitz Condition (LL). Assume that X i j for j 1 are subgaussian S u b g ( σ j n 2 ) , where σ j n s n . Then, for r , t > 0 :
1. 
P ( S ( r ) > t ) 8 L r s n log ( p n 2 ) t n ,
2. 
P ( S 1 ( r ) t ) 8 L r s n k n ln ( p n 2 ) t n ,
3. 
P ( S 2 ( r ) t ) 4 L r s n | s * | t n .
The proof of the Lemma above, which relies on Chebyshev inequality, symmetrization inequality (see Lemma 2.3.1 of [29]), and Talagrand–Ledoux inequality ([30], Theorem 4.12), is moved to the Appendix A. In the case when β * does not depend on n and thus its support does not change, Part 3 implies in particular that S 2 ( r ) is of the order n 1 / 2 in probability.

3. Properties of Lasso for a General Loss Function and Random Predictors

The main result in this section is Theorem 1. The idea for the proof is based on fact that, if S ( r ) defined in Equation (16) is sufficiently small (condition S ( r ) C ¯ λ r is satisfied), then β ^ L lies in a ball { Δ R p n : | | Δ β * | | 1 r } (see Lemma 3). Using a tail inequality for S ( r ) proved in Lemma 2, we obtain Theorem 1. Note that κ H ( ε ) has to be bounded away from 0 (condition 2 | s * | λ κ H ( ε ) ϑ C ˜ r ). Convexity of ρ ( · , y ) below is understood as convexity for both y = 0 , 1 .
Lemma 3.
Let ρ ( · , y ) be convex function and assume that λ > 0 . Moreover, assume margin Condition (MC) with constants ϑ , ϵ , δ > 0 and some non-negative definite matrix H R p n × p n . If for some r ( 0 , δ ] we have S ( r ) C ¯ λ r and 2 | s * | λ κ H ( ε ) ϑ C ˜ r , where C ¯ = ε / ( 8 + 2 ε ) and C ˜ = 2 / ( 4 + ε ) , then
| | β ^ L β * | | 1 r .
The proof of the lemma is moved to the Appendix A.
The first main result provides an exponential inequality for P | | β ^ L β * | | 1 β m i n * / 2 . The threshold β m i n * / 2 is crucial there as it ensures separation: max i s * c | β ^ L , i | min i s * | β ^ L , i | (see proof of Corollary 1).
Theorem 1.
Let ρ ( · , y ) be convex function for all y and satisfy Lipschitz Condition (LL). Assume that X i j S u b g ( σ j n 2 ) , β * exists and is unique, margin Condition (MC) is satisfied for ε , δ , ϑ > 0 , non-negative definite matrix H R p n × p n and let
2 | s * | λ ϑ κ H ( ε ) C ˜ min β m i n * 2 , δ ,
where C ˜ = 2 / ( 4 + ε ) . Then,
P | | β ^ L β * | | 1 β m i n * 2 1 2 p n e n ε 2 λ 2 A ,
where A = 128 L 2 ( 4 + ε ) 2 s n 2 .
Proof. 
Let
m = min β m i n * 2 , δ .
Lemmas 2 and 3 imply that:
P | | β ^ L β * | | 1 > β m i n * 2 P | | β ^ L β * | | 1 > m P S m > C ¯ λ m 2 p n e n ε 2 λ 2 128 L 2 ( 4 + ε ) 2 s n 2 .
 □
Corollary 1.
(Separation property) If assumptions of Theorem 1 are satisfied,
λ = 8 L s n ( 4 + ε ) ϕ ε 2 log ( 2 p n ) n
for some ϕ > 1 and κ H ( ε ) > d for some d , ε > 0 for large n, | s * | λ = o ( min { β m i n * , 1 } ) , then
P | | β ^ L β * | | 1 β m i n * 2 1 .
Moreover,
P max i s * c | β ^ L , i | min i s * | β ^ L , i | 1 .
Proof. 
The first part of the corollary follows directly from Theorem 1 and the observation that:
P | | β ^ L β * | | 1 > β m i n * 2 e log ( 2 p n ) n ε 2 λ 2 128 L 2 ( 4 + ε ) 2 s n 2 = e log ( 2 p n ) ( 1 ϕ 2 ) 0 .
Now, we prove that condition | | β ^ L β * | | 1 β m i n * / 2 implies separation property
max i s * c | β ^ L , i | min i s * | β ^ L , i | .
Indeed, observe that for all j { 1 , , p n } we have:
β m i n * 2 | | β ^ L β * | | 1 | β ^ L , j β j * | .
If j s * , then using triangle inequality yields:
| β ^ L , j β j * | | β j * | | β ^ L , j | β m i n * | β ^ L , j | .
Hence, from the above inequality and Equation (22), we obtain for j s * : | β ^ L , j | β m i n * / 2 . If j s * c , then β j * = 0 and Equation (22) takes the form: | β ^ L , j | β m i n * / 2 . This ends the proof. □
We note that the separation property in Equation (21) means that when λ is chosen in an appropriate manner, recovery of s * is feasible with a large probability if all predictors corresponding to absolute value of Lasso coefficient exceeding a certain threshold are chosen. The threshold unfortunately depends on unknown parameters of the model. However, separation property allows to restrict attention to nested family of models and thus to decrease significantly computational complexity of the problem. This is dealt with in the next section. Note moreover that if γ in Equation (10) is finite than λ defined in the Corollary is of order ( log p n / n ) 1 / 2 , which is the optimal order of Lasso penalty in the case of deterministic regressors (see, e.g., [2]).

4. GIC Consistency for a a General Loss Function and Random Predictors

Theorems 2 and 3 state probability inequalities related to behavior of GIC on supersets and on subsets of s * , respectively. In a nutshell, we show for supersets and subsets separately that the probability that the minimum of GIC is not attained at s * is exponentially small. Corollaries 2 and 3 present asymptotic conditions for GIC consistency in the aforementioned situations. Corollary 4 gathers conclusions of Theorem 1 and Corollaries 1–3 to show consistency of SS procedure (see [17] for consistency of SOS procedure for a linear model with dieterministic predictors) in case of subgaussian variables. Note that in Theorem below we want to consider minimization of GIC in Equation (23) over all supersets of s * as in our applications M is data dependent. As the number of such possible subsets is at least p n | s * | k n | s * | , the proof has to be more involved than using reasoning based on Bonferroni inequality.
Theorem 2.
Assume that ρ ( · , y ) is convex, Lipschitz function with constant L > 0 , X i j S u b g ( σ j n 2 ) , condition C ϵ ( w ) holds for some ϵ , θ > 0 and for every w { 1 , , p n } such that | w | k n . Then, for any r < ϵ , we have:
P ( min w M : s * w G I C ( w ) G I C ( s * ) ) 2 p n e a n 2 k n B + 2 p n e n D k n ,
where B = 32 n L 2 r 2 k n s n 2 and D = θ 2 r 2 / 512 L 2 s n 2 .
Proof. 
If s * w M and β ^ ( w ) β * B 2 ( r ) , then in view of inequalities R n ( β ^ ( s * ) ) R n ( β * ) and R ( β * ) R ( b ) we have:
R n ( β ^ ( s * ) ) R n ( β ^ ( w ) ) sup b D 1 : b β * B 2 ( r ) ( R n ( β * ) R n ( b ) ) sup b D 1 : b β * B 2 ( r ) ( ( R n ( β * ) R ( β * ) ) ( R n ( b ) R ( b ) ) ) sup b D 1 : b β * B 2 ( r ) | R n ( b ) R ( b ) ( R n ( β * ) R ( β * ) ) | = S 1 ( r ) .
Note that a n ( | w | | s * | ) a n . Hence, if we have for some w s * : G I C ( w ) G I C ( s * ) , then we obtain n R n ( β ^ ( s * ) ) n R n ( β ^ ( w ) ) ) a n ( | w | | s * | ) and from the above inequality we have S 1 ( r ) a n / n . Furthermore, if β ^ ( w ) β * B 2 ( r ) c and r < ϵ , then consider:
v = u β ^ ( w ) + ( 1 u ) β * ,
where u = r / ( r + | | β ^ ( w ) β * | | 2 ) . Then
| | v β * | | 2 = u | | β ^ ( w ) β * | | 2 = r · | | β ^ ( w ) β * | | 2 r + | | β ^ ( w ) β * | | 2 r 2 ,
as function x / ( x + r ) is increasing with respect to x for x > 0 . Moreover, we have | | v β * | | 2 r < ϵ . Hence, in view of C ϵ ( w ) condition, we get:
R ( v ) R ( β * ) θ | | v β * | | 2 2 θ r 2 4 .
From convexity of R n , we have:
R n ( v ) u ( R n ( β ^ ( w ) ) R n ( β * ) ) + R n ( β * ) R n ( β * ) .
Let supp v denote the support of vector v. We observe that supp v supp β ^ ( w ) supp β * w , hence v D 1 . Finally, we have:
S 1 ( r ) R n ( β * ) R ( β * ) ( R n ( v ) R ( v ) ) R ( v ) R ( β * ) θ r 2 4 .
Hence, we obtain the following sequence of inequalities:
P ( min w M : s * w G I C ( w ) G I C ( s * ) ) P ( S 1 ( r ) a n n , w M : β ^ ( w ) β * B 2 ( r ) ) + P ( w M : s * w β ^ ( w ) β * B 2 ( r ) c ) P ( S 1 ( r ) a n n ) + P ( S 1 ( r ) θ r 2 4 ) 2 p n e a n 2 32 n L 2 r 2 k n s n 2 + 2 p n e n θ 2 r 2 512 L 2 k n s n 2 .
 □
Corollary 2.
Assume that the conditions of Theorem 2 hold and for some ϵ , θ > 0 and for every w { 1 , , p n } such that | w | k n , k n ln ( p n 2 ) = o ( n ) and lim inf n D n a n k n log ( 2 p n ) > 1 , where D n 1 = 128 L 2 s n 2 ϕ / θ for some ϕ > 1 . Then, we have
P ( min w M : s * w G I C ( w ) G I C ( s * ) ) 0 .
Proof. 
We the choose allb radius r of B 2 ( r ) in a special way. Namely, we take:
r n 2 = 512 ϕ 2 L 2 s n 2 log ( 2 p n ) k n n θ 2
for some ϕ > 1 . In view of assumptions r n 0 . Consider n 0 such that r n < ϵ for all n n 0 . Hence, the second term of the upper bound in Equation (23) for r = r n is equal to:
2 p n e n θ 2 r n 2 512 L 2 k n s n 2 = e log ( 2 p n ) ( 1 ϕ 2 ) 0 .
Similarly, the first term of the upper bound in Equation (23) is equal to:
2 p n e a n 2 32 n L 2 r n 2 k n s n 2 = e log ( 2 p n ) 1 a n 2 θ 2 128 2 L 4 k n 2 s n 4 ϕ 2 log 2 ( 2 p n ) = e log ( 2 p n ) 1 D n 2 a n 2 k n 2 log 2 ( 2 p n ) 0 .
These two convergences end the proof. □
The most restrictive condition of Corollary 2 is lim inf n D n a n k n log ( 2 p n ) > 1 which is slightly weaker than k n ln ( p n 2 ) = o ( a n ) . The following remark proved in the Appendix A gives sufficient conditions for consistency of BIC and EBIC penalties, which do not satisfy condition k n log ( p n ) = o ( a n ) .
Remark 1.
If in Corollary 2 we assume D n A for some A > 0 , then condition lim inf n D n a n k n log ( 2 p n ) > 1 holds when:
(1) 
a n = log n and p n < n A k n ( 1 + u ) 2 for some u > 0 .
(2) 
a n = log n + 2 γ log p n , k n C and 2 A γ ( 1 + u ) C 0 , where C , u > 0 .
(3) 
a n = log n + 2 γ log p n , k n C , 2 A γ ( 1 + u ) C < 0 , p n < B n δ , where δ = A ( 1 + u ) C 2 A γ and B = 2 ( 1 + u ) C .
Theorem 3 is an analog of Theorem 2 for subsets of s * .
Theorem 3.
Assume that ρ ( · , y ) is convex, Lipschitz function with constant L > 0 , X i j S u b g ( σ j n 2 ) , condition C ϵ ( s * ) holds for some ϵ , θ > 0 , and 8 a n | s * | θ n min { ϵ 2 , β m i n * 2 } . Then, we have:
P ( min w M : w s * G I C ( w ) G I C ( s * ) ) 2 e n min ϵ , β m i n * 2 E ,
where E = θ 2 / 2 12 L 2 s n 2 | s * |
Proof. 
Suppose that for some w s * we have G I C ( w ) G I C ( s * ) . This is equivalent to:
n R n ( β ^ ( s * ) ) n R n ( β ^ ( w ) ) a n ( | w | | s * | ) .
In view of inequalities R n ( β ^ ( s * ) ) R n ( β * ) and a n ( | w | | s * | ) a n | s * | , we obtain:
n R n ( β * ) n R n ( β ^ ( w ) ) a n | s * | .
Let v = u β ^ ( w ) + ( 1 u ) β * for some u [ 0 , 1 ] to be specified later. From convexity of ρ , we consider:
n R n ( β * ) n R n ( v ) n u ( R n ( β * ) R n ( β ^ ( w ) ) ) u a n | s * | a n | s * | .
We consider two cases separately:
(1) β m i n * > ϵ .
First, observe that
8 a n | s * | θ ϵ 2 n ,
which follows from our assumption. Let u = ϵ / ( ϵ + | | β ^ ( w ) β * | | 2 ) and
v = u β ^ ( w ) + ( 1 u ) β * .
Note that | | β ^ ( w ) β * | | 2 | | β s * w * | | 2 β m i n * . Then, as function d ( x ) = x / ( x + c ) is increasing and bounded from above by 1 for x , c > 0 , we obtain:
ϵ | | v β * | | 2 = ϵ | | β ^ ( w ) β * | | 2 ϵ + | | β ^ ( w ) β * | | 2 ϵ β m i n * ϵ + β m i n * > ϵ 2 2 ϵ = ϵ 2 .
Hence, in view of C ϵ ( s * ) condition, we have:
R ( v ) R ( β * ) > θ ϵ 2 4 .
Using Equations (24)–(26) and the above inequality yields:
S 2 ( ϵ ) R n ( β * ) R ( β * ) ( R n ( v ) R ( v ) ) > θ ϵ 2 4 a n n | s * | θ ϵ 2 8 .
Thus, in view of Lemma 2, we obtain:
P ( min w M : w s * G I C ( w ) G I C ( s * ) ) P S 2 ( ϵ ) > θ ϵ 2 8 2 e n θ 2 ϵ 2 4096 L 2 s n 2 | s * | .
(2) β m i n * ϵ .
In this case, we take u = β m i n * / ( β m i n * + | | β ^ ( w ) β * | | 2 ) and define v as in Equation (26). Analogously, as in Equation (27), we have:
β m i n * 2 | | v β * | | 2 β m i n * .
Hence, in view of C ϵ ( s * ) condition, we have:
R ( v ) R ( β * ) θ β m i n * 2 4 .
Using Equation (24) and the above inequality yields:
S 2 ( β m i n * ) R n ( β * ) R ( β * ) ( R n ( v ) R ( v ) ) θ β m i n * 2 4 a n n | s * | θ 8 β m i n * 2 .
Thus, in view of Lemma 2, we obtain:
P ( min w M : w s * G I C ( w ) G I C ( s * ) ) P S 2 ( β m i n * ) θ 8 β m i n * 2 2 e n θ 2 β m i n * 2 2 12 L 2 s n 2 | s * | .
By combining Equations (28) and (29), the theorem follows. □
Corollary 3.
Assume that loss ρ ( · , y ) is convex, Lipschitz function with constant L > 0 , X i j S u b g ( σ j n 2 ) , condition C ϵ ( s * ) holds for some ϵ , θ > 0 and a n | s * | = o ( n min { 1 , β m i n * } 2 ) , then
P ( min w M : w s * G I C ( w ) G I C ( s * ) ) 0 .
Proof. 
First, observe that as a n
a n | s * | = o ( n min { 1 , β m i n * } 2 )
implies
| s * | = o ( n min { 1 , β m i n * } 2 ) ,
and thus in view of Theorem 3 we have
P ( min w M : w s * G I C ( w ) G I C ( s * ) ) 0 .

5. Selection Consistency of SS Procedure

In this section, we combine the results of the two previous sections to establish consistency of a two-step SS procedure. It consists in construction of a nested family of models M using magnitude of Lasso coefficients and then finding the minimizer of GIC over this family. As M is data dependent to establish consistency of the procedure we use Corollaries 2 and 3 in which the minimizer of GIC is considered over all subsets and supersets of s * .
SS (Screening and Selection) procedure is defined as follows:
  • Choose some λ > 0 .
  • Find β ^ L = arg min b R p n R n ( b ) + λ | | b | | 1 .
  • Find s ^ L = supp β ^ L = { j 1 , , j k } such that | β ^ L , j 1 | | β ^ L , j k | > 0 and j 1 , , j k { 1 , , p n } .
  • Define M S S = { , { j 1 } , { j 1 , j 2 } , , { j 1 , j 2 , , j k } } .
  • Find s ^ * = arg min w M S S G I C ( w ) .
The SS procedure is a modification of SOS procedure in [17] designed for linear models. Since ordering step considered in [17] is omitted in the proposed modification, we abbreviate the name to SS.
Corollary 4 and Remark 2 describe the situations when SS procedure is selection consistent. In it, we use the assumptions imposed in Section 2 and Section 3 together with an assumption that support of s * contains no more than k n elements, where k n is some deterministic sequence of integers. Let M S S is nested family constructed in the step 4 of SS procedure.
Corollary 4.
Assume that ρ ( · , y ) is convex, Lipschitz function with constant L > 0 , X i j S u b g ( σ j n 2 ) and β * exists and is unique. If k n N + is some sequence, margin Condition (MC) is satisfied for some ϑ , δ , ε > 0 , condition C ϵ ( w ) holds for some ϵ , θ > 0 and for every w { 1 , , p n } such that | w | k n and the following conditions are fulfilled:
  • | s * | k n ,
  • P ( w M S S : | w | k n ) 1 ,
  • lim inf n κ H ( ε ) > 0 for some ε > 0 , where H is non-negative definite matrix and κ H ( ε ) is defined in Equation (12),
  • log ( p n ) = o ( n λ 2 ) ,
  • k n λ = o ( min { β m i n * , 1 } ) ,
  • k n log p n = o ( n ) ,
  • k n log p n = o ( a n ) ,
  • a n k n = o ( n min { β m i n * , 1 } 2 ) ,
then for SS procedure we have
P ( s ^ * = s * ) 1 .
Proof. 
In view of Corollary 1, following from the separation property in Equation (22) we obtain P ( s * M S S ) 1 . Let:
A 1 = { min w M S S : w s * , | w | k n G I C ( w ) G I C ( s * ) } , A 2 = { min w M S S : w s * , | w | > k n G I C ( w ) G I C ( s * ) } , B = { w M S S : | w | k n } .
Then, we have again from the fact that A 2 B = , union inequality and Corollary 2:
P ( min w M S S : w s * G I C ( w ) G I C ( s * ) ) = P ( A 1 A 2 ) = P ( A 1 ( A 2 B c ) ) P ( A 1 ) + P ( B c ) 0 .
In an analogous way, using | s * | k n and Corollary 3 yields:
P ( min w M S S : w s * G I C ( w ) G I C ( s * ) ) 0 .
Now, observe that in view of definition of s ^ * and union inequality:
P ( s ^ * = s * ) = P ( min w M S S : w s * G I C ( w ) > G I C ( s * ) ) 1 P ( min w M S S : w s * G I C ( w ) G I C ( s * ) ) P ( min w M S S : w s * G I C ( w ) G I C ( s * ) ) .
Thus, P ( s ^ * = s * ) 1 in view of the above inequality and Equations (30) and (31). □

5.1. Case of Misspecified Semi-Parametric Model

Consider now the important case of the misspecified semi-parametric model defined in Equation (5) for which function q ˜ is unknown and may be arbitrary. An interesting question is whether information about β can be recovered when misspecification occurs. The answer is positive under some additional assumptions on distribution of random predictors. Assume additionally that X satisfies
E ( X | β T X ) = u 0 + u β T X ,
where β is the true parameter. Thus, regressions of X given β T X have to be linear. We stress that conditioning β T X involves only the true β in Equation (5). Then, it is known (cf. [5,10,11]) that β * = η β and η 0 if Cov ( Y , X ) 0 . Note that because β and β * are collinear and η 0 it follows that s = s * . This is important in practical applications as it shows that a position of the optimal separating direction given by β can be consistently recovered. It is also worth mentioning that if Equation (32) is satisfied the direction of β coincides with the direction of the first canonical vector. We refer to the work of Kubkowski and Mielniczuk [7] for the proof and to the work o Kubkowski and Mielniczuk [6] for discussion and up-to date references to this problem. The linear regressions condition in Equation (32) is satisfied, e.g., by elliptically contoured distribution, in particular by multivariate normal. We note that it is proved in [18] that Equation (32) approximately holds for the majority of β . When Equation (32) holds exactly, proportionality constant η can be calculated numerically for known q ˜ and β . We can state thus the following result provided Equation (32) is satisfied.
Corollary 5.
Assume that Equation (32) and the assumptions of Corollary 4 are satisfied. Moreover, Cov ( Y , X ) 0 . Then, P ( s ^ * = s ) 1 .
Remark 2.
If p n = O ( e c n γ ) for some c > 0 , γ ( 0 , 1 / 2 ) , ξ ( 0 , 0.5 γ ) , u ( 0 , 0.5 γ ξ ) , k n = O ( n ξ ) , λ = C n log ( p n ) / n , C n = O ( n u ) , C n + , n γ 2 = O ( β m i n * ) , a n = d n 1 2 u , then assumptions imposed on asymptotic behavior of parameters in Corollary 4 are satisfied.
Note that p n is allowed to grow exponentially: log p n = O ( n γ ) , however β m i n * may not decrease to 0 too quickly with regard to growth of p n : n γ 2 = O ( β m i n * ) .
Remark 3.
We note that, to apply Corollary 4 to the two-step procedure based on Lasso, it is required that | s * | k n and that the support of Lasso estimator with probability tending to 1 contains no more than k n elements. Some results bounding | supp β ^ L | are available for deterministic X (see [31]) and for random X (see [32]), but they are too weak to be useful for EBIC penalties. The other possibility to prove consistency of two-step procedure is to modify it in the first step by using thresholded Lasso (see [33]) corresponding to k n largest Lasso coefficients where k n N is such that k n = o ( k n ) . This is a subject of ongoing research.

6. Numerical Experiments

6.1. Selection Procedures

We note that the original procedure is defined for a single λ only. In the simulations discussed below, we implemented modifications of SS procedure introduced in Section 5. In practice, it is generally more convenient to consider in the first step some sequence of penalty parameters λ 1 > > λ m > 0 instead of only one λ in order to avoid choosing the “best” λ . For the fixed sequence λ 1 , , λ m , we construct corresponding families M 1 , , M m analogously to M in Step 4 of the SS procedure. Thus, we arrive at the following SSnet procedure, which is the modification of SOSnet procedure in [17]. Below, b ˜ is a vector b with first coordinate corresponding to intercept omitted, b = ( b 0 , b ˜ T ) T :
  • Choose some λ 1 > > λ m > 0 .
  • Find β ^ L ( i ) = arg min b R p n + 1 R n ( b ) + λ i | | b ˜ | | 1 for i = 1 , , m .
  • Find s ^ L ( i ) = supp β ˜ ^ L ( i ) = { j 1 ( i ) , , j k i ( i ) } where j 1 ( i ) , , j k i ( i ) are such that | β ^ L , j 1 ( i ) ( i ) | | β ^ L , j k i ( i ) ( i ) | > 0 for i = 1 , , m .
  • Define M i = { { j 1 ( i ) } , { j 1 ( i ) , j 2 ( i ) } , , { j 1 ( i ) , j 2 ( i ) , , j k i ( i ) } } for i = 1 , , m .
  • Define M = { } i = 1 m M i .
  • Find s ^ * = arg min w M G I C ( w ) , where
    G I C ( w ) = min b R p n + 1 : supp b ˜ w n R n ( b ) + a n ( | w | + 1 ) .
Instead of constructing families M i for each λ i in SSnet procedure, λ can be chosen by cross-validation using 1SE rule (see [34]) and then SS procedure is applied for such λ . We call this procedure SSCV. The last procedure considered was introduced by Fan and Tang [35] and is Lasso procedure with penalty parameter λ ^ chosen in a data-dependent way analogously to SSCV. Namely, it is the minimizer of GIC criterion with a n = log ( log n ) · log p n for which ML estimator has been replaced by Lasso estimator with penalty λ . Once β ^ L ( λ ^ L ) is calculated, then s ^ * is defined as its support. The procedure is called LFT in the sequel.
We list below versions of the above procedures along with R packages that were used to choose sequence λ 1 , , λ m and computation of Lasso estimator. The following packages were chosen based on selection performance after initial tests for each loss and procedure:
  • SSnet with logistic or quadratic loss: ncvreg;
  • SSCV or LFT with logistic or quadratic loss: glmnet; and
  • SSnet, SSCV or LFT with Huber loss (cf. [12]): hqreg.
The following functions were used to optimize R n in GIC minimization step for each loss:
  • logistic loss: glm.fit (package stats);
  • quadratic loss: .lm.fit (package stats); and
  • Huber loss: rlm (package rlm).
Before applying the investigated procedures, each column of matrix X = ( X 1 , , X n ) T was standardized as Lasso estimator β ^ L depends on scaling of predictors. We set length of λ i sequence to m = 20 . Moreover, in all procedures we considered only λ i for which | s ^ L ( i ) | n because, when | s ^ L ( i ) | > n , Lasso and ML solutions are not unique (see [32,36]). For Huber loss, we set parameter δ = 1 / 10 (see [12]). The number of folds in SSCV was set to K = 10 .
Each simulation run consisted of L repetitions, during which samples X k = ( X 1 ( k ) , , X n ( k ) ) T and Y k = ( Y 1 ( k ) , , Y n ( k ) ) T were generated for k = 1 , , L . For kth sample ( X k , Y k ) estimator s ^ k * of set of active predictors was obtained by a given procedure as the support of β ˜ ^ ( s ^ k * ) , where
β ^ ( s ^ k * ) = ( β ^ 0 ( s ^ k * ) , β ˜ ^ ( s ^ k * ) T ) T = arg min b R p n + 1 1 n i = 1 n ρ ( b T X i ( k ) , Y i ( k ) )
is ML estimator for kth sample. We denote by M ( k ) the family M obtained by a given procedure for kth sample.
In our numerical experiments we have computed the following measures of selection performance which gauge co-direction of true parameter β and β ^ and the interplay between s * and s ^ * :
  • A N G L E = 1 L k = 1 L arccos | cos ( β ˜ 0 , β ˜ ^ ( s ^ k * ) ) | , where
    cos ( β ˜ , β ˜ ^ ( s ^ k * ) ) = j = 1 p n β j β ^ j ( s ^ k * ) | | β ˜ | | 2 | | β ˜ ^ ( s ^ k * ) | | 2
    and we let cos ( β ˜ , β ˜ ^ ( s ^ k * ) ) = 0 , if | | β ˜ | | 2 | | β ˜ ^ ( s ^ k * ) | | 2 = 0 ,
  • P i n c = 1 L k = 1 L I ( s * M ( k ) ) ,
  • P e q u a l = 1 L k = 1 L I ( s ^ k * = s * ) .
  • P s u p s e t = 1 L k = 1 L I ( s ^ k * s * ) .
Thus, A N G L E is equal an of angle between true parameter (with intercept omitted) and its post model-selection estimator averaged over simulations, P i n c is a fraction of simulations for which family M ( k ) contains true model s * , and P e q u a l and P s u p s e t are the fractions of time when SSnet chooses true model or its superset, respectively.

6.2. Regression Models Considered

To investigate behavior of two-step procedure under misspecification we considered two similar models with different sets of predictors. As sets of predictors differ, this results in correct specification of the first model (Model M1) and misspecification of the second (Model M2).
Namely, in Model M1, we generated n observations ( X i , Y i ) R p + 1 × { 0 , 1 } for i = 1 , , n such that:
X i 0 = 1 , X i 1 = Z i 1 , X i 2 = Z i 2 , X i j = Z i , j 7 for j = 10 , , p , X i 3 = X i 1 2 , X i 4 = X i 2 2 , X i 5 = X i 1 X i 2 , X i 6 = X i 1 2 X i 2 , X i 7 = X i 1 X i 2 2 , X i 8 = X i 1 3 , X i 9 = X i 2 3 ,
where Z i = ( Z i 1 , , Z i p ) T N p ( 0 p , Σ ) , Σ = [ ρ | i j | ] i , j = 1 , , p and ρ ( 1 , 1 ) . We consider response function q ( x ) = q L ( x 3 ) for x R , s = { 1 , 2 } and β s = ( 1 , 1 ) T . Thus,
P ( Y i = 1 | X i = x i ) = q ( β s T x i , s ) = q ( x i 1 + x i 2 ) = q L ( ( x i 1 + x i 2 ) 3 ) = q L ( x i 1 3 + x i 2 3 + 3 x i 1 2 x i 2 + 3 x i 1 x i 2 2 ) = q L ( 3 x i 6 + 3 x i 7 + x i 8 + x i 9 ) .
We observe that the last equality implies that the above binary model is correctly specified with respect to family of fitted logistic models and X 6 , X 7 , X 8 and X 9 are four active predictors, whereas the remaining ones play no role in prediction of Y. Hence, s * = { 6 , 7 , 8 , 9 } and β s * * = ( 3 , 3 , 1 , 1 ) T are, respectively, sets of indices of active predictors and non-zero coefficients of projection onto family of logistic models.
We considered the following parameters in numerical experiments: n = 500 , p = 150 , ρ { 0.9 + 0.15 · k : k = 0 , 1 , , 12 } , and L = 500 (the number of generated datasets for each combination of parameters). We investigated procedures SSnet, SSCV, and LFT using logistic, quadratic, and Huber (cf. [12]) loss functions. For procedures SSnet and SSCV, we used GIC penalties with:
  • a n = log n (BIC); and
  • a n = log n + 2 log p n (EBIC1).
In Model M2, we generated n observations ( X i , Y i ) R p + 1 × { 0 , 1 } for i = 1 , , n such that X i = ( X i 0 , X i 1 , , X i p ) T and ( X i 1 , , X i p ) T N p ( 0 p , Σ ) , Σ = [ ρ | i j | ] i , j = 1 , , p and ρ ( 1 , 1 ) . Response function is q ( x ) = q L ( x 3 ) for x R , s = { 1 , 2 } and β s = ( 1 , 1 ) T . This means that:
P ( Y i = 1 | X i = x i ) = q ( β s T x i , s ) = q ( x i 1 + x i 2 ) = q L ( ( x i 1 + x i 2 ) 3 )
This model in comparison to Model M1 does not contain monomials of X i 1 and X i 2 of degree higher than 1 in its set of predictors. We observe that this binary model is misspecified with respect to fitted family of logistic models, because q ( x i 1 + x i 2 ) q L ( β T x i ) for any β R p + 1 . However, in this case, the linear regressions condition in Equation (32) is satisfied for X, as it follows normal distribution (see [5,7]). Hence, in view of Proposition 3.8 in [6], we have s l o g * = { 1 , 2 } and β l o g , s l o g * * = η ( 1 , 1 ) T for some η > 0 . Parameters n , p , ρ as well as L were chosen as for Model M1.

6.3. Results for Models M1 and M2

We first discuss the behavior of P i n c , P e q u a l and P s u p s e t for the considered procedures. We observe that values of P i n c for SSCV and SSnet are close to 1 for low correlations in Model M2 for every tested loss (see Figure 1). In Model M1, P i n c attains the largest values for SSnet procedure and logistic loss for low correlations, which is because in most cases the corresponding family M is the largest among the families created by considered procedures. P i n c is close to 0 in Model M1 for quadratic and Huber loss, which results in low values of the remaining indices. This may be due to strong dependences between predictors in Model M1; note that we have, e.g., Cor ( X i 1 , X i 8 ) = 3 / 15 0.77 . It is seen that in Model M1 inclusion probability P i n c is much lower than in Model M2 (except for negative correlations). It it also seen that P i n c for SSCV is larger than for LFT and LFT fails with respect to P i n c in M1.
In Model M1, the largest values P e q u a l are attained for SSnet with BIC penalty, the second best is SSCV with EBIC1 penalty (see Figure 2). In Model M2, P e q u a l is close to 1 for SSnet and SSCV with EBIC1 penalty and is much larger than P e q u a l for the corresponding versions using BIC penalty. We also note that choice of loss is relevant only for larger correlations. These results confirm theoretical result of Theorem 2.1 in [5], which show that collinearity holds for broad class of loss function. We observe also that, although in Model M2 remaining procedures do not select s * with high probability, they select its superset, what is indicated by values of P s u p s e t (see Figure 3). This analysis is confirmed by an analysis of A N G L E measure (see Figure 4), which attains values close to 0, when P s u p s e t is close to 1. Low values of A N G L E measure mean that estimated vector β ˜ ^ ( s ^ k * ) is approximately proportional to β ˜ , which is the case for Model M2, where normal predictors satisfy linear regressions condition. Note that the angles of β ˜ ^ ( s ^ k * ) and β ˜ * in Model M1 significantly differ even though Model M1 is well specified. In addition, for the best performing procedures in both models and any loss considered, P e q u a l is much larger in Model M2 than in Model M1, even though the latter is correctly specified. This shows that choosing a simple misspecified model which retains crucial characteristics of the well specified large model instead of the latter might be beneficial.
In Model M1, procedures with BIC penalty perform better than those with EBIC1 penalty; however, the gain for P e q u a l is much smaller than the gain when using EBIC1 in Model M2. LFT procedure performs poorly in Model M1 and reasonably well in Model M2. The overall winner in both models is SSnet. SSCV performs only slightly worse than SSnet in Model M2 but performs significantly worse in Model M1.
Analysis of computing times of the first and second stages of each procedure shows that SSnet procedure creates large families M and GIC minimization becomes computationally intensive. We also observe that the first stage for SSCV is more time consuming than for SSnet, what is caused by multiple fitting of Lasso in cross-validation. However, SSCV is much faster than SSnet in the second stage.
We conclude that in the considered experiments SSnet with EBIC1 penalty works the best in most cases; however, even for the winning procedure, strong dependence of predictors results in deterioration of its performance. It is also clear from our experiments that a choice of GIC penalty is crucial for its performance. Modification of SS procedure which would perform satisfactorily for large correlations is still an open problem.

7. Discussion

In the paper, we study the problem of selecting a set of active variables in binary regression model when the number of all predictors p is much larger then number of observations n and active predictors are sparse among all predictors, i.e., their number is significantly smaller than p. We consider a general binary model and fit based on minimization of empirical risk corresponding to a general loss function. This scenario encompasses the common case in practice when the underlying semi-parametric model is misspecified, i.e., the assumed response function is different from the true one. For random predictors, we show that in such a case the two-step procedure based on Lasso consistently estimates the support of pseudo-true vector β * . Under linear regression conditions and semi-parametric model, this implies consistent recovery of a subset of active predictors. This partly explains why selection procedures perform satisfactorily even when the fitted model is wrong. We show that, by using the two-step procedure, we can successfully reduce the dimension of the model chosen by Lasso. Moreover, for the two-step procedure in the case of random predictors, we do not require restrictive conditions on experimental matrix needed for Lasso support consistency for deterministic predictors such as irrepresentable condition. Our experiments show satisfactory behavior of the proposed SSnet procedure with EBIC1 penalty.
Future research directions include considering the performance of SS procedure without subgaussianity assumption and for practical importance an automatic choice of a penalty for GIC criterion. Moreover, we note the existing challenge of finding a modification of SS procedure that would perform satisfactorily for large correlations is still an open problem. It would also be of interest to find conditions under which weaker than Equation (32) would lead to collinearity of β and β * (see [18] for different angle on this problem).

Author Contributions

Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

The research of the second author was partially supported by Polish National Science Center grant 2015 / 17 / B / S T 6 / 01878 .

Acknowledgments

The comments by the two referees, which helped to improve presentation of the original version of the manuscript, are gratefully acknowledged.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Lemma 1:
Proof. 
Observe first that function R n is convex as ρ is convex. Moreover, from the definition of β ^ L , we get the inequality:
W n ( β ^ L ) = R n ( β ^ L ) R n ( β * ) λ ( | | β * | | 1 | | β ^ L | | 1 ) .
Note that v β * B 1 ( r ) , as we have:
| | v β * | | 1 = | | β ^ L β * | | 1 r + | | β ^ L β * | | 1 · r r .
By definition of W n , convexity of R n , Equation (A2) and definition of S, we have:
W ( v ) = W ( v ) W n ( v ) + R n ( v ) R n ( β * ) W ( v ) W n ( v ) + u ( R n ( β ^ L ) R n ( β * ) ) S ( r ) + u W n ( β ^ L ) .
From the convexity of l 1 norm, Equations (A1) and (A3), equality | | β * | | 1 = | | β s * * | | 1 , and triangle inequality, it follows that:
W ( v ) + λ | | v | | 1 W ( v ) + λ u | | β ^ L | | 1 + λ ( 1 u ) | | β * | | 1 S ( r ) + u W n ( β ^ L ) + u λ ( | | β ^ L | | 1 | | β * | | 1 ) + λ | | β * | | 1 S ( r ) + λ | | β * | | 1 S ( r ) + λ | | β * v s * | | 1 + λ | | v s * | | 1 .
Hence,
W ( v ) + λ | | v β * | | 1 = ( W ( v ) + λ | | v | | 1 ) + λ ( | | v β * | | 1 | | v | | 1 ) S ( r ) + λ | | β * v s * | | 1 + λ | | v s * | | 1 + λ ( | | v β * | | 1 | | v | | 1 ) = S ( r ) + 2 λ | | β * v s * | | 1 .
 □
We prove now Lemma A1 needed in the proof of Lemma 2 below.
Lemma A1.
Assume that S S u b g ( σ 2 ) and T is a random variable such that | T | M , where M is some positive constant and S and T are independent. Then, S T S u b g ( M 2 σ 2 ) .
Proof. 
Observe that:
E e t S T = E ( E ( e t S T | T ) ) E e t 2 T 2 σ 2 2 e t 2 M 2 σ 2 2 .
 □
Proof of Lemma 2.
Proof. 
From the Chebyshev inequality (first inequality below), symmetrization inequality (see Lemma 2.3.1 of [29]) and Talagrand–Ledoux inequality ([30], Theorem 4.12), we have for t > 0 and ( ε i ) i = 1 , , n being Rademacher variables independent of ( X i ) i = 1 , , n :
P ( S ( r ) > t ) E S ( r ) t 2 t n E sup b R p n : b β * B 1 ( r ) i = 1 n ε i ( ρ ( X i T b , Y i ) ρ ( X i T β * , Y i ) ) 4 L t n E sup b R p n : b β * B 1 ( r ) i = 1 n ε i X i T ( b β * ) .
We observe that ε i X i j S u b g ( σ j n 2 ) in view of Lemma A1. Hence, using independence, we obtain i = 1 n ε i X i j S u b g ( n σ j n 2 ) and thus i = 1 n ε i X i j S u b g ( n s n 2 ) . Applying Hölder inequality and the following inequality (see Lemma 2.2 of [37]):
E i = 1 n ε i X i j n s n 2 ln ( 2 p n ) 2 s n n ln ( p n 2 )
we have:
4 L t n E sup b R p n : b β * B 1 ( r ) i = 1 n ε i X i T ( b β * ) 4 L r t E max j { 1 , , p n } 1 n i = 1 n ε i X i j 8 L r s n log ( p n 2 ) t n .
From this, Part 1 follows. In the proofs of Parts 2 and 3, the first inequalities are the same as in Equation (A5) with supremums taken on corresponding sets. Using Cauchy–Schwarz inequality, inequality | | v | | 2 | v | | | v | | , inequality | | v π | | | | v | | for π { 1 , , p n } , and Equation (A6) yields:
P ( S 1 ( r ) t ) 4 L n t E sup b D 1 : b β * B 2 ( r ) i = 1 n ε i X i T ( b β * ) 4 L r n t E max π { 1 , , p n } , | π | k n i = 1 n ε i X i , π 2 4 L r n t E max π { 1 , , p n } , | π | k n | π | i = 1 n ε i X i , π 4 L r k n n t E i = 1 n ε i X i 8 L r t n k n s n ln ( p n 2 ) .
Similarly for S 2 ( r ) , using Cauchy–Schwarz inequality, | | v π | | 2 | | v s * | | 2 , which is valid for π s * , definition of l 2 norm and inequality E | Z | E Z 2 σ for Z S u b g ( σ 2 ) , we obtain:
P ( S 2 ( r ) t ) 4 L n t E sup b D 2 : b β * B 2 ( r ) i = 1 n ε i X i T ( b β * ) 4 L r n t E max π s * i = 1 n ε i X i , π 2 4 L r n t E i = 1 n ε i X i , s * 2 4 L r n t E i = 1 n ε i X i , s * 2 2 = 4 L r n t j s * E i = 1 n ε i X i j 2 4 L r n t | s * | s n .
 □
Proof of Lemma 3.
Proof. 
Let u and v be defined as in Lemma 1. Observe that | | v β * | | 1 r / 2 is equivalent to | | β ^ L β * | | 1 r , as the function f ( x ) = r x / ( x + r ) is increasing, f ( r ) = r / 2 and f ( | | β ^ L β * | | 1 ) = | | v β * | | 1 . Let C = 1 / ( 4 + ε ) . We consider two cases:
(i) | | v s * β s * * | | 1 C r .
In this case, from the basic inequality (Lemma 1), we have:
| | v β * | | 1 λ 1 ( W ( v ) + λ | | v β * | | 1 ) λ 1 S ( r ) + 2 | | v s * β s * * | | 1 C ¯ r + 2 C r = r 2 .
(ii) | | v s * β s * * | | 1 > C r .
Note that | | v s * c | | 1 < ( 1 C ) r , otherwise we would have | | v β * | | 1 > r , which contradicts Equation (A2) in proof of Lemma 1. Now, we observe that v β * C ε , as we have from definition of C and assumption for this case:
| | v s * c | | 1 < ( 1 C ) r = ( 3 + ε ) C r < ( 3 + ε ) | | v s * β s * * | | 1 .
By inequality between l 1 and l 2 norms, the definition of κ H ( ε ) , inequality c a 2 / 4 + b 2 / c a b , and margin Condition (MC) (which holds because v β * B 1 ( r ) B 1 ( δ ) in view of Equation (A2)), we conclude that:
| | v s * β s * * | | 1 | s * | | | v s * β s * * | | 2 | s * | | | v β * | | 2 | s * | ( v β * ) T H ( v β * ) κ H ( ε )
ϑ ( v β * ) T H ( v β * ) 4 λ + | s * | λ ϑ κ H ( ε ) W ( v ) 2 λ + | s * | λ ϑ κ H ( ε ) .
Hence, from the basic inequality (Lemma 1) and the inequality above, it follows that:
W ( v ) + λ | | v β * | | 1 S ( r ) + 2 λ | | v s * β s * * | | 1 S ( r ) + W ( v ) + 2 | s * | λ 2 ϑ κ H ( ε ) .
Subtracting W ( v ) from both sides of the above inequality and using the assumption on S, the bound on | s * | , and the definition of C ˜ yields:
| | v β * | | 1 S ( r ) λ + 2 | s * | λ ϑ κ H ( ε ) C ¯ r + 2 | s * | λ ϑ κ H ( ε ) ( C ¯ + C ˜ ) r = r 2 .
 □
Proof of Remark 1.
Proof. 
Condition lim inf n D n a n k n log ( 2 p n ) > 1 is equivalent to the condition that exists some u > 0 that for almost all n we have:
D n a n ( 1 + u ) k n log ( 2 p n ) > 0 .
(1) We observe that, if
A a n ( 1 + u ) k n log ( 2 p n ) > 0 ,
then the above condition is satisfied. For BIC, we have:
A log n > ( 1 + u ) k n log ( 2 p n ) > 0 ,
which is equivalent to the condition (1) of the Remark.
(2) We observe that using inequalities k n C , 2 A γ ( 1 + u ) C 0 and p n 1 yields for n > 2 ( 1 + u ) C A :
A ( log n + 2 γ log p n ) ( 1 + u ) k n log ( 2 p n ) A ( log n + 2 γ log p n ) ( 1 + u ) C log ( 2 p n ) = ( 2 A γ ( 1 + u ) C ) log p n + A log n ( 1 + u ) C log 2 A log n ( 1 + u ) C log 2 > 0 .
(3) In this case, we check similarly as in (2) that
A ( log n + 2 γ log p n ) ( 1 + u ) k n log ( 2 p n ) A ( log n + 2 γ log p n ) ( 1 + u ) C log ( 2 p n ) = ( 2 A γ ( 1 + u ) C ) log p n + A log n ( 1 + u ) C log 2 > 0
 □

References

  1. Cover, T.; Thomas, J. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
  2. Bühlmann, P.; van de Geer, S. Statistics for High-dimensional Data; Springer: New York, NY, USA, 2011. [Google Scholar]
  3. van de Geer, S. Estimation and Testing Under Sparsity; Lecture Notes in Mathematics; Springer: New York, NY, USA, 2009. [Google Scholar]
  4. Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity; Springer: New York, NY, USA, 2015. [Google Scholar]
  5. Li, K.; Duan, N. Regression analysis under link violation. Ann. Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
  6. Kubkowski, M.; Mielniczuk, J. Active set of predictors for misspecified logistic regression. Statistics 2017, 51, 1023–1045. [Google Scholar] [CrossRef]
  7. Kubkowski, M.; Mielniczuk, J. Projections of a general binary model on logistic regrssion. Linear Algebra Appl. 2018, 536, 152–173. [Google Scholar] [CrossRef]
  8. Kubkowski, M. Misspecification of Binary Regression Model: Properties and Inferential Procedures. Ph.D. Thesis, Warsaw University of Technology, Warsaw, Poland, 2019. [Google Scholar]
  9. Lu, W.; Goldberg, Y.; Fine, J. On the robustness of the adaptive lasso to model misspecification. Biometrika 2012, 99, 717–731. [Google Scholar] [CrossRef] [PubMed]
  10. Brillinger, D. A Generalized linear model with ‘gaussian’ regressor variables. In A Festschfrift for Erich Lehmann; Wadsworth International Group: Belmont, CA, USA, 1982; pp. 97–113. [Google Scholar]
  11. Ruud, P. Suffcient conditions for the consistency of maximum likelihood estimation despite misspecifcation of distribution in multinomial discrete choice models. Econometrica 1983, 51, 225–228. [Google Scholar] [CrossRef]
  12. Yi, C.; Huang, J. Semismooth Newton coordinate descent algorithm for elastic-net penalized Huber loss regression and quantile regression. J. Comput. Graph. Stat. 2017, 26, 547–557. [Google Scholar] [CrossRef]
  13. White, W. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  14. Vuong, Q. Likelihood ratio testts for model selection and not-nested hypotheses. Econometrica 1989, 57, 307–333. [Google Scholar] [CrossRef]
  15. Bickel, P.; Ritov, Y.; Tsybakov, A. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
  16. Negahban, S.N.; Ravikumar, P.; Wainwright, M.J.; Yu, B. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci. 2012, 27, 538–557. [Google Scholar] [CrossRef]
  17. Pokarowski, P.; Mielniczuk, J. Combined 1 and greedy 0 penalized least squares for linear model selection. J. Mach. Learn. Res. 2015, 16, 961–992. [Google Scholar]
  18. Hall, P.; Li, K.C. On almost Linearity of Low Dimensional Projections from High Dimensional Data. Ann. Stat. 1993, 21, 867–889. [Google Scholar] [CrossRef]
  19. Chen, J.; Chen, Z. Extended bayesian information criterion for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef]
  20. Chen, J.; Chen, Z. Extended BIC for small-n-large-p sparse GLM. Stat. Sin. 2012, 22, 555–574. [Google Scholar] [CrossRef]
  21. Mielniczuk, J.; Szymanowski, H. Selection consistency of Generalized Information Criterion for sparse logistic model. In Stochastic Models, Statistics and Their Applications; Steland, A., Rafajłowicz, E., Szajowski, K., Eds.; Springer: Cham, Switzerland, 2015; Volume 122, pp. 111–118. [Google Scholar]
  22. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, Theory and Applications; Cambridge University Press: Cambridge, UK, 2012; pp. 210–268. [Google Scholar]
  23. Fan, J.; Xue, L.; Zou, H. Supplement to “Strong Oracle Optimality of Folded Concave Penalized Estimation”. 2014. Available online: NIHMS649192-supplement-suppl.pdf (accessed on 25 January 2020).
  24. Fan, J.; Xue, L.; Zou, H. Strong Oracle Optimality of folded concave penalized estimation. Ann. Stat. 2014, 43, 819–849. [Google Scholar] [CrossRef]
  25. Bach, F. Self-concordant analysis for logistic regression. Electron. J. Stat. 2010, 4, 384–414. [Google Scholar] [CrossRef]
  26. Akaike, H. Statistical predictor identification. Ann. Inst. Stat. Math. 1970, 22, 203–217. [Google Scholar] [CrossRef]
  27. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  28. Kim, Y.; Jeon, J. Consistent model selection criteria for quadratically supported risks. Ann. Stat. 2016, 44, 2467–2496. [Google Scholar] [CrossRef]
  29. van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes with Applications to Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
  30. Ledoux, M.; Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes; Springe: New York, NY, USA, 1991. [Google Scholar]
  31. Huang, J.; Ma, S.; Zhang, C. Adaptive Lasso for sparse high-dimensional regression models. Stat. Sin. 2008, 18, 1603–1618. [Google Scholar]
  32. Tibshirani, R. The lasso problem and uniqueness. Electron. J. Stat. 2013, 7, 1456–1490. [Google Scholar] [CrossRef]
  33. Zhou, S. Thresholded Lasso for high dimensional variable selection and statistical estimation. arXiv 2010, arXiv:1002.1583. [Google Scholar]
  34. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
  35. Fan, Y.; Tang, C. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B 2013, 75, 531–552. [Google Scholar] [CrossRef]
  36. Rosset, S.; Zhu, J.; Hastie, T. Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 2004, 5, 941–973. [Google Scholar]
  37. Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]
Figure 1. P i n c for Models M1 and M2.
Figure 1. P i n c for Models M1 and M2.
Entropy 22 00153 g001
Figure 2. P e q u a l for Models M1 and M2.
Figure 2. P e q u a l for Models M1 and M2.
Entropy 22 00153 g002
Figure 3. P s u p s e t for Models M1 and M2.
Figure 3. P s u p s e t for Models M1 and M2.
Entropy 22 00153 g003
Figure 4. A N G L E for Models M1 and M2.
Figure 4. A N G L E for Models M1 and M2.
Entropy 22 00153 g004
Back to TopTop