All articles published by MDPI are made immediately available worldwide under an open access license. No special
permission is required to reuse all or part of the article published by MDPI, including figures and tables. For
articles published under an open access Creative Common CC BY license, any part of the article may be reused without
permission provided that the original article is clearly cited.
Feature Papers represent the most advanced research with significant potential for high impact in the field. Feature
Papers are submitted upon individual invitation or recommendation by the scientific editors and undergo peer review
prior to publication.
The Feature Paper can be either an original research article, a substantial novel research study that often involves
several techniques or approaches, or a comprehensive review paper with concise and precise updates on the latest
progress in the field that systematically reviews the most exciting advances in scientific literature. This type of
paper provides an outlook on future directions of research or possible applications.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world.
Editors select a small number of articles recently published in the journal that they believe will be particularly
interesting to authors, or important in this field. The aim is to provide a snapshot of some of the most exciting work
published in the various research areas of the journal.
We consider selection of random predictors for a high-dimensional regression problem with a binary response for a general loss function. An important special case is when the binary model is semi-parametric and the response function is misspecified under a parametric model fit. When the true response coincides with a postulated parametric response for a certain value of parameter, we obtain a common framework for parametric inference. Both cases of correct specification and misspecification are covered in this contribution. Variable selection for such a scenario aims at recovering the support of the minimizer of the associated risk with large probability. We propose a two-step selection Screening-Selection (SS) procedure which consists of screening and ordering predictors by Lasso method and then selecting the subset of predictors which minimizes the Generalized Information Criterion for the corresponding nested family of models. We prove consistency of the proposed selection method under conditions that allow for a much larger number of predictors than the number of observations. For the semi-parametric case when distribution of random predictors satisfies linear regressions condition, the true and the estimated parameters are collinear and their common support can be consistently identified. This partly explains robustness of selection procedures to the response function misspecification.
Consider a random variable and a corresponding response function defined as a posteriori probability . Estimation of the a posteriori probability is of paramount importance in machine learning and statistics since many frequently applied methods, e.g., logistic or tree-based classifiers, rely on it. One of the main estimation methods of q is a parametric approach for which the response function is assumed to have parametric form
for some fixed and known . If Equation (1) holds, that is the underlying structure is correctly specified, then it is known that
where is the expected value of a random variable and is Kullback–Leibler distance between the binary distributions with success probabilities and :
The equalities in Equations (2) and (3) form the theoretical underpinning of (conditional) maximum likelihood (ML) method as the expression under the expected value in Equation (2) is the conditional log-likelihood of Y given X in the parametric model. Moreover, it is a crucial property needed to show that ML estimates of under appropriate conditions approximate .
However, more frequently than not, the model in Equation (1) does not hold, i.e., response q is misspecified and ML estimators do not approximate , but the quantity defined by the right-hand side of Equation (3), namely
Thus, parametric fit using conditional ML method, which is the most popular approach to modeling binary response, also has very intuitive geometric and information-theoretic flavor. Indeed, fitting a parametric model, we try to approximate the which yields averaged KL projection of unknown q on set of parametric models . A typical situation is a semi-parametric framework the true response function satisfies when
for some unknown and the model in Equation (1) is fitted where . An important problem is then how in Equation (4) relates to in Equation (5). In particular, a frequently asked question is what can be said about a support of , i.e., the set , which consists of indices of predictors which truly influence Y. More specifically, an interplay between supports of and analogously defined support of is of importance as the latter is consistently estimated and the support of ML estimator is frequently considered as an approximation of the set of true predictors. Variable selection, or equivalently the support recovery of in high-dimensional setting, is one of the most intensively studied subjects in contemporary statistics and machine learning. This is related to many applications in bioinformatics, biology, image processing, spatiotemporal analysis, and other research areas (see [2,3,4]). It is usually studied under a correct model specification, i.e., under theassumption that data are generated following a given parametric model (e.g., logistic or, in the case of quantitative Y, linear model).
Consider the following example: let , where is the logistic function. Define regression model by , where is -distributed vector of predictors, and . Then, the considered model will obviously be misspecified when the family of logistic models is fitted. However, it turns out in this case that, as X is elliptically contoured, and (see ) and thus supports of and coincide. Thus, in this case, despite misspecification variable selection, i.e., finding out that and are the only active predictors, it can be solved using the methods described below.
For recent contributions to the study of Kullback–Leibler projections on logistic model (which coincide with Equation (4) for a logistic loss, see below) and references, we refer to the works of Kubkowski and Mielniczuk , Kubkowski and Mielniczuk  and Kubkowski . We also refer to the work of Lu et al. , where the asymptotic distribution of adaptive Lasso is studied under misspecification in the case of fixed number of deterministic predictors. Questions of robustness analysis evolve around an interplay between and , in particular under what conditions the directions of and coincide (cf. the important contribution by Brillinger  and Ruud ).
In the present paper, we discuss this problem in a more general non-parametric setting. Namely, the minus conditional log-likelihood is replaced by a general loss function of the form
where is some function, , and
is the associated risk function for . Our aim is to determine a support of , where
Coordinates of corresponding to non-zero coefficients are called active predictors and vector the pseudo-true vector.
The most popular loss functions are related to minus log-likelihood of specific parametric models such as logistic loss
related to , probit loss
related to , or quadratic loss related to linear regression and quantitative response. Other losses which do not correspond to any parametric model such as Huber loss (see ) are constructed with a specific aim to induce certain desired properties of corresponding estimators such as robustness to outliers. We show in the following that variable selection problem can be studied for a general loss function imposing certain analytic properties such as its convexity and Lipschitz property.
For fixed number p of predictors smaller than sample size n, the statistical consequences of misspecification of a semi-parametric regression model were intensively studied by H. White and his collaborators in the 1980s. The concept of a projection on the fitted parametric model is central to these investigations which show how the distribution of maximum likelihood estimator of centered by changes under misspecification (cf. e.g., [13,14]). However, for the case when , the maximum likelihood estimator, which is a natural tool for fixed case, is ill-defined and a natural question arises: What can be estimated and by what methods?
The aim of the present paper is to study the above problem in high-dimensional setting. To this end, we introduce two-stage approach in which the first stage is based on Lasso estimation (cf., e.g., )
where and the empirical risk corresponding to is
Parameter is Lasso penalty, which penalizes large -norms of potential candidates for a solution. Note that the criterion function in Equation (8) for can be viewed as penalized empirical risk for the logistic loss. Lasso estimator is thoroughly studied in the case of the linear model when considered loss is square loss (see, e.g., [2,4] for references and overview of the subject) and some of the papers treat the case when such model is fitted to Y, which is not necessarily linearly dependent on regressors (cf. ). In this case, regression model is misspecified with respect to linear fit. However, similar results are scarce for other scenarios such as logistic fit under misspecification in particular. One of the notable exceptions is Negahban et al. , who studied the behavior of Lasso estimate i for a general loss function and possibly misspecified models.
The output of the first stage is Lasso estimate . The second stage consists in ordering of predictors according to the absolute values of corresponding non-zero coordinates of Lasso estimator and then minimization of Generalized Information Criterion (GIC) on the resulting nested family. This is a variant of SOS (Screening-Ordering-Selection) procedure introduced in . Let be the model chosen by GIC procedure.
Our main contributions are as follows:
We prove that under misspecification when the sample size grows support coincides with support of with probability tending to 1. In the general framework allowing for misspecification this means that selection rule is consistent, i.e., when . In particular, when the model in Equation (1) is correctly specified this means that we recover the support of the true vector with probability tending to 1.
We also prove approximation result for Lasso estimator when predictors are random and is a convex Lipschitz function (cf. Theorem 1).
A useful corollary of the last result derived in the paper is determination of sufficient conditions under which active predictors can be separated from spurious ones based on the absolute values of corresponding coordinates of Lasso estimator. This makes construction of nested family containing with a large probability possible.
Significant insight has been gained for fitting of parametric model when predictors are elliptically contoured (e.g., multivariate normal). Namely, it is known that in such situation , i.e., these two vectors are collinear . Thus, in the case when we have that support of coincides with support s of and the selection consistency of two-step procedure proved in the paper entails direction and support recovery of . This may be considered as a partial justification of a frequent observation that classification methods are robust to misspecification of the model for which they are derived (see, e.g., [5,18]).
We now discuss how our results relate to previous results. Most of the variable selection methods in high-dimensional case are studied for deterministic regressors; here, our results concern random regressors with subgaussian distributions. Note that random regressors scenario is much more realistic for experimental data than deterministic one. The stated results to the best of our knowledge are not available for random predictors even when the model is correctly specified. As to novelty of SS procedure, for its second stage we assume that the number of active predictors is bounded by a deterministic sequence tending to infinity and we minimize GIC on family of models with sizes satisfying also this condition. Such exhaustive search has been proposed in  for linear models and extended to GLMs in  (cf. ). In these papers, GIC has been optimized on all possible subsets of regressors with cardinality not exceeding certain constant . Such method is feasible for practical purposes only when is small. Here, we consider a similar set-up but with important differences: is a data-dependent small nested family of models and optimization of GIC is considered in the case when the original model is misspecified. The regressors are supposed random and assumptions are carefully tailored to this case. We also stress the fact that the presented results also cover the case when the regression model is correctly specified and Equation (5) is satisfied.
In numerical experiments, we study the performance of grid version of logistic and linear SOS and compare it to its several Lasso-based competitors.
The paper is organized as follows. Section 2 contains auxiliaries, including new useful probability inequalities for empirical risk in the case of subgaussian random variables (Lemma 2). In Section 3, we prove a bound on approximation error for Lasso when the loss function is convex and Lipschitz and regressors are random (Theorem 1). This yields separation property of Lasso. In Theorems 2 and 3 of Section 4, we prove GIC consistency on nested family, which in particular can be built according to the order in which the Lasso coordinates are included in the fitted model. In Section 5.1, we discuss consequences of the proved results for semi-parametric binary model when distribution of predictors satisfies linear regressions condition. In Section 6, we numerically compare the performance of two-stage selection method for two closely related models, one of which is a logistic model and the second one is misspecified.
2. Definitions and Auxiliary Results
In the following, we allow random vector , , and p to depend on sample size n, i.e., and . We assume that n copies of a random vector in are observed together with corresponding binary responses . Moreover, we assume that observations are independent and identically distributed (iid). If this condition is satisfied for each n, but not necessarily for different n and m, i.e., distributions of may be different from that of or they may be dependent for , then such framework is called a triangular scenario. A frequently considered scenario is the sequential one. In this case, when sample size n increases, we observe values of new predictors additionally to the ones observed earlier. This is a special case of the above scheme as then . In the following, we skip the upper index n if no ambiguity arises. Moreover, we write . We impose a condition on distributions of random predictors assume that coordinates of are subgaussian with subgaussianity parameter , i.e., it holds that (see )
for all . This condition basically says that the tails of of do not decrease more slowly than tails of normal distribution . For future reference, let
and assume in the following that
We assume moreover that are linearly independent in the sense that their arbitrary linear combination is not constant almost everywhere. We consider a general form of response function and assume that for the given loss function , as defined in Equation (7), exists and is unique. For , let be defined as in Equation (7) when minimum is taken over b with support in s. We let
denote the support of with .
Let for and . Let be restricted to its support . Note that if , then provided projections are unique (see Section 2) we have
Note that this implies that for every superset of s the projection on the model pertaining to s is obtained by appending projection with appropriate number of zeros. Moreover, let
We remark that , and may depend on n. We stress that is an important quantity in the development here as it turns out that it may not decrease too quickly in order to obtain approximation results for (see Theorem 1). Note that, when the parametric model is correctly specified, i.e., for some with l being an associated log-likelihood loss, if s is the support of , we have .
First, we discuss quantities and assumptions needed for the first step of SS procedure.
We consider cones of the form:
where , and for . Cones are of special importance because we prove that (see Lemma 3). In addition, we note that since -norm is decomposable in the sense that the definition of the cone above can be stated as
Thus, consists of vectors which do not put too much mass on the complement of . Let be a fixed non-negative definite matrix. For cone , we define a quantity which can be regarded as a restricted minimal eigenvalue of a matrix in high-dimensional set-up:
In the considered context, H is usually taken as hessian and, e.g., for quadratic loss, it equals . When H is non-negative definite and not strictly positive definite its smallest eigenvalue and thus . That is why we have to restrict minimization in Equation (12) in order to have in the high-dimensional case. As we prove that and would use it is useful to restrict minimization in Equation (12) to . Let R and be the risk and the empirical risk defined above. Moreover, we introduce the following notation:
Note that . Thus, corresponds to oscillation of centred empirical risk over ball . We need the following Margin Condition (MC) in Lemma 3 and Theorem 1:
There exist and non-negative definite matrix such that for all b with we have
The above condition can be viewed as a weaker version of strong convexity of function R (when the right-hand side is replaced by ) in the restricted neighbourhood of (namely, in the intersection of ball and cone ). We stress the fact that H is not required to be positive definite, as in Section 3 we use Condition (MC) together with stronger conditions than which imply that right hand side of inequality in (MC) is positive. We also do not require here twice differentiability of R. We note in particular that Condition (MC) is satisfied in the case of logistic loss, X being bounded random variable and (see [23,24,25]). It is also easily seen that that (MC) is satisfied for quadratic loss, X such that and . Similar condition to (MC) (called Restricted Strict Convexity) was considered in  for empirical risk :
for all , some , and tolerance function . Note however that MC is a deterministic condition, whereas Restricted Strict Convexity has to be satisfied for random empirical risk function.
Another important assumption, used in Theorem 1 and Lemma 2, is the Lipschitz property of
Now, we discuss preliminaries needed for the development of the second step of SS procedure. Let stand for dimension of w. For the second step of the procedure we consider an arbitrary family of models (which are identified with subsets of and may be data-dependent) such that a.e. and is some deterministic sequence. We define Generalized Information Criterion (GIC) as:
is ML estimator for model w as minimization above is taken over all vectors b with support in w. Parameter is some penalty factor depending on the sample size n which weighs how important is the complexity of the model described by the number of its variables . Typical examples of include:
AIC (Akaike Information Criterion): ;
BIC (Bayesian Information Criterion): ; and
EBIC(d) (Extended BIC): , where .
AIC, BIC and EBIC were introduced by Akaike , Schwarz , and Chen and Chen , respectively. Note that for BIC penalty is larger than AIC penalty and in its turn EBIC penalty is larger than BIC penalty.
We study properties of for , where:
and is the maximal absolute value of the centred empirical risk and sets for are defined as follows:
The idea here is simply to consider sets consisting of vectors having no more that non-zero coordinates. However, for , we need that for , we have , what we exploit in Lemma 2. This entails additional condition in the definition of . Moreover, in Section 4, we consider the following condition for , and some :
: for all such that and
We observe also that, although Conditions (MC) and are similar, they are not equivalent, as they hold for belonging to different sets: and , respectively. If the minimal eigenvalue of matrix H in Condition (MC) is positive and Condition (MC) holds for (instead of for ), then we have for :
Furthermore, if is the maximal eigenvalue of H and Condition holds for all without restriction on , then we have for :
Thus, Condition (MC) holds in this case. A similar condition to Condition for empirical risk was considered by Kim and Jeon  (formula (2.1)) in the context of GIC minimization. It turns out that Condition together with being convex for all y and satisfying Lipschitz Condition (LL) are sufficient to establish bounds which ensure GIC consistency for and (see Corollaries 2 and 3). First, we state the following basic inequality. and are defined above the definition of Margin Condition.
(Basic inequality). Let be convex function for all If for some we have
The proof of the lemma is moved to the Appendix A. It follows from the lemma that, as in view of decomposability of -distance we have , when is small we have is not large in comparison with .
Quantities are defined in Equation (18). Recall that is an oscillation taken over ball , whereas are oscillations taken over ball with restriction on the number of nonzero coordinates.
Let be convex function for all y and satisfy Lipschitz Condition (LL). Assume that for are subgaussian , where . Then, for :
The proof of the Lemma above, which relies on Chebyshev inequality, symmetrization inequality (see Lemma 2.3.1 of ), and Talagrand–Ledoux inequality (, Theorem 4.12), is moved to the Appendix A. In the case when does not depend on n and thus its support does not change, Part 3 implies in particular that is of the order in probability.
3. Properties of Lasso for a General Loss Function and Random Predictors
The main result in this section is Theorem 1. The idea for the proof is based on fact that, if defined in Equation (16) is sufficiently small (condition is satisfied), then lies in a ball (see Lemma 3). Using a tail inequality for proved in Lemma 2, we obtain Theorem 1. Note that has to be bounded away from 0 (condition ). Convexity of below is understood as convexity for both .
Let be convex function and assume that Moreover, assume margin Condition (MC) with constants and some non-negative definite matrix . If for some we have and , where and then
The proof of the lemma is moved to the Appendix A.
The first main result provides an exponential inequality for . The threshold is crucial there as it ensures separation: (see proof of Corollary 1).
Let be convex function for all y and satisfy Lipschitz Condition (LL). Assume that , exists and is unique, margin Condition (MC) is satisfied for , non-negative definite matrix and let
Lemmas 2 and 3 imply that:
(Separation property) If assumptions of Theorem 1 are satisfied,
for some and for some for large n, then
The first part of the corollary follows directly from Theorem 1 and the observation that:
Now, we prove that condition implies separation property
Indeed, observe that for all we have:
If then using triangle inequality yields:
Hence, from the above inequality and Equation (22), we obtain for : If then and Equation (22) takes the form: This ends the proof. □
We note that the separation property in Equation (21) means that when is chosen in an appropriate manner, recovery of is feasible with a large probability if all predictors corresponding to absolute value of Lasso coefficient exceeding a certain threshold are chosen. The threshold unfortunately depends on unknown parameters of the model. However, separation property allows to restrict attention to nested family of models and thus to decrease significantly computational complexity of the problem. This is dealt with in the next section. Note moreover that if in Equation (10) is finite than defined in the Corollary is of order , which is the optimal order of Lasso penalty in the case of deterministic regressors (see, e.g., ).
4. GIC Consistency for a a General Loss Function and Random Predictors
Theorems 2 and 3 state probability inequalities related to behavior of GIC on supersets and on subsets of , respectively. In a nutshell, we show for supersets and subsets separately that the probability that the minimum of GIC is not attained at is exponentially small. Corollaries 2 and 3 present asymptotic conditions for GIC consistency in the aforementioned situations. Corollary 4 gathers conclusions of Theorem 1 and Corollaries 1–3 to show consistency of SS procedure (see  for consistency of SOS procedure for a linear model with dieterministic predictors) in case of subgaussian variables. Note that in Theorem below we want to consider minimization of GIC in Equation (23) over all supersets of as in our applications is data dependent. As the number of such possible subsets is at least , the proof has to be more involved than using reasoning based on Bonferroni inequality.
Assume that is convex, Lipschitz function with constant , condition holds for some and for every such that . Then, for any , we have:
where and .
If and , then in view of inequalities and we have:
Note that Hence, if we have for some : , then we obtain and from the above inequality we have . Furthermore, if and then consider:
where . Then
as function is increasing with respect to x for . Moreover, we have . Hence, in view of condition, we get:
From convexity of , we have:
Let denote the support of vector v. We observe that , hence . Finally, we have:
Hence, we obtain the following sequence of inequalities:
Assume that the conditions of Theorem 2 hold and for some and for every such that , and , where for some . Then, we have
We the choose allb radius r of in a special way. Namely, we take:
for some . In view of assumptions . Consider such that for all . Hence, the second term of the upper bound in Equation (23) for is equal to:
Similarly, the first term of the upper bound in Equation (23) is equal to:
These two convergences end the proof. □
The most restrictive condition of Corollary 2 is which is slightly weaker than . The following remark proved in the Appendix A gives sufficient conditions for consistency of BIC and EBIC penalties, which do not satisfy condition .
If in Corollary 2 we assume for some , then condition holds when:
and for some .
, and , where .
, , , , where and .
Theorem 3 is an analog of Theorem 2 for subsets of .
Assume that is convex, Lipschitz function with constant , condition holds for some , and . Then, we have:
Suppose that for some we have . This is equivalent to:
In view of inequalities and , we obtain:
Let for some to be specified later. From convexity of , we consider:
We consider two cases separately:
First, observe that
which follows from our assumption. Let and
Note that . Then, as function is increasing and bounded from above by 1 for , we obtain:
Hence, in view of condition, we have:
Using Equations (24)–(26) and the above inequality yields:
Thus, in view of Lemma 2, we obtain:
In this case, we take and define v as in Equation (26). Analogously, as in Equation (27), we have:
Hence, in view of condition, we have:
Using Equation (24) and the above inequality yields:
Thus, in view of Lemma 2, we obtain:
By combining Equations (28) and (29), the theorem follows. □
Assume that loss is convex, Lipschitz function with constant , condition holds for some and , then
First, observe that as
and thus in view of Theorem 3 we have
5. Selection Consistency of SS Procedure
In this section, we combine the results of the two previous sections to establish consistency of a two-step SS procedure. It consists in construction of a nested family of models using magnitude of Lasso coefficients and then finding the minimizer of GIC over this family. As is data dependent to establish consistency of the procedure we use Corollaries 2 and 3 in which the minimizer of GIC is considered over all subsets and supersets of .
SS (Screening and Selection) procedure is defined as follows:
Choose some .
Find such that and .
The SS procedure is a modification of SOS procedure in  designed for linear models. Since ordering step considered in  is omitted in the proposed modification, we abbreviate the name to SS.
Corollary 4 and Remark 2 describe the situations when SS procedure is selection consistent. In it, we use the assumptions imposed in Section 2 and Section 3 together with an assumption that support of contains no more than elements, where is some deterministic sequence of integers. Let is nested family constructed in the step 4 of SS procedure.
Assume that is convex, Lipschitz function with constant , and exists and is unique. If is some sequence, margin Condition (MC) is satisfied for some , condition holds for some and for every such that and the following conditions are fulfilled:
for some , where H is non-negative definite matrix and is defined in Equation (12),
then for SS procedure we have
In view of Corollary 1, following from the separation property in Equation (22) we obtain . Let:
Then, we have again from the fact that , union inequality and Corollary 2:
In an analogous way, using and Corollary 3 yields:
Now, observe that in view of definition of and union inequality:
Thus, in view of the above inequality and Equations (30) and (31). □
5.1. Case of Misspecified Semi-Parametric Model
Consider now the important case of the misspecified semi-parametric model defined in Equation (5) for which function is unknown and may be arbitrary. An interesting question is whether information about can be recovered when misspecification occurs. The answer is positive under some additional assumptions on distribution of random predictors. Assume additionally that X satisfies
where is the true parameter. Thus, regressions of X given have to be linear. We stress that conditioning involves only the true in Equation (5). Then, it is known (cf. [5,10,11]) that and if . Note that because and are collinear and it follows that . This is important in practical applications as it shows that a position of the optimal separating direction given by can be consistently recovered. It is also worth mentioning that if Equation (32) is satisfied the direction of coincides with the direction of the first canonical vector. We refer to the work of Kubkowski and Mielniczuk  for the proof and to the work o Kubkowski and Mielniczuk  for discussion and up-to date references to this problem. The linear regressions condition in Equation (32) is satisfied, e.g., by elliptically contoured distribution, in particular by multivariate normal. We note that it is proved in  that Equation (32) approximately holds for the majority of . When Equation (32) holds exactly, proportionality constant can be calculated numerically for known and . We can state thus the following result provided Equation (32) is satisfied.
Assume that Equation (32) and the assumptions of Corollary 4 are satisfied. Moreover, . Then, .
If for some , , , , , , , , , , then assumptions imposed on asymptotic behavior of parameters in Corollary 4 are satisfied.
Note that is allowed to grow exponentially: , however may not decrease to 0 too quickly with regard to growth of : .
We note that, to apply Corollary 4 to the two-step procedure based on Lasso, it is required that and that the support of Lasso estimator with probability tending to 1 contains no more than elements. Some results bounding are available for deterministic X (see ) and for random X (see ), but they are too weak to be useful for EBIC penalties. The other possibility to prove consistency of two-step procedure is to modify it in the first step by using thresholded Lasso (see ) corresponding to largest Lasso coefficients where is such that . This is a subject of ongoing research.
6. Numerical Experiments
6.1. Selection Procedures
We note that the original procedure is defined for a single only. In the simulations discussed below, we implemented modifications of SS procedure introduced in Section 5. In practice, it is generally more convenient to consider in the first step some sequence of penalty parameters instead of only one in order to avoid choosing the “best” . For the fixed sequence , we construct corresponding families analogously to in Step 4 of the SS procedure. Thus, we arrive at the following SSnet procedure, which is the modification of SOSnet procedure in . Below, is a vector b with first coordinate corresponding to intercept omitted, :
Choose some .
Find for .
Find where are such that for .
Define for .
Find , where
Instead of constructing families for each in SSnet procedure, can be chosen by cross-validation using 1SE rule (see ) and then SS procedure is applied for such . We call this procedure SSCV. The last procedure considered was introduced by Fan and Tang  and is Lasso procedure with penalty parameter chosen in a data-dependent way analogously to SSCV. Namely, it is the minimizer of GIC criterion with for which ML estimator has been replaced by Lasso estimator with penalty . Once is calculated, then is defined as its support. The procedure is called LFT in the sequel.
We list below versions of the above procedures along with R packages that were used to choose sequence and computation of Lasso estimator. The following packages were chosen based on selection performance after initial tests for each loss and procedure:
SSnet with logistic or quadratic loss: ncvreg;
SSCV or LFT with logistic or quadratic loss: glmnet; and
SSnet, SSCV or LFT with Huber loss (cf. ): hqreg.
The following functions were used to optimize in GIC minimization step for each loss:
logistic loss: glm.fit (package stats);
quadratic loss: .lm.fit (package stats); and
Huber loss: rlm (package rlm).
Before applying the investigated procedures, each column of matrix was standardized as Lasso estimator depends on scaling of predictors. We set length of sequence to . Moreover, in all procedures we considered only for which because, when , Lasso and ML solutions are not unique (see [32,36]). For Huber loss, we set parameter (see ). The number of folds in SSCV was set to .
Each simulation run consisted of L repetitions, during which samples and were generated for . For kth sample estimator of set of active predictors was obtained by a given procedure as the support of , where
is ML estimator for kth sample. We denote by the family obtained by a given procedure for kth sample.
In our numerical experiments we have computed the following measures of selection performance which gauge co-direction of true parameter and and the interplay between and :
and we let , if ,
Thus, is equal an of angle between true parameter (with intercept omitted) and its post model-selection estimator averaged over simulations, is a fraction of simulations for which family contains true model , and and are the fractions of time when SSnet chooses true model or its superset, respectively.
6.2. Regression Models Considered
To investigate behavior of two-step procedure under misspecification we considered two similar models with different sets of predictors. As sets of predictors differ, this results in correct specification of the first model (Model M1) and misspecification of the second (Model M2).
Namely, in Model M1, we generated n observations for such that:
where , and . We consider response function for , and . Thus,
We observe that the last equality implies that the above binary model is correctly specified with respect to family of fitted logistic models and and are four active predictors, whereas the remaining ones play no role in prediction of Y. Hence, and are, respectively, sets of indices of active predictors and non-zero coefficients of projection onto family of logistic models.
We considered the following parameters in numerical experiments: , and (the number of generated datasets for each combination of parameters). We investigated procedures SSnet, SSCV, and LFT using logistic, quadratic, and Huber (cf. ) loss functions. For procedures SSnet and SSCV, we used GIC penalties with:
In Model M2, we generated n observations for such that and , and . Response function is for , and . This means that:
This model in comparison to Model M1 does not contain monomials of and of degree higher than 1 in its set of predictors. We observe that this binary model is misspecified with respect to fitted family of logistic models, because for any . However, in this case, the linear regressions condition in Equation (32) is satisfied for X, as it follows normal distribution (see [5,7]). Hence, in view of Proposition 3.8 in , we have and for some . Parameters as well as L were chosen as for Model M1.
6.3. Results for Models M1 and M2
We first discuss the behavior of , and for the considered procedures. We observe that values of for SSCV and SSnet are close to 1 for low correlations in Model M2 for every tested loss (see Figure 1). In Model M1, attains the largest values for SSnet procedure and logistic loss for low correlations, which is because in most cases the corresponding family is the largest among the families created by considered procedures. is close to 0 in Model M1 for quadratic and Huber loss, which results in low values of the remaining indices. This may be due to strong dependences between predictors in Model M1; note that we have, e.g., . It is seen that in Model M1 inclusion probability is much lower than in Model M2 (except for negative correlations). It it also seen that for SSCV is larger than for LFT and LFT fails with respect to in M1.
In Model M1, the largest values are attained for SSnet with BIC penalty, the second best is SSCV with EBIC1 penalty (see Figure 2). In Model M2, is close to 1 for SSnet and SSCV with EBIC1 penalty and is much larger than for the corresponding versions using BIC penalty. We also note that choice of loss is relevant only for larger correlations. These results confirm theoretical result of Theorem 2.1 in , which show that collinearity holds for broad class of loss function. We observe also that, although in Model M2 remaining procedures do not select with high probability, they select its superset, what is indicated by values of (see Figure 3). This analysis is confirmed by an analysis of measure (see Figure 4), which attains values close to 0, when is close to 1. Low values of measure mean that estimated vector is approximately proportional to , which is the case for Model M2, where normal predictors satisfy linear regressions condition. Note that the angles of and in Model M1 significantly differ even though Model M1 is well specified. In addition, for the best performing procedures in both models and any loss considered, is much larger in Model M2 than in Model M1, even though the latter is correctly specified. This shows that choosing a simple misspecified model which retains crucial characteristics of the well specified large model instead of the latter might be beneficial.
In Model M1, procedures with BIC penalty perform better than those with EBIC1 penalty; however, the gain for is much smaller than the gain when using EBIC1 in Model M2. LFT procedure performs poorly in Model M1 and reasonably well in Model M2. The overall winner in both models is SSnet. SSCV performs only slightly worse than SSnet in Model M2 but performs significantly worse in Model M1.
Analysis of computing times of the first and second stages of each procedure shows that SSnet procedure creates large families and GIC minimization becomes computationally intensive. We also observe that the first stage for SSCV is more time consuming than for SSnet, what is caused by multiple fitting of Lasso in cross-validation. However, SSCV is much faster than SSnet in the second stage.
We conclude that in the considered experiments SSnet with EBIC1 penalty works the best in most cases; however, even for the winning procedure, strong dependence of predictors results in deterioration of its performance. It is also clear from our experiments that a choice of GIC penalty is crucial for its performance. Modification of SS procedure which would perform satisfactorily for large correlations is still an open problem.
In the paper, we study the problem of selecting a set of active variables in binary regression model when the number of all predictors p is much larger then number of observations n and active predictors are sparse among all predictors, i.e., their number is significantly smaller than p. We consider a general binary model and fit based on minimization of empirical risk corresponding to a general loss function. This scenario encompasses the common case in practice when the underlying semi-parametric model is misspecified, i.e., the assumed response function is different from the true one. For random predictors, we show that in such a case the two-step procedure based on Lasso consistently estimates the support of pseudo-true vector . Under linear regression conditions and semi-parametric model, this implies consistent recovery of a subset of active predictors. This partly explains why selection procedures perform satisfactorily even when the fitted model is wrong. We show that, by using the two-step procedure, we can successfully reduce the dimension of the model chosen by Lasso. Moreover, for the two-step procedure in the case of random predictors, we do not require restrictive conditions on experimental matrix needed for Lasso support consistency for deterministic predictors such as irrepresentable condition. Our experiments show satisfactory behavior of the proposed SSnet procedure with EBIC1 penalty.
Future research directions include considering the performance of SS procedure without subgaussianity assumption and for practical importance an automatic choice of a penalty for GIC criterion. Moreover, we note the existing challenge of finding a modification of SS procedure that would perform satisfactorily for large correlations is still an open problem. It would also be of interest to find conditions under which weaker than Equation (32) would lead to collinearity of and (see  for different angle on this problem).
Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
The research of the second author was partially supported by Polish National Science Center grant .
The comments by the two referees, which helped to improve presentation of the original version of the manuscript, are gratefully acknowledged.
Conflicts of Interest
The authors declare no conflict of interest.
Proof of Lemma 1:
Observe first that function is convex as is convex. Moreover, from the definition of , we get the inequality:
Note that as we have:
By definition of convexity of , Equation (A2) and definition of S, we have:
From the convexity of norm, Equations (A1) and (A3), equality , and triangle inequality, it follows that:
We prove now Lemma A1 needed in the proof of Lemma 2 below.
Assume that and T is a random variable such that where M is some positive constant and S and T are independent. Then,
Proof of Lemma 2.
From the Chebyshev inequality (first inequality below), symmetrization inequality (see Lemma 2.3.1 of ) and Talagrand–Ledoux inequality (, Theorem 4.12), we have for and being Rademacher variables independent of :
We observe that in view of Lemma A1. Hence, using independence, we obtain and thus Applying Hölder inequality and the following inequality (see Lemma 2.2 of ):
From this, Part 1 follows. In the proofs of Parts 2 and 3, the first inequalities are the same as in Equation (A5) with supremums taken on corresponding sets. Using Cauchy–Schwarz inequality, inequality , inequality for , and Equation (A6) yields:
Similarly for , using Cauchy–Schwarz inequality, , which is valid for , definition of norm and inequality for , we obtain:
Proof of Lemma 3.
Let u and v be defined as in Lemma 1. Observe that is equivalent to as the function is increasing, and . Let We consider two cases:
In this case, from the basic inequality (Lemma 1), we have:
Note that otherwise we would have , which contradicts Equation (A2) in proof of Lemma 1. Now, we observe that as we have from definition of C and assumption for this case:
By inequality between and norms, the definition of inequality , and margin Condition (MC) (which holds because in view of Equation (A2)), we conclude that:
Hence, from the basic inequality (Lemma 1) and the inequality above, it follows that:
Subtracting from both sides of the above inequality and using the assumption on S, the bound on , and the definition of yields:
Proof of Remark 1.
Condition is equivalent to the condition that exists some that for almost all n we have:
(1) We observe that, if
then the above condition is satisfied. For BIC, we have:
which is equivalent to the condition (1) of the Remark.
(2) We observe that using inequalities , and yields for :
(3) In this case, we check similarly as in (2) that
Cover, T.; Thomas, J. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
Bühlmann, P.; van de Geer, S. Statistics for High-dimensional Data; Springer: New York, NY, USA, 2011. [Google Scholar]
van de Geer, S. Estimation and Testing Under Sparsity; Lecture Notes in Mathematics; Springer: New York, NY, USA, 2009. [Google Scholar]
Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity; Springer: New York, NY, USA, 2015. [Google Scholar]
Li, K.; Duan, N. Regression analysis under link violation. Ann. Stat.1989, 17, 1009–1052. [Google Scholar] [CrossRef]
Kubkowski, M.; Mielniczuk, J. Active set of predictors for misspecified logistic regression. Statistics2017, 51, 1023–1045. [Google Scholar] [CrossRef]
Kubkowski, M.; Mielniczuk, J. Projections of a general binary model on logistic regrssion. Linear Algebra Appl.2018, 536, 152–173. [Google Scholar] [CrossRef]
Kubkowski, M. Misspecification of Binary Regression Model: Properties and Inferential Procedures. Ph.D. Thesis, Warsaw University of Technology, Warsaw, Poland, 2019. [Google Scholar]
Lu, W.; Goldberg, Y.; Fine, J. On the robustness of the adaptive lasso to model misspecification. Biometrika2012, 99, 717–731. [Google Scholar] [CrossRef] [PubMed]
Brillinger, D. A Generalized linear model with ‘gaussian’ regressor variables. In A Festschfrift for Erich Lehmann; Wadsworth International Group: Belmont, CA, USA, 1982; pp. 97–113. [Google Scholar]
Ruud, P. Suffcient conditions for the consistency of maximum likelihood estimation despite misspecifcation of distribution in multinomial discrete choice models. Econometrica1983, 51, 225–228. [Google Scholar] [CrossRef]
Yi, C.; Huang, J. Semismooth Newton coordinate descent algorithm for elastic-net penalized Huber loss regression and quantile regression. J. Comput. Graph. Stat.2017, 26, 547–557. [Google Scholar] [CrossRef]
White, W. Maximum likelihood estimation of misspecified models. Econometrica1982, 50, 1–25. [Google Scholar] [CrossRef]
Vuong, Q. Likelihood ratio testts for model selection and not-nested hypotheses. Econometrica1989, 57, 307–333. [Google Scholar] [CrossRef]
Bickel, P.; Ritov, Y.; Tsybakov, A. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat.2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Negahban, S.N.; Ravikumar, P.; Wainwright, M.J.; Yu, B. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci.2012, 27, 538–557. [Google Scholar] [CrossRef]
Pokarowski, P.; Mielniczuk, J. Combined ℓ1 and greedy ℓ0 penalized least squares for linear model selection. J. Mach. Learn. Res.2015, 16, 961–992. [Google Scholar]
Hall, P.; Li, K.C. On almost Linearity of Low Dimensional Projections from High Dimensional Data. Ann. Stat.1993, 21, 867–889. [Google Scholar] [CrossRef]
Chen, J.; Chen, Z. Extended bayesian information criterion for model selection with large model spaces. Biometrika2008, 95, 759–771. [Google Scholar] [CrossRef]
Chen, J.; Chen, Z. Extended BIC for small-n-large-p sparse GLM. Stat. Sin.2012, 22, 555–574. [Google Scholar] [CrossRef]
Mielniczuk, J.; Szymanowski, H. Selection consistency of Generalized Information Criterion for sparse logistic model. In Stochastic Models, Statistics and Their Applications; Steland, A., Rafajłowicz, E., Szajowski, K., Eds.; Springer: Cham, Switzerland, 2015; Volume 122, pp. 111–118. [Google Scholar]
Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, Theory and Applications; Cambridge University Press: Cambridge, UK, 2012; pp. 210–268. [Google Scholar]
Fan, J.; Xue, L.; Zou, H. Supplement to “Strong Oracle Optimality of Folded Concave Penalized Estimation”. 2014. Available online: NIHMS649192-supplement-suppl.pdf (accessed on 25 January 2020).
Fan, J.; Xue, L.; Zou, H. Strong Oracle Optimality of folded concave penalized estimation. Ann. Stat.2014, 43, 819–849. [Google Scholar] [CrossRef]
Bach, F. Self-concordant analysis for logistic regression. Electron. J. Stat.2010, 4, 384–414. [Google Scholar] [CrossRef]
Akaike, H. Statistical predictor identification. Ann. Inst. Stat. Math.1970, 22, 203–217. [Google Scholar] [CrossRef]
The statements, opinions and data contained in the journal Entropy are solely
those of the individual authors and contributors and not of the publisher and the editor(s).
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The statements, opinions and data contained in the journals are solely
those of the individual authors and contributors and not of the publisher and the editor(s).
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.