# Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- We prove that under misspecification when the sample size grows support ${\widehat{s}}^{*}$ coincides with support of ${\beta}^{*}$ with probability tending to 1. In the general framework allowing for misspecification this means that selection rule ${\widehat{s}}^{*}$ is consistent, i.e., $P({\widehat{s}}^{*}={s}^{*})\to 1$ when $n\to \infty $. In particular, when the model in Equation (1) is correctly specified this means that we recover the support of the true vector $\beta $ with probability tending to 1.
- We also prove approximation result for Lasso estimator when predictors are random and $\rho $ is a convex Lipschitz function (cf. Theorem 1).
- A useful corollary of the last result derived in the paper is determination of sufficient conditions under which active predictors can be separated from spurious ones based on the absolute values of corresponding coordinates of Lasso estimator. This makes construction of nested family containing ${s}^{*}$ with a large probability possible.
- Significant insight has been gained for fitting of parametric model when predictors are elliptically contoured (e.g., multivariate normal). Namely, it is known that in such situation ${\beta}^{*}=\eta \beta $, i.e., these two vectors are collinear [5]. Thus, in the case when $\eta \ne 0$ we have that support ${s}^{*}$ of ${\beta}^{*}$ coincides with support s of $\beta $ and the selection consistency of two-step procedure proved in the paper entails direction and support recovery of $\beta $. This may be considered as a partial justification of a frequent observation that classification methods are robust to misspecification of the model for which they are derived (see, e.g., [5,18]).

## 2. Definitions and Auxiliary Results

- (MC)
- There exist $\vartheta ,\epsilon ,\delta >0$ and non-negative definite matrix $H\in {R}^{{p}_{n}\times {p}_{n}}$ such that for all b with $b-{\beta}^{*}\in {\mathcal{C}}_{\epsilon}\cap {B}_{1}(\delta )$ we have$$R(b)-R({\beta}^{*})\ge \frac{\vartheta}{2}{(b-{\beta}^{*})}^{T}H(b-{\beta}^{*}).$$

- (LL)
- $\exists L>0\phantom{\rule{4pt}{0ex}}\forall {b}_{1},{b}_{2}\in R,y\in \{0,1\}:\phantom{\rule{4pt}{0ex}}|\rho ({b}_{1},y)-\rho ({b}_{2},y)|\le L|{b}_{1}-{b}_{2}|$.

- AIC (Akaike Information Criterion): ${a}_{n}=2$;
- BIC (Bayesian Information Criterion): ${a}_{n}=logn$; and
- EBIC(d) (Extended BIC): ${a}_{n}=logn+2dlog{p}_{n}$, where $d>0$.

- ${C}_{\u03f5}(w)$: $R(b)-R({\beta}^{*})\ge \theta ||b-{\beta}^{*}{||}_{2}^{2}$ for all $b\in {R}^{{p}_{n}}$ such that $suppb\subseteq w$ and $b-{\beta}^{*}\in {B}_{2}(\u03f5).$

**Lemma**

**1.**

**Lemma**

**2.**

- 1.
- $P(S(r)>t)\le \frac{8Lr{s}_{n}\sqrt{log({p}_{n}\vee 2)}}{t\sqrt{n}}$,
- 2.
- $P({S}_{1}(r)\ge t)\le \frac{8Lr{s}_{n}\sqrt{{k}_{n}ln({p}_{n}\vee 2)}}{t\sqrt{n}}$,
- 3.
- $P({S}_{2}(r)\ge t)\le \frac{4Lr{s}_{n}\sqrt{|{s}^{*}|}}{t\sqrt{n}}$.

## 3. Properties of Lasso for a General Loss Function and Random Predictors

**Lemma**

**3.**

**Theorem**

**1.**

**Proof.**

**Corollary**

**1.**

**Proof.**

## 4. GIC Consistency for a a General Loss Function and Random Predictors

**Theorem**

**2.**

**Proof.**

**Corollary**

**2.**

**Proof.**

**Remark**

**1.**

- (1)
- ${a}_{n}=logn$ and ${p}_{n}<\frac{{n}^{\frac{A}{{k}_{n}(1+u)}}}{2}$ for some $u>0$.
- (2)
- ${a}_{n}=logn+2\gamma log{p}_{n}$, ${k}_{n}\le C$ and $2A\gamma -(1+u)C\ge 0$, where $C,u>0$.
- (3)
- ${a}_{n}=logn+2\gamma log{p}_{n}$, ${k}_{n}\le C$, $2A\gamma -(1+u)C<0$, ${p}_{n}<B{n}^{\delta}$, where $\delta =\frac{A}{(1+u)C-2A\gamma}$ and $B={2}^{-(1+u)C}$.

**Theorem**

**3.**

**Proof.**

**Corollary**

**3.**

**Proof.**

## 5. Selection Consistency of SS Procedure

- Choose some $\lambda >0$.
- Find ${\widehat{\beta}}_{L}=\underset{b\in {R}^{{p}_{n}}}{arg\; min}{R}_{n}(b)+{\lambda ||b||}_{1}$.
- Find ${\widehat{s}}_{L}=supp{\widehat{\beta}}_{L}=\{{j}_{1},\dots ,{j}_{k}\}$ such that $|{\widehat{\beta}}_{L,{j}_{1}}|\ge \dots \ge |{\widehat{\beta}}_{L,{j}_{k}}|>0$ and ${j}_{1},\dots ,{j}_{k}\in \{1,\dots ,{p}_{n}\}$.
- Define ${\mathcal{M}}_{SS}=\{\varnothing ,\{{j}_{1}\},\{{j}_{1},{j}_{2}\},\dots ,\{{j}_{1},{j}_{2},\dots ,{j}_{k}\}\}$.
- Find ${\widehat{s}}^{*}=\underset{w\in {\mathcal{M}}_{SS}}{arg\; min}GIC(w)$.

**Corollary**

**4.**

- $|{s}^{*}|\le {k}_{n}$,
- $P(\forall w\in {\mathcal{M}}_{SS}:|w|\le {k}_{n})\to 1$,
- $\underset{n}{lim\; inf}{\kappa}_{H}(\epsilon )>0$ for some $\epsilon >0$, where H is non-negative definite matrix and ${\kappa}_{H}(\epsilon )$ is defined in Equation (12),
- $log({p}_{n})=o(n{\lambda}^{2})$,
- ${k}_{n}\lambda =o(min\{{\beta}_{min}^{*},1\})$,
- ${k}_{n}log{p}_{n}=o(n)$,
- ${k}_{n}log{p}_{n}=o({a}_{n})$,
- ${a}_{n}{k}_{n}=o(nmin{\{{\beta}_{min}^{*},1\}}^{2})$,

**Proof.**

#### 5.1. Case of Misspecified Semi-Parametric Model

**Corollary**

**5.**

**Remark**

**2.**

**Remark**

**3.**

## 6. Numerical Experiments

#### 6.1. Selection Procedures

- Choose some ${\lambda}_{1}>\dots >{\lambda}_{m}>0$.
- Find ${\widehat{\beta}}_{L}^{(i)}=\underset{b\in {R}^{{p}_{n}+1}}{arg\; min}{R}_{n}(b)+{\lambda}_{i}||\tilde{b}{||}_{1}$ for $i=1,\dots ,m$.
- Find ${\widehat{s}}_{L}^{(i)}=supp{\widehat{\tilde{\beta}}}_{L}^{(i)}=\{{j}_{1}^{(i)},\dots ,{j}_{{k}_{i}}^{(i)}\}$ where ${j}_{1}^{(i)},\dots ,{j}_{{k}_{i}}^{(i)}$ are such that $|{\widehat{\beta}}_{L,{j}_{1}^{(i)}}^{(i)}|\ge \dots \ge |{\widehat{\beta}}_{L,{j}_{{k}_{i}}^{(i)}}^{(i)}|>0$ for $i=1,\dots ,m$.
- Define ${\mathcal{M}}_{i}=\{\{{j}_{1}^{(i)}\},\{{j}_{1}^{(i)},{j}_{2}^{(i)}\},\dots ,\{{j}_{1}^{(i)},{j}_{2}^{(i)},\dots ,{j}_{{k}_{i}}^{(i)}\}\}$ for $i=1,\dots ,m$.
- Define $\mathcal{M}=\{\varnothing \}\cup {\displaystyle \bigcup _{i=1}^{m}}{\mathcal{M}}_{i}$.
- Find ${\widehat{s}}^{*}=\underset{w\in \mathcal{M}}{arg\; min}GIC(w)$, where$$GIC(w)=\underset{b\in {R}^{{p}_{n}+1}:supp\tilde{b}\subseteq w}{min}n{R}_{n}(b)+{a}_{n}(|w|+1).$$

- SSnet with logistic or quadratic loss:
`ncvreg`; - SSCV or LFT with logistic or quadratic loss:
`glmnet`; and - SSnet, SSCV or LFT with Huber loss (cf. [12]):
`hqreg`.

- logistic loss:
`glm.fit`(package`stats`); - quadratic loss:
`.lm.fit`(package`stats`); and - Huber loss:
`rlm`(package`rlm`).

- $ANGLE=\frac{1}{L}{\displaystyle \sum _{k=1}^{L}}arccos|cos\angle ({\tilde{\beta}}_{0},\widehat{\tilde{\beta}}({\widehat{s}}_{k}^{*}))|$, where$$cos\angle (\tilde{\beta},\widehat{\tilde{\beta}}({\widehat{s}}_{k}^{*}))=\frac{{\displaystyle \sum _{j=1}^{{p}_{n}}}{\beta}_{j}{\widehat{\beta}}_{j}({\widehat{s}}_{k}^{*})}{||\tilde{\beta}{||}_{2}||\widehat{\tilde{\beta}}({\widehat{s}}_{k}^{*}){||}_{2}}$$
- ${P}_{inc}=\frac{1}{L}{\displaystyle \sum _{k=1}^{L}}I({s}^{*}\in {\mathcal{M}}^{(k)})$,
- ${P}_{equal}=\frac{1}{L}{\displaystyle \sum _{k=1}^{L}}I({\widehat{s}}_{k}^{*}={s}^{*})$.
- ${P}_{supset}=\frac{1}{L}{\displaystyle \sum _{k=1}^{L}}I({\widehat{s}}_{k}^{*}\supseteq {s}^{*})$.

#### 6.2. Regression Models Considered

- ${a}_{n}=logn$ (BIC); and
- ${a}_{n}=logn+2log{p}_{n}$ (EBIC1).

#### 6.3. Results for Models M1 and M2

## 7. Discussion

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A

**Proof.**

**Lemma**

**A1.**

**Proof.**

**Proof.**

**Proof.**

**Proof.**

## References

- Cover, T.; Thomas, J. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
- Bühlmann, P.; van de Geer, S. Statistics for High-dimensional Data; Springer: New York, NY, USA, 2011. [Google Scholar]
- van de Geer, S. Estimation and Testing Under Sparsity; Lecture Notes in Mathematics; Springer: New York, NY, USA, 2009. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity; Springer: New York, NY, USA, 2015. [Google Scholar]
- Li, K.; Duan, N. Regression analysis under link violation. Ann. Stat.
**1989**, 17, 1009–1052. [Google Scholar] [CrossRef] - Kubkowski, M.; Mielniczuk, J. Active set of predictors for misspecified logistic regression. Statistics
**2017**, 51, 1023–1045. [Google Scholar] [CrossRef] - Kubkowski, M.; Mielniczuk, J. Projections of a general binary model on logistic regrssion. Linear Algebra Appl.
**2018**, 536, 152–173. [Google Scholar] [CrossRef] - Kubkowski, M. Misspecification of Binary Regression Model: Properties and Inferential Procedures. Ph.D. Thesis, Warsaw University of Technology, Warsaw, Poland, 2019. [Google Scholar]
- Lu, W.; Goldberg, Y.; Fine, J. On the robustness of the adaptive lasso to model misspecification. Biometrika
**2012**, 99, 717–731. [Google Scholar] [CrossRef] [PubMed] - Brillinger, D. A Generalized linear model with ‘gaussian’ regressor variables. In A Festschfrift for Erich Lehmann; Wadsworth International Group: Belmont, CA, USA, 1982; pp. 97–113. [Google Scholar]
- Ruud, P. Suffcient conditions for the consistency of maximum likelihood estimation despite misspecifcation of distribution in multinomial discrete choice models. Econometrica
**1983**, 51, 225–228. [Google Scholar] [CrossRef] - Yi, C.; Huang, J. Semismooth Newton coordinate descent algorithm for elastic-net penalized Huber loss regression and quantile regression. J. Comput. Graph. Stat.
**2017**, 26, 547–557. [Google Scholar] [CrossRef][Green Version] - White, W. Maximum likelihood estimation of misspecified models. Econometrica
**1982**, 50, 1–25. [Google Scholar] [CrossRef] - Vuong, Q. Likelihood ratio testts for model selection and not-nested hypotheses. Econometrica
**1989**, 57, 307–333. [Google Scholar] [CrossRef][Green Version] - Bickel, P.; Ritov, Y.; Tsybakov, A. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat.
**2009**, 37, 1705–1732. [Google Scholar] [CrossRef] - Negahban, S.N.; Ravikumar, P.; Wainwright, M.J.; Yu, B. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci.
**2012**, 27, 538–557. [Google Scholar] [CrossRef][Green Version] - Pokarowski, P.; Mielniczuk, J. Combined ℓ
_{1}and greedy ℓ_{0}penalized least squares for linear model selection. J. Mach. Learn. Res.**2015**, 16, 961–992. [Google Scholar] - Hall, P.; Li, K.C. On almost Linearity of Low Dimensional Projections from High Dimensional Data. Ann. Stat.
**1993**, 21, 867–889. [Google Scholar] [CrossRef] - Chen, J.; Chen, Z. Extended bayesian information criterion for model selection with large model spaces. Biometrika
**2008**, 95, 759–771. [Google Scholar] [CrossRef][Green Version] - Chen, J.; Chen, Z. Extended BIC for small-n-large-p sparse GLM. Stat. Sin.
**2012**, 22, 555–574. [Google Scholar] [CrossRef][Green Version] - Mielniczuk, J.; Szymanowski, H. Selection consistency of Generalized Information Criterion for sparse logistic model. In Stochastic Models, Statistics and Their Applications; Steland, A., Rafajłowicz, E., Szajowski, K., Eds.; Springer: Cham, Switzerland, 2015; Volume 122, pp. 111–118. [Google Scholar]
- Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, Theory and Applications; Cambridge University Press: Cambridge, UK, 2012; pp. 210–268. [Google Scholar]
- Fan, J.; Xue, L.; Zou, H. Supplement to “Strong Oracle Optimality of Folded Concave Penalized Estimation”. 2014. Available online: NIHMS649192-supplement-suppl.pdf (accessed on 25 January 2020).
- Fan, J.; Xue, L.; Zou, H. Strong Oracle Optimality of folded concave penalized estimation. Ann. Stat.
**2014**, 43, 819–849. [Google Scholar] [CrossRef] - Bach, F. Self-concordant analysis for logistic regression. Electron. J. Stat.
**2010**, 4, 384–414. [Google Scholar] [CrossRef] - Akaike, H. Statistical predictor identification. Ann. Inst. Stat. Math.
**1970**, 22, 203–217. [Google Scholar] [CrossRef] - Schwarz, G. Estimating the Dimension of a Model. Ann. Stat.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Kim, Y.; Jeon, J. Consistent model selection criteria for quadratically supported risks. Ann. Stat.
**2016**, 44, 2467–2496. [Google Scholar] [CrossRef] - van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes with Applications to Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
- Ledoux, M.; Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes; Springe: New York, NY, USA, 1991. [Google Scholar]
- Huang, J.; Ma, S.; Zhang, C. Adaptive Lasso for sparse high-dimensional regression models. Stat. Sin.
**2008**, 18, 1603–1618. [Google Scholar] - Tibshirani, R. The lasso problem and uniqueness. Electron. J. Stat.
**2013**, 7, 1456–1490. [Google Scholar] [CrossRef] - Zhou, S. Thresholded Lasso for high dimensional variable selection and statistical estimation. arXiv
**2010**, arXiv:1002.1583. [Google Scholar] - Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.
**2010**, 33, 1–22. [Google Scholar] [CrossRef] [PubMed][Green Version] - Fan, Y.; Tang, C. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B
**2013**, 75, 531–552. [Google Scholar] [CrossRef][Green Version] - Rosset, S.; Zhu, J.; Hastie, T. Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res.
**2004**, 5, 941–973. [Google Scholar] - Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kubkowski, M.; Mielniczuk, J. Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors. *Entropy* **2020**, *22*, 153.
https://doi.org/10.3390/e22020153

**AMA Style**

Kubkowski M, Mielniczuk J. Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors. *Entropy*. 2020; 22(2):153.
https://doi.org/10.3390/e22020153

**Chicago/Turabian Style**

Kubkowski, Mariusz, and Jan Mielniczuk. 2020. "Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors" *Entropy* 22, no. 2: 153.
https://doi.org/10.3390/e22020153