# Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Assumptions and Notation

- -
- ${X}_{i}={({X}_{i1},{X}_{i2},\dots ,{X}_{ip})}^{\top}$;
- -
- $\mathbb{X}={({X}_{1},{X}_{2},\dots ,{X}_{n})}^{\top}$ is the $(n\times p)$-matrix of predictors;
- -
- Let $A\subset \{1,\dots ,p\}.$ Then ${A}^{c}=\{1,\dots ,p\}\backslash A$ is a complement of A;
- -
- ${\mathbb{X}}_{A}$ is a submatrix of $\mathbb{X},$ with columns whose indices belong to A;
- -
- ${b}_{A}$ is a restriction of a vector $b\in {\mathbb{R}}^{p}$ to the indices from $A;$
- -
- $\left|A\right|$ is the number of elements in $A;$
- -
- $\tilde{A}=A\cup \left\{0\right\},$ so the set $\tilde{A}$ contains indices from A and the intercept;
- -
- The ${l}_{q}$-norm of a vector is defined as ${\left|b\right|}_{q}={\left(\right)}^{{\sum}_{j=1}^{p}}1/q$ for $q\in [1,\infty ];$
- -
- For $x\in {\mathbb{R}}^{p}$ we denote $\tilde{x}={(1,x)}^{\top};$
- -
- $\tilde{\mathbb{X}}$ is the matrix $\mathbb{X}$ with the column of ones binded from the left side;
- -
- -
- -
- The Kullback–Leibler (KL) distance [14] between two binary distributions with success probabilities ${\pi}_{1}$ and ${\pi}_{2}$ is defined as$$KL({\pi}_{1},{\pi}_{2})={\pi}_{1}log\left(\right)open="("\; close=")">\frac{{\pi}_{1}}{{\pi}_{2}}\phantom{\rule{0.222222em}{0ex}}.$$

- -
- the set of nonzero coefficients of ${b}_{*}^{quad}$ is denoted as$$T=\{1\le j\le p:{\left({b}_{*}^{\phantom{\rule{0.166667em}{0ex}}quad}\right)}_{j}\ne 0\}.$$

**Assumption**

**1.**

**Assumption**

**2.**

## 3. Predictive Properties of Classifiers

**Theorem**

**1.**

**Theorem**

**2.**

## 4. On the Event $\Omega $

- -
- The restricted eigenvalue [8]:$$RE\left(\xi \right)=\underset{0\ne b\in \mathcal{C}\left(\xi \right)}{inf}\frac{{b}^{\top}{\tilde{\mathbb{X}}}^{\top}\tilde{\mathbb{X}}b/n}{{\left|b\right|}_{2}^{2}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}},$$
- -
- The compatibility factor [7]:$$K\left(\xi \right)=\underset{0\ne b\in \mathcal{C}\left(\xi \right)}{inf}\frac{\left|T\right|{b}^{\top}{\tilde{\mathbb{X}}}^{\top}\tilde{\mathbb{X}}b/n}{|{b}_{T}{|}_{1}^{2}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}},$$
- -
- The cone invertibility factor (CIF, [9]): for $q\ge 1$$${\overline{F}}_{q}\left(\xi \right)=\underset{0\ne b\in \mathcal{C}\left(\xi \right)}{inf}\frac{{\left|T\right|}^{1/q}{|{\tilde{\mathbb{X}}}^{\top}\tilde{\mathbb{X}}b/n|}_{\infty}}{{\left|b\right|}_{q}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}.$$

**Theorem**

**3.**

**Corollary**

**1.**

## 5. Variable Selection Properties of Estimators

**Assumption**

**3.**

**Corollary**

**2.**

**Corollary**

**3.**

## 6. Numerical Experiments

- -
- Scenario 1: $g\left(x\right)=exp\left(x\right)/(1+exp(x\left)\right);$
- -
- Scenario 2: $g\left(x\right)=arctan\left(x\right)/\pi +0.5.$

- -
**TD**—the number of correctly selected relevant predictors;- -
**sep**—the number of relevant predictors, whose Lasso coefficients are greater in absolute value than the largest in absolute value Lasso coefficient corresponding to irrelevant predictors.

- -
**pred**—the fraction of correctly predicted classes of objects for each estimator.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Proofs and Auxiliary Results

#### Appendix A.1. Results from Section 3

**Proof of Theorem**

**1.**

**Lemma**

**A1.**

**Proof.**

**Proof of Theorem**

**2.**

#### Appendix A.2. Results from Section 4

**Lemma**

**A2.**

**Proof.**

**Lemma**

**A3.**

**Lemma**

**A4.**

**Proof.**

**Lemma**

**A5.**

**Proof.**

**Proof of Theorem**

**3.**

#### Appendix A.3. Results from Section 5

**Proof of Corollary**

**2.**

**Proof of Corollary**

**3.**

## References

- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Data Mining, Inference and Prediction; Springer: New York, NY, USA, 2001. [Google Scholar]
- Bühlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications; Springer: New York, NY, USA, 2011. [Google Scholar]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the Lasso. Ann. Stat.
**2006**, 34, 1436–1462. [Google Scholar] [CrossRef][Green Version] - Zhao, P.; Yu, B. On Model Selection Consistency of Lasso. J. Mach. Learn. Res.
**2006**, 7, 2541–2563. [Google Scholar] - Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc.
**2006**, 101, 1418–1429. [Google Scholar] [CrossRef][Green Version] - van de Geer, S. High-dimensional generalized linear models and the Lasso. Ann. Stat.
**2008**, 36, 614–645. [Google Scholar] [CrossRef] - Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat.
**2009**, 37, 1705–1732. [Google Scholar] [CrossRef] - Ye, F.; Zhang, C.H. Rate minimaxity of the Lasso and Dantzig selector for the l
_{q}loss in l_{r}balls. J. Mach. Learn. Res.**2010**, 11, 3519–3540. [Google Scholar] - Huang, J.; Zhang, C.H. Estimation and Selection via Absolute Penalized Convex Minimization and Its Multistage Adaptive Applications. J. Mach. Learn. Res.
**2012**, 13, 1839–1864. [Google Scholar] - Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci.
**1997**, 55, 119–139. [Google Scholar] [CrossRef][Green Version] - Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
- Kubkowski, M.; Mielniczuk, J. Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors. Entropy
**2020**, 22, 153. [Google Scholar] [CrossRef][Green Version] - Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Statist.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Schwarz, G. Estimating the dimension of a model. Ann. Stat.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Quintero, F.; Contreras-Reyes, J.E.; Wiff, R.; Arellano-Valle, R.B. Flexible Bayesian analysis of the von Bertalanffy growth function with the use of a log-skew-t distribution. Fish. Bull.
**2017**, 115, 12–26. [Google Scholar] - Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat.
**2004**, 32, 56–85. [Google Scholar] [CrossRef] - Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification and risk bounds. J. Am. Stat. Assoc.
**2006**, 101, 138–156. [Google Scholar] [CrossRef][Green Version] - Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer-Verlag: New York, NY, USA, 1996. [Google Scholar]
- Boucheron, S.; Bousquet, O.; Lugosi, G. Introduction to statistical learning theory. Adv. Lect. Mach. Learn.
**2004**, 36, 169–207. [Google Scholar] - Boucheron, S.; Bousquet, O.; Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM P&S
**2005**, 9, 323–375. [Google Scholar] - Bartlett, P.L.; Bousquet, O.; Mendelson, S. Local Rademacher complexities. Ann. Stat.
**2005**, 33, 1497–1537. [Google Scholar] [CrossRef] - Audibert, J.Y.; Tsybakov, A.B. Fast learning rates for plug-in classifiers. Ann. Stat.
**2007**, 35, 608–633. [Google Scholar] [CrossRef] - Blanchard, G.; Bousquet, O.; Massart, P. Statistical performance of support vector machines. Ann. Stat.
**2008**, 36, 489–531. [Google Scholar] [CrossRef] - Tarigan, B.; van de Geer, S. Classifiers of support vector machine type with l
_{1}complexity regularization. Bernoulli**2006**, 12, 1045–1076. [Google Scholar] [CrossRef] - Abramovich, F.; Grinshtein, V. High-Dimensional Classification by Sparse Logistic Regression. IEEE Trans. Inf. Theory
**2019**, 65, 3068–3079. [Google Scholar] [CrossRef][Green Version] - Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat.
**2004**, 32, 407–499. [Google Scholar] - Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.
**2010**, 33, 1–22. [Google Scholar] [CrossRef][Green Version] - Buldygin, V.; Kozachenko, Y. Metric Characterization of Random Variables and Random Processes; American Mathematical Society: Providence, RI, USA, 2000. [Google Scholar]
- Huang, J.; Sun, T.; Ying, Z.; Yu, Y.; Zhang, C.H. Oracle inequalities for the lasso in the Cox model. Ann. Stat.
**2013**, 41, 1142–1165. [Google Scholar] [CrossRef][Green Version] - van de Geer, S.; Bühlmann, P. On the conditions used to prove oracle results for the Lasso. Electron. J. Stat.
**2009**, 3, 1360–1392. [Google Scholar] [CrossRef][Green Version] - Li, K.C.; Duan, N. Regression analysis under link violation. Ann. Stat.
**1989**, 17, 1009–1052. [Google Scholar] [CrossRef] - Thorisson, H. Coupling methods in probability theory. Scand. J. Stat.
**1995**, 22, 159–182. [Google Scholar] - Brillinger, D.R. A Generalized Linear Model with Gaussian Regressor Variables; A Festschrift for Erich Lehmann; Bickel, P.J., Doksum, K., Hodges, J.L., Eds.; Wadsworth: Belmont, CA, USA, 1983; pp. 97–114. [Google Scholar]
- Ruud, P.A. Sufficient Conditions for the Consistency of Maximum Likelihood Estimation Despite Misspecification of Distribution in Multinomial Discrete Choice Models. Econometrica
**1983**, 51, 225–228. [Google Scholar] [CrossRef] - Zhong, W.; Zhu, L.; Li, R.; Cui, H. Regularized quantile regression and robust feature screening for single index models. Stat. Sin.
**2016**, 26, 69–95. [Google Scholar] [CrossRef][Green Version] - Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B
**2008**, 70, 849–911. [Google Scholar] [CrossRef] [PubMed][Green Version] - Hall, P.; Li, K.C. On almost Linearity of Low Dimensional Projections from High Dimensional Data. Ann. Stat.
**1993**, 21, 867–889. [Google Scholar] [CrossRef] - Pokarowski, P.; Mielniczuk, J. Combined l
_{1}and Greedy l_{0}Penalized Least Squares for Linear Model Selection. J. Mach. Learn. Res.**2015**, 16, 961–992. [Google Scholar] - Pokarowski, P.; Rejchel, W.; Soltys, A.; Frej, M.; Mielniczuk, J. Improving Lasso for model selection and prediction. arXiv
**2019**, arXiv:1907.03025. [Google Scholar] - R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- van de Geer, S. Estimation and Testing under Sparsity; Springer: Berlin, Germany, 2016. [Google Scholar]
- Baraniuk, R.; Davenport, M.A.; Duarte, M.F.; Hegde, C. An Introduction to Compressive Sensing; Connexions, Rice University: Houston, TX, USA, 2011. [Google Scholar]

$\mathit{n}=100$ | Quadratic | Logistic | Oracle |
---|---|---|---|

TD | 6.3 | 6.1 | |

sep | 2.2 | 2.3 | |

pred | 0.734 | 0.736 | 0.810 |

$n=350$ | |||

TD | 9.3 | 9.5 | |

sep | 6.0 | 6.3 | |

pred | 0.774 | 0.779 | 0.831 |

$n=600$ | |||

TD | 9.8 | 9.9 | |

sep | 8.6 | 8.9 | |

pred | 0.791 | 0.795 | 0.832 |

$\mathit{n}=100$ | Quadratic | Logistic | Oracle |
---|---|---|---|

TD | 4.8 | 4.6 | |

sep | 1.4 | 1.4 | |

pred | 0.697 | 0.698 | 0.768 |

$n=350$ | |||

TD | 8.1 | 8.2 | |

sep | 3.9 | 3.9 | |

pred | 0.730 | 0.731 | 0.805 |

$n=600$ | |||

TD | 9.4 | 9.4 | |

sep | 6.8 | 6.9 | |

pred | 0.750 | 0.752 | 0.809 |

**Table 3.**Relative time difference (28) of algorithms.

Scenario 1 | Scenario 2 | |
---|---|---|

$n=350$ | 0.02 | 0.06 |

$n=600$ | 0.11 | 0.13 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Furmańczyk, K.; Rejchel, W.
Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification. *Entropy* **2020**, *22*, 543.
https://doi.org/10.3390/e22050543

**AMA Style**

Furmańczyk K, Rejchel W.
Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification. *Entropy*. 2020; 22(5):543.
https://doi.org/10.3390/e22050543

**Chicago/Turabian Style**

Furmańczyk, Konrad, and Wojciech Rejchel.
2020. "Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification" *Entropy* 22, no. 5: 543.
https://doi.org/10.3390/e22050543