Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

In this paper, we consider prediction and variable selection in the misspecified binary classification models under the high-dimensional scenario. We focus on two approaches to classification, which are computationally efficient, but lead to model misspecification. The first one is to apply penalized logistic regression to the classification data, which possibly do not follow the logistic model. The second method is even more radical: we just treat class labels of objects as they were numbers and apply penalized linear regression. In this paper, we investigate thoroughly these two approaches and provide conditions, which guarantee that they are successful in prediction and variable selection. Our results hold even if the number of predictors is much larger than the sample size. The paper is completed by the experimental results.


Introduction
Large-scale data sets, where the number of predictors significantly exceeds the number of observations, become common in many practical problems from, among others, biology or genetics. Currently, the analysis of such data sets is a fundamental challenge in statistics and machine learning. High-dimensional prediction and variable selection are arguably the most popular and intensively studied topics in this field. There are many methods trying to solve these problems such as those based on penalized estimation [1,2]. The main representative of them is Lasso [3], that relates to l 1 -norm penalization. Its properties in model selection, estimation and prediction are deeply investigated, among others, in [2,[4][5][6][7][8][9][10]. The results obtained in the above papers can be applied only if some specific assumptions are satisfied. For instance, these conditions concern the relation between the response variable and predictors. However, it is quite common that a complex data set does not satisfy these model assumptions or they are difficult to verify, which leads to the fact that the considered model is specified incorrectly. The model misspecification problem is the core of the current paper. We investigate this topic in the context of high-dimensional binary classification (binary regression).
In the classification problem we are to predict or to guess the class label of the object on the basis of its observed predictors. The object is described by the random vector (X, Y), where X ∈ R p is a vector of predictors and Y ∈ {−1, 1} is the class label of the object. A classifier is defined as a measurable function f : R p → R, which determines the label of an object in the following way: if f (x) ≥ 0, then we predict that y = 1.
The most natural approach is to look for a classifier f , which minimizes the misclassification risk (probability of incorrect classification) Let η(x) = P(Y = 1|X = x). It is clear that f B (x) = sign(2η(x) − 1) minimizes the risk (1) in the family of all classifiers. It is called the Bayes classifier and we denote its risk as R B = R ( f B ) . Obviously, in practice we do not know the function η, so we cannot find the Bayes classifier. However, if we possess a training sample (X 1 , Y 1 ), . . . , (X n , Y n ) containing independent copies of (X, Y), then we can consider a sample analog of (1), namely the empirical misclassification risk where I is the indicator function. Then a minimizer of (2) could be used as our estimator.
The main difficulty in this approach lies in discontinuity of the function (2). It entails that finding its minimizer is computationally difficult and not effective. To overcome this problem, one usually replaces the discontinuous loss function by its convex analog φ : R → [0, ∞], for instance the logistic loss, the hinge loss or the exponential loss. Then we obtain the convex empirical risk In the high-dimensional case one usually obtains an estimator by minimizing the penalized version of (3). Those tricks have been successfully used in the classification theory and have allowed to invent boosting algorithms [11], support vector machines [12] or Lasso estimators [3]. In this paper we are mainly interested in Lasso estimators, because they are able to solve both variable selection and prediction problems simultaneously, while the first two algorithms are developed mainly for prediction. Thus, we consider linear classifiers . For a fixed loss function φ we define the Lasso estimator aŝ b = arg min b∈R p+1Q where λ is a positive tuning parameter, which provides a balance between minimizing the empirical risk and the penalty. The form of the penalty is crucial, because its singularity at the origin implies that some coordinates of the minimizerb are exactly equal to zero, if λ is sufficiently large. Thus, calculating (5) we simultaneously select significant predictors in the model and we estimate their coefficients, so we are also able to predict the class of new objects. The functionQ( f b ) and the penalty are convex, so (5) is a convex minimization problem, which is an important fact from both practical and theoretical points of view. Notice that the intercept b 0 is not penalized in (5). The random vector (5) is an estimator of where Q( f b ) = Eφ(Y f b (X)). In this paper we are mainly interested in minimizers (6) corresponding to quadratic and logistic loss functions. The latter has a nice information-theoretic interpretation.
Namely, it can be viewed as the Kullback-Leibler projection of unknown η on logistic models [13]. The Kullback-Leibler divergence [14] plays an important role in the information theory and statistics, for instance it is involved in information criteria in model selection [15] or in detecting influenctial observations [16]. In general, the classifier corresponding to (6) need not coincide with the Bayes classifier. Obviously, we want to have a "good" estimator, which means that its misclassification risk should be as close to the risk of the Bayes classifier as possible. In other words, its excess risk should be small, where E D is the expectation with respect to the data D = {(X 1 , Y 1 ), . . . , (X n , Y n )} and we write simply R(b) instead of R( f b ). Our goal is to study the excess risk (7) for the estimator (5) with different loss functions φ. We do it by looking for the upper bounds of (7). In the excess risk (7) we compare two misclassification risks defined in (1). In the literature one can also find a different approach, which replaces the misclassification risks R(·) in (7) by the convex risks Q(·). In that case the excess risk depends on the loss function φ. To deal with this fact one uses the results from [17,18], which state the relation between the excess risk (7) and its analog based on the convex risk Q(·). In this paper we do not follow this way and work, right from the beginning, with the excess risk independent of φ. Only the estimator (5) depends on the loss φ.
In this paper we are also interested in variable selection. We investigate this problem in the following semiparametric model where η(x) = P(Y = 1|X = x), β ∈ R p+1 is the true parameter and g is unknown function. Thus, we suppose that predictors influence class probability through the function g of the linear combination β j x j . The goal of variable selection is the identification of the set of significant predictors Obviously, in the model (8) we cannot estimate an intercept β 0 and we can identify the vector (β 1 , . . . , β p ) only up to a multiplicative constant, because any shift or scale change in β 0 + p ∑ j=1 β j X j can be absorbed by g. However, we show in Section 5 that in many situations the Lasso estimator (5) can properly identify the set (9). The literature on the classification problem is comprehensive. We just mention a few references: [12,[19][20][21]. The predictive quality of classifiers is often investigated by obtaining upper bounds for their excess risks. It is an important problem and was studied thoroughly, among others in [17,18,[22][23][24]. The variable selection and predictive properties of estimators in the high-dimensional scenario were studied, for instance, in [2,10,13,25,26]. In the current paper we investigate the behaviour of classifiers in possibly misspecified high-dimensional classification, which appears frequently in practice. For instance, while working with binary regression one often assumes incorrectly that the data follow the logistic regression model. Then the problem is solved using the Lasso penalized maximum likelihood method. Another approach to binary regression, which is widely used due to its computational simplicity, is just treating labels Y i as they were numbers and applying standard Lasso. For instance, such method is used in ( These two approaches to classification sometimes give unexpectedly good results in variable selection and prediction, but the reason of this phenomenon has not been deeply studied in the literature. Among the above mentioned papers only [2,13,25] take up this issue. However, [25] focuses mainly on the predictive properties of Lasso classifiers with the hinge loss. Bühlmann and van de Geer [2] and Kubkowski and Mielniczuk [13] study general Lipschitz loss functions. The latter paper considers only the variable selection problem. In [2] one also investigates prediction, but they do not study classification with the quadratic loss. In this paper we are interested in both variable selection and predictive properties of classifiers with convex (but not necessarily Lipschitz) loss functions. The prominent example is classification with the quadratic loss function, which has not been investigated so far in the context of the high-dimensional misspecified model. In this case the estimator (5) can be calculated efficiently using the existing algorithms, for instance [27] or [28], even if the number of predictors is much larger than the sample size. It makes this estimator very attractive, while working with large data sets. In [28] one provides also the efficient algorithm for Lasso estimators with the logistic loss in the high-dimensional scenario. Therefore, misspecified classification with the logistic loss plays an important role in this paper as well. Our goal is to study thoroughly such estimators and provide conditions, which guarantee that they are successful in prediction and variable selection.
The paper is organized as follows: in the next section we provide basic notations and assumptions, which are used in this paper. In Section 3 we study predictive properties of Lasso estimators with different loss functions. We will see that these properties depend strongly on the estimation quality of estimators, which is studied in Section 4. In Section 5 we consider variable selection. In Section 6 we show numerical experiments, which describe the quality of estimators in practice. The proofs and auxiliary results are relegated to Appendix A.

Assumptions and Notation
In this paper we work in the high-dimensional scenario p >> n. As usual we assume that the number of predictors p can vary with the sample size n, which could be denoted as p(n) = p n . However, to make notation simpler we omit the lower index and write p istead of p n . The same refers to the other objects appearing in this paper.
In the further sections we will need the following notation: -X A is a submatrix of X, with columns whose indices belong to A; -b A is a restriction of a vector b ∈ R p to the indices from A; -|A| is the number of elements in A; -Ã = A ∪ {0}, so the setÃ contains indices from A and the intercept; -The l q -norm of a vector is defined as ; -X is the matrix X with the column of ones binded from the left side; -b quad , b quad * are minimizers in (5), (6), respectively, with the quadratic loss function; -b log , b log * are minimizers in (5), (6), respectively, with the logistic loss function; -The Kullback-Leibler (KL) distance [14] between two binary distributions with success probabilities π 1 and π 2 is defined as Obviously, we have KL(π 1 , π 2 ) ≥ 0 and KL(π 1 , π 2 ) = 0 if only if π 1 = π 2 . Moreover, the KL distance need not be symmetric; -the set of nonzero coefficients of b quad * is denoted as Notice that the intercept is not contained in (11) even if it is nonzero. We also specify assumptions, which are used in this paper.
In Sections 4 and 5 we need stronger version of Assumption 1.

Assumption 2.
We suppose that the subvector of predictors X T is subgaussian with the coefficient The remaining conditions are as in Assumption 1. We also denote σ = max(σ 0 , σ j , j / ∈ T).

Predictive Properties of Classifiers
In this part of the paper we study prediction properties of classifiers with convex loss functions. To do it we look for upper bounds of the excess risk (7) of estimators.
As usual the excess risk in (7) can be decomposed as The second term in (12) is the approximation risk and compares the predictive abilitity of the "best" linear classifier (6) to the Bayes classifier. The first term in (12) is called the estimation risk and describes how the estimation process influences the predictive property of classifiers.
In the next theorem we bound from above the estimation risk of classifiers. To make the result more transparent we use notations P D and P X in (13), which indicate explicitly which probability we consider, i.e., P D is probability with respect to the data D and P X is with respect to the new object X. In further results we omit these lower indexes and believe that it does not lead to confusion.
In Theorem 1 we obtain the upper bound for the estimation risk. This risk becomes small, if we establish that probability of the event Ω c is small and the sequence c, which is involved in Ω and in the second term on the right-hand side of (13), decreases sufficiently fast to zero. Therefore, Theorem 1 shows that to have a small estimation risk it is enough to prove that for each ε ∈ (0, 1) there exists c such that Moreover, numbers ε and c should be sufficiently small. This property will be studied thoroughly in the next section. Notice that the first term on the right-hand side of (13) relates to the fact, how well (5) estimates (6). Moreover, the second expression on the right-hand side of (13) can be bounded from above, if predictors are sufficiently regular, for instance subgaussian. So far, we have been interested in the estimation risk of estimators. In the next result we establish the upper bound for the approximation risk as well. This bound combined with (13) enables us to bound from above the excess risk of estimators. We prove this fact for the quadratic loss φ(t) = (1 − t) 2 and the logistic loss φ(t) = log(1 + e −v ), which play prominent roles in this paper.

Theorem 2.
Suppose that Assumption 1 is fulfilled. Moreover, a random variable b * X has a density h, which is continuous on the interval U = [−2σc log p, 2σc log p ] andh = sup u∈U h(u).
(a) We have whereh quad refers to the density h of (b where KL(·, ·) is the Kullback-Leibler distance defined in (10) andh log refers to the density h of (b In Theorem 2 we establish upper bounds on the excess risks for Lasso estimators (5). They describe predictive properties of these classifiers. In this paper we consider linear classifiers, so the misclassification risk of an estimator is close to the Bayes risk, if the "truth" can be approximated linearly in a satisfactory way. For the classifier with the logistic loss this fact is described by (18) and (19), which measure the distance between true success probability and the one in logistic regression. In particular, when the true model is logistic, then (18) and (19) vanish. The expression (16) relates to the approximation error in the case of the quadratic loss. It measures how well the conditional expectation E[Y|X] can be described by the "best" (with respect to the loss φ) linear function (b quad * ) X . The right-hand sides of (15) and (17) relate to estimation risk. They have been already discussed after Theorem 1. Using subgaussianity of predictors we have made them more explicit. The main ingredient of bounds in Theorem 2, namely P(Ω c ), is studied in the next section.
Results in Theorem 2 refer to Lasso estimators with quadratic and logistic loss functions. Similar results are given in ( [2], Theorem 6.4). They refer to the case that the convex excess risk is considered, i.e., the misclassification risks R(·) are replaced by the convex risks Q(·) in (7). Moreover, these results do not consider Lasso estimators with the quadratic loss applied to classification, which is an approach playing a key role in the current paper. Furthermore, in ([2], Theorem 6.4) the estimation errorb − b * is measured in the l 1 -norm, which is enough for prediction. However, for variable selection the l ∞ -norm gives better results. Such results will be established in Sections 4 and 5. Finally, results of [2] need more restrictive assumptions than ours. For instance, predictors should be bounded and a function f b * should be sufficiently close to f B in the supremum norm.
Analogous bounds to those in Theorem 2 can be obtained for other loss functions, if we combine Theorem 1 with results of [17]. Finally, we should stress that the estimatorb need not rely on the Lasso method. All we require is that the bound (14) can be estiblished for this estimator.

On the Event Ω
In this section we show that probability of the event Ω can be close to one. Such results for classification models with Lipschitz loss functions were established in [2,13]. Therefore, we focus on the quadratic loss function, which is obviously non-Lipschitz. This loss function is important from the practical point of view, but was not considered in these papers. Moreover, in our results the estimation error in Ω can be measured in the l q -norms, q ≥ 1, not only in the l 1 -norm as in [2,13]. Bounds in the l ∞ -norm lead to better results in variable selection, which are given in Section 5.
We start with introducing the cone invertibility factor (CIF), which plays a significant role in investigating properties of estimators based on the Lasso penalty [9]. In the case n > p one usually uses the minimal eigenvalue of the matrix X X/n to express the strength of correlations between predictors. Obviously, in the high-dimensional scenario this value is equal to zero and the minimal eigenvalue needs to be replaced by some other measure of predictors interdependency, which would describe the potential of consistent estimation of model parameters.
For ξ > 1 we define a cone where we recall thatT = T ∪ {0}. In the case when p >> n three different characteristics measuring the potential for consistent estimation of the model parameters have been introduced: -The restricted eigenvalue [8]: -The compatibility factor [7]: -The cone invertibility factor (CIF, [9]): for q ≥ 1 In this article we will use CIF, because this factor allows for a sharp formulation of convergency results for all l q norms with q ≥ 1, see ( [9], Section 3.2). The population (non-random) version of CIF is given by The key property of the random and the population versions of CIF,F q (ξ) and F q (ξ), is that, in contrast to the smallest eigenvalues of matricesX X /n andH, they can be close to each other in the high-dimensional setting, see ( [30], Lemma 4.1) or ( [31], Corollary 10.1). This fact is used in the proof of Theorem 3 (given below). Next, we state the main results of this section.
Theorem 3. Let a ∈ (0, 1), q ≥ 1 and ξ > 1 be arbitrary. Suppose that Assumption 2 is satisfied and and where K 1 , K 2 are universal constants. Then there exists a universal constant K 3 > 0 such that with probability at least 1 − K 3 a we have In Theorem 3 we provide the upper bound for the estimation error of the Lasso estimator with the quadratic loss function. This result gives the conditions for estimation consistency ofb quad in the high-dimensional scenario, i.e., the number of predictors can be significantly greater than the sample size. Indeed, consistency in the l ∞ -norm holds e.g., when p = exp(n a 1 ), |T| = n a 2 , a = exp(−n a 1 ), where a 1 + 2a 2 < 1. Moreover, λ is taken as the right-hand side of the inequality (21) and finally F ∞ (ξ) is bounded from below (or slowly converging to 0) and σ is bounded from above (or slowly diverging to ∞).
The choice of the λ parameter is difficult in practice, which is a common drawback of Lasso estimators. However, Theorem 3 gives us a hint how to choose λ. The "safe" choice of λ is the right-hand side of the inequality (21), so, roughly speaking, λ should be proportional to log(p)/n. In the experimental part of the paper the parameter λ is chosen using the cross-validation method. As we will observe it gives satisfatory results for the Lasso estimators in both prediction and variable selection.
Theorem 3 is a crucial fact, which gives the upper bound for (15) in Theorem 2. Namely, taking q = 1, a = 1/p and λ equal to the right-hand side of the inequality (21), we obtain the following consequence of Theorem 3.

Corollary 1.
Suppose that Assumptions 2 is satisfied. Moreover, assume that there exist ξ 0 > 1 and constants C 1 > 0 and C 2 < ∞ such that F 1 (ξ 0 ) ≥ C 1 and σ ≤ C 2 . If n ≥ K 1 |T| 2 log p, then where the constants K 1 and K 2 depend only on ξ 0 , C 1 , C 2 and K 3 is a universal constant provided in Theorem 3.
The above result works for Lasso estimators with the quadratic loss. In the case of the logistic loss analogous result is obtained in ([13], Theorem 1). In fact, their results relate to the case of quite general Lipschitz loss functions, which can be useful in extending Theorem 2 to such cases.

Variable Selection Properties of Estimators
In Section 3 we are interested in predictive properties of estimators. In this part of the paper we focus on variable selection, which is another important problem in high-dimensional statistics. As we have already noticed upper bounds for probability of the event Ω are crucial in proving results concerning prediction. It also plays a key role in establishing results relating to variable selection. In this section we again focus on the Lasso estimators with the quadratic loss functions. The analogous results for Lipschitz loss functions were considered in ( [13], Corollary 1).
In the variable selection problem we want to find significant predictors, which, roughly speaking, give us some information on the observed phenomenon. We consider this problem in the semiparametric model, which is defined in (8). In this case the set of significant predictors is given by (9). As we have already mentioned vectors β and b quad * need not be the same. However, in [32] one proved that for a real number γ the following relation holds under Assumption 3, which is now stated. Assumption 3. Letβ = (β 1 , . . . , β p ). We assume that for each θ ∈ R p the conditional expectation E θ X|β X exists and for a real number d θ ∈ R.
The coefficient γ in (24) can be easily calculated. Namely, we have Standard arguments [33] show that γ is nonzero, if g is monotonic. In this case we have that the set T defined in (9) equals to T defined in (11). Assumption 3 is a well-known condition in the literature, see e.g., [13,32,[34][35][36]. It is always satisfied in the simple regression model (i.e., when X 1 ∈ R), which is often used for initial screening of explanatory variables, see, e.g., [37]. It is also satisfied when X comes from the elliptical distribution, like the multivariate normal distribution or multivariate t-distribution. In the interesting paper [38] one advocates that Assumption 3 is a nonrestrictive condition, when the number of predictors is large, which is the case that we focus on in this paper. Now, we state the results of this part of the paper. We will use the notation b quad min = min j∈T |(b quad * ) j |.

Corollary 2. Suppose that conditions of Theorem 3 are satisfied for
where K 3 is the universal constant from Theorem 3.
In Corollary 2 we show that the Lasso estimator with the quadratic loss is able to separate predictors, if the nonzero coefficients ofb quad * are large enough in absolute values. In the case that T equals (9) (i.e., T is the set of significant predictors) we can prove that the thresholded Lasso estimator is able to find the true model with high-probability. This fact is stated in the next result. The thresholded Lasso estimator is denoted byb quad th and defined as where δ > 0 is a threshold. We set (b quad th ) 0 =b quad 0 Corollary 3. Let g in (8) be monotonic. We suppose that Assumption 3 and conditions of Theorem 3 are satisfied for q = ∞. If where K 3 is the universal constant from Theorem 3.
Corollary 3 states that the Lasso estimator after thresholding is able to find the true model with high probability, if the threshold is appropriately chosen. However, Corollary 3 does not give a constructive way of choosing the threshold, because both endpoints of the interval [ 2ξλ (ξ+1)F ∞ (ξ) , b quad min /2] are unknown. It is not a surprising fact and has been already observed, for instance, in linear models ([9], Theorem 8). In the literature we can find methods, which help to choose a threshold in practice, for instance the approach relying on information criteria developed in [39,40].
Finally, we discuss the condition of Corollary 3 that b quad min cannot be too small, i.e., b quad min ≥ 4ξλ (ξ+1)F ∞ (ξ) . We know that (b quad * ) j = γβ j for j = 1, . . . , p, so the considered condition requires that Compared to the similar condition for the Lasso estimators in the well-specified models, we observe that the denominator in (26) contains an additional factor |γ|. This number is usually smaller than one, which means that in the misspecified models the Lasso estimator needs larger sample size to work well. This phenomenon is typical for misspecified models and the similar restrictions hold for competitors [13].

Numerical Experiments
In this section we present simulation study, where we compare the accuracy of considered estimators in prediction and variable selection. We consider the model (8) where signs are chosen at random. The first coordinate in (27) corresponds to the intercept and the next ten coefficients relate to significant predictors in the model. We study two cases: -Scenario 1: g(x) = exp(x)/(1 + exp(x)); -Scenario 2: g(x) = arctan(x)/π + 0.5.
In each scenario we generate the data (X 1 , Y 1 ), . . . , (X n , Y n ) for n ∈ {100, 350, 600}. The corresponding numbers of predictors are p ∈ {100, 1225, 3600}, so the number of predictors exceeds significantly the sample size in the experiments. For every model we consider two Lasso estimators with unpenalized intercepts (5): the first one with the logistic loss and the second one with the quadratic loss. They are denoted by "logistic" and "quadratic", respectively. To calculate them we use the "glmnet" package [28] in the "R" software [41]. The tuning parameters λ are chosen on the basis of 10-fold cross-validation.
Observe that applying the Lasso estimator with the logistic loss function to Scenario 1 leads to a well-specified model, while using the quadratic loss implies misspecification. In Scenario 2 both estimators work in misspecified models.
Simulations for each scenario are repeated 300 times.
To describe the quality of estimators in variable selection we calculate two values: -TD-the number of correctly selected relevant predictors; sep-the number of relevant predictors, whose Lasso coefficients are greater in absolute value than the largest in absolute value Lasso coefficient corresponding to irrelevant predictors.
So, we want to confirm that the considered estimators are able to separate predictors, which we establish in Section 5. Using TD we also study "screening" properties of estimators, which are easier than separability.
The classification accuracy of estimators is measured in the following way: we generate a test sample containg 1000 objects. On this set we calculate pred-the fraction of correctly predicted classes of objects for each estimator.
The results of experiments are collected in Tables 1 and 2. By the "oracle" we mean the classifier, which works only with significant predictors and uses the function g from the true model (8) in the estimation process. Finally, we also compare execution time of both algorithms. In Table 3 we show the averaged relative time difference t log − t quad t quad , where t quad and t log is time of calculating Lasso with quadratic and logistic loss functions, respectively. Looking at the results of experiments we observe that both estimators perform in a satisfactory way. Their predictive accuracy is relatively close to the oracle, especially when the sample size is larger. In variable selection we see that both estimators are able to find significant predictors and separate predictors in both scenarios. Again we can notice that properties of estimators become better, when n increases.
In Scenario 2 the quality of both estimators in prediction and variable selection is comparable. In Scenario 1, which is well-specified for Lasso with the logistic loss, we observe its dominance over Lasso with the quadratic loss. However, this dominance is not large. Therefore, using Lasso with the quadratic loss we obtain slightly worse accuracy of the procedure, but this algorithm is computationally faster. The computational efficiency is especially important, when we study large data sets. As we can see in Table 3 execution time of estimators is almost the same for n = 350, but for n = 600 the relative time difference becomes greater than 10%.
Author Contributions: Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
Funding: The research of K.F. was partially supported by Warsaw University of Life Sciences (SGGW).

Acknowledgments:
We would like to thank J. Mielniczuk and the reviewers for their valuable comments, which have improved the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs and Auxiliary Results
This section contains proofs of results from the paper. Additional lemmas are also provided.
Appendix A.1. Results from Section 3 Proof of Theorem 1. For arbitrary b ∈ R p+1 the averaged misclassification risk of f b can be expressed as Moreover, we have Applying (A1) and (A2) forb and b * , we obtain where P is probability with respect to both the data D and the new object X. Observe that on the event Ω, we have thatb Analogously, we obtain which finishes the proof.
Lemma A1. Suppose that Assumption 1 is fulfilled. Moreover, a random variable b * X has a density h, which is continuous on the interval U = [−2σc log p, 2σc log p ] andh = sup u∈U h(u). Then Proof. For simplicity, we omit the lower index X in probability P X in this proof. We take a > 1 and obtain inequalities The second expression in (A4) equals P(|X| ∞ > a), because a > 1. It can be handled using subgaussianity of X as follows: take z > 0 and notice that by the Markov inequality and the fact that exp(|u|) ≤ exp(u) + exp(−u) for each u ∈ R, we obtain Taking z = a/σ 2 , we obtain P(|X| ∞ > a) ≤ 2p exp(−a 2 /(2σ 2 )).
Then we choose a = 2σ log p, which is not smaller than one, because σ ≥ 1 from Assumption 1. Finally, the first term in (A4) can be bounded from above by 2cah = 4σhc log p by the mean value theorem.
Proof of Theorem 2. The right-hand side of (15) and (17) are upper bounds on the estimation risk. They are obtained using Theorem 1 and Lemma A1. The expressions (16) and (18) are upper bounds for the approximation risk in the case of estimators with the quadratic and logistic loss functions, respectively. In particular, (16) where the Kullback-Leibler distance KL(·, ·) is defined in (10). Next, we define the function h(a) = a log a + (1 − a) log(1 − a) for a ∈ (0, 1). Clearly, and h (a) = (a(1 − a)) −1. Therefore, from the mean value theorem for some c between a and b. To finish the proof we apply (A6) to the right-hand side of (A5) with Appendix A.2. Results from Section 4 To simplify notation we writeb, b * forb quad , b quad * , respectively, in this section. Moreover, we also denoteb * = ((b * ) 1 , . . . , (b * ) p ).
We start with establishing results, which help us to prove Theorem 3.
Proof. The proof is elementary and based on the inequality The right-hand side of (A7) can be expressed as =b * Hb * and we can bound from above the right-hand side of (A8) by which finishes the proof.
Lemma A3. Suppose that Z 1 , . . . , Z n are i.i.d. random variables and there exists L > 0 such that C 2 = E exp (|Z 1 |/L) is finite. Then for arbitrary u > 0 Lemma A4. For arbitrary j = 1, . . . , p and u > 0 we have Proof. Fix j = 1, . . . , p and u > 0. Recall that Hb * = E [YX] and E [X] = 0. Thus, we work with an average of i.i.d. centred random variables, so we can use Lemma A3. We only have to find L, C > 0 such that where X j is the j-th coordinate of X. For each positive number a, b we have the inequality ab ≤ a 2 2 + b 2 2 . Therefore, we have Applying this fact and the Schwarz inequality we obtain The variable X j is subgaussian, so using ( [43], Lemma 7.4) we can bound the first expectation in (A11) by 1 − 2σ 2 L −1/2 provided that L > 2σ 2 . The second expectation in (A11) can be bounded using subgaussianity of the vector X T , ( [43], Lemma 7.4) and Lemma A2 in the following way provided that 4σ 2 < L. Taking L = 4.1σ 2 we can bound exp(4/L) ≤ 2.7, because H jj = 1 implies that σ ≥ 1. Thus, we obtain C ≤ 3, where C is the upper bound in (A10). It finishes the proof.
Proof. Fix a ∈ (0, 1), q ≥ 1, ξ > 1. We start with considering the l ∞ -norm of the matrix We focus only on the right-hand side of (A12), because (A13) can be done similarly. Thus, fix j, k ∈ {1, . . . , p}. Using subgaussianity of predictors, Lemma A3 and argumentation similar to the proof of Lemma A4 we have where K 2 is an universal constant. The values of constants K i that appear in this proof can change from line to line. Therefore, using union bounds we obtain P 1 nX X − E XX ∞ > K 2 σ 2 log(p 2 /a) n ≤ K 3 a.
To finish the proof we use (20) with K 1 being sufficiently large.
Then we apply Lemma A5 to obtain (22).
Thus, we focus on showing that (A14) holds with high probability . Denote A = {|∇Q(b * )| ∞ ≤ ξ−1 ξ+1 λ}. We start with bounding from below probability of A. Recall that b * is the minimizer of Q(b) = E(1 − Yb X ) 2 , which can be easily calculated, namelẙ For every j = 1, . . . , p the j-th partial derivative ofQ(b) at b * is The derivative with respect to the b 0 is Taking λ, which satisfies (21), and using union bounds, we obtain that Consider a summand on the right-hand side of (A17), which corresponds to j ∈ {1, . . . , p}. From (A15) we can handle it using Lemma A4. We just take u = log(p/a) and suffciently large K 2 . Probability of the first term on the right-hand side of (A17), which corresponds to j = 0, can be bounded from above analogously as in the proof of Lemma A4. Proceeding is even easier, so we omit it.
In further argumentation we consider only the event A. Besides, we denote θ =b − b * , whereb is a minimizer of a convex function (5), that is equivalent to where j = 1, . . . , p.