Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

Konrad Furmańczyk; Wojciech Rejchel

doi:10.3390/e22050543

and

¹

Institute of Information Technology, Warsaw University of Life Sciences (SGGW), Nowoursynowska 159, 02-776 Warszawa, Poland

²

Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18, 87-100 Toruń, Poland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy2020, 22(5), 543;https://doi.org/10.3390/e22050543

This article belongs to the Special Issue Nonparametric Statistical Inference with an Emphasis on Information-Theoretic Methods

Version Notes

Order Reprints

Abstract

In this paper, we consider prediction and variable selection in the misspecified binary classification models under the high-dimensional scenario. We focus on two approaches to classification, which are computationally efficient, but lead to model misspecification. The first one is to apply penalized logistic regression to the classification data, which possibly do not follow the logistic model. The second method is even more radical: we just treat class labels of objects as they were numbers and apply penalized linear regression. In this paper, we investigate thoroughly these two approaches and provide conditions, which guarantee that they are successful in prediction and variable selection. Our results hold even if the number of predictors is much larger than the sample size. The paper is completed by the experimental results.

Keywords:

misclassification risk; model misspecification; penalized estimation; supervised classification; variable selection consistency

1. Introduction

Large-scale data sets, where the number of predictors significantly exceeds the number of observations, become common in many practical problems from, among others, biology or genetics. Currently, the analysis of such data sets is a fundamental challenge in statistics and machine learning. High-dimensional prediction and variable selection are arguably the most popular and intensively studied topics in this field. There are many methods trying to solve these problems such as those based on penalized estimation [,]. The main representative of them is Lasso [], that relates to

l_{1}

-norm penalization. Its properties in model selection, estimation and prediction are deeply investigated, among others, in [,,,,,,,]. The results obtained in the above papers can be applied only if some specific assumptions are satisfied. For instance, these conditions concern the relation between the response variable and predictors. However, it is quite common that a complex data set does not satisfy these model assumptions or they are difficult to verify, which leads to the fact that the considered model is specified incorrectly. The model misspecification problem is the core of the current paper. We investigate this topic in the context of high-dimensional binary classification (binary regression).

In the classification problem we are to predict or to guess the class label of the object on the basis of its observed predictors. The object is described by the random vector

(X, Y),

where

X \in R^{p}

is a vector of predictors and

Y \in {- 1, 1}

is the class label of the object. A classifier is defined as a measurable function

f : R^{p} \to R,

which determines the label of an object in the following way:

if f (x) \geq 0, then we predict that y = 1 .

Otherwise, we guess that

y = - 1 .

The most natural approach is to look for a classifier

f,

which minimizes the misclassification risk (probability of incorrect classification)

R (f) = P (Y = 1, f (X) < 0) + P (Y = - 1, f (X) \geq 0) .

(1)

Let

η (x) = P (Y = 1 | X = x) .

It is clear that

f_{B} (x) = sign (2 η (x) - 1)

minimizes the risk (1) in the family of all classifiers. It is called the Bayes classifier and we denote its risk as

R_{B} = R (f_{B}) .

Obviously, in practice we do not know the function

η

, so we cannot find the Bayes classifier. However, if we possess a training sample

(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})

containing independent copies of

(X, Y),

then we can consider a sample analog of (1), namely the empirical misclassification risk

\frac{1}{n} \sum_{i = 1}^{n} [I (Y_{i} = 1, f (X_{i}) < 0) + I (Y_{i} = - 1, f (X_{i}) \geq 0)],

(2)

where

I

is the indicator function. Then a minimizer of (2) could be used as our estimator.

The main difficulty in this approach lies in discontinuity of the function (2). It entails that finding its minimizer is computationally difficult and not effective. To overcome this problem, one usually replaces the discontinuous loss function by its convex analog

ϕ : R \to [0, \infty]

, for instance the logistic loss, the hinge loss or the exponential loss. Then we obtain the convex empirical risk

\bar{Q} (f) = \frac{1}{n} \sum_{i = 1}^{n} ϕ (Y_{i} f (X_{i})) .

(3)

In the high-dimensional case one usually obtains an estimator by minimizing the penalized version of (3). Those tricks have been successfully used in the classification theory and have allowed to invent boosting algorithms [], support vector machines [] or Lasso estimators []. In this paper we are mainly interested in Lasso estimators, because they are able to solve both variable selection and prediction problems simultaneously, while the first two algorithms are developed mainly for prediction.

Thus, we consider linear classifiers

f_{b} (x) = b_{0} + \sum_{j = 1}^{p} b_{j} x_{j},

(4)

where

b = (b_{0}, b_{1}, \dots, b_{p}) \in R^{p + 1} .

For a fixed loss function

ϕ

we define the Lasso estimator as

\hat{b} = arg min_{b \in R^{p + 1}} \bar{Q} (f_{b}) + λ \sum_{j = 1}^{p} | b_{j} |,

(5)

where

λ

is a positive tuning parameter, which provides a balance between minimizing the empirical risk and the penalty. The form of the penalty is crucial, because its singularity at the origin implies that some coordinates of the minimizer

\hat{b}

are exactly equal to zero, if

λ

is sufficiently large. Thus, calculating (5) we simultaneously select significant predictors in the model and we estimate their coefficients, so we are also able to predict the class of new objects. The function

\bar{Q} (f_{b})

and the penalty are convex, so (5) is a convex minimization problem, which is an important fact from both practical and theoretical points of view. Notice that the intercept

b_{0}

is not penalized in (5).

The random vector (5) is an estimator of

b_{*} = arg min_{b \in R^{p + 1}} Q (f_{b}),

(6)

where

Q (f_{b}) = E ϕ (Y f_{b} (X)) .

In this paper we are mainly interested in minimizers (6) corresponding to quadratic and logistic loss functions. The latter has a nice information-theoretic interpretation. Namely, it can be viewed as the Kullback–Leibler projection of unknown

η

on logistic models []. The Kullback–Leibler divergence [] plays an important role in the information theory and statistics, for instance it is involved in information criteria in model selection [] or in detecting influenctial observations [].

In general, the classifier corresponding to (6) need not coincide with the Bayes classifier. Obviously, we want to have a “good” estimator, which means that its misclassification risk should be as close to the risk of the Bayes classifier as possible. In other words, its excess risk

E (\hat{b}, f_{B}) = E_{D} R (\hat{b}) - R_{B}

(7)

should be small, where

E_{D}

is the expectation with respect to the data

D = {(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})}

and we write simply

R (b)

instead of

R (f_{b}) .

Our goal is to study the excess risk (7) for the estimator (5) with different loss functions

ϕ .

We do it by looking for the upper bounds of (7).

In the excess risk (7) we compare two misclassification risks defined in (1). In the literature one can also find a different approach, which replaces the misclassification risks

R (\cdot)

in (7) by the convex risks

Q (\cdot) .

In that case the excess risk depends on the loss function

ϕ .

To deal with this fact one uses the results from [,], which state the relation between the excess risk (7) and its analog based on the convex risk

Q (\cdot) .

In this paper we do not follow this way and work, right from the beginning, with the excess risk independent of

ϕ .

Only the estimator (5) depends on the loss

ϕ .

In this paper we are also interested in variable selection. We investigate this problem in the following semiparametric model

η (x) = g (β_{0} + \sum_{j = 1}^{p} β_{j} x_{j}),

(8)

where

η (x) = P (Y = 1 | X = x)

,

β \in R^{p + 1}

is the true parameter and g is unknown function. Thus, we suppose that predictors influence class probability through the function g of the linear combination

β_{0} + \sum_{j = 1}^{p} β_{j} x_{j}

. The goal of variable selection is the identification of the set of significant predictors

T = {1 \leq j \leq p : β_{j} \neq 0} .

(9)

Obviously, in the model (8) we cannot estimate an intercept

β_{0}

and we can identify the vector

(β_{1}, \dots, β_{p})

only up to a multiplicative constant, because any shift or scale change in

β_{0} + \sum_{j = 1}^{p} β_{j} X_{j}

can be absorbed by g. However, we show in Section 5 that in many situations the Lasso estimator (5) can properly identify the set (9).

The literature on the classification problem is comprehensive. We just mention a few references: [,,,]. The predictive quality of classifiers is often investigated by obtaining upper bounds for their excess risks. It is an important problem and was studied thoroughly, among others in [,,,,]. The variable selection and predictive properties of estimators in the high-dimensional scenario were studied, for instance, in [,,,,]. In the current paper we investigate the behaviour of classifiers in possibly misspecified high-dimensional classification, which appears frequently in practice. For instance, while working with binary regression one often assumes incorrectly that the data follow the logistic regression model. Then the problem is solved using the Lasso penalized maximum likelihood method. Another approach to binary regression, which is widely used due to its computational simplicity, is just treating labels

Y_{i}

as they were numbers and applying standard Lasso. For instance, such method is used in ([], [Subsections 4.2 and 4.3]) or ([], Subsection 2.4.1). These two approaches to classification sometimes give unexpectedly good results in variable selection and prediction, but the reason of this phenomenon has not been deeply studied in the literature. Among the above mentioned papers only [,,] take up this issue. However, [] focuses mainly on the predictive properties of Lasso classifiers with the hinge loss. Bühlmann and van de Geer [] and Kubkowski and Mielniczuk [] study general Lipschitz loss functions. The latter paper considers only the variable selection problem. In [] one also investigates prediction, but they do not study classification with the quadratic loss.

In this paper we are interested in both variable selection and predictive properties of classifiers with convex (but not necessarily Lipschitz) loss functions. The prominent example is classification with the quadratic loss function, which has not been investigated so far in the context of the high-dimensional misspecified model. In this case the estimator (5) can be calculated efficiently using the existing algorithms, for instance [] or [], even if the number of predictors is much larger than the sample size. It makes this estimator very attractive, while working with large data sets. In [] one provides also the efficient algorithm for Lasso estimators with the logistic loss in the high-dimensional scenario. Therefore, misspecified classification with the logistic loss plays an important role in this paper as well. Our goal is to study thoroughly such estimators and provide conditions, which guarantee that they are successful in prediction and variable selection.

The paper is organized as follows: in the next section we provide basic notations and assumptions, which are used in this paper. In Section 3 we study predictive properties of Lasso estimators with different loss functions. We will see that these properties depend strongly on the estimation quality of estimators, which is studied in Section 4. In Section 5 we consider variable selection. In Section 6 we show numerical experiments, which describe the quality of estimators in practice. The proofs and auxiliary results are relegated to Appendix A.

2. Assumptions and Notation

In this paper we work in the high-dimensional scenario

p > > n .

As usual we assume that the number of predictors p can vary with the sample size

n,

which could be denoted as

p (n) = p_{n} .

However, to make notation simpler we omit the lower index and write p istead of

p_{n} .

The same refers to the other objects appearing in this paper.

In the further sections we will need the following notation:

-: $X_{i} = {(X_{i 1}, X_{i 2}, \dots, X_{i p})}^{⊤}$ ;
-: $X = {(X_{1}, X_{2}, \dots, X_{n})}^{⊤}$ is the $(n \times p)$ -matrix of predictors;
-: Let $A \subset {1, \dots, p} .$ Then $A^{c} = {1, \dots, p} \ A$ is a complement of A;
-: $X_{A}$ is a submatrix of $X,$ with columns whose indices belong to A;
-: $b_{A}$ is a restriction of a vector $b \in R^{p}$ to the indices from $A;$
-: $| A |$ is the number of elements in $A;$
-: $\tilde{A} = A \cup {0},$ so the set $\tilde{A}$ contains indices from A and the intercept;
-: The $l_{q}$ -norm of a vector is defined as ${| b |}_{q} = {(\sum_{j = 1}^{p} {| b_{j} |}^{q})}^{1 / q}$ for $q \in [1, \infty];$
-: For $x \in R^{p}$ we denote $\tilde{x} = {(1, x)}^{⊤};$
-: $\tilde{X}$ is the matrix $X$ with the column of ones binded from the left side;
-: ${\hat{b}}^{q u a d}, b_{*}^{q u a d}$ are minimizers in (5), (6), respectively, with the quadratic loss function;
-: ${\hat{b}}^{l o g}, b_{*}^{l o g}$ are minimizers in (5), (6), respectively, with the logistic loss function;
-: The Kullback–Leibler (KL) distance [] between two binary distributions with success probabilities $π_{1}$ and $π_{2}$ is defined as

$K L (π_{1}, π_{2}) = π_{1} log (\frac{π_{1}}{π_{2}}) + (1 - π_{1}) log (\frac{1 - π_{1}}{1 - π_{2}}) .$

(10)

Obviously, we have

K L (π_{1}, π_{2}) \geq 0

and

K L (π_{1}, π_{2}) = 0

if only if

π_{1} = π_{2}

. Moreover, the

K L

distance need not be symmetric;

-: the set of nonzero coefficients of $b_{*}^{q u a d}$ is denoted as

$T = {1 \leq j \leq p : {(b_{*}^{q u a d})}_{j} \neq 0} .$

(11)

Notice that the intercept is not contained in (11) even if it is nonzero.

We also specify assumptions, which are used in this paper.

Assumption 1.

We assume that

(X, Y), (X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})

are i.i.d. random vectors. Moreover, predictors are univariate subgaussian, i.e., for each

a \in R

and

j \in {1, \dots, p}

we have

E exp (a X_{j}) \leq exp (σ_{j}^{2} a^{2} / 2)

for positive numbers

σ_{j} .

We also denote

σ = max_{1 \leq j \leq p} σ_{j} .

Finally, we suppose that the matrix

H = E [X X^{⊤}]

is positive definite and

H_{j j} = 1

for

j = 1, \dots, p .

In Section 4 and Section 5 we need stronger version of Assumption 1.

Assumption 2.

We suppose that the subvector of predictors

X_{T}

is subgaussian with the coefficient

σ_{0} > 0

, i.e., for each

u \in R^{| T |}

we have

E exp (u^{⊤} X_{T}) \leq exp (σ_{0}^{2} u^{⊤} H_{T} u / 2),

where

H_{T} = {(E [X_{1 j} X_{1 k}])}_{j, k \in T} .

The remaining conditions are as in Assumption 1. We also denote

σ = max (σ_{0}, σ_{j}, j \notin T) .

Subgaussianity of predictors is a standard assumption while working with random predictors in high-dimensional models, cf. []. In particular, Assumption 1 implies that

E [X] = 0

and

σ \geq 1

[].

3. Predictive Properties of Classifiers

In this part of the paper we study prediction properties of classifiers with convex loss functions. To do it we look for upper bounds of the excess risk (7) of estimators.

As usual the excess risk in (7) can be decomposed as

E_{D} R (\hat{b}) - R (b_{*}) + R (b_{*}) - R_{B} .

(12)

The second term in (12) is the approximation risk and compares the predictive abilitity of the “best” linear classifier (6) to the Bayes classifier. The first term in (12) is called the estimation risk and describes how the estimation process influences the predictive property of classifiers.

In the next theorem we bound from above the estimation risk of classifiers. To make the result more transparent we use notations

P_{D}

and

P_{X}

in (13), which indicate explicitly which probability we consider, i.e.,

P_{D}

is probability with respect to the data D and

P_{X}

is with respect to the new object X. In further results we omit these lower indexes and believe that it does not lead to confusion.

Theorem 1.

For

c > 0

we consider an event

Ω = {| \hat{b} - b_{*} |_{1} \leq c} .

We have

E_{D} R (\hat{b}) - R (b_{*}) \leq 2 P_{D} (Ω^{c}) + P_{X} (| b_{*}^{⊤} \tilde{X} | \leq c | \tilde{X} |_{\infty}) .

(13)

In Theorem 1 we obtain the upper bound for the estimation risk. This risk becomes small, if we establish that probability of the event

Ω^{c}

is small and the sequence

c,

which is involved in

Ω

and in the second term on the right-hand side of (13), decreases sufficiently fast to zero. Therefore, Theorem 1 shows that to have a small estimation risk it is enough to prove that for each

ε \in (0, 1)

there exists c such that

P (| \hat{b} - b_{*} |_{1} \leq c) \geq 1 - ε .

(14)

Moreover, numbers

ε

and c should be sufficiently small. This property will be studied thoroughly in the next section. Notice that the first term on the right-hand side of (13) relates to the fact, how well (5) estimates (6). Moreover, the second expression on the right-hand side of (13) can be bounded from above, if predictors are sufficiently regular, for instance subgaussian.

So far, we have been interested in the estimation risk of estimators. In the next result we establish the upper bound for the approximation risk as well. This bound combined with (13) enables us to bound from above the excess risk of estimators. We prove this fact for the quadratic loss

ϕ (t) = {(1 - t)}^{2}

and the logistic loss

ϕ (t) = log (1 + e^{- v})

, which play prominent roles in this paper.

Theorem 2.

Suppose that Assumption 1 is fulfilled. Moreover, a random variable

b_{*}^{⊤} \tilde{X}

has a density

h,

which is continuous on the interval

U = [- 2 σ c \sqrt{log p}, 2 σ c \sqrt{log p}]

and

\tilde{h} = sup_{u \in U} h (u) .

(a) We have

\begin{matrix} E ({\hat{b}}^{q u a d}, f_{B}) & \leq & 2 P (Ω^{c}) + 4 σ {\tilde{h}}^{q u a d} c \sqrt{log p} + 2 / p \end{matrix}

(15)

\begin{matrix} + & \sqrt{E {[2 η (X) - 1 - {(b_{*}^{q u a d})}^{⊤} \tilde{X}]}^{2}}, \end{matrix}

(16)

where

{\tilde{h}}^{q u a d}

refers to the density h of

{(b_{*}^{q u a d})}^{⊤} \tilde{X} .

(b) Let

η_{l o g} (u) = 1 / (1 + exp (- u)) .

Then we obtain

\begin{matrix} E ({\hat{b}}^{l o g}, f_{B}) & \leq & 2 P (Ω^{c}) + 4 σ {\tilde{h}}^{l o g} c \sqrt{log p} + 2 / p \end{matrix}

(17)

\begin{matrix} + & \sqrt{2 E [K L (η (X), η_{l o g} ({(b_{*}^{l o g})}^{⊤} \tilde{X}))]} \end{matrix}

(18)

where

K L (\cdot, \cdot)

is the Kullback–Leibler distance defined in (10) and

{\tilde{h}}^{l o g}

refers to the density h of

{(b_{*}^{l o g})}^{⊤} \tilde{X} .

Additionally, assuming that there exists

δ \in (0, 1)

such that

δ \leq η (X) \leq 1 - δ

and

δ \leq η_{l o g} ({(b_{*}^{l o g})}^{⊤} \tilde{X}) \leq 1 - δ,

we have

E [K L (η (X), η_{l o g} ({(b_{*}^{l o g})}^{⊤} \tilde{X}))] \leq {(2 δ (1 - δ))}^{- 1} E {[η (X) - η_{l o g} ({(b_{*}^{l o g})}^{⊤} \tilde{X})]}^{2} .

(19)

In Theorem 2 we establish upper bounds on the excess risks for Lasso estimators (5). They describe predictive properties of these classifiers. In this paper we consider linear classifiers, so the misclassification risk of an estimator is close to the Bayes risk, if the “truth” can be approximated linearly in a satisfactory way. For the classifier with the logistic loss this fact is described by (18) and (19), which measure the distance between true success probability and the one in logistic regression. In particular, when the true model is logistic, then (18) and (19) vanish. The expression (16) relates to the approximation error in the case of the quadratic loss. It measures how well the conditional expectation

E [Y | X]

can be described by the “best” (with respect to the loss

ϕ

) linear function

{(b_{*}^{q u a d})}^{⊤} \tilde{X} .

The right-hand sides of (15) and (17) relate to estimation risk. They have been already discussed after Theorem 1. Using subgaussianity of predictors we have made them more explicit. The main ingredient of bounds in Theorem 2, namely

P (Ω^{c}),

is studied in the next section.

Results in Theorem 2 refer to Lasso estimators with quadratic and logistic loss functions. Similar results are given in ([], Theorem 6.4). They refer to the case that the convex excess risk is considered, i.e., the misclassification risks

R (\cdot)

are replaced by the convex risks

Q (\cdot)

in (7). Moreover, these results do not consider Lasso estimators with the quadratic loss applied to classification, which is an approach playing a key role in the current paper. Furthermore, in ([], Theorem 6.4) the estimation error

\hat{b} - b_{*}

is measured in the

l_{1}

-norm, which is enough for prediction. However, for variable selection the

l_{\infty}

-norm gives better results. Such results will be established in Section 4 and Section 5. Finally, results of [] need more restrictive assumptions than ours. For instance, predictors should be bounded and a function

f_{b_{*}}

should be sufficiently close to

f_{B}

in the supremum norm.

Analogous bounds to those in Theorem 2 can be obtained for other loss functions, if we combine Theorem 1 with results of []. Finally, we should stress that the estimator

\hat{b}

need not rely on the Lasso method. All we require is that the bound (14) can be estiblished for this estimator.

4. On the Event $Ω$

In this section we show that probability of the event

Ω

can be close to one. Such results for classification models with Lipschitz loss functions were established in [,]. Therefore, we focus on the quadratic loss function, which is obviously non-Lipschitz. This loss function is important from the practical point of view, but was not considered in these papers. Moreover, in our results the estimation error in

Ω

can be measured in the

l_{q}

-norms,

q \geq 1,

not only in the

l_{1}

-norm as in [,]. Bounds in the

l_{\infty}

-norm lead to better results in variable selection, which are given in Section 5.

We start with introducing the cone invertibility factor (CIF), which plays a significant role in investigating properties of estimators based on the Lasso penalty []. In the case

n > p

one usually uses the minimal eigenvalue of the matrix

X^{⊤} X / n

to express the strength of correlations between predictors. Obviously, in the high-dimensional scenario this value is equal to zero and the minimal eigenvalue needs to be replaced by some other measure of predictors interdependency, which would describe the potential of consistent estimation of model parameters.

For

ξ > 1

we define a cone

C (ξ) = {b \in R^{p + 1} : | b_{T^{c}} |_{1} \leq ξ | b_{\tilde{T}} |_{1}},

where we recall that

\tilde{T} = T \cup {0} .

In the case when

p > > n

three different characteristics measuring the potential for consistent estimation of the model parameters have been introduced:

-: The restricted eigenvalue []:

$R E (ξ) = inf_{0 \neq b \in C (ξ)} \frac{b^{⊤} {\tilde{X}}^{⊤} \tilde{X} b / n}{{| b |}_{2}^{2}},$
-: The compatibility factor []:

$K (ξ) = inf_{0 \neq b \in C (ξ)} \frac{| T | b^{⊤} {\tilde{X}}^{⊤} \tilde{X} b / n}{| b_{T} |_{1}^{2}},$
-: The cone invertibility factor (CIF, []): for $q \geq 1$

${\bar{F}}_{q} (ξ) = inf_{0 \neq b \in C (ξ)} \frac{{| T |}^{1 / q} {| {\tilde{X}}^{⊤} \tilde{X} b / n |}_{\infty}}{{| b |}_{q}} .$

In this article we will use CIF, because this factor allows for a sharp formulation of convergency results for all

l_{q}

norms with

q \geq 1,

see ([], Section 3.2). The population (non-random) version of CIF is given by

F_{q} (ξ) = inf_{0 \neq b \in C (ξ)} \frac{{| T |}^{1 / q} {| \tilde{H} b |}_{\infty}}{{| b |}_{q}},

where

\tilde{H} = E [\tilde{X} {\tilde{X}}^{⊤}] .

The key property of the random and the population versions of CIF,

{\bar{F}}_{q} (ξ)

and

F_{q} (ξ)

, is that, in contrast to the smallest eigenvalues of matrices

{\tilde{X}}^{⊤} \tilde{X} / n

and

\tilde{H},

they can be close to each other in the high-dimensional setting, see ([], Lemma 4.1) or ([], Corollary 10.1). This fact is used in the proof of Theorem 3 (given below).

Next, we state the main results of this section.

Theorem 3.

Let

a \in (0, 1), q \geq 1

and

ξ > 1

be arbitrary. Suppose that Assumption 2 is satisfied and

n \geq \frac{K_{1} {| T |}^{2} σ^{4} {(1 + ξ)}^{2} log (p / a)}{F_{q}^{2} (ξ)}

(20)

and

λ \geq K_{2} \frac{ξ + 1}{ξ - 1} σ^{2} \sqrt{\frac{log (p / a)}{n}},

(21)

where

K_{1}, K_{2}

are universal constants. Then there exists a universal constant

K_{3} > 0

such that with probability at least

1 - K_{3} a

we have

| {\hat{b}}^{q u a d} - b_{*}^{q u a d} |_{q} \leq \frac{{2 ξ | T |}^{1 / q} λ}{(ξ + 1) F_{q} (ξ)} .

(22)

In Theorem 3 we provide the upper bound for the estimation error of the Lasso estimator with the quadratic loss function. This result gives the conditions for estimation consistency of

{\hat{b}}^{q u a d}

in the high-dimensional scenario, i.e., the number of predictors can be significantly greater than the sample size. Indeed, consistency in the

l_{\infty}

-norm holds e.g., when

p = exp (n^{a_{1}}), | T | = n^{a_{2}}, a = exp (- n^{a_{1}}),

where

a_{1} + 2 a_{2} < 1 .

Moreover,

λ

is taken as the right-hand side of the inequality (21) and finally

F_{\infty} (ξ)

is bounded from below (or slowly converging to 0) and

σ

is bounded from above (or slowly diverging to ∞).

The choice of the

λ

parameter is difficult in practice, which is a common drawback of Lasso estimators. However, Theorem 3 gives us a hint how to choose

λ

. The “safe” choice of

λ

is the right-hand side of the inequality (21), so, roughly speaking,

λ

should be proportional to

\sqrt{log (p) / n} .

In the experimental part of the paper the parameter

λ

is chosen using the cross-validation method. As we will observe it gives satisfatory results for the Lasso estimators in both prediction and variable selection.

Theorem 3 is a crucial fact, which gives the upper bound for (15) in Theorem 2. Namely, taking

q = 1, a = 1 / p

and

λ

equal to the right-hand side of the inequality (21), we obtain the following consequence of Theorem 3.

Corollary 1.

Suppose that Assumptions 2 is satisfied. Moreover, assume that there exist

ξ_{0} > 1

and constants

C_{1} > 0

and

C_{2} < \infty

such that

F_{1} (ξ_{0}) \geq C_{1}

and

σ \leq C_{2}

. If

n \geq K_{1} {| T |}^{2} log p,

then

P (| {\hat{b}}^{q u a d} - b_{*}^{q u a d} |_{1} \leq K_{2} | T | \sqrt{\frac{log p}{n}}) \geq 1 - K_{3} / p,

(23)

where the constants

K_{1}

and

K_{2}

depend only on

ξ_{0}, C_{1}, C_{2}

and

K_{3}

is a universal constant provided in Theorem 3.

The above result works for Lasso estimators with the quadratic loss. In the case of the logistic loss analogous result is obtained in ([], Theorem 1). In fact, their results relate to the case of quite general Lipschitz loss functions, which can be useful in extending Theorem 2 to such cases.

5. Variable Selection Properties of Estimators

In Section 3 we are interested in predictive properties of estimators. In this part of the paper we focus on variable selection, which is another important problem in high-dimensional statistics. As we have already noticed upper bounds for probability of the event

Ω

are crucial in proving results concerning prediction. It also plays a key role in establishing results relating to variable selection. In this section we again focus on the Lasso estimators with the quadratic loss functions. The analogous results for Lipschitz loss functions were considered in ([], Corollary 1).

In the variable selection problem we want to find significant predictors, which, roughly speaking, give us some information on the observed phenomenon. We consider this problem in the semiparametric model, which is defined in (8). In this case the set of significant predictors is given by (9). As we have already mentioned vectors

β

and

b_{*}^{q u a d}

need not be the same. However, in [] one proved that for a real number

γ

the following relation

{(b_{*}^{q u a d})}_{j} = γ β_{j}, j = 1, \dots, p

(24)

holds under Assumption 3, which is now stated.

Assumption 3.

Let

\overset{˚}{β} = (β_{1}, \dots, β_{p}) .

We assume that for each

θ \in R^{p}

the conditional expectation

E [θ^{⊤} X | {\overset{˚}{β}}^{⊤} X]

exists and

E [θ^{⊤} X | {\overset{˚}{β}}^{⊤} X] = d_{θ} {\overset{˚}{β}}^{⊤} X

for a real number

d_{θ} \in R .

The coefficient

γ

in (24) can be easily calculated. Namely, we have

γ = \frac{E [Y {\overset{˚}{β}}^{⊤} X]}{{\overset{˚}{β}}^{⊤} H \overset{˚}{β}} = \frac{2 E [g (β^{⊤} \tilde{X}) {\overset{˚}{β}}^{⊤} X]}{{\overset{˚}{β}}^{⊤} H \overset{˚}{β}} .

Standard arguments [] show that

γ

is nonzero, if g is monotonic. In this case we have that the set

T

defined in (9) equals to T defined in (11).

Assumption 3 is a well-known condition in the literature, see e.g., [,,,,]. It is always satisfied in the simple regression model (i.e., when

X_{1} \in R

), which is often used for initial screening of explanatory variables, see, e.g., []. It is also satisfied when X comes from the elliptical distribution, like the multivariate normal distribution or multivariate t-distribution. In the interesting paper [] one advocates that Assumption 3 is a nonrestrictive condition, when the number of predictors is large, which is the case that we focus on in this paper.

Now, we state the results of this part of the paper. We will use the notation

b_{m i n}^{q u a d} = {min}_{j \in T} | {(b_{*}^{q u a d})}_{j} | .

Corollary 2.

Suppose that conditions of Theorem 3 are satisfied for

q = \infty .

If

b_{m i n}^{q u a d} \geq \frac{4 ξ λ}{(ξ + 1) F_{\infty} (ξ)},

then

P (\forall_{j \in T, k \notin T} | {\hat{b}}_{j}^{q u a d} | > | {\hat{b}}_{k}^{q u a d} |) \geq 1 - K_{3} a,

where

K_{3}

is the universal constant from Theorem 3.

In Corollary 2 we show that the Lasso estimator with the quadratic loss is able to separate predictors, if the nonzero coefficients of

{\overset{˚}{b}}_{*}^{q u a d}

are large enough in absolute values. In the case that T equals (9) (i.e., T is the set of significant predictors) we can prove that the thresholded Lasso estimator is able to find the true model with high-probability. This fact is stated in the next result. The thresholded Lasso estimator is denoted by

{\hat{b}}_{t h}^{q u a d}

and defined as

{({\hat{b}}_{t h}^{q u a d})}_{j} = {\hat{b}}_{j}^{q u a d} I (| {\hat{b}}_{j}^{q u a d} | \geq δ), j = 1, \dots, p,

(25)

where

δ > 0

is a threshold. We set

{({\hat{b}}_{t h}^{q u a d})}_{0} = {\hat{b}}_{0}^{q u a d}

and denote

{\hat{T}}_{t h} = {1 \leq j \leq p : {({\hat{b}}_{t h}^{q u a d})}_{j} \neq 0} .

Corollary 3.

Let g in (8) be monotonic. We suppose that Assumption 3 and conditions of Theorem 3 are satisfied for

q = \infty .

If

\frac{2 ξ λ}{(ξ + 1) F_{\infty} (ξ)} < δ \leq b_{m i n}^{q u a d} / 2,

then

P ({\hat{T}}_{t h} = T) \geq 1 - K_{3} a,

where

K_{3}

is the universal constant from Theorem 3.

Corollary 3 states that the Lasso estimator after thresholding is able to find the true model with high probability, if the threshold is appropriately chosen. However, Corollary 3 does not give a constructive way of choosing the threshold, because both endpoints of the interval

[\frac{2 ξ λ}{(ξ + 1) F_{\infty} (ξ)}, b_{m i n}^{q u a d} / 2]

are unknown. It is not a surprising fact and has been already observed, for instance, in linear models ([], Theorem 8). In the literature we can find methods, which help to choose a threshold in practice, for instance the approach relying on information criteria developed in [,].

Finally, we discuss the condition of Corollary 3 that

b_{m i n}^{q u a d}

cannot be too small, i.e.,

b_{m i n}^{q u a d} \geq \frac{4 ξ λ}{(ξ + 1) F_{\infty} (ξ)}

. We know that

{(b_{*}^{q u a d})}_{j} = γ β_{j}

for

j = 1, \dots, p,

so the considered condition requires that

min_{j \in T} | β_{j} | \geq \frac{4 ξ λ}{| γ | (ξ + 1) F_{\infty} (ξ)} .

(26)

Compared to the similar condition for the Lasso estimators in the well-specified models, we observe that the denominator in (26) contains an additional factor

| γ |

. This number is usually smaller than one, which means that in the misspecified models the Lasso estimator needs larger sample size to work well. This phenomenon is typical for misspecified models and the similar restrictions hold for competitors [].

6. Numerical Experiments

In this section we present simulation study, where we compare the accuracy of considered estimators in prediction and variable selection.

We consider the model (8) with predictors generated from the p-dimensional normal distribution

N (0, H)

, where

H_{j j} = 1

and

H_{j k} = 0.5

for

j \neq k

. The true parameter is

β = (1, \underset{10}{\underset{︸}{\pm 1, \pm 1, \dots, \pm 1}}, 0, 0, \dots, 0),

(27)

where signs are chosen at random. The first coordinate in (27) corresponds to the intercept and the next ten coefficients relate to significant predictors in the model. We study two cases:

-: Scenario 1: $g (x) = exp (x) / (1 + exp (x));$
-: Scenario 2: $g (x) = arctan (x) / π + 0.5 .$

In each scenario we generate the data

(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})

for

n \in {100, 350, 600} .

The corresponding numbers of predictors are

p \in {100, 1225, 3600},

so the number of predictors exceeds significantly the sample size in the experiments. For every model we consider two Lasso estimators with unpenalized intercepts (5): the first one with the logistic loss and the second one with the quadratic loss. They are denoted by “logistic” and “quadratic”, respectively. To calculate them we use the “glmnet” package [] in the “R” software []. The tuning parameters

λ

are chosen on the basis of 10-fold cross-validation.

Observe that applying the Lasso estimator with the logistic loss function to Scenario 1 leads to a well-specified model, while using the quadratic loss implies misspecification. In Scenario 2 both estimators work in misspecified models.

Simulations for each scenario are repeated 300 times.

To describe the quality of estimators in variable selection we calculate two values:

-: TD—the number of correctly selected relevant predictors;
-: sep—the number of relevant predictors, whose Lasso coefficients are greater in absolute value than the largest in absolute value Lasso coefficient corresponding to irrelevant predictors.

So, we want to confirm that the considered estimators are able to separate predictors, which we establish in Section 5. Using TD we also study “screening” properties of estimators, which are easier than separability.

The classification accuracy of estimators is measured in the following way: we generate a test sample containg 1000 objects. On this set we calculate

-: pred—the fraction of correctly predicted classes of objects for each estimator.

The results of experiments are collected in Table 1 and Table 2. By the “oracle” we mean the classifier, which works only with significant predictors and uses the function g from the true model (8) in the estimation process.

Table 1. Results for Scenario 1.

Table 2. Results for Scenario 2.

Finally, we also compare execution time of both algorithms. In Table 3 we show the averaged relative time difference

\frac{t^{l o g} - t^{q u a d}}{t^{q u a d}},

(28)

where

t^{q u a d}

and

t^{l o g}

is time of calculating Lasso with quadratic and logistic loss functions, respectively.

Table 3. Relative time difference (28) of algorithms.

Looking at the results of experiments we observe that both estimators perform in a satisfactory way. Their predictive accuracy is relatively close to the oracle, especially when the sample size is larger. In variable selection we see that both estimators are able to find significant predictors and separate predictors in both scenarios. Again we can notice that properties of estimators become better, when n increases.

In Scenario 2 the quality of both estimators in prediction and variable selection is comparable. In Scenario 1, which is well-specified for Lasso with the logistic loss, we observe its dominance over Lasso with the quadratic loss. However, this dominance is not large. Therefore, using Lasso with the quadratic loss we obtain slightly worse accuracy of the procedure, but this algorithm is computationally faster. The computational efficiency is especially important, when we study large data sets. As we can see in Table 3 execution time of estimators is almost the same for

n = 350

, but for

n = 600

the relative time difference becomes greater than

10 % .

Author Contributions

Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

The research of K.F. was partially supported by Warsaw University of Life Sciences (SGGW).

Acknowledgments

We would like to thank J. Mielniczuk and the reviewers for their valuable comments, which have improved the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs and Auxiliary Results

This section contains proofs of results from the paper. Additional lemmas are also provided.

Appendix A.1. Results from Section 3

Proof of Theorem 1.

For arbitrary

b \in R^{p + 1}

the averaged misclassification risk of

f_{b}

can be expressed as

E_{D} R (b) = E_{D} E_{(X, Y)} [I (Y = 1) I (b^{⊤} \tilde{X} < 0) + I (Y = - 1) I (b^{⊤} \tilde{X} \geq 0)] .

(A1)

Moreover, we have

I (Y = - 1) I (b^{⊤} \tilde{X} \geq 0) = I (Y = - 1) [1 - I (b^{⊤} \tilde{X} < 0)] .

(A2)

Applying (A1) and (A2) for

\hat{b}

and

b_{*},

we obtain

\begin{matrix} | E_{D} R (\hat{b}) - R (b_{*}) | \\ = & |E_{D} E_{(X, Y)} [I (Y = 1) - I (Y = - 1)] [I ({\hat{b}}^{⊤} \tilde{X} < 0) - I (b_{*}^{⊤} \tilde{X} < 0)]| \\ \leq & E_{D} E_{(X, Y)} |I ({\hat{b}}^{⊤} \tilde{X} < 0) - I (b_{*}^{⊤} \tilde{X} < 0)| \\ = & P ({\hat{b}}^{⊤} \tilde{X} < 0, b_{*}^{⊤} \tilde{X} \geq 0) + P ({\hat{b}}^{⊤} \tilde{X} \geq 0, b_{*}^{⊤} \tilde{X} < 0), \end{matrix}

where P is probability with respect to both the data D and the new object

X .

Observe that on the event

Ω,

we have that

{\hat{b}}^{⊤} \tilde{X} \leq c {| \tilde{X} |}_{\infty} + b_{*}^{⊤} \tilde{X},

so

\begin{matrix} P ({\hat{b}}^{⊤} \tilde{X} \geq 0, b_{*}^{⊤} \tilde{X} < 0) = P ({\hat{b}}^{⊤} \tilde{X} \geq 0, b_{*}^{⊤} \tilde{X} < 0, Ω) \\ + & P ({\hat{b}}^{⊤} \tilde{X} \geq 0, b_{*}^{⊤} \tilde{X} < 0, Ω^{c}) \\ \leq & P_{X} (- c | \tilde{X} |_{\infty} \leq b_{*}^{⊤} \tilde{X} < 0) + P_{D} (Ω^{c}) . \end{matrix}

Analogously, we obtain

P ({\hat{b}}^{⊤} \tilde{X} < 0, b_{*}^{⊤} \tilde{X} \geq 0) \leq P_{X} (0 \leq b_{*}^{⊤} \tilde{X} \leq c | \tilde{X} |_{\infty}) + P_{D} (Ω^{c})

from

{\hat{b}}^{⊤} \tilde{X} \geq - c {| \tilde{X} |}_{\infty} + b_{*}^{⊤} \tilde{X},

which finishes the proof. □

Lemma A1.

Suppose that Assumption 1 is fulfilled. Moreover, a random variable

b_{*}^{⊤} \tilde{X}

has a density

h,

which is continuous on the interval

U = [- 2 σ c \sqrt{log p}, 2 σ c \sqrt{log p}]

and

\tilde{h} = sup_{u \in U} h (u) .

Then

P_{X} (| b_{*}^{⊤} \tilde{X} | \leq c | \tilde{X} |_{\infty}) \leq 4 σ \tilde{h} c \sqrt{log p} + 2 / p .

(A3)

Proof.

For simplicity, we omit the lower index X in probability

P_{X}

in this proof. We take

a > 1

and obtain inequalities

\begin{matrix} P (| b_{*}^{⊤} \tilde{X} | \leq c | \tilde{X} |_{\infty}) \leq P (| b_{*}^{⊤} \tilde{X} | \leq c | \tilde{X} |_{\infty}, | \tilde{X} |_{\infty} \leq a) \\ + & P (| b_{*}^{⊤} \tilde{X} | \leq c | \tilde{X} |_{\infty}, | \tilde{X} |_{\infty} > a) \\ \leq & P (| b_{*}^{⊤} \tilde{X} | \leq c a) + P (| \tilde{X} |_{\infty} > a) . \end{matrix}

(A4)

The second expression in (A4) equals

P (| X |_{\infty} > a),

because

a > 1 .

It can be handled using subgaussianity of X as follows: take

z > 0

and notice that by the Markov inequality and the fact that

exp (| u |) \leq exp (u) + exp (- u)

for each

u \in R,

we obtain

\begin{matrix} P (| X |_{\infty} > a) & \leq & e^{- z a} {E exp (z | X |}_{\infty}) \leq e^{- z a} \sum_{j = 1}^{p} E exp (z | X_{j} |) \\ \leq & 2 p exp (σ^{2} z^{2} / 2 - a z) . \end{matrix}

Taking

z = a / σ^{2},

we obtain

{P (| X |}_{\infty} > a) \leq 2 p exp (- a^{2} / (2 σ^{2})) .

Then we choose

a = 2 σ \sqrt{log p},

which is not smaller than one, because

σ \geq 1

from Assumption 1.

Finally, the first term in (A4) can be bounded from above by

2 c a \tilde{h} = 4 σ \tilde{h} c \sqrt{log p}

by the mean value theorem. □

Proof of Theorem 2.

The right-hand side of (15) and (17) are upper bounds on the estimation risk. They are obtained using Theorem 1 and Lemma A1. The expressions (16) and (18) are upper bounds for the approximation risk in the case of estimators with the quadratic and logistic loss functions, respectively. In particular, (16) follows from ([], [Theorem 2.1) applied for

f_{b_{*}^{q u a d}}

and Example 3.1. Establishing (18) is similar: we just use ([], [Theorem 2.1) applied for

f_{b_{*}^{l o g}}

and Example 3.5 to show that

R (b_{*}^{l o g}) - R_{B} \leq \sqrt{2 E [K L (η (X), η_{l o g} ({(b_{*}^{l o g})}^{⊤} \tilde{X}))]},

(A5)

where the Kullback–Leibler distance

K L (\cdot, \cdot)

is defined in (10).

Next, we define the function

h (a) = a log a + (1 - a) log (1 - a)

for

a \in (0, 1)

. Clearly, we have

K L (a, b) = h (a) - h (b) - h^{'} (b) (a - b)

and

h^{″} (a) = {(a (1 - a))}^{- 1 .}

Therefore, from the mean value theorem

K L (a, b) = \frac{{(a - b)}^{2}}{2 c (1 - c)}

(A6)

for some c between a and b. To finish the proof we apply (A6) to the right-hand side of (A5) with

δ < c < 1 - δ .

□

Appendix A.2. Results from Section 4

To simplify notation we write

\hat{b}, b_{*}

for

{\hat{b}}^{q u a d}, b_{*}^{q u a d},

respectively, in this section. Moreover, we also denote

{\overset{˚}{b}}_{*} = ({(b_{*})}_{1}, \dots, {(b_{*})}_{p}) .

We start with establishing results, which help us to prove Theorem 3.

Lemma A2.

For

{\overset{˚}{b}}_{*} = H^{- 1} E [X Y]

we have

{\overset{˚}{b}}_{*}^{⊤} H {\overset{˚}{b}}_{*} \leq 1 .

Proof.

The proof is elementary and based on the inequality

0 \leq E {[E [Y | X] - {\overset{˚}{b}}_{*}^{⊤} X]}^{2} .

(A7)

The right-hand side of (A7) can be expressed as

E [{[E [Y | X]]}^{2} - 2 {\overset{˚}{b}}_{*}^{⊤} E [X Y | X] + {\overset{˚}{b}}_{*}^{⊤} X X^{⊤} {\overset{˚}{b}}_{*}] = E {[E [Y | X]]}^{2} - 2 {\overset{˚}{b}}_{*}^{⊤} E [X Y] + {\overset{˚}{b}}_{*}^{⊤} H {\overset{˚}{b}}_{*} .

(A8)

Using

{\overset{˚}{b}}_{*} = H^{- 1} E [X Y],

we have

{\overset{˚}{b}}_{*}^{⊤} E [X Y] = {\overset{˚}{b}}_{*}^{⊤} H H^{- 1} E [X Y] = {\overset{˚}{b}}_{*}^{⊤} H {\overset{˚}{b}}_{*}

and we can bound from above the right-hand side of (A8) by

E [Y^{2}] - {\overset{˚}{b}}_{*}^{⊤} H {\overset{˚}{b}}_{*},

which finishes the proof. □

The next result is given in ([], Corollary 8.2).

Lemma A3.

Suppose that

Z_{1}, \dots, Z_{n}

are i.i.d. random variables and there exists

L > 0

such that

C^{2} = E exp (| Z_{1} | / L)

is finite. Then for arbitrary

u > 0

P (\frac{1}{n} \sum_{i = 1}^{n} (Z_{i} - E [Z_{i}]) > 2 L (C \sqrt{\frac{2 u}{n}} + \frac{u}{n})) \leq exp (- u) .

Lemma A4.

For arbitrary

j = 1, \dots, p

and

u > 0

we have

P (\frac{2}{n} \sum_{i = 1}^{n} X_{i j} (X_{i}^{⊤} {\overset{˚}{b}}_{*} + E [Y] - Y_{i}) > 16.4 σ^{2} (3 \sqrt{\frac{2 u}{n}} + \frac{u}{n})) \leq exp (- u) .

(A9)

Proof.

Fix

j = 1, \dots, p

and

u > 0 .

Recall that

H {\overset{˚}{b}}_{*} = E [Y X]

and

E [X] = 0 .

Thus, we work with an average of i.i.d. centred random variables, so we can use Lemma A3. We only have to find

L, C > 0

such that

E exp (| X_{j} (X^{⊤} {\overset{˚}{b}}_{*} + E [Y] - Y) | / L) \leq C^{2},

(A10)

where

X_{j}

is the j-th coordinate of

X .

For each positive number

a, b

we have the inequality

a b \leq \frac{a^{2}}{2} + \frac{b^{2}}{2} .

Therefore, we have

| X_{j} (X^{⊤} {\overset{˚}{b}}_{*} + E [Y] - Y) | \leq \frac{X_{j}^{2}}{2} + {(X^{⊤} {\overset{˚}{b}}_{*})}^{2} + 4 .

Applying this fact and the Schwarz inequality we obtain

E exp (| X_{j} (X^{⊤} {\overset{˚}{b}}_{*} - Y) | / L) \leq exp (\frac{4}{L}) \sqrt{E exp (\frac{X_{j}^{2}}{L}) E exp (\frac{2 {(X^{⊤} {\overset{˚}{b}}_{*})}^{2}}{L})} .

(A11)

The variable

X_{j}

is subgaussian, so using ([], Lemma 7.4) we can bound the first expectation in (A11) by

{(1 - \frac{2 σ^{2}}{L})}^{- 1 / 2}

provided that

L > 2 σ^{2} .

The second expectation in (A11) can be bounded using subgaussianity of the vector

X_{T},

([], Lemma 7.4) and Lemma A2 in the following way

E exp (\frac{2 {(X^{⊤} {\overset{˚}{b}}_{*})}^{2}}{L}) \leq {(1 - \frac{4 σ^{2}}{L})}^{- 1 / 2},

provided that

4 σ^{2} < L .

Taking

L = 4.1 σ^{2}

we can bound

exp (4 / L) \leq 2.7,

because

H_{j j} = 1

implies that

σ \geq 1 .

Thus, we obtain

C \leq 3,

where C is the upper bound in (A10). It finishes the proof. □

Lemma A5.

Suppose that assumptions of Theorem 3 are satisfied. Then for arbitrary

a \in (0, 1), q \geq 1, ξ > 1

with probability at least

1 - K a

we have

{\bar{F}}_{q} (ξ) \geq F_{q} (ξ) / 2,

where K is an universal constant.

Proof.

Fix

a \in (0, 1), q \geq 1, ξ > 1 .

We start with considering the

l_{\infty}

-norm of the matrix

\begin{matrix} {|\frac{1}{n} {\tilde{X}}^{⊤} \tilde{X} - E [\tilde{X} {\tilde{X}}^{⊤}]|}_{\infty} & = & max (max_{j, k = 1, \dots, p} |\frac{1}{n} \sum_{i = 1}^{n} X_{i j} X_{i k} - E [X_{j} X_{k}]|, \end{matrix}

(A12)

\begin{matrix} max_{j = 1, \dots, p} |\frac{1}{n} \sum_{i = 1}^{n} X_{i j}|) . \end{matrix}

(A13)

We focus only on the right-hand side of (A12), because (A13) can be done similarly. Thus, fix

j, k \in {1, \dots, p} .

Using subgaussianity of predictors, Lemma A3 and argumentation similar to the proof of Lemma A4 we have

P (|\frac{1}{n} \sum_{i = 1}^{n} X_{i j} X_{i k} - E [X_{1 j} X_{1 k}]| > K_{2} σ^{2} \sqrt{\frac{log (p^{2} / a)}{n}}) \leq \frac{2 a}{p^{2}},

where

K_{2}

is an universal constant. The values of constants

K_{i}

that appear in this proof can change from line to line.

Therefore, using union bounds we obtain

P ({|\frac{1}{n} {\tilde{X}}^{⊤} \tilde{X} - E [\tilde{X} {\tilde{X}}^{⊤}]|}_{\infty} > K_{2} σ^{2} \sqrt{\frac{log (p^{2} / a)}{n}}) \leq K_{3} a .

Proceeding similarly to the proof of ([], Lemma 4.1) we have the following probabilistic inequality

{\bar{F}}_{q} (ξ) \geq F_{q} (ξ) - K_{2} (1 + ξ) | T | σ^{2} \sqrt{\frac{log (p^{2} / a)}{n}} .

To finish the proof we use (20) with

K_{1}

being sufficiently large. □

Proof of Theorem 3.

Let

a \in (0, 1), q \geq 1, ξ > 1

be arbitrary. The main part of the proof is to show that with high probability

| \hat{b} - b_{*} |_{q} \leq \frac{{ξ | T |}^{1 / q} λ}{(ξ + 1) {\bar{F}}_{q} (ξ)} .

(A14)

Then we apply Lemma A5 to obtain (22).

Thus, we focus on showing that (A14) holds with high probability. Denote

A = {| \nabla \bar{Q} (b_{*}) |_{\infty} \leq \frac{ξ - 1}{ξ + 1} λ} .

We start with bounding from below probability of

A .

Recall that

b_{*}

is the minimizer of

Q (b) = E {(1 - Y b^{⊤} \tilde{X})}^{2},

which can be easily calculated, namely

{\overset{˚}{b}}_{*} = H^{- 1} E [Y X] and {(b_{*})}_{0} = E [Y] .

For every

j = 1, \dots, p

the j-th partial derivative of

\bar{Q} (b)

at

b_{*}

is

\nabla_{j} \bar{Q} (b_{*}) = \frac{2}{n} \sum_{i = 1}^{n} X_{i j} (X_{i}^{⊤} {\overset{˚}{b}}_{*} + E [Y] - Y_{i}) .

(A15)

The derivative with respect to the

b_{0}

is

\nabla_{0} \bar{Q} (b_{*}) = \frac{2}{n} \sum_{i = 1}^{n} (X_{i}^{⊤} {\overset{˚}{b}}_{*} + E [Y] - Y_{i}) .

(A16)

Taking

λ,

which satisfies (21), and using union bounds, we obtain that

P (A^{c}) \leq \sum_{j = 0}^{p} P (| \nabla_{j} \bar{Q} (b_{*}) | > K_{2} σ^{2} \sqrt{\frac{log (p / a)}{n}}) .

(A17)

Consider a summand on the right-hand side of (A17), which corresponds to

j \in {1, \dots, p} .

From (A15) we can handle it using Lemma A4. We just take

u = log (p / a)

and suffciently large

K_{2} .

Probability of the first term on the right-hand side of (A17), which corresponds to

j = 0,

can be bounded from above analogously as in the proof of Lemma A4. Proceeding is even easier, so we omit it.

In further argumentation we consider only the event

A .

Besides, we denote

θ = \hat{b} - b_{*},

where

\hat{b}

is a minimizer of a convex function (5), that is equivalent to

\{\begin{matrix} \nabla_{j} \bar{Q} (\hat{b}) = - λ sign ({\hat{b}}_{j}) & for & {\hat{b}}_{j} \neq 0; \\ | \nabla_{j} \bar{Q} (\hat{b}) | \leq λ & for & {\hat{b}}_{j} = 0; \\ \nabla_{0} \bar{Q} (\hat{b}) = 0, \end{matrix}

(A18)

where

j = 1, \dots, p .

First, we prove that

θ \in C (ξ) .

Here our argumentation is standard []. From (A18) and the fact that

{| θ |}_{1} = | θ_{T} |_{1} + | θ_{T^{c}} |_{1} + | θ_{0} |

we can calculate

\begin{matrix} 0 & \leq 2 θ^{⊤} {\tilde{X}}^{⊤} \tilde{X} θ / n = θ^{⊤} [\nabla \bar{Q} (\hat{b}) - \nabla \bar{Q} (b_{*})] \\ = \sum_{j \in T} θ_{j} \nabla_{j} \bar{Q} (\hat{b}) + \sum_{j \in T^{c}} {\hat{b}}_{j} \nabla_{j} \bar{Q} (\hat{b}) - θ^{⊤} \nabla \bar{Q} (b_{*}) \\ \leq λ \sum_{j \in T} | θ_{j} | - λ \sum_{j \in T^{c}} | {\hat{b}}_{j} {| + | θ |}_{1} {| \nabla \bar{Q} (b_{*}) |}_{\infty} \\ = [λ + | \nabla \bar{Q} (b_{*}) |_{\infty}] | θ_{T} |_{1} + [| \nabla \bar{Q} (b_{*}) |_{\infty} - λ] | θ_{T^{c}} |_{1} + | θ_{0} | | \nabla \bar{Q} (b_{*}) |_{\infty} . \end{matrix}

Thus, using the fact that we consider the event

A

we get

| θ_{T^{c}} |_{1} \leq \frac{λ + | \nabla \bar{Q} (b_{*}) |_{\infty}}{λ - | \nabla \bar{Q} (b_{*}) |_{\infty}} | θ_{T} |_{1} + \frac{| \nabla \bar{Q} (b_{*}) |_{\infty}}{λ - | \nabla \bar{Q} (b_{*}) |_{\infty}} | θ_{0} | \leq ξ | θ_{\tilde{T}} |_{1} .

Therefore, from the definition of

{\bar{F}}_{q} (ξ)

we have

| \hat{b} - b_{*} |_{q} \leq \frac{{| T |}^{1 / q} {| {\tilde{X}}^{⊤} \tilde{X} (\hat{b} - b_{*}) / n |}_{\infty}}{{\bar{F}}_{q} (ξ)} \leq {| T |}^{1 / q} \frac{| \nabla \bar{Q} (\hat{b}) |_{\infty} / 2 + {| \nabla \bar{Q} (b_{*}) |}_{\infty} / 2}{{\bar{F}}_{q} (ξ)} .

Using (A18) and the fact, that we are on

A,

we obtain (A14). □

Appendix A.3. Results from Section 5

Proof of Corollary 2.

The proof is a simple consequence of the bound (22) with

q = \infty

obtained in Theorem 3. Indeed, for arbitrary predictors

j \in T

and

k \notin T

we obtain

\begin{matrix} | {\hat{b}}_{j}^{q u a d} | & \geq & | {(b_{*}^{q u a d})}_{j} | - | {\hat{b}}_{j}^{q u a d} - {(b_{*}^{q u a d})}_{j} | \geq b_{m i n}^{q u a d} - {| {\hat{b}}^{q u a d} - b_{*}^{q u a d} |}_{\infty} \\ > & \frac{2 ξ λ}{(ξ + 1) F_{\infty} (ξ)} \geq | {\hat{b}}^{q u a d} - (b_{*}^{q u a d}) |_{\infty} \geq | {\hat{b}}_{k}^{q u a d} - {(b_{*}^{q u a d})}_{k} | = | {\hat{b}}_{k}^{q u a d} | . \end{matrix}

□

Proof of Corollary 3.

The proof is almost the same as the proof of Corollary 2, so it is omitted. □

References

Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Data Mining, Inference and Prediction; Springer: New York, NY, USA, 2001. [Google Scholar]
Bühlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications; Springer: New York, NY, USA, 2011. [Google Scholar]
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 2006, 34, 1436–1462. [Google Scholar] [CrossRef]
Zhao, P.; Yu, B. On Model Selection Consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
van de Geer, S. High-dimensional generalized linear models and the Lasso. Ann. Stat. 2008, 36, 614–645. [Google Scholar] [CrossRef]
Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Ye, F.; Zhang, C.H. Rate minimaxity of the Lasso and Dantzig selector for the l_q loss in l_r balls. J. Mach. Learn. Res. 2010, 11, 3519–3540. [Google Scholar]
Huang, J.; Zhang, C.H. Estimation and Selection via Absolute Penalized Convex Minimization and Its Multistage Adaptive Applications. J. Mach. Learn. Res. 2012, 13, 1839–1864. [Google Scholar]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
Kubkowski, M.; Mielniczuk, J. Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors. Entropy 2020, 22, 153. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Quintero, F.; Contreras-Reyes, J.E.; Wiff, R.; Arellano-Valle, R.B. Flexible Bayesian analysis of the von Bertalanffy growth function with the use of a log-skew-t distribution. Fish. Bull. 2017, 115, 12–26. [Google Scholar]
Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 2004, 32, 56–85. [Google Scholar] [CrossRef]
Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification and risk bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef]
Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer-Verlag: New York, NY, USA, 1996. [Google Scholar]
Boucheron, S.; Bousquet, O.; Lugosi, G. Introduction to statistical learning theory. Adv. Lect. Mach. Learn. 2004, 36, 169–207. [Google Scholar]
Boucheron, S.; Bousquet, O.; Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM P&S 2005, 9, 323–375. [Google Scholar]
Bartlett, P.L.; Bousquet, O.; Mendelson, S. Local Rademacher complexities. Ann. Stat. 2005, 33, 1497–1537. [Google Scholar] [CrossRef]
Audibert, J.Y.; Tsybakov, A.B. Fast learning rates for plug-in classifiers. Ann. Stat. 2007, 35, 608–633. [Google Scholar] [CrossRef]
Blanchard, G.; Bousquet, O.; Massart, P. Statistical performance of support vector machines. Ann. Stat. 2008, 36, 489–531. [Google Scholar] [CrossRef]
Tarigan, B.; van de Geer, S. Classifiers of support vector machine type with l₁ complexity regularization. Bernoulli 2006, 12, 1045–1076. [Google Scholar] [CrossRef]
Abramovich, F.; Grinshtein, V. High-Dimensional Classification by Sparse Logistic Regression. IEEE Trans. Inf. Theory 2019, 65, 3068–3079. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef]
Buldygin, V.; Kozachenko, Y. Metric Characterization of Random Variables and Random Processes; American Mathematical Society: Providence, RI, USA, 2000. [Google Scholar]
Huang, J.; Sun, T.; Ying, Z.; Yu, Y.; Zhang, C.H. Oracle inequalities for the lasso in the Cox model. Ann. Stat. 2013, 41, 1142–1165. [Google Scholar] [CrossRef]
van de Geer, S.; Bühlmann, P. On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 2009, 3, 1360–1392. [Google Scholar] [CrossRef]
Li, K.C.; Duan, N. Regression analysis under link violation. Ann. Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
Thorisson, H. Coupling methods in probability theory. Scand. J. Stat. 1995, 22, 159–182. [Google Scholar]
Brillinger, D.R. A Generalized Linear Model with Gaussian Regressor Variables; A Festschrift for Erich Lehmann; Bickel, P.J., Doksum, K., Hodges, J.L., Eds.; Wadsworth: Belmont, CA, USA, 1983; pp. 97–114. [Google Scholar]
Ruud, P.A. Sufficient Conditions for the Consistency of Maximum Likelihood Estimation Despite Misspecification of Distribution in Multinomial Discrete Choice Models. Econometrica 1983, 51, 225–228. [Google Scholar] [CrossRef]
Zhong, W.; Zhu, L.; Li, R.; Cui, H. Regularized quantile regression and robust feature screening for single index models. Stat. Sin. 2016, 26, 69–95. [Google Scholar] [CrossRef]
Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 2008, 70, 849–911. [Google Scholar] [CrossRef] [PubMed]
Hall, P.; Li, K.C. On almost Linearity of Low Dimensional Projections from High Dimensional Data. Ann. Stat. 1993, 21, 867–889. [Google Scholar] [CrossRef]
Pokarowski, P.; Mielniczuk, J. Combined l₁ and Greedy l₀ Penalized Least Squares for Linear Model Selection. J. Mach. Learn. Res. 2015, 16, 961–992. [Google Scholar]
Pokarowski, P.; Rejchel, W.; Soltys, A.; Frej, M.; Mielniczuk, J. Improving Lasso for model selection and prediction. arXiv 2019, arXiv:1907.03025. [Google Scholar]
R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
van de Geer, S. Estimation and Testing under Sparsity; Springer: Berlin, Germany, 2016. [Google Scholar]
Baraniuk, R.; Davenport, M.A.; Duarte, M.F.; Hegde, C. An Introduction to Compressive Sensing; Connexions, Rice University: Houston, TX, USA, 2011. [Google Scholar]

Table 1. Results for Scenario 1.

$n = 100$	Quadratic	Logistic	Oracle
TD	6.3	6.1
sep	2.2	2.3
pred	0.734	0.736	0.810
$n = 350$
TD	9.3	9.5
sep	6.0	6.3
pred	0.774	0.779	0.831
$n = 600$
TD	9.8	9.9
sep	8.6	8.9
pred	0.791	0.795	0.832

Table 2. Results for Scenario 2.

$n = 100$	Quadratic	Logistic	Oracle
TD	4.8	4.6
sep	1.4	1.4
pred	0.697	0.698	0.768
$n = 350$
TD	8.1	8.2
sep	3.9	3.9
pred	0.730	0.731	0.805
$n = 600$
TD	9.4	9.4
sep	6.8	6.9
pred	0.750	0.752	0.809

Table 3. Relative time difference (28) of algorithms.

	Scenario 1	Scenario 2
$n = 350$	0.02	0.06
$n = 600$	0.11	0.13

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

Abstract

1. Introduction

2. Assumptions and Notation

3. Predictive Properties of Classifiers

4. On the Event $Ω$

5. Variable Selection Properties of Estimators

6. Numerical Experiments

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proofs and Auxiliary Results

Appendix A.1. Results from Section 3

Appendix A.2. Results from Section 4

Appendix A.3. Results from Section 5

References

Article Metrics

Citations

Article Access Statistics

Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

Abstract

1. Introduction

2. Assumptions and Notation

3. Predictive Properties of Classifiers

4. On the Event Ω

5. Variable Selection Properties of Estimators

6. Numerical Experiments

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proofs and Auxiliary Results

Appendix A.1. Results from Section 3

Appendix A.2. Results from Section 4

Appendix A.3. Results from Section 5

References

Article Metrics

Citations

Article Access Statistics

4. On the Event $Ω$