Estimation of Error Variance in Regularized Regression Models via Adaptive Lasso

Wang, Xin; Kong, Lingchen; Wang, Liqun

doi:10.3390/math10111937

Open AccessArticle

Estimation of Error Variance in Regularized Regression Models via Adaptive Lasso

by

Xin Wang

^1,*,

Lingchen Kong

¹ and

Liqun Wang

²

¹

Department of Applied Mathematics, Beijing Jiaotong University, Beijing 100044, China

²

Department of Statistics, University of Manitoba, Winnipeg, MB R3T 2N2, Canada

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(11), 1937; https://doi.org/10.3390/math10111937

Submission received: 24 April 2022 / Revised: 30 May 2022 / Accepted: 30 May 2022 / Published: 6 June 2022

(This article belongs to the Special Issue Statistical Methods for High-Dimensional and Massive Datasets)

Download

Browse Figures

Versions Notes

Abstract

:

Estimation of error variance in a regression model is a fundamental problem in statistical modeling and inference. In high-dimensional linear models, variance estimation is a difficult problem, due to the issue of model selection. In this paper, we propose a novel approach for variance estimation that combines the reparameterization technique and the adaptive lasso, which is called the natural adaptive lasso. This method can, simultaneously, select and estimate the regression and variance parameters. Moreover, we show that the natural adaptive lasso, for regression parameters, is equivalent to the adaptive lasso. We establish the asymptotic properties of the natural adaptive lasso, for regression parameters, and derive the mean squared error bound for the variance estimator. Our theoretical results show that under appropriate regularity conditions, the natural adaptive lasso for error variance is closer to the so-called oracle estimator than some other existing methods. Finally, Monte Carlo simulations are presented, to demonstrate the superiority of the proposed method.

Keywords:

high-dimensional linear model; variance estimation; natural adaptive lasso; mean squared error bound; regularized regression

MSC:

62F10; 62J05; 62J10

1. Introduction

Consider linear regression model

y = x^{T} β + ε

, where

y \in R

is the response variable,

x \in R^{p}

is the predictor variable,

β \in R^{p}

is the unknown regression parameter and

ε \in R

is the random error satisfying

ε \sim N (0, σ^{2} I_{n})

. Given an

i . i . d .

random sample

{(y_{i}, x_{i}^{T})}^{T} \in R^{p + 1}

,

i = 1, \dots, n

, the model can be written in the matrix form as

y = X β + ε

, where

y = {(y_{1}, \dots, y_{n})}^{T} \in R^{n}

,

X = {[x_{1}, \dots, x_{n}]}^{T} \in R^{n \times p}

and

ε = {(ε_{1}, \dots, ε_{n})}^{T} \in R^{n}

. In this paper, we are mainly interested in the high-dimensional sparse model, where

p ≫ n

.

Regularized methods for simultaneous model selection and parameter estimation have been intensively studied in the literature, e.g., the lasso [1], smoothly clipped absolute deviation (SCAD) [2], adaptive lasso [3], bridge [4], adaptive elastic net [5], and minimax concave penalty (MCP) [6], as well as the Dantzig selector [7]. In addition, screening rules for dimension reduction are proposed, e.g., the sure independent screening method and iteratively sure independent screening method [8], lasso-based screening rules [9,10,11], etc.

However, most of these works focus on selection and estimation, with respect to regression parameters, and few studies deal with estimation of error variance, although it is a fundamental and crucial problem in statistical inference and regression analysis. In conventional linear models, the common estimator, based on residuals, plays an important role in statistical inferences and model checking. In high-dimensional models, however, variance estimation becomes a difficult problem, mainly due to two reasons. One is that the traditional residual-based methods may perform poorly or, even, fail, as, for example, the ordinary least squares method does not work when the number of covariates is greater than the sample size. The other reason is that it is difficult to select the true model, accurately, since in practice the selected model, often, contains spurious variables that are correlated with the residuals, resulting in significant underestimation of error variance (e.g., [12,13]).

Next, we provide some examples, where model error variance is involved and plays an important role.

Example 1

(Model selection). Penalization is a common approach to model selection and parameter estimation, in high-dimensional linear models. The efficiency and accuracy of such methods depend on certain tuning parameters that are chosen using some criteria, such as Mallows’s

C_{p}

, Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC). For example, the AIC and BIC for the lasso [14] are given by

AIC (X {\hat{β}}_{λ_{n}}, σ^{2}) = \frac{{∥ y - X {\hat{β}}_{λ_{n}} ∥}_{2}^{2}}{n σ^{2}} + \frac{2}{n} df (X {\hat{β}}_{λ_{n}})

and

BIC (X {\hat{β}}_{λ_{n}}, σ^{2}) = \frac{{∥ y - X {\hat{β}}_{λ_{n}} ∥}_{2}^{2}}{n σ^{2}} + \frac{log (n)}{n} df (X {\hat{β}}_{λ_{n}})

respectively, where

{\hat{β}}_{λ_{n}}

is the lasso estimator with tuning parameter

λ_{n}

and the degrees of freedom

df (X {\hat{β}}_{λ_{n}})

is equal to the number of non-zero elements in

{\hat{β}}_{λ_{n}}

. It is easy to see that these criteria rely on error variance.

Example 2

(Confidence intervals). For a least-squares-based penalized estimator

{\hat{β}}_{λ_{n}}

, let

\hat{A}

be its index set, corresponding to non-vanishing parameters. If

{\hat{β}}_{λ_{n}}

has the oracle property, then for each

i \in \hat{A}

, the

1 - α

confidence interval for

β_{i}

is given by

[{\hat{β}}_{i} - z_{1 - α / 2} c_{i} σ^{2}, {\hat{β}}_{i} + z_{1 - α / 2} c_{i} σ^{2}],

where

z_{1 - α / 2}

is the

(1 - α / 2)

-th quantile of the standard normal distribution and

c_{i}

is the i-th diagonal element of the matrix

{(X_{\hat{A}}^{T} X_{\hat{A}})}^{- 1}

. It is clear that the above intervals depend on the variance parameter.

Example 3

(Penalized second-order least squares estimation). The second-order least squares method, in [15], extends the ordinary least squares method by, simultaneously, minimizing the first two order distances

ρ_{i} (β, σ^{2}) = {(y_{i} - x_{i}^{T} β, y_{i}^{2} - {(x_{i}^{T} β)}^{2} - σ^{2})}^{T}

and yields the joint estimators for the regression and variance parameters. Under general conditions, the second-order least squares estimator has been shown to be, asymptotically, more efficient than the ordinary least squares estimator, if the model error has a nonzero third moment, and they are equivalent otherwise. The regularized version of this method can be used in high-dimensional models.

1.1. Literature Review

Variance estimation in high-dimensional models has attracted increasing attention in recent years. Here, we briefly review some important advances in this area. First, if the true parameter vector

β^{*}

was known, then the ideal variance estimator, called the oracle estimator, is

σ_{oracle}^{2} = (1 / n) \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} β^{*})}^{2}

. Correspondingly, the estimator

σ_{naive}^{2} = \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} \hat{β})}^{2} / n

, based on some estimator

\hat{β}

for

β

, is called a naive estimator. Since the naive estimator is downward biased, a modified unbiased estimator is given by

{\hat{σ}}^{2} = \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} \hat{β})}^{2} / (n - \hat{s})

, where

\hat{s} : = # {i : {\hat{β}}_{i} \neq 0}

is the number of nonzero elements in

\hat{β}

. Unfortunately, when p is much larger than n, even a small change in

\hat{s}

will cause huge fluctuation in

{\hat{σ}}^{2}

, if

\hat{s} \approx n

.

To overcome this problem, Ref. [16] estimated the mean and variance parameters jointly, by maximizing a reparameterized likelihood with

ℓ_{1}

penalty:

({\hat{ϕ}}_{λ_{n}}, {\hat{ρ}}_{λ_{n}}) = arg min_{ϕ \in R^{p}, ρ \in R_{+}} \{log (ρ) + \frac{{∥ ρ y - X ϕ ∥}_{2}^{2}}{2 n} + λ_{n} {∥ ϕ ∥}_{1}\},

where

ϕ = β / σ

,

ρ = 1 / σ

and

R_{+} = {x \in R : x > 0}

. Moreover, they proposed a generalized EM algorithm for the numerical optimization.

A refitted cross-validation (RCV) method, to derive a variance estimator, was proposed in [12], and its asymptotic properties were studied. The main idea of RCV is to attenuate the influence of irrelevant variables with high spurious correlations, via a data-splitting technique. Ref. [12], also, discussed the asymptotic properties of the lasso-based estimator

{\hat{σ}}_{lasso}^{2} = \sum_{i = 1}^{n} {(y - x_{i}^{T} {\hat{β}}_{lasso} s)}^{2} / (n - {\hat{s}}_{lasso})

and SCAD-based estimator

{\hat{σ}}_{SCAD}^{2} = \sum_{i = 1}^{n} {(y - x_{i}^{T} {\hat{β}}_{SCAD})}^{2} / (n - {\hat{s}}_{SCAD})

, where

{\hat{β}}_{lasso}

and

{\hat{β}}_{SCAD}

are the least squares estimator, with

ℓ_{1}

penalty [1] and SCAD penalty [2], respectively;

{\hat{s}}_{lasso} = # {i : {({\hat{β}}_{lasso})}_{i} \neq 0}

and

{\hat{s}}_{SCAD} = # {i : {({\hat{β}}_{SCAD})}_{i} \neq 0}

.

Further, a scaled lasso was proposed in [17], for simultaneous estimation of regression and variance parameters. Their model can be written as

({\hat{β}}_{λ_{n}}, {\hat{σ}}_{λ_{n}}^{2}) = arg min_{β \in R^{p}, σ \in R_{+}} \{\frac{{∥ y - X β ∥}_{2}^{2}}{2 n σ} + \frac{(1 - a) σ}{2} + λ_{n} {∥ β ∥}_{1}\} .

Under some regularity conditions, Ref. [17] proved the oracle inequalities for prediction and their estimators.

A moment estimator for the error variance, based on the covariance matrix

Σ

of the predictor variables, was studied in [18], where three cases were considered:

Σ = I

, estimable

Σ

and non-estimable

Σ

. A maximum likelihood method for the normally distributed noise was developed in [19].

Moreover, Ref. [13] considered another re-parameterized likelihood, with lasso penalty

\begin{matrix} ({\hat{θ}}_{λ_{n}}, {\hat{ϕ}}_{λ_{n}}) \in arg min_{θ \in R^{p}, ϕ \in R_{+ +}} \{- \frac{1}{2} log ϕ + ϕ \frac{{∥ y ∥}_{2}^{2}}{2 n} - \frac{1}{n} y^{T} X θ + \frac{{∥ X θ ∥}_{2}^{2}}{2 n ϕ} + λ_{n} Ω (θ, ϕ)\}, \end{matrix}

(1)

where

ϕ_{λ_{n}} = 1 / σ_{λ_{n}}^{2}

,

θ_{λ_{n}} = ϕ_{λ_{n}} β_{λ_{n}}

. In particular, they proposed two estimators: the natural lasso with

Ω (θ, ϕ) = {∥ θ ∥}_{1}

and the organic lasso with

Ω (θ, ϕ) = ϕ^{- 1} {∥ θ ∥}_{1}^{2}

.

Finally, Ref. [20] proposed a ridge-based method to estimate the error variance, under certain conditions, which is defined as follows:

{\hat{σ}}^{2} = {1 - n^{- 1} tr (A_{1 n})}^{- 1} {\overset{ˇ}{σ}}^{2},

where

{\overset{ˇ}{σ}}^{2} = n^{- 1} y^{T} (I_{n} - A_{1 n}) y

,

A_{1 n} = n^{- 1} X (n^{- 1} X^{T} X + η I_{p}) X^{T}

and

η

is the tuning parameter. This method performs well in low-dimensional cases, with weak signals, and it is suitable for sparse as well as non-sparse models.

1.2. Notation and Outline

Throughout the paper, let

A_{0} : = {i : β_{i}^{*} \neq 0}

be the index set and

s : = # {i : β_{i}^{*} \neq 0}

be the number of the nonzero elements of

β^{*}

, respectively. Given a design matrix

X

and a subset

A

of

{1, \dots, p}

,

X_{i}

denotes the i-th column vector of

X

, and

X_{A}

denotes the sub-matrix, consisting of the columns with indices in

A

. For vectors

x

,

y \in R^{p}

,

x \circ y : = {(x_{1} y_{1}, \dots, x_{p} y_{p})}^{T}

denotes the Hadamard product. Moreover, let

1 / {| x | or | x |}^{- 1} = (1 / | x_{1} |, \dots, 1 / | x_{p} {|)}^{T}

,

sgn (x) = {(sgn (x_{1}), \dots, sgn (x_{p}))}^{T}

,

sign (x) = {(sign (x_{1}), \dots, sign (x_{p}))}^{T}

,

{\partial ∥ x ∥}_{1} = \partial | x_{1} | \times \dots \times \partial | x_{p} |

, where

sgn (t) = \{\begin{matrix} 0, t \neq 0, \\ 1, t = 0, \end{matrix} sign (t) = \{\begin{matrix} 1, t > 0, \\ 0, t = 0, \\ - 1, t < 0, \end{matrix} and \partial | t | = \{\begin{matrix} {1}, t > 0, \\ [- 1, 1], t = 0, \\ {- 1}, t < 0 . \end{matrix}

The rest of this paper is organized as follows: Section 2 defines and describes the proposed natural adaptive lasso, and Section 3 gives its asymptotic properties. Section 4 deals with the numerical optimization of the proposed estimators. Monte Carlo simulation studies of finite sample properties are provided in Section 5. The conclusions and discussion are given in Section 6, while the mathematical proofs are given in Section 7.

2. Natural Adaptive Lasso (NAL)

Some researchers, e.g., Refs. [13,16], used reparameterized likelihood to jointly estimate the mean and variance parameters in high-dimensional linear models. In particular, the method of [13] has good performance, and the associated numerical computation can be converted to some simple optimization procedures. However, the natural lasso in [13] always overestimates error variance, due to the over-selection of the covariates. This motivates us to consider the more generally adaptive lasso penalty, to further improve the properties of the estimators. Consider the following adaptively weighted

ℓ_{1}

-penalized likelihood

\begin{matrix} ({\hat{θ}}_{λ_{n}}, {\hat{ϕ}}_{λ_{n}}) \in arg min_{θ, ϕ \in R_{+}} \{L (θ_{λ_{n}}, ϕ_{λ_{n}}) + λ_{n} {∥ w \circ θ ∥}_{1}\}, \end{matrix}

(2)

where

L (θ_{λ_{n}}, ϕ_{λ_{n}})

is the reparameterized likelihood as (1),

λ_{n}

is the tuning parameter and

w : = {(w_{1}, \dots, w_{p})}^{T}

is the adaptive weight vector. Given a solution

({\hat{θ}}_{λ_{n}}, {\hat{ϕ}}_{λ_{n}})

of problem (2), the natural adaptive lasso estimators (NALE) for

β

and

σ^{2}

are given by

{\hat{β}}_{λ_{n}} = \frac{{\hat{θ}}_{λ_{n}}}{{\hat{ϕ}}_{λ_{n}}}, {\hat{σ}}_{λ_{n}}^{2} = \frac{1}{{\hat{ϕ}}_{λ_{n}}} .

(3)

It is easy to see that, when

w = 1

, the NALE reduces to the natural lasso estimator (NLE) of [13].

Note that the quality of the NALE depends on the weight vector

w

. It follows from Proposition 1 in Section 3, that the weight

w

in problem (2) plays the same role as in the adaptive lasso estimation of the regression coefficients only, which solves the following convex optimization problem:

\begin{matrix} {\hat{β}}_{ada} = arg min_{β \in R^{p}} \{\frac{1}{n} {∥ y - X β ∥}_{2}^{2} + 2 λ_{n} {∥ w \circ β ∥}_{1}\}, \end{matrix}

(4)

where the weight

w

depends on the initial estimator

{\tilde{β}}^{ini}

. As indicated by [3], any root-n consistent estimator can be used as the initial estimator for

β

. For example, the least squares estimator

{\hat{β}}_{ols} : = {(X^{T} X)}^{- 1} X^{T} y

can be used, and the weight vector is calculated as

w = 1 / | {\hat{β}}_{ols} |^{γ}

,

γ > 0

. Ref. [4] discusses the selection of the initial estimators in linear models, with

log p = O (n^{a})

for some

a \in (0, 1)

. They show that their marginal regression estimator can be used in the adaptive lasso, to yield the desirable selection and estimation properties. In addition, the weight

w

in adaptive elastic-net [5], for moderate dimensional models (

log p = O (log n)

), can be constructed as

w = 1 / | {\hat{β}}_{net} + (1 / n) sgn ({\hat{β}}_{net}) |^{γ}

,

γ > 0

, where

{\hat{β}}_{net}

is the elastic-net estimator. In this paper, we use the following two-step procedure to calculate the weight vector.

Step 1: Solve the lasso problem to obtain the NLE

{\hat{β}}_{lasso}

, which is used as the initial estimator

{\tilde{β}}^{ini}

.

Step 2: Set

w

with

w_{j} = p_{λ_{n}}^{'} (| {\tilde{β}}_{j}^{ini} |)

, where

j = 1, \dots, p

and

p_{λ_{n}}

is a folded-concave penalty function (such as SCAD, MCP or bridge).

Remark 1.

From [7,21,22], under some regularity conditions, the lasso is consistent with a near-oracle rate

\sqrt{s log p / n}

and has the sure-screening property, i.e.,

∥ {\hat{β}}_{lasso} - β^{*} ∥_{2} \leq O (\sqrt{s log (p) / n}), supp ({\hat{β}}_{lasso}) \supseteq supp (β^{*}) .

Further, based on the order of the bias of the lasso, under suitable conditions for the minimum signal strength (dee the first part of Condition 4 in Section 7) and the choice of tuning parameter,

w_{A_{0}}

will be close, or even equal, to zero vector, when n is sufficiently larger, if a folded-concave penalty, such as SCAD, is used. These properties play an important role in some of the conclusions that follow.

3. Asymptotic Properties

In this section, we, first, establish the relationship between the NALE and the adaptive lasso, then analyze the asymptotic properties of the NALE for

σ^{2}

.

Proposition 1.

The NALE estimator

({\hat{β}}_{λ_{n}}, {\hat{σ}}_{λ_{n}}^{2})

, defined in (3), where

({\hat{θ}}_{λ_{n}}, {\hat{ϕ}}_{λ_{n}})

is a solution of (2), satisfies the following properties:

(i): ${\hat{β}}_{λ_{n}}$ is a solution of the adaptive lasso (4);
(ii): ${\hat{σ}}_{λ_{n}}^{2}$ is the optimal value, of the objective function of the adaptive lasso (4). Furthermore, we have ${\hat{σ}}_{λ_{n}}^{2} = n^{- 1} {(∥ y ∥}_{2}^{2} - ∥ X {\hat{β}}_{λ_{n}} ∥_{2}^{2})$ .

The results of Proposition 1 are instrumental in the derivation of the other theoretical results in this paper. Moreover, they, also, provide a method for calculating the NALE for

β

and

σ^{2}

. It is well known that the adaptive lasso (4) is a convex optimization, and many existing optimization tools can be used to compute this problem.

Note that, since

\begin{matrix} {\hat{σ}}_{λ_{n}}^{2} & = & \frac{1}{n} ∥ y - X {\hat{β}}_{λ_{n}} ∥_{2}^{2} + 2 λ_{n} {∥ w \circ {\hat{β}}_{λ_{n}} ∥}_{1} \\ = & {\hat{σ}}_{naive}^{2} ({\hat{β}}_{λ_{n}}) + 2 λ_{n} {∥ w \circ {\hat{β}}_{λ_{n}} ∥}_{1} \end{matrix}

(5)

and

∥ w \circ {\hat{β}}_{λ_{n}} ∥_{1} = {∥ w_{A_{0}} \circ {\hat{β}}_{A_{0}} ∥}_{1}

will be close or even equal to zero, for suitably chosen

w

, the NALE for

σ^{2}

will be close to the naive estimator, if

λ_{n} \to 0

. As mentioned before, the naive estimator for

σ^{2}

, based on the adaptive lasso estimator

{\hat{β}}_{λ_{n}}

, may work well when non-zero variables are selected, accurately. However, when more irrelevant variables are selected, the value of the penalty term will not be close to 0 in the finite sample, so that the naive estimator for

σ^{2}

will, always, underestimate the true error variance. In this case, the penalty term will mitigate the difference between the naive estimator and the true variance. Although the form of the natural lasso estimator of [13] is similar to (5), their method often tends to over-select predictors, due to the use of a lasso penalty. In addition, the value of the penalty term in [13] remains large because it is not controlled by the weight vector. These facts explain why the natural lasso estimator for

σ^{2}

tends to be larger than the true error variance in the simulation studies, in [13].

Next, we establish a key inequality for the NALE for

σ^{2}

.

Lemma 1.

If

λ_{n} \geq \frac{1}{n} {∥ X^{T} ε ∥}_{\infty}

, then

| {\hat{σ}}_{λ_{n}}^{2} - \frac{1}{n} {∥ ε ∥}_{2}^{2} | \leq 2 λ_{n} max \{∥ w \circ β^{*} ∥_{1}, {∥ {\hat{β}}_{λ_{n}} - β^{*} ∥}_{1}\} .

The above inequality is deterministic, in that it does not rely on any statistical assumptions for

X

and

ε

. Unlike Lemma 1 in [13], the proof of this result uses the fact that any vector

β

provides an upper bound on

{\hat{σ}}_{λ_{n}}^{2}

and the convexity of the loss function. In addition, if

w = {(1, \dots, 1)}^{T}

and

O (∥ {\hat{β}}_{λ_{n}} - β^{*} ∥_{1}) \leq O (∥ β^{*} ∥_{1})

, Lemma 1 reduces to Lemma 1 in [13]. If

w

is close or equal to zero, and

O (∥ {\hat{β}}_{λ_{n}} - β^{*} ∥_{1}) \leq O (∥ β^{*} ∥_{1})

, then the bound on the right-hand side of the inequality in Lemma 1 is lower than that for the natural lasso and organic lasso in [13].

3.1. Adaptive Lasso

It follows, from Lemma 1, that the error bound of the NALE for

σ^{2}

is controlled by the convergence rate of the adaptive-lasso estimator

{\hat{β}}_{λ_{n}}

. Therefore, it is necessary to establish the asymptotic properties for

{\hat{β}}_{λ_{n}}

. The results in this subsection are similar to that in [23]. All regularity conditions and proofs are given in Section 7.

Theorem 1.

Suppose Conditions 1–3 hold. Assume that

min_{i \in A_{0}^{c}} w_{i}^{*} > C_{1}^{- 1}, λ_{n} = 4 C_{1} σ \sqrt{(2 log p + 2 L) / n}

and

s (log (p) + L) / n \to 0

, where

C_{1}

is some positive constant and

L > 0

. Then, with probability at least

1 - e^{- L}

, there exists unique minimizer

{\hat{β}}_{λ_{n}} = ({\hat{β}}_{A_{0}}^{T}, {\hat{β}}_{A_{0}^{c}}^{T})

of problem (4), such that

{\hat{β}}_{A_{0}^{c}} = 0

and

∥ {\hat{β}}_{λ_{n}} - β^{*} ∥_{2} \leq a_{n}

, where

a_{n} = C_{4} (\sqrt{s (2 log p + 2 L) / n} + 2 λ_{n} (∥ w_{A_{0}}^{*} ∥_{2} + C_{2} C_{3} \sqrt{s (log p) / n}))

with some constant

C_{4} > 0

,

C_{2}

and

C_{3}

are defined in the regularity conditions.

It follows from inequality (18) that the extra term

λ_{n} \sqrt{s (log p) / n}

in

a_{n}

is due to the bias of the initial estimator

{\tilde{β}}^{ini}

. When

λ_{n}

tends to zero, the order of the extra term is

o (\sqrt{s (log p) / n})

. Thus, under some general conditions, the convergence rate of

{\hat{β}}_{λ_{n}}

is

O (\sqrt{s (log (p) + L) / n})

. Usually, the order of L is

O (log p)

.

We, now, present the asymptotic normality of the adaptive-lasso estimator

{\hat{β}}_{λ_{n}}

.

Theorem 2.

Assume that conditions of Theorem 1 hold. Let

s_{n}^{2} = (1 / n) σ^{2} α_{n}^{T} X_{A_{0}}^{T} X_{A_{0}} α_{n}

for any

α_{n} \in R^{s}

satisfying

{∥ α ∥}_{2} \leq 1

. Then, under Conditions 1–4, with probability at least

1 - e^{- L}

, the minimizer

{\hat{β}}_{λ_{n}}

in Theorem 1 satisfies

\begin{matrix} n^{\frac{1}{2}} s_{n}^{- 1} α_{n}^{T} [({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) + n α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} λ_{n} w_{A_{0}}^{*} \circ g_{A_{0}}^{*}] \\ = n^{1 / 2} s_{n}^{- 1} α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} X_{A_{0}}^{T} ε + o_{p} (1) \overset{D}{\to} N (0, 1), \end{matrix}

where

g_{A_{0}}^{*} \in \partial {∥ β_{A_{0}}^{*} ∥}_{1}

.

The result of Theorem 2 is consistent with the asymptotic normality, for the bridge estimator of

β

in [4]. The only difference is in the form of the penalty function.

Next, we consider the convergence performance of the specific adaptive-lasso estimator

{\hat{β}}_{λ_{n}}

, with a weight vector decided by the SCAD penalty [23], which is defined by

p_{λ_{n}}^{'} (| t |) = 1 {| t | \leq λ_{n}^{SCAD}} + \frac{(a λ_{n}^{SCAD} - {| t |)}_{+}}{(a - 1) λ_{n}^{SCAD}} 1 {| t | > λ_{n}^{SCAD}},

where

a > 2

is a given constant and

{(\cdot)}_{+} : = max {0, \cdot}

. Usually, the order of

λ_{n}^{SCAD}

is

O (\sqrt{s (log p + L) / n})

. By definition, it holds

w_{A_{0}}^{*} = 0

, and Condition 4 is satisfied when

{min}_{i \in A_{0}} | β_{i}^{*} | \geq 2 a λ_{n}^{SCAD}

. Thus, we have the following result.

Corollary 1.

Assume that the conditions of Theorem 1 hold. Then, under Conditions 1–4, with probability at least

1 - e^{- L}

, there exists unique minimizer

{\hat{β}}_{λ_{n}} = ({\hat{β}}_{A_{0}}^{T}, {\hat{β}}_{A_{0}^{c}}^{T})

of problem (4), such that

\begin{matrix} ∥ {\hat{β}}_{λ_{n}} - β^{*} ∥_{2} \leq O (\sqrt{s (log p + L) / n}), \\ sgn ({\hat{β}}_{A_{0}}) = sgn (β_{A_{0}}^{*}) a n d {\hat{β}}_{A_{0}^{c}} = 0 . \end{matrix}

Furthermore,

{\hat{β}}_{λ_{n}}

satisfies

n^{\frac{1}{2}} s_{n}^{- 1} α_{n}^{T} ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) = n^{1 / 2} α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} X_{A_{0}}^{T} ε \overset{D}{\to} N (0, 1),

where

s_{n}^{2} = (1 / n) σ^{2} α_{n}^{T} X_{A_{0}}^{T} X_{A_{0}} α_{n}

for any

α_{n} \in R^{s}

satisfying

∥ α_{n} ∥_{2} \leq 1

.

The rate of convergence of the estimators in Theorem 1 and Corollary 1 is controlled by the distribution of random error and predictor matrix. Moreover, these results can be generalized for other situations, where random error follows sub-Gaussian or sub-exponential distributions.

3.2. Error Bounds of NALE

In this subsection, we establish the error bound for the NALE of

σ^{2}

. It follows from (14) that, under the conditions of Theorem 1,

λ_{n} \geq (1 / n) {∥ X^{T} ε ∥}_{\infty}

holds, with probability

1 - e^{- L}

. Since

s (log (p) + L) / n \to 0

, we have

a_{n} \to 0

. Thus, in order to establish the asymptotic properties of NALE for

σ^{2}

, we still need to determine the order of

λ_{n} {∥ w \circ β^{*} ∥}_{1}

. By Condition 2 and Theorem 1, we have

\begin{matrix} ∥ w \circ β^{*} ∥_{1} & = & \sum_{i \in A_{0}} w_{i} | β_{i}^{*} | \leq \sum_{i \in A_{0}} (C_{3} | {\hat{β}}_{i} - β_{i}^{*} | + w_{i}^{*}) | β_{i}^{*} | \\ \leq & C_{3} a_{n} ∥ β^{*} ∥_{1} + {∥ w^{*} \circ β^{*} ∥}_{1} . \end{matrix}

(6)

Thus, we have the following result on the error bound of the NALE for

σ^{2}

.

Theorem 3.

Under the conditions of Theorem 1, the NALE for

σ^{2}

has the following error bound, with probability at least

1 - e^{- L}

:

| {\hat{σ}}_{λ_{n}}^{2} - \frac{1}{n} {∥ ε ∥}_{2}^{2} | \leq b_{n},

where

b_{n} = 2 λ_{n} max \{C_{3} a_{n} ∥ β^{*} ∥_{1} + {∥ w^{*} \circ β^{*} ∥}_{1}, \sqrt{s} a_{n}\}

.

The proof of the above result follows, straightforwardly, from Lemma 1 and Theorem 1, so it is omitted. Since

a_{n} \to 0

,

∥ w^{*} \circ β^{*} ∥_{1}

is close or equal to zero, and the order of

λ_{n}

for the adaptive lasso is

O (\sqrt{(log (p) + L) / n})

, we have

b_{n} = o (\sqrt{s (log (p) + L) / n})

. It follows that when

L = O (log (p))

, the error bound of NALE for

σ^{2}

is smaller than that of the NLE, OLE and SLE, when n is sufficiently large. In the following, we analyze the mean squared error bound for the NALE of

σ^{2}

.

Theorem 4.

Under the conditions of Theorem 1, for any

M > 1

and

λ_{n} = 4 C_{1} σ \sqrt{(2 M log p) / n}

, the NALE for

σ^{2}

satisfies

E \{{(\frac{{\hat{σ}}_{λ_{n}}^{2}}{σ^{2}} - 1)}^{2}\} \leq {[{(M + \frac{p^{1 - M}}{log p})}^{\frac{1}{2}} \frac{b_{n}^{2}}{σ^{2}} + {(\frac{2}{n})}^{\frac{1}{2}}]}^{2} .

Note that the above mean squared error bound of NALE for

σ^{2}

is lower than that for the NLE, OLS and SLE estimators. Finally, we consider the case using the SCAD penalty. Then, by Theorem 3 and the fact that

∥ w \circ β^{*} ∥_{1} = 0

, under the condition on minimum signal strength, we have the following result.

Corollary 2.

Under the conditions of Corollary 1, the NALE for

σ^{2}

using the SCAD has the following error bound, with probability at least

1 - e^{- L}

:

| {\hat{σ}}_{λ_{n}}^{2} - \frac{1}{n} {∥ ε ∥}_{2}^{2} | \leq 2 \sqrt{s} λ_{n} a_{n} .

Further, by Theorem 4 and Corollary 2, we have the mean squared error bound of the NALE for

σ^{2}

using the SCAD.

Corollary 3.

Under the conditions of Corollary 1, for any

M > 1

, the NALE for

σ^{2}

using SCAD with

λ_{n} = 4 C_{1} σ \sqrt{(2 M log p) / n}

satisfies the following relative mean squared error bound:

E \{{(\frac{{\hat{σ}}_{λ_{n}}^{2}}{σ^{2}} - 1)}^{2}\} \leq {[{(M + \frac{p^{1 - M}}{log p})}^{\frac{1}{2}} \frac{4 s λ_{n}^{2} a_{n}^{2}}{σ^{2}} + {(\frac{2}{n})}^{\frac{1}{2}}]}^{2} .

4. Numerical Optimization

In this section, we study the optimization method for the NALE. Proposition 1 provides an easy way to calculate the NALE for

σ^{2}

, through existing optimization tools, to compute the adaptive lasso (4). Given the tuning parameter

λ_{n}

, we consider the proximal gradient algorithm (PGA) to calculate this problem, which has the following steps:

Initialization: take initial value $β^{0} \in R^{p}$ , $τ \in (0, τ^{*})$ .
Iterative step: $β^{k + 1} = {prox}_{τ λ_{n} {∥ β ∥}_{1}} (β^{k} - \frac{2 τ}{n} X^{T} (X β^{k} - y))$ .

In the above framework,

1 / τ^{*}

is taken to be the Lipschitz constant of

\nabla Q_{n} (β)

,

Q_{n} = (1 / n) {∥ y - X β ∥}_{2}^{2}

, such that for any

β_{1}

,

β_{2} \in R^{p}

,

∥ \nabla Q_{n} (β_{1}) - \nabla Q_{n} (β_{2}) ∥_{2} \leq \frac{1}{τ^{*}} {∥ β_{1} - β_{2} ∥}_{2} .

Usually,

τ = (2 / n) λ_{max} (X^{T} X)

. In addition, by the definition of proximal mapping,

{prox}_{τ λ_{n} {∥ β ∥}_{1}} (β^{k} - \frac{2 τ}{n} X^{T} (X β^{k} - y)) = arg min_{β \in R^{p}} \frac{1}{2} ∥ β - [β^{k} - \frac{2 τ}{n} X^{T} (X β^{k} - y)] ∥_{2}^{2} + τ λ_{n} {∥ β ∥}_{1} .

By simple calculation,

β^{k + 1} = {[| β^{k} - \frac{2 τ}{n} X^{T} (X β^{k} - y) | - τ λ_{n} 1]}_{+} \circ sign ([β^{k} - \frac{2 τ}{n} X^{T} (X β^{k} - y)]) .

Finally, the PGA is terminated, when either the sequence

{β^{k}}

meets the criterion

\begin{matrix} \frac{∥ β^{k + 1} - β^{k} ∥_{2}}{max {1, ∥ β^{k} ∥_{2}}} \leq ϵ, \end{matrix}

or the maximum number of iterations is reached.

5. Numerical Simulations

In this section, we carry out Monte Carlo simulations to study the finite-sample performance of the NALE, with the weight calculated by using the SCAD penalty. Further, we compare the NALE with the square-root/scaled lasso (SLE) [17], the natural lasso (NLE) [13], the organic lasso (OLE) [13] and the ridge-based estimator (RBE) [20]. We, also, include the oracle estimator (OE)

{(1 / n) ∥ ε ∥}_{2}^{2}

, as a benchmark in the comparisons. All numerical computation was done using Matlab. The programs are available upon request, from the first author of this paper or Supplementary Materials.

5.1. Simulation Settings

Following [23], throughout the simulations we use the sample size

n = 100

and parameter dimension

p = 400

. Further, each row of the design matrix

X

is generated from the multivariate normal distribution

N (0, Σ)

, with

Σ_{i j} = ρ^{| i - j |}

and

ρ \in (0, 1)

. The sparsity of

β^{*}

is set to be the largest integer less than or equal to

n^{α}

, and the locations of the nonzero elements in

β^{*}

are determined randomly. We consider various parameter values,

ρ \in {0.1, 0.3, 0.5}

,

α \in {0.1, 0.2, 0.3, 0.4, 0.5}

and

σ^{2} \in {0.5, 1}

, and use the following true regression parameter vectors

\begin{matrix} α = 0.1, β^{*} = {(1.2, - 0.8, 0, \dots, 0)}^{T}; \\ α = 0.2, β^{*} = {(1.2, - 0.8, 1, 0, \dots, 0)}^{T}; \\ α = 0.3, β^{*} = {(1.2, - 0.8, 1, - 0.6, 0, \dots, 0)}^{T}; \\ α = 0.4, β^{*} = {(1.2, - 0.8, 1, - 0.6, 0.8, - 0.9, 1.2, 0, \dots, 0)}^{T}; \\ α = 0.5, β^{*} = {(1.2, - 0.8, 1, - 0.6, 0.8, - 0.9, 1.2, 0.4, 0.9, - 1.1, 0, \dots, 0)}^{T} . \end{matrix}

We have, also, considered other variance settings, such as

σ^{2} \in {3, 5}

, however, the simulation results are similar to that of the above settings and, therefore, are not included. To assess the performances of the estimators, we calculate the average mean squared error (MSE)

\hat{E} {{(σ^{- 1} \hat{σ} - 1)}^{2}}

and the average relative error (RE)

\hat{E} (σ^{- 1} \hat{σ})

, based on 100 Monte Carlo runs.

5.2. Selection of Tuning Parameters

Usually, five-fold cross-validation can be used, to select tuning parameters for each estimation, which is fairly expensive. In order to reduce the computational cost, we consider the following methods, with a fixed choice of tuning parameters for all estimators, except for the NLE and NALE.

For the SLE, we consider three penalty levels

λ_{n, i} = \sqrt{2^{i - 1} (log p) / n}

,

i = 1, 2, 3

, which is similar to Example 1 in [17]. Then, the best estimator is selected as the final SLE estimator. Indeed, Ref. [17] found that

λ_{n, 2}

works very well for SLE. By the simulation results of [13], the OLE with

λ_{n, 1} = log (p) / n

and

λ_{n, 2} = E (n^{- 2} ∥ X^{T} ϵ ∥_{\infty}^{2})

performed very well, where

ϵ \in N (0, I_{n})

. From [20,24], the tuning parameter used in RBE is calculated by setting

η = α {max}_{1 \leq i \leq p} | X_{i}^{T} y | / (n p)

with

α = 0.1

.

5.3. Simulation Results

In each simulation, 100 runs are carried out to calculate the average of the performance measures. The results are presented in Table 1, Table 2, Table 3 and Table 4. These results show that, overall, both the MSE and RE of the NALE are very close to that of the OE, and are remarkably better than the other estimators, in most of the cases. However, in a few cases, such as

ρ = 0.3

and

α = 0.5

with both

σ^{2} = 0.5

and

σ^{2} = 1

, the NALE has a slightly larger MSE than the NLE, although it has smaller RE than the latter. As expected, the NLE often overestimates the true value, due to the bias and over-selection of the lasso. Moreover, in the cases where the NLE has a relatively large MSE, the NALE tends to have a large MSE as well, indicating that the poor performance of the NLE will impact the performance of the NALE, since it is used as the initial estimator. Finally, Ref. [20] reported that the RBE performs well in the cases with relatively small p and weak signals, however, it performs poorly and is, even, ineffective in the settings of our simulations.

We, further, summarize the performances of various methods using boxplots, in Figure 1 and Figure 2. As one can easily see, the NALE is accurate and stable in all cases, while the OLE is less accurate, although it is, still, fairly stable. Further, the NALE performs well in extremely sparse scenarios. Another interesting point is that the NALE inherits the variable selection and parameter estimation of the adaptive lasso. Although we focus on the variance estimation in this work, the method performs well in estimating the regression coefficients as well.

6. Conclusions and Discussion

We proposed a novel approach for variance estimation that combines the reparameterized log-likelihood function and adaptive-lasso penalization. We have established the asymptotic properties of the NALE. The theory in this paper shows that the NALE converges at a faster rate than some other existing estimators, including the NLE, OLE and SLE. In addition, the NAL is closely related to the adaptive lasso, which makes its numerical calculation straightforward. We have used the PGA to obtain the NALE in numerical simulations. Our simulation results show that the NALE performs well and favorably against other existing methods, in most finite sample situations, especially in extremely sparse scenarios. However, the quality of the NALE depends on that of the initial estimator used in its numerical optimization, and the poor performance of the initial estimator may result in the poor performance of the NALE.

7. Regularity Conditions and Proofs

This section provides theoretical proofs. We first state the following regularity conditions.

Condition 1.

With probability approaching one, the initial estimator satisfies

∥ {\tilde{β}}^{ini} - β^{*} ∥_{2} \leq C_{2} \sqrt{s (log p) / n}

.

Condition 2.

p_{λ}^{'} (t)

is non-increasing in

t \in (0, \infty)

and is Lipschitz with constant

C_{3}

, that is,

| p_{λ_{n}}^{'} (| t_{1} |) - p_{λ_{n}}^{'} (| t_{2} |) | \leq C_{3} | t_{1} - t_{2} |

for any

t_{1}

,

t_{2} \in R

. Moreover,

p_{λ_{n}}^{'} (C_{2} \sqrt{s log p / n}) > (1 / 2) p_{λ_{n}}^{'} (0 +)

for sufficiently large n, where

C_{2}

is defined in Condition 1.

Condition 3.

There exist positive constants

0 < c_{\min} < c_{\max} < \infty

, such that

c_{\min} \leq λ_{\min} (\frac{1}{n} X_{A_{0}}^{T} X_{A_{0}}) \leq λ_{\max} (\frac{1}{n} X_{A_{0}}^{T} X_{A_{0}}) \leq c_{\max},

and

{∥ \frac{1}{n} X_{A_{0}^{c}}^{T} X_{A_{0}} ∥}_{2, \infty} < \frac{λ_{n}}{4 ∥ w_{A_{0}^{c}}^{- 1} ∥_{\infty} a_{n}},

where

{∥ B ∥}_{2, \infty} = {max}_{{∥ v ∥}_{2} \leq 1} {∥ B v ∥}_{\infty}

,

w_{A_{0}^{c}}^{- 1} = {(w_{s + 1}^{- 1}, \dots, w_{p}^{- 1})}^{T}

,

a_{n}

is defined in Theorem 1.

Condition 4.

The true coefficients satisfy

{min}_{i \in A_{0}} | β_{i}^{*} | ≫ \sqrt{(s (2 log p + 2 L)) / n}

. Moreover, it holds

p_{λ_{n}}^{″} (t) = o (s^{- 1} λ_{n}^{- 1} {(2 log p + 2 L)}^{- 1 / 2})

for any

t > {min}_{i \in A_{0}} | β_{i}^{*} | / 2

and

L > 0

.

As we pointed out in Remark 1, the lasso estimator

β_{lasso}

satisfies Condition 1. Condition 2 affects the bound between

{\hat{β}}_{λ_{n}}

and

β^{*}

and is used in the proof of Theorem 1. Further, it determines the bound between

{\hat{σ}}_{λ_{n}}^{2}

and

σ_{oracle}^{2}

. The first part of Condition 3 is a very common regularity condition (see [4,12,23]) in high-dimensional regression. The remaining part is similar to Condition 3 in [23], which is used in the proofs of Theorems 1 and 2. Condition 4 is needed in the analysis of Corollary 1.

Proof of Proposition 1.

(i): Since $({\hat{θ}}_{λ_{n}}, {\hat{ϕ}}_{λ_{n}})$ is a solution of (2), ${\hat{θ}}_{λ_{n}}$ is a solution of the problem

$min_{θ_{λ_{n}} \in R^{p}} L (θ, {\hat{ϕ}}_{λ_{n}}) + λ_{n} {∥ w \circ θ ∥}_{1} .$

Hence, by optimality of the above problem, we have

$- X^{T} y + X^{T} X \frac{{\hat{θ}}_{λ_{n}}}{{\hat{ϕ}}_{λ_{n}}} + n λ_{n} w \circ \hat{g} = 0,$

where $\hat{g} \in \partial (∥ {\hat{θ}}_{λ_{n}} ∥_{1})$ . It follows that

$- X^{T} y + X^{T} X {\hat{β}}_{λ_{n}} + n λ_{n} w \circ \hat{g} = 0 .$

Since $sign ({\hat{θ}}_{λ_{n}}) = sign ({\hat{β}}_{λ_{n}})$ , we have $\partial (∥ {\hat{θ}}_{λ_{n}} ∥_{1}) = \partial (∥ {\hat{β}}_{λ_{n}} ∥_{1})$ , which, further, implies that ${\hat{β}}_{λ_{n}}$ is a solution of the adaptive lasso (4).
(ii): Since $({\hat{θ}}_{λ_{n}}, {\hat{ϕ}}_{λ_{n}})$ is a solution of (2), by optimality of problem (2), we have

$- X^{T} y + X^{T} X \frac{{\hat{θ}}_{λ_{n}}}{{\hat{ϕ}}_{λ_{n}}} + n λ w \circ \hat{g} = 0, - \frac{1}{{\hat{ϕ}}_{λ_{n}}} + \frac{1}{n} {∥ y ∥}_{2}^{2} - \frac{∥ X {\hat{θ}}_{λ_{n}} ∥_{2}^{2}}{n {\hat{ϕ}}_{λ_{n}}^{2}} = 0,$

where $\hat{g} \in \partial (∥ {\hat{θ}}_{λ_{n}} ∥_{1})$ . Therefore, we have

$\begin{matrix} - X^{T} y + X^{T} X {\hat{β}}_{λ_{n}} + n λ_{n} w \circ \hat{g} = 0, - \frac{1}{{\hat{ϕ}}_{λ_{n}}} + \frac{1}{n} {∥ y ∥}_{2}^{2} - \frac{∥ X {\hat{θ}}_{λ_{n}} ∥_{2}^{2}}{n {\hat{ϕ}}_{λ_{n}}^{2}} = 0 . \end{matrix}$

(7)

Since

\partial (∥ {\hat{θ}}_{λ_{n}} ∥_{1}) = \partial (∥ {\hat{β}}_{λ_{n}} ∥_{1})

, we have

\hat{g} \in \partial (∥ {\hat{β}}_{λ_{n}} ∥_{1})

. Further,

\begin{matrix} {\hat{β}}_{λ_{n}}^{T} (w \circ \hat{g}) = \sum_{i = 1}^{p} w_{i} {\hat{β}}_{i} {\hat{g}}_{i} = \sum_{i = 1}^{p} | w_{i} {\hat{β}}_{i} | = ∥ w \circ \hat{β} ∥_{1} . \end{matrix}

(8)

Combining (7) and (8), we obtain

0 = - {\hat{β}}_{λ_{n}}^{T} X^{T} y + ∥ X {\hat{β}}_{λ_{n}} ∥_{2}^{2} + λ_{n} {∥ w \circ {\hat{β}}_{λ_{n}} ∥}_{1}, {\hat{σ}}_{λ_{n}}^{2} = \frac{1}{n} ({∥ y ∥}_{2}^{2} - {∥ X {\hat{β}}_{λ_{n}} ∥}_{2}^{2}) .

Further, by the first term in (7),

\begin{matrix} ∥ y - X {\hat{β}}_{λ_{n}} ∥_{2}^{2} & = & {∥ y ∥}_{2}^{2} - {∥ X {\hat{β}}_{λ_{n}} ∥}_{2}^{2} + 2 (∥ X {\hat{β}}_{λ_{n}} ∥_{2}^{2} - y^{T} X {\hat{β}}_{λ_{n}}) \\ = & {∥ y ∥}_{2}^{2} - ∥ X {\hat{β}}_{λ_{n}} ∥_{2}^{2} - 2 λ_{n} {∥ w \circ \hat{β} ∥}_{1} . \end{matrix}

Combining the above equality and the second term in (7), we have

{\hat{σ}}_{λ_{n}}^{2} = \frac{1}{n} ({∥ y ∥}_{2}^{2} - {∥ X {\hat{β}}_{λ_{n}} ∥}_{2}^{2}) = ∥ y - X {\hat{β}}_{λ_{n}} ∥_{2}^{2} + 2 λ_{n} {∥ w \circ {\hat{β}}_{λ_{n}} ∥}_{1},

which implies that

{\hat{σ}}_{λ_{n}}^{2}

is the optimal value of the adaptive lasso (4). □

Proof of Lemma 1.

From Proposition 1, we have

\begin{matrix} {\hat{σ}}_{λ_{n}}^{2} \leq \frac{1}{n} ∥ y - X β^{*} ∥_{2}^{2} + 2 λ_{n} ∥ w \circ β^{*} ∥_{1} = \frac{1}{n} {∥ ε ∥}_{2}^{2} + 2 λ_{n} {∥ w \circ β^{*} ∥}_{1} . \end{matrix}

(9)

Since the loss function in the adaptive lasso is convex, we have

\begin{matrix} {\hat{σ}}_{λ_{n}}^{2} & = & \frac{1}{n} ∥ y - X {\hat{β}}_{λ_{n}} ∥_{2}^{2} + 2 λ_{n} {∥ w \circ {\hat{β}}_{λ_{n}} ∥}_{1} \\ \geq & \frac{1}{n} {∥ y - X β^{*} ∥}_{2}^{2} + {[\frac{2}{n} X^{T} (X^{T} β^{*} - y)]}^{T} ({\hat{β}}_{λ_{n}} - β^{*}) \\ = & \frac{1}{n} {∥ ε ∥}_{2}^{2} - \frac{2}{n} ε^{T} X ({\hat{β}}_{λ_{n}} - β^{*}) \\ \geq & \frac{1}{n} {∥ ε ∥}_{2}^{2} - 2 ∥ \frac{1}{n} ε^{T} {X ∥}_{\infty} {∥ {\hat{β}}_{λ_{n}} - β^{*} ∥}_{1} \\ \geq & \frac{1}{n} {∥ ε ∥}_{2}^{2} - 2 λ_{n} {∥ {\hat{β}}_{λ_{n}} - β^{*} ∥}_{1} . \end{matrix}

(10)

Combining inequalities (9) and (10), we obtain

| {\hat{σ}}_{λ_{n}}^{2} - \frac{1}{n} {∥ ε ∥}_{2}^{2} | \leq max {2 λ_{n} ∥ w \circ β^{*} ∥_{1}, 2 ∥ \frac{1}{n} ε^{T} {X ∥}_{\infty} ∥ {\hat{β}}_{λ_{n}} - β^{*} ∥_{1}},

which completes the proof. □

Proof of Theorem 1.

Since problem (4) is a convex optimization, by Theorem 1 of [25], it suffices to show that, with probability tending to 1, there exists a minimizer

{\hat{β}}_{λ_{n}}

of problem (4) that satisfies

\begin{matrix} X_{A_{0}}^{T} (y - X {\hat{β}}_{λ_{n}}) - n λ_{n} w_{A_{0}} \circ {\hat{g}}_{A_{0}} = 0, \end{matrix}

(11)

\begin{matrix} ∥ X_{A_{0}^{c}}^{T} (y - X {\hat{β}}_{λ_{n}}) ∥_{\infty} < n λ_{n} w_{A_{0}^{c}}, \end{matrix}

(12)

\begin{matrix} λ_{\min} (\frac{1}{n} X_{A_{0}}^{T} X_{A_{0}}) \geq c_{\min} . \end{matrix}

(13)

where

\hat{g} \in \partial {∥ \hat{β} ∥}_{1}

.

Let

ξ_{1} = X_{A_{0}}^{T} ε

and

ξ_{2} = X_{A_{0}^{c}}^{T} ε

. Since

∥ X_{j} ∥_{2}^{2} = n

and

ε \sim N (0, σ^{2} I_{n})

, it follows from Corollary 4.3 in [26] that, for any

L > 0

,

\begin{matrix} P \{\frac{∥ X^{T} {ε ∥}_{\infty}}{n σ} > \sqrt{\frac{2 log p + 2 L}{n}}\} \leq e^{- L} . \end{matrix}

(14)

Now we show that there exists a minimizer

\hat{β}

of problem (4) satisfies conditions (11)–(13).

Equation (11): Consider the minimizer of problem (4) in the subspace

{β = {(β_{A_{0}}^{T}, β_{A_{0}^{c}}^{T})}^{T} : β_{A_{0}^{c}} = 0}

. Let

β = {(β_{A_{0}}^{T}, 0^{T})}^{T}

, where

β_{A_{0}} = β_{A_{0}}^{*} + {\tilde{a}}_{n} v_{A_{0}} \in R^{s}

with

{\tilde{a}}_{n} = \sqrt{s (2 log p + 2 L) / n} + 2 λ_{n} (∥ w_{A_{0}}^{*} ∥_{2} + C_{2} C_{3} \sqrt{s (log p) / n})

,

∥ v_{A_{0}} ∥_{2} = C

, and

C > 0

is some large enough constant. Note that

\begin{matrix} L_{n} (β_{A_{0}}^{*} + {\tilde{a}}_{n} v_{A_{0}}, 0) - L_{n} (β_{A_{0}}^{*}, 0) = I_{1} (v_{A_{0}}) + I_{2} (v_{A_{0}}), \end{matrix}

(15)

where

I_{1} (v_{A_{0}}) = \frac{1}{n} ∥ X (β^{*} + {\tilde{a}}_{n} v) {- y ∥}_{2}^{2} - \frac{1}{n} {∥ X β^{*} - y ∥}_{2}^{2}

,

I_{2} (v_{A_{0}}) = 2 λ_{n} ∥ w_{A_{0}} \circ (β_{A_{0}}^{*} + {\tilde{a}}_{n} v_{A_{0}}) ∥_{1} - 2 λ_{n} {∥ w_{A_{0}} \circ β_{A_{0}}^{*} ∥}_{1}

. For

I_{1} (v_{A_{0}})

, by (14), we have

\begin{matrix} I_{1} (v_{A_{0}}) & = & \frac{1}{n} ∥ X (β^{*} + {\tilde{a}}_{n} v) {- y ∥}_{2}^{2} - \frac{1}{n} {∥ X β^{*} - y ∥}_{2}^{2} \\ = & \frac{1}{n} {\tilde{a}}_{n}^{2} v^{T} X^{T} X v + \frac{2}{n} ε^{T} X {\tilde{a}}_{n} v \\ = & \frac{1}{n} {\tilde{a}}_{n}^{2} v_{A_{0}}^{T} X_{A_{0}}^{T} X_{A_{0}} v_{A_{0}} + \frac{2}{n} {\tilde{a}}_{n} ε^{T} X_{A_{0}} v_{A_{0}} \\ \geq & c_{\min} {\tilde{a}}_{n}^{2} ∥ v_{A_{0}} ∥_{2}^{2} - 2 {\tilde{a}}_{n} ∥ \frac{ξ_{1}}{n} ∥_{2} {∥ v_{A_{0}} ∥}_{2} \\ \geq & c_{\min} {\tilde{a}}_{n}^{2} ∥ v_{A_{0}} ∥_{2}^{2} - 2 σ {\tilde{a}}_{n} \sqrt{s (2 log p + 2 L) / n} {∥ v_{A_{0}} ∥}_{2}, \end{matrix}

(16)

where the last inequality holds, due to

{∥ \cdot ∥}_{2} \leq \sqrt{s} {∥ \cdot ∥}_{\infty}

. For

I_{2} (v_{A_{0}})

, we have

\begin{matrix} I_{2} (v_{A_{0}}) & = & 2 λ_{n} ∥ w_{A_{0}} \circ (β_{A_{0}}^{*} + {\tilde{a}}_{n} v_{A_{0}}) ∥_{1} - 2 λ_{n} {∥ w_{A_{0}} \circ β_{A_{0}}^{*} ∥}_{1} \\ \leq & 2 λ_{n} ∥ w_{A_{0}} \circ ({\tilde{a}}_{n} v_{A_{0}}) ∥_{1} \leq 2 {\tilde{a}}_{n} λ_{n} ∥ w_{A_{0}} ∥_{2} {∥ v_{A_{0}} ∥}_{2} . \end{matrix}

(17)

By the two-steps procedure of weight vector and Condition 2, it holds that

\begin{matrix} ∥ w_{A_{0}} ∥_{2} & \leq & ∥ w_{A_{0}} - w_{A_{0}}^{*} ∥_{2} + ∥ w_{A_{0}}^{*} ∥_{2} \leq C_{3} ∥ {\tilde{β}}_{A_{0}}^{ini} - β_{A_{0}}^{*} ∥_{2} + {∥ w_{A_{0}}^{*} ∥}_{2} \\ \leq & C_{2} C_{3} \sqrt{s (log p) / n} + {∥ w_{A_{0}}^{*} ∥}_{2} . \end{matrix}

(18)

Hence, combining (15)–(18) yields

\begin{matrix} L_{n} (β_{A_{0}}^{*} + {\tilde{a}}_{n} v_{A_{0}}, 0) - L_{n} (β_{A_{0}}^{*}, 0) & \geq & c_{\min} {\tilde{a}}_{n}^{2} ∥ v_{A_{0}} ∥_{2}^{2} - 2 σ {\tilde{a}}_{n} \sqrt{s (2 log p + 2 L) / n} {∥ v_{A_{0}} ∥}_{2} \\ - 2 {\tilde{a}}_{n} λ_{n} (C_{2} C_{3} \sqrt{s (log p) / n} + ∥ w_{A_{0}}^{*} ∥_{2}) ∥ v_{A_{0}} ∥_{2} . \end{matrix}

Taking a large enough C, we have obtained, with probability tending to one,

L_{n} (β_{A_{0}}^{*} + {\tilde{a}}_{n} v_{A_{0}}, 0) - L_{n} (β_{A_{0}}^{*}, 0) > 0 .

It follows, immediately, that, with probability approaching one, there exists a minimizer

{\hat{β}}_{A_{0}}

of problem (4), subject to subspace

{β = {(β_{A_{0}}^{T}, β_{A_{0}^{c}}^{T})}^{T} : β_{A_{0}^{c}} = 0}

, such that

∥ {\hat{β}}_{A_{0}} - β_{A_{0}}^{*} ∥_{2} \leq C_{4} {\tilde{a}}_{n} \equiv a_{n}

, with some constant

C_{4} > 0

. Therefore, equality (11) holds, by the optimality theory.

Inequality (12): It remains to be proven that with asymptotic probability 1, (12) holds. Then, by optimality theory,

{\hat{β}}_{λ_{n}} = {({\hat{β}}_{A_{0}}^{T}, 0^{T})}^{T}

is the unique global minimizer of problem (4).

By triangle inequality, we have

\begin{matrix} ∥ X_{A_{0}^{c}}^{T} (y - X \hat{β}) ∥_{\infty} \leq ∥ X_{A_{0}^{c}}^{T} (y - X β^{*}) ∥_{\infty} + {∥ X_{A_{0}^{c}}^{T} X (β^{*} - \hat{β}) ∥}_{\infty} . \end{matrix}

(19)

Further, by Condition 1, we have

| {\tilde{β}}_{i}^{ini} | \leq C_{2} \sqrt{s (log p) / n}

with probability approaching one, where

i \in A_{0}^{c}

. Moreover, by the definition of the fold-concave penalty function,

\begin{matrix} p_{λ_{n}}^{'} (| {\tilde{β}}_{i}^{ini} |) \geq p_{λ_{n}}^{'} (C_{2} \sqrt{s (log p) / n}) . \end{matrix}

(20)

Therefore, by Condition 2 and inequality (20), we conclude that

\begin{matrix} ∥ w_{A_{0}^{c}}^{- 1} ∥_{\infty} = min_{i \in A_{0}^{c}} p_{λ_{n}}^{'} (| {\tilde{β}}_{i}^{ini} {|)}^{- 1} < \frac{2}{p_{λ_{n}}^{'} (0 +)} = 2 {∥ {(w_{A_{0}^{c}}^{*})}^{- 1} ∥}_{\infty} . \end{matrix}

(21)

Thus, for the first term of the right hand of inequality (19), by (14) and the condition that

{min}_{i \in A_{0}^{c}} {w_{i}^{*}} > C_{1}^{- 1}

, with probability approaching one,

\begin{matrix} \frac{1}{n} {∥ X_{A_{0}^{c}}^{T} (y - X β^{*}) ∥}_{\infty} & = & \frac{1}{n} {∥ X_{A_{0}^{c}}^{T} ε ∥}_{\infty} \\ < & σ \sqrt{\frac{2 log p + 2 L}{n}} = \frac{λ_{n}}{4 C_{1}} < \frac{λ_{n}}{4 ∥ {(w_{A_{0}^{c}}^{*})}^{- 1} ∥_{\infty}} < \frac{λ_{n}}{2 ∥ w_{A_{0}^{c}}^{- 1} ∥_{\infty}} . \end{matrix}

(22)

As for the second term of right hand of inequality (19), by Condition 3, inequality (21) and inequality (14), with probability approaching one, we have

\begin{matrix} \frac{1}{n} {∥ X_{A_{0}^{c}}^{T} X (β^{*} - \hat{β}) ∥}_{\infty} & \leq & \frac{1}{n} ∥ X_{A_{0}^{c}}^{T} X_{A_{0}} ∥_{2, \infty} {∥ β_{A_{0}}^{*} - {\hat{β}}_{A_{0}} ∥}_{2} \\ \leq & \frac{λ_{n}}{4 ∥ {(w_{A_{0}^{c}}^{*})}^{- 1} ∥_{\infty}} < \frac{λ_{n}}{2 ∥ w_{A_{0}^{c}}^{- 1} ∥_{\infty}} . \end{matrix}

(23)

Combining (19), (22) and (23), we obtain inequality (12).

Inequality (13): it follows from Condition 3 that with an asymptotic probability of one, inequality (13) holds. This completes the proof of Theorem 1. □

Proof of Theorem 2.

By equality (11),

(1 / n) X_{A_{0}}^{T} (y - X {\hat{β}}_{λ_{n}}) - λ_{n} w_{A_{0}} \circ {\hat{g}}_{A_{0}} = 0 .

Since

y - X β^{*} = ε

, we have

\frac{1}{n} X_{A_{0}}^{T} X_{A_{0}} ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) = - λ_{n} w_{A_{0}} \circ {\hat{g}}_{A_{0}} + \frac{1}{n} X_{A_{0}}^{T} ε .

Therefore,

\begin{matrix} n^{1 / 2} α_{n}^{T} ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) + n^{3 / 2} λ_{n} α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} w_{A_{0}} \circ {\hat{g}}_{A_{0}} = n^{1 / 2} α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} X_{A_{0}}^{T} ε . \end{matrix}

(24)

By the first part of Condition 4 and the bound of

{\hat{β}}_{λ_{n}}

in Theoroem 1, we have

sign ({\hat{β}}_{A_{0}}) = sign (β_{A_{0}}^{*})

. Then,

{\hat{g}}_{A_{0}} = g_{A_{0}}^{*}

. In addition, by the second part in condition 4,

\begin{matrix} w_{A_{0}} = w_{A_{0}}^{*} + ζ ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}), \end{matrix}

where

ζ = diag {(p_{λ_{n}}^{″} ({\tilde{β}}_{1}), \dots, p_{λ_{n}}^{″} ({\tilde{β}}_{s}))}^{T} \in R^{s \times s}

and

\tilde{β} \in R^{s}

lie on the line segment

[{\hat{β}}_{A_{0}}, β_{A_{0}}^{*}]

. It follows that

∥ ζ ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) \circ g_{A_{0}}^{*} ∥_{2} = {∥ ζ ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) ∥}_{2} = o (λ_{n}^{- 1} \sqrt{1 / n})

. Further, since

\begin{matrix} | n^{3 / 2} λ_{n} α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} [ζ ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) \circ g_{A_{0}}^{*}] | \leq n^{1 / 2} λ_{n} c_{\max} ∥ α_{n} ∥_{2} {∥ ζ ({\hat{β}}_{A_{0}} - β_{A_{0}}^{*}) ∥}_{2} = o (1), \end{matrix}

we have, for the second term of the left hand of (24),

\begin{matrix} n^{3 / 2} λ_{n} α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} w_{A_{0}} \circ {\hat{g}}_{A_{0}} = n^{3 / 2} λ_{n} α_{n}^{T} {(X_{A_{0}}^{T} X_{A_{0}})}^{- 1} w_{A_{0}} \circ g_{A_{0}}^{*} + o (1) . \end{matrix}

Finally, the result follows, by verifying the conditions of the Lindeberg–Feller central limit theorem, in the same way as in the proof of Theorem 2 in [4]. □

Proof of Theorem 4.

For any

M > 1

, take

L = (M - 1) log p

and denote

Z_{n} = ({\hat{σ}}_{λ_{n}}^{2} - \frac{1}{n} {∥ ε ∥}_{2}^{2})^{2}

. Then, by Theorems 1 and 3, we have

P (Z_{n} > M b_{n}^{2}) \leq e^{- (M - 1) log p} .

It follows that

\begin{matrix} E (\frac{Z_{n}}{b_{n}^{2}}) & = & \int_{0}^{\infty} P (\frac{Z_{n}}{b_{n}^{2}} > t) d t = \int_{0}^{M} P (\frac{Z_{n}}{b_{n}^{2}} > t) d t + \int_{M}^{\infty} P (\frac{Z_{n}}{b_{n}^{2}} > t) d t \\ \leq & M + \int_{M}^{\infty} e^{- (M - 1) log p} d t = M + \frac{p^{1 - M}}{log p} . \end{matrix}

(25)

Further, since

σ^{- 2} {∥ ε ∥}_{2}^{2} \sim χ^{2} (n)

, we have

E (\frac{1}{n} {∥ ε ∥}_{2}^{2}) = σ^{2}, Var (\frac{1}{n} {∥ ε ∥}_{2}^{2}) = \frac{2 σ^{4}}{n} .

By the proof of Theorem 12 in [13], we have

\begin{matrix} E \{{({\hat{σ}}_{λ_{n}}^{2} - σ^{2})}^{2}\} \leq {\{{[E \{{({\hat{σ}}_{λ_{n}}^{2} - \frac{1}{n} {∥ ε ∥}_{2}^{2})}^{2}\}]}^{\frac{1}{2}} + {\{Var (\frac{1}{n} ∥ ε ∥_{2}^{2})\}}^{\frac{1}{2}}\}}^{2} . \end{matrix}

(26)

Combining (25) and (26), we obtain

\begin{matrix} E \{{({\hat{σ}}_{λ_{n}}^{2} - σ^{2})}^{2}\} \leq {[{(M + \frac{p^{1 - M}}{log p})}^{\frac{1}{2}} b_{n}^{2} + σ^{2} {(\frac{2}{n})}^{\frac{1}{2}}]}^{2} . \end{matrix}

The proof is complete. □

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math10111937/s1, The programs in numerical simulations are available in supplementary materials.

Author Contributions

Methodology, X.W., L.K. and L.W.; software, X.W.; writing—original draft, X.W., L.K. and L.W.; validation, X.W., L.K. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

The National Natural Science Foundation of China (12071022) and the 111 Project of China (B16002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors thank the editor and the three anonymous reviewers, for their helpful comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 73, 273–282. [Google Scholar] [CrossRef]
Fan, J.; Li, Y. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. R. Stat. Soc. Ser. B 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
Huang, J.; Horowitz, J.L.; Ma, S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Stat. 2008, 36, 587–613. [Google Scholar] [CrossRef]
Zou, H.; Zhang, H.H. On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 2009, 37, 1733–1751. [Google Scholar] [CrossRef] [Green Version]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 2008, 70, 849–911. [Google Scholar] [CrossRef] [Green Version]
Ghaoui, L.E.; Viallon, V.; Rabbani, T. Safe feature elimination in sparse supervised learning. Pac. J. Optim. 2012, 8, 667–698. [Google Scholar]
Wang, J.; Wonka, P.; Ye, j. Lasso screening rules via dual polytope projection. J. Mach. Learn. Res. 2015, 16, 1063–1101. [Google Scholar]
Xiang, Z.J.; Wang, Y.; Ramadge, P.J. Safe feature elimination in sparse supervised learning. IEEE Trans. Pattern Anal. 2017, 39, 1008–1027. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fan, J.; Guo, S.; Hao, N. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B 2012, 74, 37–65. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yu, G.; Bien, J. Estimating the error variance in a high-dimensional linear model. Biometrika 2019, 106, 533–546. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T.; Tibshirani, R. On the “Degrees of freedom” of the lasso. Ann. Stat. 2007, 35, 2173–2192. [Google Scholar] [CrossRef]
Wang, L.; Leblanc, A. Second-order nonlinear least squares estimation. Ann. Inst. Stat. Math. 2008, 60, 883–900. [Google Scholar] [CrossRef]
Stadler, N.; Bühlmann, P. ℓ₁-penalization for mixture regression models. Test 2010, 19, 209–256. [Google Scholar] [CrossRef] [Green Version]
Sun, T.; Zhang, C.H. Scaled sparse linear regression. Biometrika 2012, 99, 879–898. [Google Scholar] [CrossRef] [Green Version]
Dicker, L.H. Variance estimation in high-dimensional linear models. Biometrika 2014, 101, 269–284. [Google Scholar] [CrossRef]
Dicker, L.H.; Erdogdu, M.A. Maximum likelihood for variance estimation in high-dimensional linear models. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 159–167. [Google Scholar]
Liu, X.; Zheng, S.; Feng, X. Estimation of error variance via ridge regression. Biometrika 2020, 107, 481–488. [Google Scholar] [CrossRef]
Zhang, C.H.; Huang, J. The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 2008, 36, 1567–1594. [Google Scholar] [CrossRef]
Bickel, P.J.; Ritov, Y.A.; Tsybakov, A.B. Simultaneous analysis of lasso and dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Fan, J.; Fan, Y.; Barut, E. Adaptive robust variable selection. Ann. Stat. 2014, 42, 324–351. [Google Scholar] [CrossRef] [PubMed]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fan, J.; Lv, J. Non-concave penalized likelihood with np-dimensionality. IEEE Trans. Inform. Theory 2011, 57, 5467–5484. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Giraud, C. Introduction to High-Dimensional Statistics, 1st ed.; Chapman and Hall/CRC: New York, NY, USA, 2014. [Google Scholar]

Figure 1. Boxplots of 100 RE values for five estimators, true

σ^{2} = 0.5

.

Figure 1. Boxplots of 100 RE values for five estimators, true

σ^{2} = 0.5

.

Figure 2. Boxplots of 100 RE values for five estimators, true

σ^{2} = 1

.

Figure 2. Boxplots of 100 RE values for five estimators, true

σ^{2} = 1

.

Table 1. Average RE of various estimators, true

σ^{2} = 0.5

.

Table 1. Average RE of various estimators, true

σ^{2} = 0.5

.

$α$	OE	NALE	SLE $_{λ_{n, 1}}$	SLE $_{λ_{n, 2}}$	SLE $_{λ_{n, 3}}$	NLE	OLE $_{λ_{n, 1}}$	OLE $_{λ_{n, 2}}$	RBE
$(ρ = 0.1)$
0.1	0.004	0.004	0.692	0.081	0.013	0.052	0.098	0.152	1.034
0.2	0.005	0.005	0.723	0.080	0.011	0.115	0.283	0.097	1.799
0.3	0.006	0.006	0.758	0.076	0.010	0.170	0.474	0.067	2.043
0.4	0.005	0.005	0.817	0.059	0.018	0.503	1.998	0.001	3.758
0.5	0.005	0.006	0.730	0.022	0.164	0.830	3.405	0.044	5.769
$(ρ = 0.3)$
0.1	0.005	0.005	0.621	0.069	0.008	0.051	0.086	0.150	0.655
0.2	0.005	0.004	0.642	0.053	0.005	0.111	0.242	0.098	0.765
0.3	0.004	0.004	0.585	0.043	0.014	0.161	0.383	0.067	1.249
0.4	0.006	0.006	0.598	0.024	0.081	0.483	1.616	0.002	3.431
0.5	0.006	0.006	0.721	0.015	0.536	0.790	2.848	0.003	4.896
$(ρ = 0.5)$
0.1	0.004	0.005	0.458	0.056	0.008	0.049	0.079	0.147	0.485
0.2	0.005	0.013	0.422	0.035	0.015	0.099	0.184	0.099	1.214
0.3	0.005	0.005	0.425	0.023	0.043	0.142	0.283	0.070	0.878
0.4	0.005	0.004	0.392	0.013	0.219	0.438	1.228	0.003	3.062
0.5	0.017	0.018	0.810	0.066	1.703	2.854	6.952	0.040	15.446

Table 2. Average RE of various estimators, true

σ^{2} = 0.5

.

Table 2. Average RE of various estimators, true

σ^{2} = 0.5

.

$α$	OE	NALE	SLE $_{λ_{n, 1}}$	SLE $_{λ_{n, 2}}$	SLE $_{λ_{n, 3}}$	NLE	OLE $_{λ_{n, 1}}$	OLE $_{λ_{n, 2}}$	RBE
$(ρ = 0.1)$
0.1	0.993	0.988	0.175	0.727	0.914	1.225	1.309	0.612	2.013
0.2	0.983	0.979	0.156	0.737	0.953	1.336	1.528	0.690	2.337
0.3	0.992	0.992	0.133	0.743	0.982	1.410	1.685	0.745	2.426
0.4	1.003	0.994	0.100	0.775	1.094	1.708	2.412	0.998	2.937
0.5	1.003	0.961	0.152	0.900	1.388	1.910	2.844	1.206	3.400
$(ρ = 0.3)$
0.1	1.007	0.998	0.219	0.754	0.963	1.222	1.287	0.614	1.803
0.2	0.996	0.988	0.205	0.783	1.015	1.330	1.487	0.689	1.870
0.3	0.998	0.992	0.242	0.815	1.071	1.399	1.615	0.745	2.113
0.4	1.000	0.993	0.239	0.887	1.259	1.694	2.269	0.987	2.850
0.5	1.001	1.017	0.159	0.976	1.717	1.999	2.686	1.182	3.211
$(ρ = 0.5)$
0.1	0.990	0.985	0.331	0.779	0.965	1.217	1.274	0.619	1.688
0.2	0.994	1.043	0.359	0.845	1.077	1.311	1.423	0.688	2.097
0.3	1.000	1.001	0.357	0.881	1.183	1.374	0.528	0.737	1.932
0.4	0.999	1.014	0.384	1.029	1.453	1.660	2.105	0.966	2.747
0.5	1.011	1.025	0.297	1.058	1.503	1.638	1.901	0.900	2.251

Table 3. Average MSE of various estimators, true

σ^{2} = 1

.

Table 3. Average MSE of various estimators, true

σ^{2} = 1

.

$α$	OE	NALE	SLE $_{λ_{n, 1}}$	SLE $_{λ_{n, 2}}$	SLE $_{λ_{n, 3}}$	NLE	OLE $_{λ_{n, 1}}$	OLE $_{λ_{n, 2}}$	RBE
$(ρ = 0.1)$
0.1	0.004	0.004	0.740	0.090	0.013	0.019	0.025	0.200	0.347
0.2	0.005	0.005	0.731	0.074	0.006	0.005	0.083	0.155	0.455
0.3	0.005	0.005	0.759	0.081	0.008	0.074	0.161	0.127	0.600
0.4	0.004	0.008	0.748	0.043	0.038	0.043	0.592	0.041	1.479
0.5	0.005	0.009	0.834	0.028	0.223	0.091	1.021	0.009	1.934
$(ρ = 0.3)$
0.1	0.005	0.005	0.655	0.087	0.015	0.018	0.023	0.201	0.266
0.2	0.005	0.006	0.642	0.063	0.008	0.045	0.074	0.153	0.457
0.3	0.005	0.005	0.597	0.048	0.012	0.065	0.116	0.131	0.478
0.4	0.005	0.015	0.621	0.015	0.128	0.192	0.404	0.049	1.101
0.5	0.004	0.016	0.667	0.026	0.356	0.349	0.856	0.010	1.811
$(ρ = 0.5)$
0.1	0.004	0.005	0.476	0.054	0.007	0.016	0.017	0.193	0.141
0.2	0.005	0.007	0.414	0.028	0.012	0.036	0.044	0.156	0.275
0.3	0.006	0.009	0.357	0.017	0.030	0.049	0.064	0.134	0.345
0.4	0.005	0.004	0.398	0.015	0.108	0.169	0.325	0.055	0.868
0.5	0.004	0.004	0.490	0.023	0.237	0.306	0.722	0.016	1.328

Table 4. Average RE of various estimators, true

σ^{2} = 1

.

Table 4. Average RE of various estimators, true

σ^{2} = 1

.

$α$	OE	NALE	SLE $_{λ_{n, 1}}$	SLE $_{λ_{n, 2}}$	SLE $_{λ_{n, 3}}$	NLE	OLE $_{λ_{n, 1}}$	OLE $_{λ_{n, 2}}$	RBE
$(ρ = 0.1)$
0.1	1.002	0.977	0.144	0.711	0.914	1.131	1.148	0.555	1.581
0.2	0.999	0.964	0.149	0.742	0.969	1.211	1.279	0.608	1.667
0.3	1.003	0.964	0.133	0.726	0.963	1.268	1.394	0.636	1.767
0.4	1.003	0.940	0.141	0.818	1.176	1.205	1.766	0.801	2.212
0.5	0.981	0.937	0.091	0.891	1.459	1.300	2.007	0.913	2.387
$(ρ = 0.3)$
0.1	0.994	0.971	0.197	0.720	0.913	1.125	1.137	0.553	1.506
0.2	1.005	0.987	0.207	0.767	0.997	1.207	1.263	0.610	1.669
0.3	1.007	1.000	0.234	0.804	1.054	1.251	1.331	0.640	1.684
0.4	1.001	1.068	0.219	0.982	1.342	1.436	1.632	0.783	2.048
0.5	0.993	1.068	0.193	1.111	1.584	1.589	1.922	0.912	2.341
$(ρ = 0.5)$
0.1	1.001	0.980	0.316	0.783	0.977	1.118	1.113	0.562	1.364
0.2	1.000	1.016	0.362	0.862	1.065	1.183	1.200	0.607	1.514
0.3	0.987	1.028	0.412	0.911	1.146	1.215	1.242	0.636	1.580
0.4	0.990	0.996	0.381	1.026	1.310	1.408	1.566	0.771	1.926
0.5	1.000	1.011	0.310	1.100	1.473	1.551	1.846	0.882	2.148

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Kong, L.; Wang, L. Estimation of Error Variance in Regularized Regression Models via Adaptive Lasso. Mathematics 2022, 10, 1937. https://doi.org/10.3390/math10111937

AMA Style

Wang X, Kong L, Wang L. Estimation of Error Variance in Regularized Regression Models via Adaptive Lasso. Mathematics. 2022; 10(11):1937. https://doi.org/10.3390/math10111937

Chicago/Turabian Style

Wang, Xin, Lingchen Kong, and Liqun Wang. 2022. "Estimation of Error Variance in Regularized Regression Models via Adaptive Lasso" Mathematics 10, no. 11: 1937. https://doi.org/10.3390/math10111937

APA Style

Wang, X., Kong, L., & Wang, L. (2022). Estimation of Error Variance in Regularized Regression Models via Adaptive Lasso. Mathematics, 10(11), 1937. https://doi.org/10.3390/math10111937

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimation of Error Variance in Regularized Regression Models via Adaptive Lasso

Abstract

1. Introduction

1.1. Literature Review

1.2. Notation and Outline

2. Natural Adaptive Lasso (NAL)

3. Asymptotic Properties

3.1. Adaptive Lasso

3.2. Error Bounds of NALE

4. Numerical Optimization

5. Numerical Simulations

5.1. Simulation Settings

5.2. Selection of Tuning Parameters

5.3. Simulation Results

6. Conclusions and Discussion

7. Regularity Conditions and Proofs

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI