On the Convergence Rate of the SCAD-Penalized Empirical Likelihood Estimator

Ando, Tomohiro; Sueishi, Naoya

doi:10.3390/econometrics7010015

Open AccessArticle

On the Convergence Rate of the SCAD-Penalized Empirical Likelihood Estimator

by

Tomohiro Ando

¹ and

Naoya Sueishi

^2,*

¹

Melbourne Business School, University of Melbourne, 200 Leicester Street, Carlton, Victoria 3053, Australia

²

Graduate School of Economics, Kobe University, 2-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan

^*

Author to whom correspondence should be addressed.

Econometrics 2019, 7(1), 15; https://doi.org/10.3390/econometrics7010015

Submission received: 18 October 2018 / Revised: 18 March 2019 / Accepted: 18 March 2019 / Published: 20 March 2019

Download Versions Notes

Abstract

:

This paper investigates the asymptotic properties of a penalized empirical likelihood estimator for moment restriction models when the number of parameters (

p_{n}

) and/or the number of moment restrictions increases with the sample size. Our main result is that the SCAD-penalized empirical likelihood estimator is

\sqrt{n / p_{n}}

-consistent under a reasonable condition on the regularization parameter. Our consistency rate is better than the existing ones. This paper also provides sufficient conditions under which

\sqrt{n / p_{n}}

-consistency and an oracle property are satisfied simultaneously. As far as we know, this paper is the first to specify sufficient conditions for both

\sqrt{n / p_{n}}

-consistency and the oracle property of the penalized empirical likelihood estimator.

Keywords:

diverging number of parameters; penalized empirical likelihood; sparse models

JEL Classification:

C14; C52

1. Introduction

Recently, sparse regression models have received considerable attention in business, economics, genetics, and various other fields. In these models, the number of possible regressors can be potentially large; however, only a relatively small number of these regressors are relevant.

Penalization is an alternative to a classical subset selection. One of the drawbacks of subset selection is lack of stability due to its discrete nature, meaning that variables are either retained or are dropped from a model. As a result, a small perturbation in a sample may cause a drastic change in the post-selection results (Breiman 1996). Penalization addresses this issue by achieving variable selection and estimation simultaneously, through a continuous process.

Several penalization methods have been advocated for linear regression models. Examples include the bridge penalty (Frank and Friedman 1993), LASSO (Tibshirani 1996), the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li 2001), and the elastic net penalty (Zou and Hastie 2005). However, penalized least squares methods are not applicable when endogeneity exists (Fan and Liao 2014). When endogeneity exists, parameters of interest are identified often by moment restrictions, using instrumental variables.

This study investigates the asymptotic properties of a penalized empirical likelihood (PEL) estimator for moment restriction models, when the number of parameters and/or the number of moment restrictions increases with the sample size. We extend the EL estimator of Qin and Lawless (1994) by employing the SCAD penalty, so that we can achieve estimation and variable selection simultaneously.

Some penalized estimators for moment restriction models have been proposed in the econometric literature. Caner (2009) and Shi (2016b) considered the GMM estimator with a LASSO-type penalty. Caner and Zhang (2014) proposed the adaptive elastic net GMM estimator. Fan and Liao (2014) proposed the penalized focused GMM estimator. Leng and Tang (2012) and Chang et al. (2015) studied the asymptotic properties of the PEL estimator for independent and weakly dependent observations, respectively. Tang et al. (2018) considered a penalized exponential tilting estimator.

This paper shows that the SCAD-penalized EL estimator is

\sqrt{n / p_{n}}

-consistent, where

p_{n}

is the number of parameters. Leng and Tang (2012) showed that the non-penalized EL estimator is

\sqrt{n / p_{n}}

-consistent under the assumption that

p_{n} / r_{n} \to c \in (0, 1)

, where

r_{n}

is the number of moment restrictions. Thus, essentially, they only proved

\sqrt{n / r_{n}}

-consistency. Chang et al. (2015) proved

\sqrt{n / p_{n}}

-consistency of the non-penalized EL estimator without imposing

p_{n} / r_{n} \to c \in (0, 1)

, but they only obtained

\sqrt{n / r_{n}}

-consistency for the PEL estimator. We prove

\sqrt{n / p_{n}}

-consistency of the PEL estimator under a reasonable condition on the regularization parameter of the penalty function. Our result is important because it implies

\sqrt{n}

-consistency of the estimator when

p_{n}

is fixed and only

r_{n}

increases with the sample size. This is consistent with previous results in the EL literature such as Donald et al. (2003). In contrast,

\sqrt{n / r_{n}}

-consistency implies that only a slow rate of convergence can be achieved even when

p_{n}

is finite and fixed.

This paper also shows that the PEL estimator satisfies the oracle property in the sense of Fan and Peng (2004) when the truth is sparse. That is, if the true parameter vector has some zero components, then they are estimated as zeros with probability approaching one, and the other nonzero components are estimated well, similar to the case when the zero components are known a priori. Although Leng and Tang (2012) and Chang et al. (2015) also discussed the oracle property of the PEL estimator, they obtained their results under high-level assumptions. As far as we know, this paper is the first to specify sufficient conditions for both

\sqrt{n / p_{n}}

-consistency and the oracle property of the PEL estimator.

Recently, Chang et al. (2018) proposed an alternative PEL estimator that regularizes both parameters and Lagrange multipliers. Their estimator allows the case where

r_{n}

and

p_{n}

increase at an exponential rate, while our PEL estimator allows a polynomial rate only. Their method is useful when the truth is actually sparse. In contrast, our estimator is valid even when the truth is not sparse because

\sqrt{n / p_{n}}

-consistency can be established without imposing sparsity.

There is also a large literature on instrument (moment) selection that addresses the problem of selecting/constructing optimal instruments when a large number of instruments are available (e.g., Donald and Newey 2001; Bai and Ng 2009; Kuersteiner and Okui 2010; Belloni et al. 2012; Caner and Fan 2015; Cheng and Liao 2015; Shi 2016a). In contrast to these papers, here we focus on variable selection in a structural model.

This paper is organized as follows. We first show

\sqrt{n / p_{n}}

-consistency of the SCAD-penalized EL estimator and compare our assumptions with those of Leng and Tang (2012) and Chang et al. (2015). Then, we obtain the asymptotic distribution. Our proofs are new in the EL literature. All the proofs are found in the Appendix A.

2. PEL Estimator and Asymptotic Results

Let

{y_{1}, \dots, y_{n}}

be a random sample from an unknown distribution on

R^{d_{n}}

. This study considers the moment restriction model

E [m (y_{i}, θ_{0})] = 0,

where

θ_{0} = {(θ_{10}, \dots, θ_{p_{n} 0})}^{'} \in Θ_{n}

is a

p_{n}

-dimensional true parameter and

m (y, θ) = {(m_{1} (y, θ), \dots, m_{r_{n}} (y, θ))}^{'}

is an

r_{n}

-dimensional moment function. For instance, the model includes the linear instrumental variable model

E [z_{i} (y_{i} - x_{i}^{'} θ_{0})] = 0,

where

z_{i}

is an

r_{n} \times 1

vector of instrumental variables and

x_{i}

is a

p_{n} \times 1

vector of explanatory variables. We consider the case where

r_{n} \geq p_{n}

. The subscript indicates that

d_{n}

,

p_{n}

, and

r_{n}

may increase with the sample size.

The PEL estimator for

θ_{0}

is

{\hat{θ}}_{n} = arg min_{θ \in Θ_{n}} max_{λ \in {\hat{Λ}}_{n} (θ)} \{\frac{1}{n} \sum_{i = 1}^{n} log (1 - λ^{'} m (y_{i}, θ)) + \sum_{j = 1}^{p_{n}} p_{κ_{n}} (θ_{j})\},

where

{\hat{Λ}}_{n} (θ) = {λ \in R^{r_{n}} : λ^{'} m (y_{i}, θ) < 1, i = 1, \dots, n}

and

p_{κ} (\cdot)

is a penalty function with a regularization parameter

κ

. Thus, the estimator is the same as that of Leng and Tang (2012).

For concreteness, we employ the SCAD penalty of Fan and Li (2001):

\begin{matrix} p_{κ} (u) = \{\begin{matrix} κ | u | & | u | \leq κ \\ - (u^{2} - 2 a κ | u | + κ^{2}) / [2 (a - 1)] & κ < | u | \leq a κ \\ (a + 1) κ^{2} / 2 & | u | > a κ \end{matrix} \end{matrix}

for some

a > 2

. Similar asymptotic results are obtained also by using a different penalty function, such as the minimax concave penalty of Zhang (2010).

The true model may be sparse, that is, some elements of

θ_{0}

may be zero. Let

q_{n}

be the number of nonzero elements in

θ_{0}

. Without loss of generality, we can write

θ_{0} = {(θ_{10}^{'}, θ_{20}^{'})}^{'} = {({θ_{10}}^{'}, 0^{'})}^{'}

with

θ_{1} = {(θ_{1}, \dots, θ_{q_{n}})}^{'} \in R^{q_{n}}

and

θ_{2} = {(θ_{q_{n} + 1}, \dots, θ_{p_{n}})}^{'} \in R^{p_{n} - q_{n}}

. For now, the sparsity assumption is not crucial. It is possible that

q_{n} = p_{n}

.

Let

m_{i} (θ) = m (y_{i}, θ)

and

M_{i} (θ) = \partial m_{i} (θ) / \partial θ^{'}

. Also, let

m_{i} = m_{i} (θ_{0})

and

M_{i} = M_{i} (θ_{0})

. We define

Q_{n} (θ, λ) = E [log (1 - λ^{'} m_{i} (θ))]

and

{\hat{Q}}_{n} (θ, λ) = n^{- 1} \sum_{i = 1}^{n} log (1 - λ^{'} m_{i} (θ))

. Moreover, we use

λ (θ)

and

\hat{λ} (θ)

to denote

arg {max}_{λ \in Λ_{n} (θ)} Q_{n} (θ, λ)

and

arg {max}_{λ \in {\hat{Λ}}_{n} (θ)} {\hat{Q}}_{n} (θ, λ)

, respectively, where

Λ_{n} (θ)

is a subset in

R^{r_{n}}

, such that

0 \in int (Λ_{n} (θ))

. Let

λ_{min} (A)

and

λ_{max} (A)

denote the minimum and maximum eigenvalues of a matrix A. Also, let

∥ \cdot ∥

denote the Euclidean (Frobenius) norm.

We impose the following conditions for

\sqrt{n / p_{n}}

-consistency.

Assumption 1.

(i) The true parameter vector

θ_{0}

is the unique minimizer of

Q_{n} (θ, λ (θ))

and belongs to the interior of

Θ_{n}

; (ii) There are positive functions

Δ_{1} (r, p)

and

Δ_{2} (ϵ)

such that for any

ϵ > 0

inf_{{θ \in Θ_{n} : ∥ θ - θ_{0} ∥ > ϵ}} Q_{n} (θ, λ (θ)) \geq Δ_{1} (r_{n}, p_{n}) Δ_{2} (ϵ) > 0,

where

{lim inf}_{n \to \infty} Δ_{1} (r_{n}, p_{n}) > 0

; (iii)

{sup}_{θ \in Θ_{n}} |{\hat{Q}}_{n} (θ, λ (θ)) - Q_{n} (θ, λ (θ))| = o_{p} (Δ_{1} (r_{n}, p_{n}))

.

Assumption 2.

(i)

E [{sup}_{θ \in Θ_{n}} (∥ m_{i} (θ) ∥ r_{n}^{- 1 / 2})^{α}] < \infty

for some

α > 4

; (ii)

{lim}_{n \to \infty} r_{n}^{4} / n = 0

.

Assumption 3.

(i) There exists C such that

0 < 1 / C \leq λ_{min} (E [m_{i} (θ) m_{i} {(θ)}^{'}]) \leq

λ_{max} (E [m_{i} (θ) m_{i} {(θ)}^{'}]) < C < \infty

in a neighborhood of

θ_{0}

; (ii) There exists C such that

λ_{max} (E {[M_{i}]}^{'} E [M_{i}]) < C < \infty

; (iii) There exists C such that

λ_{max} (E [M_{i} (θ) M_{i} {(θ)}^{'}]) < C < \infty

in a neighborhood of

θ_{0}

.

Assumption 4.

(i) The moment function

m (y, θ)

is twice continuously differentiable in

θ

for all y in a neighborhood of

θ_{0}

; (ii) There exists C such that

λ_{min} (\frac{d^{2} {\hat{Q}}_{n} (θ, \hat{λ} (θ))}{d θ d θ^{'}}) \geq C > 0

in a neighborhood of

θ_{0}

with probability approaching one.

Assumption 5.

{lim}_{n \to \infty} \sqrt{q_{n}} κ_{n} / {min}_{1 \leq j \leq q_{n}} | θ_{j 0} | = 0

.

Assumption 1 is similar to condition 2.1 of Chang et al. (2015). Assumption 1 (iii) is an extension of the uniform convergence. If we restrict the parameter space such that

Θ_{n}

is compact and

E [{sup}_{θ \in Θ_{n}} log (1 - λ {(θ)}^{'} m_{i} (θ))] < \infty

, then Assumption 1 (iii) is satisfied with

Δ_{1} (r, p) = 1

. Assumption 1 is used to show that

∥ {\hat{θ}}_{n} - θ_{0} ∥ = o_{p} (1)

. Any condition that guarantees consistency of the estimator can replace 1.

Assumptions 2 (i) and (ii) are similar to Assumptions 2 and 4 in Leng and Tang (2012). However, we do not assume that

p_{n} / r_{n} \to c \in (0, 1)

. Thus,

r_{n}

can grow faster than

p_{n}

. We can allow the case where

p_{n}

is fixed and only

r_{n}

increases with the sample size.

Assumption 4 states that the objective function of the EL estimator is strictly convex in

θ

in a neighborhood of

θ_{0}

. When

r_{n}

and

p_{n}

are fixed, this condition is satisfied under fairly weak conditions. We can also relax the condition so that

λ_{min} (\frac{d^{2} {\hat{Q}}_{n} (θ, \hat{λ} (θ))}{d θ d θ^{'}}) \geq ρ_{n}

with a positive sequence

ρ_{n}

such that

ρ_{n} \to 0

. In that case, we obtain a different convergence rate of the estimator. Under certain conditions, we have

∥ {\hat{θ}}_{n} - θ_{0} ∥ = O_{p} (\sqrt{p_{n} / n} / ρ_{n})

.

Assumption 5 is similar to condition (B2) in Huang and Xie (2007), who obtained the convergence rate of the SCAD-penalized least squares estimator. Assumption 5 states that the minimum of nonzero elements in

θ_{0}

may converge to 0, but the convergence rate must be sufficiently slow. If nonzero elements are too small compared to

κ_{n}

, then the PEL estimator cannot distinguish between zero and nonzero elements. Following Huang and Xie (2007), we prove

\sqrt{n / p_{n}}

-consistency of the PEL estimator in two steps. We first prove

∥ {\hat{θ}}_{n} - θ_{0} ∥ = O_{p} (\sqrt{p_{n} / n} + {\sqrt{q}}_{n} κ_{n})

under Assumptions 1–4 and

q_{n} κ_{n}^{2} \to 0

(see Lemma A3 in the Appendix A). Then, we improve the convergence rate by using Assumption 5. Notice that if we assume

{\sqrt{q}}_{n} κ_{n} = O (\sqrt{p_{n} / n})

, then

\sqrt{n / p_{n}}

-consistency of the PEL estimator is obtained immediately from Lemma A3. However, as we will see later, this condition contradicts Assumption 6 (i), which is a key condition for the oracle property. Assumption 5 is imposed so that

\sqrt{n / p_{n}}

-consistency and the oracle property are satisfied simultaneously.

Theorem 1.

Suppose that Assumptions 1–5 hold. Then, we have

∥ {\hat{θ}}_{n} - θ_{0} ∥ = O_{p} (\sqrt{p_{n} / n})

.

The sparsity assumption is not necessary for this theorem. The same result is obtained even if all elements in

θ_{0}

are nonzero. Moreover, because Assumption 5 does not exclude

κ_{n} = 0

, the theorem also applies to the non-penalized EL estimator, whose

\sqrt{n / p_{n}}

-consistency has been established by Chang et al. (2015). As we will see in the next theorem, if the truth is sparse, then we obtain

\sqrt{n / q_{n}}

-consistency of the PEL estimator under certain additional assumptions.

Our convergence rate of the PEL estimator is better than that of Chang et al. (2015). Roughly speaking, different convergence rates are based on different equalities. The asymptotic analyses of Leng and Tang (2012) and Chang et al. (2015) are based on the moment equality

E [m_{i}] = 0

, which implies

∥ n^{- 1} \sum_{i = 1}^{n} m_{i} ∥ = O_{p} (\sqrt{r_{n} / n})

. Leng and Tang (2012) obtained

\sqrt{n / p_{n}}

-consistency of the non-penalized EL estimator by assuming

r_{n} = O (p_{n})

and hence

∥ n^{- 1} \sum_{i = 1}^{n} m_{i} ∥ = O_{p} (\sqrt{p_{n} / n})

. On the other hand, our asymptotic analysis is based on the first-order condition

E [\frac{d log (1 - λ {(θ_{0})}^{'} m_{i} (θ_{0}))}{d θ}] = 0

, which implies

∥\frac{d {\hat{Q}}_{n} (θ_{0}, {\hat{λ}}_{n} (θ_{0}))}{d θ}∥ = O_{p} (\sqrt{p_{n} / n})

. Therefore, our proof is not a straightforward extension of that of Leng and Tang (2012) and Chang et al. (2015).

To obtain a convergence rate in line with the proof of Leng and Tang (2012) and Chang et al. (2015), we need a rather strong condition on the regularization parameter. For instance, Chang et al. (2015) assumed that

q_{n} κ_{n} r_{n}^{- 1} n M^{- 1} = O (1)

to prove

\sqrt{n / r_{n}}

-consistency, where M is the block length, which is equal to unity when the observations are independent. The condition of Chang et al. (2015) corresponds to the condition that

\sqrt{q_{n}} κ_{n} = o (\sqrt{p_{n} / n})

in our case. As stated before, although this condition simplifies the proof of

\sqrt{n / p_{n}}

-consistency, it causes a problem for the oracle property of the estimator.

Next, we show sparsity and asymptotic normality of the PEL estimator. Let

{\hat{θ}}_{1 n}

and

{\hat{θ}}_{2 n}

be the corresponding estimators of

θ_{10}

and

θ_{20}

, respectively. Furthermore, let

M_{1 i} = \partial m_{i} (θ_{10}, 0) / \partial θ_{1}^{'}

. We define

V_{n} = {(E {[M_{i}]}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} E [M_{i}])}^{- 1}

and

V_{1 n} = {(E {[M_{1 i}]}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} E [M_{1 i}])}^{- 1}

.

We impose additional conditions.

Assumption 6.

(i)

{lim}_{n \to \infty} \sqrt{n / p_{n}} κ_{n} = \infty

; (ii)

{lim}_{n \to \infty} r_{n} p_{n}^{3 / 2} / \sqrt{n} = 0

Assumption 7.

There exists

B_{j k l} (y)

such that

| \partial^{2} m_{l} (y, θ) / \partial θ_{j} \partial θ_{k} | \leq B_{j k l} (y)

and

E [B_{j k l}^{2} (y_{i})] < \infty

for all

j, k = 1, \dots, p_{n}

and

l = 1, \dots, r_{n}

in a neighborhood of

θ_{0}

.

Assumption 8.

There exists C such that

0 < 1 / C \leq λ_{min} (V_{n}) \leq λ_{max} (V_{n}) \leq C < \infty

.

Assumption 6 (i) is a key condition for sparsity of the PEL estimator. It requires that the regularization parameter is not too small so that zero elements in

θ_{0}

are estimated as zero. The same condition is also employed by Leng and Tang (2012).

Theorem 2.

Suppose that Assumptions 1–8 hold. Let

B_{n}

be an

l \times q_{n}

matrix such that

B_{n} B_{n}^{'} \to G

, where G is an

l \times l

matrix with fixed l. Then, the PEL estimator satisfies the following:

1.: Sparsity: ${\hat{θ}}_{2 n} = 0$ with probability approaching one.
2.: $\sqrt{n / q_{n}}$ -consistency: $∥ {\hat{θ}}_{1 n} - θ_{10} ∥ = O_{p} (\sqrt{q_{n} / n})$ .
3.: Asymptotic normality: $\sqrt{n} B_{n} V_{1 n}^{- 1 / 2} ({\hat{θ}}_{1 n} - θ_{10}) \overset{d}{\to} N (0, G) .$

The selection of the matrix

B_{n}

depends on the parameter of interest. For instance, suppose that the parameter of interest is the first element of

θ_{10}

. Let

{\hat{θ}}_{1 n, 1}

and

θ_{10, 1}

denote first elements of

{\hat{θ}}_{1 n}

and

θ_{10}

, respectively. Then, we choose

B_{n} = (1, 0, \dots, 0)

and obtain

\sqrt{n} ({\hat{θ}}_{1 n, 1} - θ_{10, 1}) \overset{d}{\to} N (0, v_{11})

, where

v_{11}

is the limit of the first diagonal element of

V_{1 n}

.

Although a detailed proof is given in the Appendix A, we give a sketch of the proof for asymptotic normality here. If

λ (θ)

were known, then

θ_{0}

can be estimated by

{\tilde{θ}}_{n} = arg min_{θ \in Θ_{n}} \{\frac{1}{n} \sum_{i = 1}^{n} log (1 - λ {(θ)}^{'} m_{i} (θ)) + \sum_{j = 1}^{p_{n}} p_{κ_{n}} (θ_{j})\},

which is a penalized maximum likelihood estimator using a least favorable submodel of the moment restriction model (see Sueishi 2016, for instance). Because

{\tilde{θ}}_{n}

is the penalized maximum likelihood estimator, its distribution can be obtained in a manner similar to Fan and Peng (2004). We derive the asymptotic distribution of

{\hat{θ}}_{n}

by showing that

{\hat{θ}}_{n}

is asymptotically equivalent to

{\tilde{θ}}_{n}

.

By modifying the proof of Theorem 2, we can obtain easily the asymptotic distribution of the non-penalized EL estimator. Because the asymptotic distribution of the non-penalized EL estimator has already been derived by Leng and Tang (2012), we omit the derivation. We see that the efficiency of the PEL estimator for

θ_{10}

is the same as that of the non-penalized EL estimator for which it is known a priori that

θ_{20} = 0

. Thus, our estimator satisfies the oracle property in the sense of Fan and Peng (2004).

Theorem 2 is similar to Theorem 3 of Leng and Tang (2012). However, they proved sparsity by assuming that the PEL estimator is

\sqrt{n / p_{n}}

-consistent. They did not state explicitly the conditions under which the non-penalized and penalized EL estimators have the same convergence rate.

Chang et al. (2015) showed a similar result to Theorem 2 for weakly dependent observations. They obtained

\sqrt{n / r_{n}}

-consistency and sparsity under two separate

κ_{n}

rate conditions. Specifically, they assume: (i)

q_{n} κ_{n} r_{n}^{- 1} n M^{- 1} = O (1)

for

\sqrt{n / r_{n}}

-consistency and (ii)

κ_{n} \sqrt{n / r_{n}} M^{- 1} \to \infty

for sparsity. If condition (ii) is satisfied, however, condition (i) requires that

q_{n} \sqrt{n / r_{n}} \to 0

, which is clearly impossible. This causes a trouble because their proof of sparsity requires

\sqrt{n / r_{n}}

-consistency of the estimator. We relaxed condition (i) and obtained sufficient conditions under which both

\sqrt{n / p_{n}}

-consistency and sparsity are satisfied.

3. Conclusions

We investigated the asymptotic properties of the PEL estimator when the number of parameters and/or the number of moment restrictions increases with the sample size. In particular, we showed that the PEL estimator is

\sqrt{n / p_{n}}

-consistent under a reasonable condition on the regularization parameter. Although we cannot compare our results directly to those of Chang et al. (2015) because they allow weakly dependent observations, our convergence rate is improved over the existing ones. In terms of converge rate, our result is even better than Tang et al. (2018) and Chang et al. (2018), because their convergence rates depend also on the number of moment restrictions.

A crucial issue with the PEL estimation concerns selecting the size of the regularization parameter. The asymptotic theory does not tell us how to select the regularization parameter in practice. Although some selection methods are considered by Leng and Tang (2012), Shi (2016b), and Ando and Sueishi (2019), this is still an underdeveloped area of research.

Author Contributions

Both authors contributed equally to this work.

Funding

This research was supported by JSPS KAKENHI Grant Number 15K03396.

Acknowledgments

The authors would like to thank anonymous reviewers for their comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Throughout the Appendix, C denotes a generic positive constant which may vary according to context. The qualifier “with probability approaching one” is abbreviated as w.p.a.1. We define

\begin{matrix} H_{11} (θ, λ) = E [\frac{\partial^{2} log (1 - λ^{'} m_{i} (θ))}{\partial θ \partial θ^{'}}] = - E [\frac{\frac{\partial}{\partial θ^{'}} (M_{i} {(θ)}^{'} λ)}{1 - λ^{'} m_{i} (θ)}] - E [\frac{M_{i} {(θ)}^{'} λ λ^{'} M_{i} (θ)}{{(1 - λ^{'} m_{i} (θ))}^{2}}] \\ H_{12} (θ, λ) = E [\frac{\partial^{2} log (1 - λ^{'} m_{i} (θ))}{\partial θ \partial λ^{'}}] = - E [\frac{M_{i} {(θ)}^{'}}{1 - λ^{'} m_{i} (θ)}] - E [\frac{M_{i} {(θ)}^{'} λ m_{i} {(θ)}^{'}}{{(1 - λ^{'} m_{i} (θ))}^{2}}] \\ H_{22} (θ, λ) = E [\frac{\partial^{2} log (1 - λ^{'} m_{i} (θ))}{\partial λ \partial λ^{'}}] = - E [\frac{m_{i} (θ) m_{i} {(θ)}^{'}}{{(1 - λ^{'} m_{i} (θ))}^{2}}] . \end{matrix}

We use

{\hat{H}}_{i j} (θ, λ)

to denote the sample analog of

H_{i j} (θ, λ)

. Moreover, we define

{\hat{Q}}_{n} (θ) = {\hat{Q}}_{n} (θ, \hat{λ} (θ))

and

Q_{n} (θ) = Q_{n} (θ, λ (θ))

.

We prepare some lemmas to prove Theorems 1 and 2.

Lemma A1.

Suppose that Assumptions 1, 2 and 3 (i) hold. Then, we have

∥ {\hat{θ}}_{n} - θ_{0} ∥ = o_{p} (1)

if

q_{n} κ_{n}^{2} \to 0

.

Proof of Lemma A1.

Let

ξ

satisfy

1 / α + 1 / 8 \leq ξ < 3 / 8

and let

{\bar{Λ}}_{n} = {λ \in R^{r_{n}} : ∥ λ ∥ \leq n^{- ξ}}

. Then, by Assumption 2, we have

max_{1 \leq i \leq n} sup_{θ \in Θ_{n}} | λ^{'} m_{i} (θ) | \leq n^{- ξ} max_{1 \leq i \leq n} sup_{θ \in Θ_{n}} ∥ m_{i} (θ) ∥ = o_{p} (n^{- ξ + 1 / α} r_{n}^{1 / 2}) = o_{p} (1)

for all

λ \in {\bar{Λ}}_{n}

. Let

\tilde{λ} = arg {max}_{λ \in {\bar{Λ}}_{n}} {\hat{Q}}_{n} (θ_{0}, λ)

. Because Assumptions 2 (ii) and 3 (i) imply

λ_{min} (n^{- 1} \sum_{i = 1}^{n} m_{i} m_{i}^{'}) > C

w.p.a.1, by expanding

log (1 - x)

around

x = 0

, we have

\begin{matrix} 0 \leq {\hat{Q}}_{n} (θ_{0}, \tilde{λ}) \leq - {\tilde{λ}}^{'} {\bar{m}}_{n} - \frac{1}{2} {\tilde{λ}}^{'} \{\frac{1}{n} \sum_{i = 1}^{n} \frac{m_{i} m_{i}^{'}}{{(1 - {\dot{λ}}^{'} m_{i})}^{2}}\} \tilde{λ} \leq ∥ \tilde{λ} ∥ ∥ {\bar{m}}_{n} ∥ - C ∥ \tilde{λ} ∥^{2}, \end{matrix}

(A1)

where

{\bar{m}}_{n} = n^{- 1} \sum_{i = 1}^{n} m_{i}

and

\dot{λ}

lies between

0

and

\tilde{λ}

. Therefore, we obtain

∥ \tilde{λ} ∥ = O_{p} (∥ {\bar{m}}_{n} ∥) = O_{p} (\sqrt{r_{n} / n}) = o_{p} (n^{- 3 / 8})

by Assumption 2 (ii), and hence

\tilde{λ} \in int ({\bar{Λ}}_{n})

. Because

{\bar{Λ}}_{n} \subset {\hat{Λ}}_{n} (θ_{0})

, the concavity of

{\hat{Q}}_{n} (θ_{0}, λ)

implies

\tilde{λ} = \hat{λ} (θ_{0})

. Moreover, we obtain

\begin{matrix} {\hat{Q}}_{n} ({\hat{θ}}_{n}, λ ({\hat{θ}}_{n})) \leq {\hat{Q}}_{n} ({\hat{θ}}_{n}) \leq {\hat{Q}}_{n} (θ_{0}) + \sum_{j = 1}^{p_{n}} p_{κ_{n}} (θ_{j 0}) = o_{p} (1) . \end{matrix}

(A2)

Now, suppose that

{\hat{θ}}_{n}

is not consistent. Then, there exists a subsequence

{n_{k}}

such that

∥ {\hat{θ}}_{n_{k}} - θ_{0} ∥ > ϵ

for some

ϵ > 0

almost surely. By Assumption 1 (iii) and Equation (A2), we have

∥ Q_{n_{k}} ({\hat{θ}}_{n_{k}}) ∥ = o_{p} (Δ_{1} (r_{n_{k}}, p_{n_{k}})) + o_{p} (1)

. In contrast, Assumption 1 (ii) implies

∥ Q_{n_{k}} ({\hat{θ}}_{n_{k}}) ∥ > Δ_{1} (r_{n_{k}}, p_{n_{k}}) Δ_{2} (ϵ)

. Because

lim {inf}_{n \to \infty} Δ (r_{n}, p_{n}) > 0

, it is a contradiction. Therefore, we have

∥ {\hat{θ}}_{n} - θ_{0} ∥ = o_{p} (1)

. □

Lemma A2.

Suppose that Assumptions 1–3 hold. Then, we have

∥\frac{d {\hat{Q}}_{n} (θ_{0})}{d θ} - \frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0})))}{d θ}∥ = o_{p} (\frac{1}{\sqrt{n}}) .

Proof of Lemma A2.

Let

H_{i j} (θ) = H_{i j} (θ, λ (θ))

and

{\hat{H}}_{i j} (θ) = {\hat{H}}_{i j} (θ, \hat{λ} (θ))

for

i, j = 1, 2

. Also, let

H_{i j} = H_{i j} (θ_{0})

and

{\hat{H}}_{i j} = {\hat{H}}_{i j} (θ_{0})

. Because

λ (θ_{0}) = 0

, we have

\begin{matrix} ∥ {\hat{H}}_{12} - H_{12} ∥ \leq ∥\frac{1}{n} \sum_{i = 1}^{n} \frac{M_{i}^{'} \hat{λ} (θ_{0}) m_{i}^{'}}{{(1 - \hat{λ} {(θ_{0})}^{'} m_{i})}^{2}}∥ + ∥\frac{1}{n} \sum_{i = 1}^{n} \frac{M_{i}}{1 - \hat{λ} {(θ_{0})}^{'} m_{i}} - E [M_{i}]∥ . \end{matrix}

From the proof of Lemma A1, we see that

∥ \hat{λ} (θ_{0}) ∥ = O_{p} (\sqrt{r_{n} / n})

. In addition, it follows from Assumptions 2 (ii) and 3 (iii) that

λ_{max} (n^{- 1} \sum_{i = 1}^{n} M_{i} M_{i}^{'}) < C

w.p.a.1. Because

n^{- 1} \sum_{i = 1}^{n} {∥ m_{i} ∥}^{2} = O_{p} (r_{n})

by Assumption 2 (i), we have

\begin{matrix} ∥\frac{1}{n} \sum_{i = 1}^{n} \frac{M_{i}^{'} \hat{λ} (θ_{0}) m_{i}^{'}}{{(1 - \hat{λ} {(θ_{0})}^{'} m_{i})}^{2}}∥ & \leq & C \sqrt{\hat{λ} {(θ_{0})}^{'} (\frac{1}{n} \sum_{i = 1}^{n} M_{i} M_{i}^{'}) \hat{λ} (θ_{0})} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {∥ m_{i} ∥}^{2}} \\ = & O_{p} (\frac{r_{n}}{\sqrt{n}}) . \end{matrix}

Furthermore, because,

| \hat{λ} {(θ_{0})}^{'} m_{i} | = o_{p} (1)

for all i, we have

{(1 - \hat{λ} {(θ_{0})}^{'} m_{i})}^{- 1} = 1 + \hat{λ} {(θ_{0})}^{'} m_{i} + o_{p} (| \hat{λ} {(θ_{0})}^{'} m_{i} |)

. Hence, we have

\begin{matrix} ∥\frac{1}{n} \sum_{i = 1}^{n} \frac{M_{i}}{1 - \hat{λ} {(θ_{0})}^{'} m_{i}} - E [M_{i}]∥ \\ \leq ∥\frac{1}{n} \sum_{i = 1}^{n} M_{i} - E [M_{i}]∥ + C ∥\frac{1}{n} \sum_{i = 1}^{n} \hat{λ} {(θ_{0})}^{'} m_{i} M_{i}∥ = O_{p} (\frac{r_{n}}{\sqrt{n}}), \end{matrix}

which implies

∥ {\hat{H}}_{12} - H_{12} ∥ = O_{p} (r_{n} / \sqrt{n})

. Similarly, we have

\begin{matrix} ∥{\hat{H}}_{22} - H_{22}∥ & \leq & ∥\frac{1}{n} \sum_{i = 1}^{n} m_{i} m_{i}^{'} - E [m_{i} m_{i}^{'}]∥ + C ∥\frac{1}{n} \sum_{i = 1}^{n} (\hat{λ} {(θ_{0})}^{'} m_{i}) m_{i} m_{i}^{'}∥ \\ \leq & ∥\frac{1}{n} \sum_{i = 1}^{n} m_{i} m_{i}^{'} - E [m_{i} m_{i}^{'}]∥ \\ + C \sqrt{\hat{λ} {(θ_{0})}^{'} (\frac{1}{n} \sum_{i = 1}^{n} m_{i} m_{i}^{'}) \hat{λ} (θ_{0})} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {∥ m_{i} ∥}^{4}} \\ = & O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) . \end{matrix}

(A3)

By the Taylor expansion,

\begin{matrix} \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ} - \frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ} \\ = {\frac{d}{d θ} \frac{\partial {\hat{Q}}_{n} (θ, \dot{λ} (θ))}{\partial λ^{'}}|}_{θ = θ_{n 0}} \hat{λ} (θ_{0}) + {(\frac{\partial \hat{λ} (θ_{0})}{\partial θ^{'}} - \frac{\partial λ (θ_{0})}{\partial θ^{'}})}^{'} \frac{\partial {\hat{Q}}_{n} (θ_{0}, \dot{λ} (θ_{0}))}{\partial λ}, \end{matrix}

where

\dot{λ} (θ)

locates between

\hat{λ} (θ)

and

λ (θ)

. By applying the implicit function theorem to the first-order conditions, we obtain

\begin{matrix} \frac{\partial \hat{λ} (θ_{0})}{\partial θ^{'}} = - {\hat{H}}_{22}^{- 1} {\hat{H}}_{21} and \frac{\partial λ (θ_{0})}{\partial θ^{'}} = - H_{22}^{- 1} H_{21} . \end{matrix}

Here we have

1 / C \leq λ_{min} ({\hat{H}}_{22}) \leq λ_{max} ({\hat{H}}_{22}) < C

by Assumptions 2 (ii) and 3 (i) and Equation (A3) w.p.a.1. Thus, by Assumption 3 (ii), we have

\begin{matrix} ∥\frac{\partial \hat{λ} (θ_{0})}{\partial θ^{'}} - \frac{\partial λ (θ_{0})}{\partial θ^{'}}∥ \leq ∥{\hat{H}}_{22}^{- 1} ({\hat{H}}_{21} - H_{21})∥ + ∥({\hat{H}}_{22}^{- 1} - H_{22}^{- 1}) H_{21}∥ = O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) . \end{matrix}

Moreover, some calculation yields

\begin{matrix} ∥{\frac{d}{d θ} \frac{\partial {\hat{Q}}_{n} (θ, \dot{λ} (θ))}{\partial λ^{'}}|}_{θ = θ_{n 0}}∥ \\ = ∥{\hat{H}}_{12} (θ_{0}, \dot{λ} (θ_{0})) + {(\frac{\partial \dot{λ} (θ_{0})}{\partial θ^{'}})}^{'} {\hat{H}}_{22} (θ_{0}, \dot{λ} (θ_{0}))∥ \\ \leq ∥{\hat{H}}_{12} (θ_{0}, \dot{λ} (θ_{0})) - H_{12}∥ + ∥{(\frac{\partial \dot{λ} (θ_{0})}{\partial θ^{'}} - \frac{\partial λ (θ_{0})}{\partial θ^{'}})}^{'} {\hat{H}}_{22} (θ_{0}, \dot{λ} (θ_{0}))∥ \\ + ∥H_{12} H_{22}^{- 1} ({\hat{H}}_{22} (θ_{0}, \dot{λ} (θ_{0})) - H_{22})∥ \\ = O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) . \end{matrix}

Combining these results, we obtain

∥\frac{d {\hat{Q}}_{n} (θ_{0})}{d θ} - \frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ}∥ = O_{p} (\frac{r_{n}^{2}}{n}),

which implies the desired result by Assumption 2 (ii). □

Lemma A3.

Suppose that Assumptions 1–4 hold. Then, we have

∥ {\hat{θ}}_{n} - θ_{0} ∥ = O_{p} (\sqrt{p_{n} / n} + \sqrt{q_{n}} κ_{n})

.

Proof of Lemma A3.

We denote

\nabla^{2} {\hat{Q}}_{n} (θ) = d^{2} {\hat{Q}}_{n} (θ) / d θ d θ^{'}

. By Assumption 4,

\nabla^{2} {\hat{Q}}_{n} (θ)

is positive definite in a neighborhood of

θ_{0}

w.p.a.1. By the definition of the PEL estimator, we have

\begin{matrix} {\hat{Q}}_{n} (θ_{0}) + \sum_{j = 1}^{p_{n}} p_{κ_{n}} (θ_{j 0}) \geq {\hat{Q}}_{n} ({\hat{θ}}_{n}) . \end{matrix}

(A4)

Because

p_{κ_{n}} (θ_{j 0}) \leq (a + 1) κ^{2} / 2

for

j = 1, \dots, q_{n}

and

p_{κ_{n}} (θ_{j 0}) = 0

for

j = q_{n} + 1, \dots, p_{n}

, expanding Equation (A4) yields

\begin{matrix} 0 & \geq & 2 \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ^{'}} ({\hat{θ}}_{n} - θ_{0}) + {({\hat{θ}}_{n} - θ_{0})}^{'} \nabla^{2} {\hat{Q}}_{n} ({\dot{θ}}_{n}) ({\hat{θ}}_{n} - θ_{0}) - (a + 1) q_{n} κ_{n}^{2} \\ = & {∥\nabla^{2} {\hat{Q}}_{n}^{1 / 2} ({\dot{θ}}_{n}) ({\hat{θ}}_{n} - θ_{0}) + \nabla^{2} {\hat{Q}}_{n}^{- 1 / 2} ({\dot{θ}}_{n}) \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ}∥}^{2} - \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ^{'}} \nabla^{2} {\hat{Q}}_{n}^{- 1} ({\dot{θ}}_{n}) \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ} \\ - (a + 1) q_{n} κ_{n}^{2} \end{matrix}

for some

{\dot{θ}}_{n}

located between

{\hat{θ}}_{n}

and

θ_{0}

. Therefore, by the Loève’s

C_{2}

-inequality, we obtain

\begin{matrix} {∥\nabla^{2} {\hat{Q}}_{n}^{1 / 2} ({\dot{θ}}_{n}) ({\hat{θ}}_{n} - θ_{0})∥}^{2} \\ \leq 2 {∥\nabla^{2} {\hat{Q}}_{n}^{1 / 2} ({\dot{θ}}_{n}) ({\hat{θ}}_{n} - θ_{0}) + \nabla^{2} {\hat{Q}}_{n}^{- 1 / 2} ({\dot{θ}}_{n}) \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ}∥}^{2} + 2 \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ^{'}} \nabla^{2} {\hat{Q}}_{n}^{- 1} ({\dot{θ}}_{n}) \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ} \\ \leq 4 \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ^{'}} \nabla^{2} {\hat{Q}}_{n}^{- 1} ({\dot{θ}}_{n}) \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ} + 2 (a + 1) q_{n} κ_{n}^{2} . \end{matrix}

By Lemma A2, we obtain

∥\frac{d {\hat{Q}}_{n} (θ_{0})}{d θ}∥ = O_{p} (\sqrt{p_{n} / n})

, and hence

\begin{matrix} C ∥ {\hat{θ}}_{n} - θ_{0} ∥^{2} \leq {∥\nabla^{2} {\hat{Q}}_{n}^{1 / 2} ({\dot{θ}}_{n}) ({\hat{θ}}_{n} - θ_{0})∥}^{2} = O_{p} (\frac{p_{n}}{n} + q_{n} κ_{n}^{2}) \end{matrix}

by Assumption 4 (ii). □

Proof of Theorem 1.

If

\sqrt{q_{n}} κ_{n} = O (\sqrt{p_{n} / n})

, then we trivially have

∥ {\hat{θ}}_{n} - θ_{0} ∥ = O_{p} (\sqrt{p_{n} / n})

by Lemma A3. Thus, we only consider the case where

\sqrt{q_{n}} κ_{n} / \sqrt{p_{n} / n} \to \infty

.

By Lemma A3, we have

∥ {\hat{θ}}_{n} - θ_{0} ∥ = O_{p} (u_{n}) with u_{n} = \sqrt{\frac{p_{n}}{n}} + \sqrt{q_{n}} κ_{n} .

Furthermore, for any M and for any

θ

such that

∥ θ - θ_{0} ∥ \leq 2^{M} u_{n}

, we have

min_{1 \leq j \leq q_{n}} | θ_{j} | \geq min_{1 \leq j \leq q_{n}} | θ_{j 0} | - 2^{M} u_{n} .

By Assumption 5, we have

u_{n} / {min}_{1 \leq j \leq q_{n}} | θ_{j 0} | < 2^{- M - 1}

for sufficiently large n, and hence

min_{1 \leq j \leq q_{n}} | θ_{j} | \geq \frac{1}{2} min_{1 \leq j \leq q_{n}} | θ_{j 0} | .

This implies that

{min}_{1 \leq j \leq q_{n}} | θ_{j} | > a κ_{n}

for sufficiently large n.

Let

{h_{n}}

be a positive sequence that converges to 0 as

n \to \infty

. Following Huang and Xie (2007), we decompose

Θ_{n} \ {0}

into shells

S_{n, k} = {θ : 2^{k - 1} h_{n} \leq ∥ θ - θ_{0} ∥ \leq 2^{k} h_{n}}

for

k = 1, 2, \dots

. For

θ \in S_{n, k}

such that

2^{k} h_{n} \leq 2^{M} u_{n}

, we obtain

\begin{matrix} {\hat{Q}}_{n} (θ) - {\hat{Q}}_{n} (θ_{0}) & = & \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ^{'}} (θ - θ_{0}) + \frac{1}{2} {(θ - θ_{0})}^{'} \nabla^{2} {\hat{Q}}_{n} ({\dot{θ}}_{n}) (θ - θ_{0}) \end{matrix}

and

\begin{matrix} \frac{1}{2} {(θ - θ_{0})}^{'} \nabla^{2} {\hat{Q}}_{n} ({\dot{θ}}_{n}) (θ - θ_{0}) \geq 2^{2 k - 3} C h_{n}^{2} \end{matrix}

(A5)

w.p.a.1. Let

E_{n}

be the event such that Equation (A5) is satisfied. Because Lemma A2 implies that the difference between

\frac{d {\hat{Q}}_{n} (θ_{0})}{d θ}

and

\frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ}

is asymptotically negligible, we have

\begin{matrix} P (∥ {\hat{θ}}_{n} - θ_{0} ∥ > 2^{L} h_{n}) \\ \leq P (∥ {\hat{θ}}_{n} - θ_{0} ∥ > 2^{M} u_{n}) + P (\{2^{L} h_{n} < ∥ {\hat{θ}}_{n} - θ_{0} ∥ \leq 2^{M} u_{n}\} \cap E_{n}) \\ = o (1) + \sum_{k} P (\{{\hat{θ}}_{n} \in S_{n, k}\} \cap E_{n}) \\ \leq o (1) + \sum_{k} P (\{inf_{θ \in S_{n, k}} {\hat{Q}}_{n} (θ) + \sum_{j = 1}^{p_{n}} p_{κ_{n}} (θ_{j}) \leq {\hat{Q}}_{n} (θ_{0}) + \sum_{j = 1}^{p_{n}} p_{κ_{n}} (θ_{j 0})\} \cap E_{n}) \\ \leq o (1) + \sum_{k} P (sup_{θ \in S_{n, k}} - \frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ^{'}} (θ - θ_{0}) \geq 2^{2 k - 3} C h_{n}^{2}), \end{matrix}

where

\sum_{k}

stands for

\sum_{k : k > L, 2^{k} h_{n} \leq 2^{M} u_{n}}

. Moreover, some calculation yields that

\frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ} = \frac{1}{n} \sum_{i = 1}^{n} E {[M_{i}]}^{'} E {[m_{i} m_{i}]}^{- 1} m_{i} .

Thus, it follows from the Markov and Cauchy-Schwarz inequalities that

\begin{matrix} \sum_{k} P (\{sup_{θ \in S_{n, k}} - \frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ^{'}} (θ - θ_{0}) \geq 2^{2 k - 3} C h_{n}^{2}\}) \\ \leq C \sum_{k} \frac{E [{sup}_{θ \in S_{n, k}} |\frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ^{'}} (θ - θ_{0})|]}{2^{2 k - 3} h_{n}^{2}} \\ \leq C \sum_{k : k > L} \frac{2^{k} h_{n} {(tr {E {[M_{i}]}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} E [M_{i}]} / n)}^{1 / 2}}{2^{2 k - 3} h_{n}^{2}} \\ \leq C \sum_{k : k > L} \frac{\sqrt{p_{n} / n}}{2^{k - 3} h_{n}} . \end{matrix}

Notice that

\sum_{k}

is changed to

\sum_{k : k > L}

in the second inequality. By choosing

h_{n} = \sqrt{p_{n} / n}

, we obtain the desired result. □

Lemma A4.

Suppose that Assumptions 2, 3, 4 (i) and 7 hold. Then, for any

θ

such that

∥ θ - θ_{0} ∥ = O_{p} (\sqrt{p_{n} / n})

, we have

∥\nabla^{2} {\hat{Q}}_{n} (θ) - \nabla^{2} Q_{n} (θ_{0})∥ = O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) + O_{p} (\frac{r_{n} p_{n}}{\sqrt{n}}) .

Proof of Lemma A4.

Let

θ

satisfy

∥ θ - θ_{0} ∥ = O_{p} (\sqrt{p_{n} / n})

. By a simple calculation, we obtain

\begin{matrix} \nabla^{2} {\hat{Q}}_{n} (θ) = {\hat{H}}_{11} (θ) - {\hat{H}}_{12} (θ) {\hat{H}}_{22}^{- 1} (θ) {\hat{H}}_{21} (θ) \end{matrix}

and

\begin{matrix} \nabla^{2} Q_{n} (θ_{0}) = H_{11} - H_{12} H_{22}^{- 1} H_{21} = E {[M_{i}]}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} E [M_{i}] . \end{matrix}

Thus, it is sufficient to show that

∥{\hat{H}}_{11} (θ)∥ + ∥- {\hat{H}}_{12} (θ) {\hat{H}}_{22} {(θ)}^{- 1} {\hat{H}}_{21} (θ) - E {[M_{i}]}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} E [M_{i}]∥ = O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) + O_{p} (\frac{r_{n} p_{n}}{\sqrt{n}}) .

By using a similar argument as in Equation (A1), we have

∥ \hat{λ} (θ) ∥ = O_{p} (\sqrt{r_{n} / n})

. Also, the

(j, k)

element of

\frac{\partial}{\partial θ^{'}} (M_{i} {(θ)}^{'} \hat{λ} (θ))

is given by

\sum_{l = 1}^{r_{n}} \partial^{2} m_{l} (y_{i}, θ) / \partial θ_{j} \partial θ_{k} {\hat{λ}}_{l} (θ)

and

|\frac{1}{n} \sum_{i = 1}^{n} \sum_{l = 1}^{r_{n}} \frac{\partial^{2} m_{l} (y_{i}, θ)}{\partial θ_{j} \partial θ_{k}} {\hat{λ}}_{l} (θ)| \leq \sqrt{\frac{1}{n} \sum_{i = 1}^{n} \sum_{l = 1}^{r_{n}} B_{j k l}^{2} (y_{i})} ∥ \hat{λ} (θ) ∥ = O_{p} (\frac{r_{n}}{\sqrt{n}})

by Assumption 7. Therefore, we have

\begin{matrix} ∥{\hat{H}}_{11} (θ)∥ & \leq & C ∥\frac{1}{n} \sum_{i = 1}^{n} \frac{\partial}{\partial θ^{'}} (M_{i} (θ) \hat{^{'} λ} (θ))∥ + C ∥\frac{1}{n} \sum_{i = 1}^{n} M_{i} {(θ)}^{'} \hat{λ} (θ) \hat{λ} {(θ)}^{'} M_{i} (θ)∥ \\ = & O_{p} (\frac{r_{n} p_{n}}{\sqrt{n}}) . \end{matrix}

Moreover, by doing similar calculations as in the proof of Lemma A2, we obtain

\begin{matrix} ∥- {\hat{H}}_{12} (θ) - E [M_{i}]∥ & \leq & ∥\frac{1}{n} \sum_{i = 1}^{n} M_{i} (θ) - \frac{1}{n} \sum_{i = 1}^{n} M_{i}∥ + O_{p} (\frac{r_{n}}{\sqrt{n}}) \\ \leq & \sqrt{\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{p_{n}} {∥\frac{\partial M_{i} (\dot{θ})}{\partial θ_{j}}∥}^{2}} ∥ θ - θ_{0} ∥ + O_{p} (\frac{r_{n}}{\sqrt{n}}) \\ = & O_{p} (\frac{r_{n}^{1 / 2} p_{n}^{3 / 2}}{\sqrt{n}}) + O_{p} (\frac{r_{n}}{\sqrt{n}}) \end{matrix}

and

\begin{matrix} ∥- {\hat{H}}_{22} (θ) - E [m_{i} m_{i}^{'}]∥ \\ \leq ∥\frac{1}{n} \sum_{i = 1}^{n} m_{i} (θ) m_{i} {(θ)}^{'} - \frac{1}{n} \sum_{i = 1}^{n} m_{i} m_{i}^{'}∥ + O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) \\ \leq 2 ∥\frac{1}{n} \sum_{i = 1}^{n} m_{i}^{'} M_{i} (\dot{θ}) (θ - θ_{0})∥ + {(θ - θ_{0})}^{'} (\frac{1}{n} \sum_{i = 1}^{n} M_{i} (\dot{θ}) M_{i}^{'} (\dot{θ})) (θ - θ_{0}) + O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) \\ = O_{p} (\frac{r_{n}^{3 / 2}}{\sqrt{n}}) \end{matrix}

for some

\dot{θ}

that is located between

θ

and

θ_{0}

. Hence, we obtain the result. □

Proof of Theorem 2.

We first prove sparsity. Theorem 1 and Assumption 6 (i) imply that

∥ {\hat{θ}}_{n} - θ_{0} ∥ \leq κ_{n}

w.p.a.1. Thus, it is sufficient to show that w.p.a.1,

\begin{matrix} \frac{d {\hat{Q}}_{n} (θ_{0} + v)}{d θ_{j}} + p_{κ_{n}}^{'} (v_{j}) > 0 (0 < v_{j} < κ_{n}) \\ \frac{d {\hat{Q}}_{n} (θ_{0} + v)}{d θ_{j}} + p_{κ_{n}}^{'} (v_{j}) < 0 (- κ_{n} < v_{j} < 0) \end{matrix}

for any

v = {(v_{1}, \dots, v_{p_{n}})}^{'}

such that

∥ v ∥ = O (\sqrt{p_{n} / n})

and for

j = q_{n} + 1, \dots, p_{n}

. Because

p_{κ_{n}}^{'} (u) = κ_{n} sgn (u)

for

| u | \leq κ_{n}

, we have

\begin{matrix} \frac{d {\hat{Q}}_{n} (θ_{0} + v)}{d θ_{j}} + p_{κ_{n}}^{'} (v_{j}) & = & \frac{d {\hat{Q}}_{n} (θ_{0})}{d θ_{j}} + \frac{d^{2} {\hat{Q}}_{n} (θ_{0} + \dot{v})}{d θ_{j} d θ^{'}} v + κ_{n} sgn (v_{j}) \\ \equiv & I_{1} + I_{2} + I_{3} \end{matrix}

for

j = q_{n} + 1, \dots, p_{n}

and for some

\dot{v}

such that

∥ \dot{v} ∥ = O_{p} (\sqrt{p_{n} / n})

. By Lemma A2, we have

| I_{1} | = O_{p} (\sqrt{p_{n} / n})

. Moreover, by Assumption 8 and Lemma A4, we have

∥\frac{d^{2} {\hat{Q}}_{n} (θ_{0} + \dot{v})}{d θ_{j} d θ^{'}}∥ = O_{p} (1),

and thus

| I_{2} | = O_{p} (\sqrt{p_{n} / n})

. Therefore,

I_{1}

and

I_{2}

are asymptotically dominated by

I_{3}

. The sign of

d {\hat{Q}}_{n} (θ_{0} + v) / d θ_{j} + p_{κ_{n}}^{'} (v_{j})

is determined by the sign of

v_{j}

.

Next, we show asymptotic normality. Let

{\hat{Q}}_{1 n} (θ_{1}) = {\hat{Q}}_{n} (θ_{1}, 0)

. Lemma A3 and Assumption 5 imply that

{min}_{1 \leq j \leq q_{n}} | {\hat{θ}}_{j} | > a κ_{n}

w.p.a.1. Moreover, we have

P ({\hat{θ}}_{2 n} = 0) \to 1

. Thus, expanding the first-order condition for

{\hat{θ}}_{1 n}

yields

0 = \frac{d {\hat{Q}}_{1 n} (θ_{10})}{d θ_{1}} + \frac{d^{2} {\hat{Q}}_{1 n} ({\dot{θ}}_{1 n})}{d θ_{1} d θ_{1}^{'}} ({\hat{θ}}_{1 n} - θ_{10})

for some

{\dot{θ}}_{1 n}

that is located between

{\hat{θ}}_{1 n}

and

θ_{10}

. Combining this with Lemmas A2 and A4 and Assumptions 2 (ii) and 6 (ii), we have

\begin{matrix} V_{1 n}^{- 1} ({\hat{θ}}_{1 n} - θ_{10}) = - \frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ_{1}} + o_{p} (\frac{1}{\sqrt{n}}), \end{matrix}

which immediately implies that

∥ {\hat{θ}}_{1 n} - θ_{10} ∥ = O_{p} (\sqrt{q_{n} / n})

. Moreover, because

tr (B_{n} V_{1 n} B_{n}^{'}) < C tr (B_{n} B_{n}^{'}) < C

by the assumption of Theorem 2 and Assumption 8, we have

\begin{matrix} \sqrt{n} B_{n} V_{1 n}^{- 1 / 2} ({\hat{θ}}_{1 n} - θ_{10}) & = & - \sqrt{n} B_{n} V_{1 n}^{1 / 2} \frac{d {\hat{Q}}_{n} (θ_{0}, λ (θ_{0}))}{d θ_{1}} + o_{p} (∥ B_{n} V_{1 n}^{1 / 2} ∥) \\ = & \sum_{i = 1}^{n} z_{n i} + o_{p} (1), \end{matrix}

where

z_{n i} = - \frac{1}{\sqrt{n}} B_{n} V_{1 n}^{1 / 2} E {[M_{1 i}]}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} m_{i} .

Here, by Assumptions 2 (i) and 8, we have

\begin{matrix} E [∥ z_{n i} ∥^{4}] & = & \frac{1}{n^{2}} E [{\{m_{i}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} E [M_{1 i}] V_{1 n}^{1 / 2} B_{n}^{'} B_{n} V_{1 n}^{1 / 2} E {[M_{1 i}]}^{'} E {[m_{i} m_{i}^{'}]}^{- 1} m_{i}\}}^{2}] \\ \leq & \frac{C}{n^{2}} E [{m_{i}^{'} m_{i}}^{2}] \\ = & O (\frac{r_{n}^{2}}{n^{2}}) . \end{matrix}

Furthermore, because

B_{n} B_{n}^{'} \to G

, we have

\sum_{i = 1}^{n} E [z_{n i} z_{n i}^{'}] \to G

and

\begin{matrix} P (∥ z_{n i} ∥ > ϵ) \leq \frac{E [z_{n i}^{'} z_{n i}]}{ϵ^{2}} = O (\frac{1}{n}) . \end{matrix}

Therefore, we obtain

\begin{matrix} \sum_{i = 1}^{n} E [∥ z_{n i} ∥^{2} 1 {∥ z_{n i} ∥^{2} > ϵ}] \leq n E {[∥ z_{n i} ∥^{4}]}^{1 / 2} P (∥ z_{n i} {∥ > ϵ)}^{1 / 2} = o (1), \end{matrix}

and thus

\sum_{i = 1}^{n} z_{n i} \overset{d}{\to} N (0, G)

by the Lindeberg-Feller central limit theorem. □

References

Ando, Tomohiro, and Naoya Sueishi. 2019. Regularization parameter selection for penalized empirical likelihood estimator. Economics Letters 178: 1–4. [Google Scholar] [CrossRef]
Bai, Jushan, and Serena Ng. 2009. Selecting instrumental variables in a data rich environment. Journal of Time Series Econometrics 1: 4. [Google Scholar]
Belloni, Alexandre, Daniel Chen, Victor Chernozhukov, and Christian Hansen. 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80: 2369–429. [Google Scholar]
Breiman, Leo. 1996. Heuristics of instability and stabilization in model selection. Annals of Statistics 24: 2350–83. [Google Scholar] [CrossRef]
Caner, Mehmet, and Qingliang Fan. 2015. Hybrid generalized empirical likelihood estimators: Instrument selection with adaptive lasso. Journal of Econometrics 187: 256–74. [Google Scholar] [CrossRef]
Caner, Mehmet, and Hao Helen Zhang. 2014. Adaptive elastic net for generalized methods of moments. Journal of Business & Economic Statistics 32: 30–47. [Google Scholar]
Caner, Mehmet. 2009. Lasso-type gmm estimator. Econometric Theory 25: 270–90. [Google Scholar] [CrossRef]
Chang, Jinyuan, Song Xi Chen, and Xiaohong Chen. 2015. High dimensional generalized empirical likelihood for moment restrictions with dependent data. Journal of Econometrics 185: 283–304. [Google Scholar] [CrossRef] [Green Version]
Chang, Jinyuan, Cheng Yong Tang, and Tong Tong Wu. 2018. A new scope of penalized empirical likelihood with high-dimensional estimating equations. Annals of Statistics 46: 3185–216. [Google Scholar] [CrossRef]
Cheng, Xu, and Zhipeng Liao. 2015. Select the valid and relevant moments: An information-based lasso for gmm with many moments. Journal of Econometrics 186: 443–64. [Google Scholar] [CrossRef]
Donald, Stephen G., and Whitney K. Newey. 2001. Choosing the number of instruments. Econometrica 69: 1161–91. [Google Scholar] [CrossRef]
Donald, Stephen G., Guido W. Imbens, and Whitney K. Newey. 2003. Empirical likelihood estimation and consistent tests with conditional moment restrictions. Journal of Econometrics 117: 55–93. [Google Scholar] [CrossRef]
Fan, Jianqing, and Yuan Liao. 2014. Endogeneity in high dimensions. Annals of Statistics 42: 872–917. [Google Scholar] [CrossRef]
Fan, Jianqing, and Runze Li. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96: 1348–60. [Google Scholar] [CrossRef]
Fan, Jianqing, and Heng Peng. 2004. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics 32: 928–61. [Google Scholar]
Frank, Ildiko E., and Jerome H. Friedman. 1993. A statistical view of some chemometrics regression tools. Technometrics 35: 109–35. [Google Scholar] [CrossRef]
Huang, Jian, and Huiliang Xie. 2007. Asymptotic oracle properties of scad-penalized least squares estimators. IMS Lecture Notes–Monograph Series 55: 149–66. [Google Scholar]
Kuersteiner, Guido, and Ryo Okui. 2010. Constructing optimal instruments by first-stage prediction averaging. Econometrica 78: 697–718. [Google Scholar]
Leng, Chenlei, and Cheng Yong Tang. 2012. Penalized empirical likelihood and growing dimensional general estimating equations. Biometrika 99: 703–16. [Google Scholar] [CrossRef]
Qin, Jin, and Jerry Lawless. 1994. Empirical likelihood and general estimating equations. Annals of Statistics 22: 300–25. [Google Scholar] [CrossRef]
Shi, Zhentao. 2016a. Econometric estimation with high-dimensional moment equalities. Journal of Econometrics 195: 104–19. [Google Scholar] [CrossRef]
Shi, Zhentao. 2016b. Estimation of sparse structral parameter with many endogenous variables. Econometric Reviews 35: 1582–608. [Google Scholar] [CrossRef]
Sueishi, Naoya. 2016. A simple derivation of the efficiency bound for conditional moment restriction models. Economics Letters 138: 57–59. [Google Scholar] [CrossRef]
Tang, Niansheng, Xiaodong Yan, and Puying Zhao. 2018. Exponentially tilted likelihood inference on growing dimensional unconditional moment models. Journal of Econometrics 202: 57–74. [Google Scholar] [CrossRef] [Green Version]
Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B 58: 267–88. [Google Scholar] [CrossRef]
Zhang, Cun-Hui. 2010. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38: 894–942. [Google Scholar] [CrossRef] [Green Version]
Zou, Hui, and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67: 301–20. [Google Scholar] [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ando, T.; Sueishi, N. On the Convergence Rate of the SCAD-Penalized Empirical Likelihood Estimator. Econometrics 2019, 7, 15. https://doi.org/10.3390/econometrics7010015

AMA Style

Ando T, Sueishi N. On the Convergence Rate of the SCAD-Penalized Empirical Likelihood Estimator. Econometrics. 2019; 7(1):15. https://doi.org/10.3390/econometrics7010015

Chicago/Turabian Style

Ando, Tomohiro, and Naoya Sueishi. 2019. "On the Convergence Rate of the SCAD-Penalized Empirical Likelihood Estimator" Econometrics 7, no. 1: 15. https://doi.org/10.3390/econometrics7010015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Convergence Rate of the SCAD-Penalized Empirical Likelihood Estimator

Abstract

1. Introduction

2. PEL Estimator and Asymptotic Results

3. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI