Robust-BD Estimation and Inference for General Partially Linear Models

Chunming Zhang; Zhengjun Zhang

doi:10.3390/e19110625

and

Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA

^*

Author to whom correspondence should be addressed.

Entropy2017, 19(11), 625;https://doi.org/10.3390/e19110625

This article belongs to the Special Issue New Developments in Statistical Information Theory Based on Entropy and Divergence Measures

Version Notes

Order Reprints

Abstract

The classical quadratic loss for the partially linear model (PLM) and the likelihood function for the generalized PLM are not resistant to outliers. This inspires us to propose a class of “robust-Bregman divergence (BD)” estimators of both the parametric and nonparametric components in the general partially linear model (GPLM), which allows the distribution of the response variable to be partially specified, without being fully known. Using the local-polynomial function estimation method, we propose a computationally-efficient procedure for obtaining “robust-BD” estimators and establish the consistency and asymptotic normality of the “robust-BD” estimator of the parametric component

β_{o}

. For inference procedures of

β_{o}

in the GPLM, we show that the Wald-type test statistic

W_{n}

constructed from the “robust-BD” estimators is asymptotically distribution free under the null, whereas the likelihood ratio-type test statistic

Λ_{n}

is not. This provides an insight into the distinction from the asymptotic equivalence (Fan and Huang 2005) between

W_{n}

and

Λ_{n}

in the PLM constructed from profile least-squares estimators using the non-robust quadratic loss. Numerical examples illustrate the computational effectiveness of the proposed “robust-BD” estimators and robust Wald-type test in the appearance of outlying observations.

Keywords:

Bregman divergence; generalized linear model; local-polynomial regression; model check; nonparametric test; quasi-likelihood; semiparametric model; Wald statistic

1. Introduction

Semiparametric models, such as the partially linear model (PLM) and generalized PLM, play an important role in statistics, biostatistics, economics and engineering studies [1,2,3,4,5]. For the response variable Y and covariates

(X, T)

, where

X = {(X_{1}, \dots, X_{d})}^{T} \in R^{d}

and

T \in T \subseteq R^{D}

, the PLM, which is widely used for continuous responses Y, describes the model structure according to:

Y = X^{T} β_{o} + η^{o} (T) + ϵ, E (ϵ ∣ X, T) = 0,

(1)

where

β_{o} = {(β_{1; o}, \dots, β_{d; o})}^{T}

is a vector of unknown parameters and

η^{o} (\cdot)

is an unknown smooth function; the generalized PLM, which is more suited to discrete responses Y and extends the generalized linear model [6], assumes:

\begin{matrix} m (x, t) = E (Y ∣ X = x, T = t) = F^{- 1} (x^{T} β_{o} + η^{o} (t)), \end{matrix}

(2)

\begin{matrix} Y ∣ (X, T) \sim exponential family of distributions, \end{matrix}

(3)

where F is a known link function. Typically, the parametric component

β_{o}

is of primary interest, while the nonparametric component

η^{o} (\cdot)

serves as a nuisance function. For illustration clarity, this paper focuses on

D = 1

. An important application of PLM to brain fMRI data was given in [7] for detecting activated brain voxels in response to external stimuli. There,

β_{o}

corresponds to the part of hemodynamic response values, which is the object of primary interest to neuroscientists;

η^{o} (\cdot)

is the slowly drifting baseline of time. Determining whether a voxel is activated or not can be formulated as testing for the linear form of hypotheses,

H_{0} : A β_{o} = g_{0} versus H_{1} : A β_{o} \neq g_{0},

(4)

where

A

is a given

k \times d

full row rank matrix and

g_{0}

is a known

k \times 1

vector.

Estimation of the parametric and nonparametric components of PLM and generalized PLM has received much attention in the literature. On the other hand, the existing work has some limitations: (i) The generalized PLM assumes that

Y ∣ (X, T)

follows the distribution in (3), so that the likelihood function is fully available. From the practical viewpoint, results from the generalized PLM are not applicable to situations where the distribution of

Y ∣ (X, T)

either departs from (3) or is incompletely known. (ii) Some commonly-used error measures, such as the quadratic loss in PLM for Gaussian-type responses (see for example [7,8]) and the (negative) likelihood function used in the generalized PLM, are not resistant to outliers. The work in [9] studied robust inference based on the kernel regression method for the generalized PLM with a canonical link, based on either the (negative) likelihood or (negative) quasi-likelihood as the error measure, and illustrated numerical examples with the dimension

d = 1

. However, the quasi-likelihood is not suitable for the exponential loss function (defined in Section 2.1), commonly used in machine learning and data mining. (iii) The work in [8] developed the inference of (4) for PLM, via the classical quadratic loss as the error measure, and demonstrated that the asymptotic distributions of the likelihood ratio-type statistic and Wald statistic under the null of (4) are both

χ_{k}^{2}

. It remains unknown whether this conclusion holds when the tests are constructed based on robust estimators.

Without completely specifying the distribution of

Y ∣ (X, T)

, we assume:

var (Y ∣ X = x, T = t) = V (m (x, t)),

(5)

with a known functional form of

V (\cdot)

. We refer to a model specified by (2) and (5) as the “general partially linear model” (GPLM). This paper aims to develop robust estimation of GPLM and robust inference of

β_{o}

, allowing the distribution of

Y ∣ (X, T)

to be partially specified. To introduce robust estimation, we adopt a broader class of robust error measures, called “robust-Bregman divergence (BD)” developed in [10], for a GLM, in which BD includes the quadratic loss, the (negative) quasi-likelihood, the exponential loss and many other commonly-used error measures as special cases. We propose the “robust-BD estimators” for both the parametric and nonparametric components of the GPLM. Distinct from the explicit-form estimators for PLM using the classical quadratic loss (see [8]), the “robust-BD estimators” for GPLM do not have closed-form expressions, which makes the theoretical derivation challenging. Moreover, the robust-BD estimators, as numerical solutions to non-linear optimization problems, pose key implementation challenges. Our major contributions are given below.

The robust fitting of the nonparametric component $η^{o} (\cdot)$ is formulated using the local-polynomial regression technique [11]. See Section 2.3.
We develop a coordinate descent algorithm for the robust-BD estimator of $β_{o}$ , which is computationally efficient particularly when the dimension d is large. See Section 3.
Theorems 1 and 2 demonstrate that under the GPLM, the consistency and asymptotic normality of the proposed robust-BD estimator for $β_{o}$ are achieved. See Section 4.
For robust inference of $β_{o}$ , we propose a robust version of the Wald-type test statistic $W_{n}$ , based on the robust-BD estimators, and justify its validity in Theorems 3–5. It is shown to be asymptotically $χ^{2}$ (central) under the null, thus distribution free, and $χ^{2}$ (noncentral) under the contiguous alternatives. Hence, this result, when applied to the exponential loss, as well as other loss functions in the wider class of BD, is practically feasible. See Section 5.1.
For robust inference of $β_{o}$ , we re-examine the likelihood ratio-type test statistic $Λ_{n}$ , constructed by replacing the negative log-likelihood with the robust-BD. Our Theorem 6 reveals that the asymptotic null distribution of $Λ_{n}$ is generally not $χ^{2}$ , but a linear combination of independent $χ^{2}$ variables, with weights relying on unknown quantities. Even in the particular case of using the classical-BD, the limit distribution is not invariant with re-scaling the generating function of the BD. Moreover, the limit null distribution of $Λ_{n}$ (in either the non-robust or robust version) using the exponential loss, which does not belong to the (negative) quasi-likelihood, but falls in BD, is always a weighted $χ^{2}$ , thus limiting its use in practical applications. See Section 5.2.

Simulation studies in Section 6 demonstrate that the proposed class of robust-BD estimators and robust Wald-type test either compare well with or perform better than the classical non-robust counterparts: the former is less sensitive to outliers than the latter, and both perform comparably well for non-contaminated cases. Section 7 illustrates some real data applications. Section 8 ends the paper with brief discussions. Details of technical derivations are relegated to Appendix A.

2. Robust-BD and Robust-BD Estimators

This section starts with a brief review of BD in Section 2.1 and “robust-BD” in Section 2.2, followed by the proposed “robust-BD” estimators of

η^{o} (\cdot)

and

β_{o}

in Section 2.3 and Section 2.4.

2.1. Classical-BD

To broaden the scope of robust estimation and inference, we consider a class of error measures motivated from the Bregman divergence (BD). For a given concave q-function, [12] defined a bivariate function,

Q_{q} (ν, μ) = - q (ν) + q (μ) + (ν - μ) q^{'} (μ) .

(6)

We call

Q_{q}

the BD and call q the generating q-function of the BD. For example, a function

q (μ) = a μ - μ^{2}

for some constant a yields the quadratic loss

Q_{q} (Y, μ) = {(Y - μ)}^{2}

. For a binary response variable Y,

q (μ) = \min {μ, (1 - μ)}

gives the misclassification loss

Q_{q} (Y, μ) = I {Y \neq I (μ > 1 / 2)}

, where

I (\cdot)

is an indicator function;

q (μ) = - 2 {μ \log (μ) + (1 - μ) \log (1 - μ)}

gives the Bernoulli deviance loss log-likelihood

Q_{q} (Y, μ) = - 2 {Y \log (μ) + (1 - Y) \log (1 - μ)}

;

q (μ) = 2 \min {μ, (1 - μ)}

results in the hinge loss

Q_{q} (Y, μ) = \max {1 - (2 Y - 1) sign (μ - 0.5), 0}

of the support vector machine;

q (μ) = 2 {μ (1 - μ)}^{1 / 2}

yields the exponential loss

Q_{q} (Y, μ) = \exp [- (Y - 0.5) \log {μ / (1 - μ)}]

used in AdaBoost [13]. Moreover, [14] showed that if:

q (μ) = \int_{a}^{μ} \frac{s - μ}{V (s)} d s,

(7)

with a finite constant a such that the integral is well defined, then

Q_{q} (y, μ)

matches the “classical (negative) quasi-likelihood” function.

2.2. Robust-BD $ρ_{q} (y, μ)$

Let

r (y, μ) = (y - μ) / \sqrt{V (μ)}

denote the Pearson residual, which reduces to the standardized residual for linear models. In contrast to the “classical-BD”, denoted by

Q_{q}

in (6), the “robust-BD” developed in [10] for a GLM [6], is formed by:

ρ_{q} (y, μ) = \int_{y}^{μ} ψ (r (y, s)) {q^{″} (s) \sqrt{V (s)}} d s - G (μ),

(8)

where

ψ (r)

is chosen to be a bounded, odd function, such as the Huber

ψ

-function [15],

ψ (r) = r \min (1, c / | r |)

, and the bias-correction term,

G (μ)

, entails the Fisher consistency of the parameter estimator and satisfies:

G^{'} (μ) = G_{1}^{'} (μ) {q^{″} (μ) \sqrt{V (μ)}},

with

G_{1}^{'} (m (x, t)) = E {ψ (r (Y, m (x, t))) ∣ X = x, T = t} .

(9)

We make the following discussions regarding features of the “robust-BD”. To facilitate the discussion, we first introduce some necessary notation. Assume that the quantities:

p_{j} (y; θ) = \frac{\partial^{j}}{\partial θ^{j}} ρ_{q} (y, F^{- 1} (θ)), j = 0, 1, \dots,

(10)

exist finitely up to any order required. Then, we have the following expressions,

\begin{matrix} p_{1} (y; θ) & = & {ψ (r (y, μ)) - G_{1}^{'} (μ)} {q^{″} (μ) \sqrt{V (μ)}} / F^{'} (μ), \\ p_{2} (y; θ) & = & A_{0} (y, μ) + {ψ (r (y, μ)) - G_{1}^{'} (μ)} A_{1} (μ), \\ p_{3} (y; θ) & = & A_{2} (y, μ) + {ψ (r (y, μ)) - G_{1}^{'} (μ)} A_{1}^{'} (μ) / F^{'} (μ), \end{matrix}

(11)

where

μ = F^{- 1} (θ)

,

A_{0} (y, μ) = - [ψ^{'} (r (y, μ)) \{1 + \frac{y - μ}{\sqrt{V (μ)}} \times \frac{V^{'} (μ)}{2 \sqrt{V (μ)}}\} + G_{1}^{″} (μ) \sqrt{V (μ)}] \frac{q^{″} (μ)}{{F^{'} (μ)}^{2}},

A_{1} (μ) = [{q^{(3)} (μ) \sqrt{V (μ)} + 2^{- 1} q^{″} (μ) V^{'} (μ) / \sqrt{V (μ)}} F^{'} (μ) - q^{″} (μ) \sqrt{V (μ)} F^{″} (μ)] / {F^{'} (μ)}^{3}

and

A_{2} (y, μ) = [\partial A_{0} (y, μ) / \partial μ + \partial {ψ (r (y, μ)) - G_{1}^{'} (μ)} / \partial μ A_{1} (μ)] / F^{'} (μ)

. Particularly,

p_{1} (y; θ)

contains

ψ (r)

;

p_{2} (y; θ)

contains

ψ (r)

,

ψ^{'} (r)

and

ψ^{'} (r) r

;

p_{3} (y; θ)

contains

ψ (r)

,

ψ^{'} (r)

,

ψ^{'} (r) r

,

ψ^{″} (r)

,

ψ^{″} (r) r

, and

ψ^{″} (r) r^{2}

, where

r = r (y, μ) = (y - μ) / \sqrt{V (μ)}

denotes the Pearson residual. Accordingly,

{p_{j} (y; θ) : j = 1, 2, 3}

depend on y through

ψ (r)

and its derivatives coupled with r. Then, we observe from (9) and (11) that:

E {p_{1} (Y; X^{T} β_{o} + η^{o} (T)) ∣ X, T} = 0 .

(12)

In the particular choice of

ψ (r) = r

, it is clearly noticed from (9) that

G_{1}^{'} (\cdot) = 0

, and thus,

G^{'} (\cdot) = 0

. In such a case, the proposed “robust-BD”

ρ_{q} (y, μ)

reduces to the “classical-BD”

Q_{q} (y, μ)

.

2.3. Local-Polynomial Robust-BD Estimator of $η^{o} (\cdot)$

Let

{(Y_{i}, X_{i}, T_{i})}_{i = 1}^{n}

be

i . i . d .

observations of

(Y, X, T)

captured by the GPLM in (2) and (5), where the dimension

d \geq 1

is a finite integer. From (2), it is directly observed that if the true value of

β_{o}

is known, then estimating

η^{o} (\cdot)

becomes estimating a nonparametric function; conversely, if the actual form of

η^{o} (\cdot)

is available, then estimating

β_{o}

amounts to estimating a vector parameter.

To motivate the estimation of

η^{o} (\cdot)

at a fitting point t, a proper way to characterize

η^{o} (t)

is desired. For any given value of

β

, define:

S (a; t, β) = E {ρ_{q} (Y, F^{- 1} (X^{T} β + a)) w_{1} (X) ∣ T = t},

(13)

where a is a scalar,

ρ_{q} (y, μ)

is the “robust-BD” defined in (8), which aims to guard against outlying observations in the response space of Y, and

w_{1} (\cdot) \geq 0

is a given bounded weight function that downweights high leverage points in the covariate space of

X

. See Section 6 and Section 7 for an example of

w_{1} (x)

. Set:

η_{β} (t) = \arg \min_{a \in R^{1}} S (a; t, β) .

(14)

Theoretically,

η^{o} (t) = η_{β_{o}} (t)

will be assumed (in Condition A3) for obtaining asymptotically unbiased estimators of

η^{o} (\cdot)

. Such property indeed holds, for example, when a classical quadratic loss combined with an identity link is used in (14). Thus, we call

η_{η} (\cdot)

the “surrogate function” for

η^{o} (\cdot)

.

The characterization of the surrogate function

η_{β} (t)

in (14) enables us to develop its robust-BD estimator

{\hat{η}}_{β} (t)

based on nonparametric function estimation. Assume that

η^{o} (\cdot)

is

(p + 1)

-times continuously differentiable at the fitting point t. Denote by

a_{o} (t) = {(η^{o} (t), {(η^{o})}^{(1)} (t), \dots, {(η^{o})}^{(p)} (t) / p!)}^{T} \in R^{p + 1}

the vector consisting of

η^{o} (t)

along with its (re-scaled) derivatives. For observed covariates

T_{i}

close to the point t, the Taylor expansion implies that:

\begin{matrix} η^{o} (T_{i}) & \approx & η^{o} (t) + (T_{i} - t) {(η^{o})}^{(1)} (t) + \dots + {(T_{i} - t)}^{p} {(η^{o})}^{(p)} (t) / p! \\ = & t_{i} {(t)}^{T} a_{o} (t), \end{matrix}

(15)

where

t_{i} (t) = {(1, (T_{i} - t), \dots, {(T_{i} - t)}^{p})}^{T}

. For any given value of

β

, let

\hat{a} (t; β) = {({\hat{a}}_{0} (t; β), {\hat{a}}_{1} (t; β), \dots, {\hat{a}}_{p} (t; β))}^{T}

be the minimizer of the criterion function,

S_{n} (a; t, β) = \frac{1}{n} \sum_{i = 1}^{n} ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + t_{i} {(t)}^{T} a)) w_{1} (X_{i}) K_{h} (T_{i} - t),

(16)

with respect to

a \in R^{p + 1}

, where

K_{h} (\cdot) = K (\cdot / h) / h

is re-scaled from a kernel function K and

h > 0

is termed a bandwidth parameter. The first entry of

\hat{a} (t; β)

supplies the local-polynomial robust-BD estimator

{\hat{η}}_{β} (t)

of

η_{β} (t)

, i.e.,

{\hat{η}}_{β} (t) = e_{1, p + 1}^{T} \{\arg \min_{a \in R^{p + 1}} S_{n} (a; t, β)\},

(17)

where

e_{j, p + 1}

denotes the j-th column of a

(p + 1) \times (p + 1)

identity matrix.

It is noted that the reliance of

{\hat{η}}_{β} (t)

on

β

does not guarantee its consistency to

η^{o} (t)

. Nonetheless, it is anticipated from the uniform consistency of

{\hat{η}}_{\hat{β}}

in Lemma 1 that

{\hat{η}}_{\hat{β}} (t)

will offer a valid estimator of

η^{o} (t)

, provided that

\hat{β}

consistently estimates

β_{o}

. Section 2.4 will discuss our proposed robust-BD estimator

\hat{β}

. Furthermore, Lemma 1 will assume (in Condition A1) that

η_{β} (t)

is the unique minimizer of

S (a; t, β)

with respect to a.

Remark 1.

The case of using the “kernel estimation”, or locally-constant estimation, corresponds to the choice of degree

p = 0

in (15). In that case, the criterion function in (16) and the estimator in (17) reduce to:

\begin{matrix} S_{n} (a; t, β) & = & \frac{1}{n} \sum_{i = 1}^{n} ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + a)) w_{1} (X_{i}) K_{h} (T_{i} - t), \end{matrix}

(18)

\begin{matrix} {\hat{η}}_{β} (t) & = & \arg \min_{a \in R^{1}} S_{n} (a; t, β), \end{matrix}

(19)

respectively.

2.4. Robust-BD Estimator of $β_{o}$

For any given value of

β

, define:

J (β, η_{β}) = E {ρ_{q} (Y, F^{- 1} (X^{T} β + η_{β} (T))) w_{2} (X)},

(20)

where

η_{β} (\cdot)

is as defined in (14) and

w_{2} (\cdot)

plays the same role as

w_{1} (\cdot)

in (13). Theoretically, it is anticipated that:

β_{o} = \arg \min_{β \in R^{d}} J (β, η_{β}),

(21)

which holds for example in the case where a classical quadratic loss combined with an identity link is used. To estimate

β_{o}

, it is natural to replace (20) by its sample-based criterion,

J_{n} (β, {\hat{η}}_{β}) = \frac{1}{n} \sum_{i = 1}^{n} ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + {\hat{η}}_{β} (T_{i}))) w_{2} (X_{i}),

(22)

where

{\hat{η}}_{β} (\cdot)

is as defined in (17). Hence, a parametric estimator of

β_{o}

is provided by:

\hat{β} = \arg \min_{β \in R^{d}} J_{n} (β, {\hat{η}}_{β}) .

(23)

Finally, the estimator of

η^{o} (\cdot)

is given by:

\hat{η} (\cdot) = {\hat{η}}_{\hat{β}} (\cdot) .

To achieve asymptotic normality of

\hat{β}

, Theorem 2 assumes (in Condition

A 2

) that

β_{o}

is the unique minimizer in (21), a standard condition for consistent M-estimators [16].

As a comparison, it is seen that

w_{1} (\cdot)

in (16) is used to robustify covariates

X_{i}

in estimating

η^{o} (\cdot)

,

w_{2} (\cdot)

in (22) is used to robustify covariates

X_{i}

in estimating

β_{o}

and

ρ_{q} (\cdot, \cdot)

serves to robustify the responses

Y_{i}

in both estimating procedures.

3. Two-Step Iterative Algorithm for Robust-BD Estimation

In a special case of using the classical quadratic loss combined with an identity link function, the robust-BD estimators for parametric and nonparametric components have explicit expressions,

\hat{β} = {({\tilde{X}}^{T} w_{2} \tilde{X})}^{- 1} ({\tilde{X}}^{T} w_{2} \tilde{y}), {(\hat{η} (T_{1}), \dots, \hat{η} (T_{n}))}^{T} = S_{h} (y - X \hat{β}),

(24)

where

w_{2} = diag (w_{2} (X_{1}), \dots, w_{2} (X_{n}))

,

\tilde{y} = (I - S_{h}) y

,

\tilde{X} = (I - S_{h}) X

, with

I

being an identity matrix,

y = {(Y_{1}, \dots, Y_{n})}^{T}

,

X = {(X_{1}, \dots, X_{n})}^{T}

the design matrix,

S_{h} = (\begin{matrix} e_{1, p + 1}^{T} {[{T (T_{1})}^{T} W_{w_{1}; K} (T_{1}) T (T_{1})]}^{- 1} {T (T_{1})}^{T} W_{w_{1}; K} (T_{1}) \\ ⋮ \\ e_{1, p + 1}^{T} {[{T (T_{n})}^{T} W_{w_{1}; K} (T_{n}) T (T_{n})]}^{- 1} {T (T_{n})}^{T} W_{w_{1}; K} (T_{n}) \end{matrix}),

and:

T (t) = {(t_{1} (t), \dots, t_{n} (t))}^{T}, W_{w_{1}; K} (t) = diag {w_{1} (X_{i}) K_{h} (T_{i} - t) : i = 1, \dots, n} .

When

w_{1} (x) = w_{2} (x) \equiv 1

, (24) reduces to the “profile least-squares estimators” of [8].

In other cases, robust-BD estimators from (17) and (23) do not have closed-form expressions and need to be solved numerically, which are computationally challenging and intensive. We now discuss a two-step robust proposal for iteratively estimating

β_{o}

and

η^{o} (\cdot)

. Let

{\hat{β}}^{[k - 1]}

and

{{\hat{η}}^{[k - 1]} (T_{i})}_{i = 1}^{n}

denote the estimates in the

(k - 1)

-th iteration, where

{\hat{η}}^{[k - 1]} (\cdot) = {\hat{η}}_{{\hat{β}}^{[k - 1]}} (\cdot)

. The k-th iteration consists of two steps below.

Step 1: Instead of solving (23) directly, we propose to solve a surrogate optimization problem, ${\hat{β}}^{[k]} = \arg \min_{β \in R^{d}} J_{n} (β, {\hat{η}}^{[k - 1]})$ . This minimizer approximates $\hat{β}$ .
Step 2: Obtain ${\hat{η}}^{[k]} (T_{i}) = {\hat{η}}_{{\hat{β}}^{[k]}} (T_{i})$ , $i = 1, \dots, n$ , where ${\hat{η}}_{β} (t)$ is defined in (17).

The algorithm terminates provided that

∥ {\hat{β}}^{[k]} - {\hat{β}}^{[k - 1]} ∥

is below some pre-specified threshold value, and all

{{\hat{η}}^{[k]} (T_{i})}_{i = 1}^{n}

stabilize.

3.1. Step 1

For the above two-step algorithm, we first elaborate on the procedure of acquiring

{\hat{β}}^{[k]}

in Step 1, by extending the coordinate descent (CD) iterative algorithm [17] designed for penalized estimation to our current robust-BD estimation, which is computationally efficient. For any given value of

η

, by Taylor expansion, around some initial estimate

β^{*}

(for example,

{\hat{β}}^{[k - 1]}

), we obtain the weighted quadratic approximation,

\begin{matrix} ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + η)) \approx \frac{1}{2} s_{i}^{I} {(Z_{i}^{I} - X_{i}^{T} β)}^{2} + C_{i}, \end{matrix}

where

C_{i}

is a constant not depending on

β

,

\begin{matrix} s_{i}^{I} & = & p_{2} (Y_{i}; X_{i}^{T} β^{*} + η), \\ Z_{i}^{I} & = & X_{i}^{T} β^{*} - p_{1} (Y_{i}; X_{i}^{T} β^{*} + η) / p_{2} (Y_{i}; X_{i}^{T} β^{*} + η), \end{matrix}

with

p_{j} (y; θ)

defined in (10). Hence,

\begin{matrix} J_{n} (β, η) & = & \frac{1}{n} \sum_{i = 1}^{n} ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + η)) w_{2} (X_{i}) \\ \approx & \frac{1}{2} \sum_{i = 1}^{n} \{n^{- 1} s_{i}^{I} w_{2} (X_{i})\} {(Z_{i}^{I} - X_{i}^{T} β)}^{2} + constant . \end{matrix}

Thus it suffices to conduct minimization of

\sum_{i = 1}^{n} s_{i}^{I} w_{2} (X_{i}) {(Z_{i}^{I} - X_{i}^{T} β)}^{2}

with respect to

β

, using a coordinate descent (CD) updating procedure. Suppose that the current estimate is

{\hat{β}}^{old} = {({\hat{β}}_{1}^{old}, \dots, {\hat{β}}_{d}^{old})}^{T}

, with the current residual vector

{\hat{r}}^{old} = ({\hat{r}}_{1}^{old}, \dots, {\hat{r}}_{n}^{old}) = z^{I} - X {\hat{β}}^{old}

, where

z^{I} = {(Z_{1}^{I}, \dots, Z_{n}^{I})}^{T}

is the vector of pseudo responses. Adopting the Newton–Raphson algorithm, the estimate of the j-th coordinate based on the previous estimate

{\hat{β}}_{j}^{old}

is updated to:

{\hat{β}}_{j}^{new} = {\hat{β}}_{j}^{old} + \frac{\sum_{i = 1}^{n} {s_{i}^{I} w_{2} (X_{i})} {\hat{r}}_{i}^{old} X_{i, j}}{\sum_{i = 1}^{n} {s_{i}^{I} w_{2} (X_{i})} X_{i, j}^{2}} .

As a result, the residuals due to such an update are updated to:

{\hat{r}}_{i}^{new} = {\hat{r}}_{i}^{old} - X_{i, j} ({\hat{β}}_{j}^{new} - {\hat{β}}_{j}^{old}), i = 1, \dots, n .

Cycling through

j = 1, \dots, d

, we obtain the estimate

{\hat{β}}^{new} = {({\hat{β}}_{1}^{new}, \dots, {\hat{β}}_{d}^{new})}^{T}

. Now, we set

η = {\hat{η}}^{[k - 1]}

and

β^{*} = {\hat{β}}^{[k - 1]}

. Iterate the process of weighted quadratic approximation followed by the CD updating, for a number of times, until the estimate

{\hat{β}}^{new}

stabilizes to the solution

{\hat{β}}^{[k]}

.

The validity of

{\hat{β}}^{[k]}

in Step 1 converging to the true parameter

β_{o}

is justified as follows. (i) Standard results for M-estimation [16] indicate that the minimizer of

J_{n} (β, η_{β_{o}})

is consistent with

β_{o}

. (ii) According to our Theorem 1 (ii) in Section 4.1,

\sup_{t \in T} | {\hat{η}}_{\hat{β}} (t) - η_{β_{o}} (t) | \overset{P}{⟶} 0

for a compact set

T

, where

\overset{P}{⟶}

stands for convergence in probability. Using derivations similar to those of (A4) gives

\sup_{β \in K} | J_{n} (β, {\hat{η}}_{\hat{β}}) - J_{n} (β, η_{β_{o}}) | \overset{P}{⟶} 0

for any compact set

K

. Thus, minimizing

J_{n} (β, {\hat{η}}_{\hat{β}})

is asymptotically equivalent to minimizing

J_{n} (β, η_{β_{o}})

. (iii) Similarly, provided that

{\hat{β}}^{[k - 1]}

is close to

\hat{β}

, minimizing

J_{n} (β, {\hat{η}}_{{\hat{β}}^{[k - 1]}})

is asymptotically equivalent to minimizing

J_{n} (β, {\hat{η}}_{\hat{β}})

. Assembling these three results with the definition of

{\hat{β}}^{[k]}

yields:

\begin{matrix} {\hat{β}}^{[k]} & = & \arg \min_{β} J_{n} (β, {\hat{η}}_{{\hat{β}}^{[k - 1]}}) \\ = & \arg \min_{β} J_{n} (β, {\hat{η}}_{\hat{β}}) + o_{P} (1) \\ = & \arg \min_{β} J_{n} (β, η_{β_{o}}) + o_{P} (1) \\ = & β_{o} + o_{P} (1) . \end{matrix}

3.2. Step 2

In Step 2, obtaining

{\hat{η}}_{β} (t)

for any given values of

β

and t is equivalent to minimizing

S_{n} (a; t, β)

in (16). Notice that the dimension

(p + 1)

of

a

is typically low, with degrees

p = 0

or

p = 1

being the most commonly used in practice. Hence, the minimizer of

S_{n} (a; t, β)

can be obtained by directly applying the Newton–Raphson iteration: for

k = 0, 1, \dots

,

a^{[k + 1]} (t; β) = a^{[k]} (t; β) - {\{\frac{\partial^{2} S_{n} (a; t, β)}{\partial a \partial a^{T}} |_{a = a^{[k]} (t; β)}\}}^{- 1} \frac{\partial S_{n} (a; t, β)}{\partial a} |_{a = a^{[k]} (t; β)},

where

a^{[k]} (t; β)

denotes the estimate in the k-th iteration, and:

\begin{matrix} \frac{\partial S_{n} (a; t, β)}{\partial a} & = & \frac{1}{n} \sum_{i = 1}^{n} p_{1} (Y_{i}; X_{i}^{T} β + t_{i} {(t)}^{T} a) t_{i} (t) w_{1} (X_{i}) K_{h} (T_{i} - t), \\ \frac{\partial^{2} S_{n} (a; t, β)}{\partial a \partial a^{T}} & = & \frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; X_{i}^{T} β + t_{i} {(t)}^{T} a) t_{i} (t) t_{i} {(t)}^{T} w_{1} (X_{i}) K_{h} (T_{i} - t) . \end{matrix}

The iterations terminate until the estimate

{\hat{η}}^{[k + 1]} (t) = e_{1, p + 1}^{T} a^{[k + 1]} (t; β)

stabilizes.

Our numerical studies of the robust-BD estimation indicate that (i) the kernel regression method can be both faster and stabler than the local-linear method; (ii) to estimate the nonparametric component

η^{o} (\cdot)

, the local-linear method outperforms the kernel method, especially at the edges of points

{T_{i}}_{i = 1}^{n}

; (iii) for the performance of the robust estimation of

β_{o}

, which is of major interest, there is a relatively negligible difference between choices of using the kernel and local-linear methods in estimating nonparametric components.

4. Asymptotic Property of the Robust-BD Estimators

This section investigates the asymptotic behavior of robust-BD estimators

\hat{β}

and

{\hat{η}}_{\hat{β}}

, under regularity conditions. The consistency of

\hat{β}

to

β_{o}

and uniform consistency of

{\hat{η}}_{\hat{β}}

to

η^{o}

are given in Theorem 1; the asymptotic normality of

\hat{β}

is obtained in Theorem 2. For the sake of exposition, the asymptotic results will be derived using local-linear estimation with degree

p = 1

. Analogous results can be obtained for local-polynomial methods with lengthier technical details and are omitted.

We assume that

T \in T

, and let

T_{0} \subseteq T

be a compact set. For any continuous function

v : T \mapsto R

, define

{∥ v ∥}_{\infty} = \sup_{t \in T} | v (t) |

and

{∥ v ∥}_{T_{0}; \infty} = \sup_{t \in T_{0}} | v (t) |

. For a matrix M, the smallest and largest eigenvalues are denoted by

λ_{j} (M)

,

λ_{\min} (M)

and

λ_{\max} (M)

, respectively. Let

∥ M ∥ = \sup_{∥ x ∥ = 1} ∥ M x ∥ = {λ_{\max} (M^{T} M)}^{1 / 2}

be the matrix

L_{2}

norm. Denote by

\overset{P}{⟶}

convergence in probability and

\overset{D}{⟶}

convergence in distribution.

4.1. Consistency

We first present Lemma 1, which states the uniform consistency of

{\hat{η}}_{β} (\cdot)

to the surrogate function

η_{β} (\cdot)

. Theorem 1 gives the consistency of

\hat{β}

and

{\hat{η}}_{\hat{β}}

.

Lemma 1

(For the non-parametric surrogate

η_{β} (\cdot)

). Let

K \subseteq R^{d}

and

T_{0} \subseteq T

be compact sets. Assume Condition

A 1

and Condition

B

in the Appendix. If

n \to \infty

,

h \to 0

,

n h \to \infty

,

\log (1 / h) / (n h) \to 0

, then

\sup_{β \in K} {∥ {\hat{η}}_{β} - η_{β} ∥}_{T_{0}; \infty} \overset{P}{⟶} 0 .

Theorem 1

(For

β_{o}

and

η^{o} (\cdot)

). Assume conditions in Lemma 1.

(i): If there exists a compact set $K_{1}$ such that $\lim_{n \to \infty} P (\hat{β} \in K_{1}) = 1$ and Condition $A 2$ holds, then $\hat{β} \overset{P}{⟶} β_{o}$ .
(ii): Moreover, if Condition $A 3$ holds, then ${∥ {\hat{η}}_{\hat{β}} - η^{o} ∥}_{T_{0}; \infty} \overset{P}{⟶} 0$ .

4.2. Asymptotic Normality

The asymptotic normality of

\hat{β}

is provided in Theorem 2.

Theorem 2

(For the parametric part β_o). Assume Conditions A and Condition B in the Appendix. If

n \to \infty

,

n h^{4} \to 0

and

\log (1 / h) / (n h^{2}) \to 0

, then:

\sqrt{n} (\hat{β} - β_{o}) \overset{D}{⟶} N (0, H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}),

where:

\begin{matrix} H_{0} = E [p_{2} (Y; X^{T} β_{o} + η^{o} (T)) \{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\} {\{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\}}^{T} w_{2} (X)], \end{matrix}

(25)

and:

\begin{matrix} Ω_{0}^{*} & = & E (p_{1}^{2} (Y; X^{T} β_{o} + η^{o} (T)) [\{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\} w_{2} (X) - \frac{γ (T)}{g_{2} (T; T, β_{o})} w_{1} (X)] \\ \times {[\{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\} w_{2} (X) - \frac{γ (T)}{g_{2} (T; T, β_{o})} w_{1} (X)]}^{T}) \end{matrix}

(26)

with:

\begin{matrix} γ (t) & = & E [p_{2} (Y; X^{T} β_{o} + η^{o} (t)) \{X + \frac{\partial η_{β} (t)}{\partial β} |_{β = β_{o}}\} w_{2} (X) | T = t], \\ g_{2} (t; t, β) & = & E {p_{2} (Y; X^{T} β + η_{β} (t)) w_{1} (X) ∣ T = t} . \end{matrix}

From Condition

A 1

, (13) and (14), we can show that if

w_{1} (\cdot) \equiv C w_{2} (\cdot)

for some constant

C \in (0, \infty)

, then

γ (t) = 0

. In that case,

Ω_{0}^{*} = Ω_{0}

, where:

\begin{matrix} Ω_{0} = E [p_{1}^{2} (Y; X^{T} β_{o} + η^{o} (T)) \{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\} {\{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\}}^{T} w_{2}^{2} (X)] . \end{matrix}

(27)

Consider the conventional PLM in (1), estimated using the classical quadratic loss, identity link and

w_{1} (\cdot) = w_{2} (\cdot) \equiv 1

. If

var (ϵ ∣ X, T) \equiv σ^{2}

, then

H_{0}^{- 1} Ω_{0} H_{0}^{- 1} = σ^{2} {[E {var (X ∣ T)}]}^{- 1}

, and thus, the result of Theorem 2 agrees with that in [18].

Remark 2.

Theorem 2 implies the root-n convergence rate of

\hat{β}

. This differs from

{\hat{η}}_{\hat{β}} (t)

, which converges at some rate incorporating both the sample size n and the bandwidth h, as seen in the proofs of Lemma 1 and Theorem 2.

5. Robust Inference for $β_{o}$ Based on BD

In many statistical applications, we will check whether or not a subset of explanatory variables used is statistically significant. Specific examples include:

\begin{matrix} H_{0} : β_{j; o} = 0, & for j = j_{0}, \\ H_{0} : β_{j; o} = 0, & for j = j_{1}, \dots, j_{2} . \end{matrix}

These forms of linear hypotheses for

β_{o}

can be more generally formulated as: (4).

5.1. Wald-Type Test $W_{n}$

We propose a robust version of the Wald-type test statistic,

W_{n} = n {(A \hat{β} - g_{0})}^{T} {(A {\hat{H}}_{0}^{- 1} {\hat{Ω}}_{0}^{*} {\hat{H}}_{0}^{- 1} A^{T})}^{- 1} (A \hat{β} - g_{0}),

(28)

based on the robust-BD estimator

\hat{β}

proposed in Section 2.4, where

{\hat{Ω}}_{0}^{*}

and

{\hat{H}}_{0}

are estimates of

Ω_{0}^{*}

and

H_{0}

satisfying

{\hat{H}}_{0}^{- 1} {\hat{Ω}}_{0}^{*} {\hat{H}}_{0}^{- 1} \overset{P}{⟶} H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}

. For example,

\begin{matrix} {\hat{H}}_{0} = \frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; X_{i}^{T} \hat{β} + {\hat{η}}_{\hat{β}} (T_{i})) \{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = \hat{β}}\} {\{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = \hat{β}}\}}^{T} w_{2} (X_{i}), \end{matrix}

and:

\begin{matrix} {\hat{Ω}}_{0}^{*} & = & \frac{1}{n} \sum_{i = 1}^{n} p_{1}^{2} (Y_{i}; X_{i}^{T} \hat{β} + {\hat{η}}_{\hat{β}} (T_{i})) [\{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = \hat{β}}\} w_{2} (X_{i}) - \frac{\hat{γ} (T_{i})}{{\hat{g}}_{2} (T_{i}; T_{i}, \hat{β})} w_{1} (X_{i})] \\ \times {[\{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = \hat{β}}\} w_{2} (X_{i}) - \frac{\hat{γ} (T_{i})}{{\hat{g}}_{2} (T_{i}; T_{i}, \hat{β})} w_{1} (X_{i})]}^{T}, \end{matrix}

fulfill the requirement, where:

\begin{matrix} \frac{\partial {\hat{η}}_{β} (t)}{\partial β} & = & - \frac{\sum_{k = 1}^{n} p_{2} (Y_{k}; X_{k}^{T} β + {\hat{η}}_{β} (t)) X_{k} w_{1} (X_{k}) K_{h} (T_{k} - t)}{\sum_{k = 1}^{n} p_{2} (Y_{k}; X_{k}^{T} β + {\hat{η}}_{β} (t)) w_{1} (X_{k}) K_{h} (T_{k} - t)}, \\ \hat{γ} (t) & = & \frac{1}{n} \sum_{k = 1}^{n} p_{2} (Y_{k}; X_{k}^{T} \hat{β} + {\hat{η}}_{\hat{β}} (t)) \{X_{k} + \frac{\partial {\hat{η}}_{β} (t)}{\partial β} |_{β = \hat{β}}\} w_{2} (X_{k}) K_{h} (T_{k} - t), \\ {\hat{g}}_{2} (t; t, β) & = & \frac{1}{n} \sum_{k = 1}^{n} p_{2} (Y_{k}; X_{k}^{T} β + {\hat{η}}_{β} (t)) w_{1} (X_{k}) K_{h} (T_{k} - t) . \end{matrix}

Again, we can verify that if

w_{1} (\cdot) \equiv C w_{2} (\cdot)

for some constant

C \in (0, \infty)

and

{\hat{η}}_{β} (t)

is obtained from kernel estimation method, then

\hat{γ} (t) = 0

, and hence,

{\hat{Ω}}_{0}^{*} = {\hat{Ω}}_{0}

, where:

{\hat{Ω}}_{0} = \frac{1}{n} \sum_{i = 1}^{n} p_{1}^{2} (Y_{i}; X_{i}^{T} \hat{β} + {\hat{η}}_{\hat{β}} (T_{i})) \{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = \hat{β}}\} {\{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = \hat{β}}\}}^{T} w_{2}^{2} (X_{i}) .

Theorem 3 justifies that under the null,

W_{n}

would for large n be distributed as

χ_{k}^{2}

, thus asymptotically distribution-free.

Theorem 3

(Wald-type test based on robust-BD under H₀). Assume conditions in Theorem 2, and

{\hat{H}}_{0}^{- 1} {\hat{Ω}}_{0}^{*} {\hat{H}}_{0}^{- 1} \overset{P}{⟶} H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}

in (28). Then, under

H_{0}

in (4), we have that:

W_{n} \overset{D}{⟶} χ_{k}^{2} .

Theorem 4 indicates that

W_{n}

has a non-trivial local power detecting contiguous alternatives approaching the null at the rate

n^{- 1 / 2}

:

H_{1 n} : A β_{o} - g_{0} = c / \sqrt{n} {1 + o (1)},

(29)

where

c = {(c_{1}, \dots, c_{k})}^{T} \neq 0

.

Theorem 4

(Wald-type test based on robust-BD under H_1n). Assume conditions in Theorem 2, and

{\hat{H}}_{0}^{- 1} {\hat{Ω}}_{0}^{*} {\hat{H}}_{0}^{- 1} \overset{P}{⟶} H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}

in (28). Then, under

H_{1 n}

in (29),

W_{n} \overset{D}{⟶} χ_{k}^{2} (τ^{2})

, where

τ^{2} = c^{T} {(A H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1} A^{T})}^{- 1} c > 0

.

To appreciate the discriminating power of

W_{n}

in assessing the significance, the asymptotic power is analyzed. Theorem 5 manifests that under the fixed alternative

H_{1}

,

W_{n} \overset{P}{⟶} + \infty

at the rate n. Thus,

W_{n}

has the power approaching to one against fixed alternatives.

Theorem 5

(Wald-type test based on robust-BD under H₁). Assume conditions in Theorem 2, and

{\hat{H}}_{0}^{- 1} {\hat{Ω}}_{0}^{*} {\hat{H}}_{0}^{- 1} \overset{P}{⟶} H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}

in (28). Then, under

H_{1}

in (4),

n^{- 1} W_{n} \geq λ_{\max}^{- 1} (A H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1} A^{T}) {∥ A β_{o} - g_{0} ∥}^{2} + o_{P} (1)

.

For the conventional PLM in (1) estimated using the non-robust quadratic loss, [8] showed the asymptotic equivalence between the Wald-type test and likelihood ratio-type test. Our results in the next Section 5.2 reveal that such equivalence is violated when estimators are obtained using the robust loss functions.

5.2. Likelihood Ratio-Type Test $Λ_{n}$

This section explores the degree to which the likelihood ratio-type test is extended to the “robust-BD” for testing the null hypothesis in (4) for the GPLM. The robust-BD test statistic is:

Λ_{n} = 2 n \{\min_{β \in R^{d} : A β = g_{0}} J_{n} (β, {\hat{η}}_{β}) - J_{n} (\hat{β}, {\hat{η}}_{\hat{β}})\},

(30)

where

\hat{β}

is the robust-BD estimator for

β_{o}

developed in Section 2.4.

Theorem 6 indicates that the limit distribution of

Λ_{n}

under

H_{0}

is a linear combination of independent chi-squared variables, with weights relying on some unknown quantities, thus not distribution free.

Theorem 6

(Likelihood ratio-type test based on robust-BD under H₀). Assume conditions in Theorem 2.

(i): Under $H_{0}$ in (4), we obtain:

$Λ_{n} \overset{D}{⟶} \sum_{j = 1}^{k} λ_{j} {{(A H_{0}^{- 1} A^{T})}^{- 1} (A V_{0} A^{T})} Z_{j}^{2},$

where $V_{0} = H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}$ and ${Z_{j}}_{j = 1}^{k} \overset{i . i . d .}{\sim} N (0, 1)$ .
(ii): Moreover, if $ψ (r) = r$ , $w_{1} (x) = w_{2} (x) \equiv 1$ , and the generating q-function of BD satisfies:

$q^{″} (m (x, t)) = - \frac{C}{V (m (x, t))}, for a constant C > 0,$

(31)

then under $H_{0}$ in (4), we have that $Λ_{n} / C \overset{D}{⟶} χ_{k}^{2}$ .

Theorem 7 states that

Λ_{n}

has non-trivial local power for identifying contiguous alternatives approaching the null at rate

n^{- 1 / 2}

and that

Λ_{n} \overset{P}{⟶} + \infty

at the rate n under

H_{1}

, thus having the power approaching to one against fixed alternatives.

Theorem 7

(Likelihood ratio-type test based on robust-BD under H_1n and H₁). Assume conditions in Theorem 2. Let

V_{0} = H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}

and

λ_{j} = λ_{j} {{(A H_{0}^{- 1} A^{T})}^{- 1} (A V_{0} A^{T})}

,

j = 1, \dots, k

.

(i): Under $H_{1 n}$ in (29), $Λ_{n} \overset{D}{⟶} \sum_{j = 1}^{k} {(\sqrt{λ_{j}} Z_{j} + e_{j, k}^{T} S c)}^{2}$ , where ${Z_{j}}_{j = 1}^{k} \overset{i . i . d .}{\sim} N (0, 1)$ , and S is a matrix satisfying $S^{T} S = {(A H_{0}^{- 1} A^{T})}^{- 1}$ and $S (A V_{0} A^{T}) S^{T} = diag (λ_{1}, \dots, λ_{k})$ .
(ii): Under $H_{1}$ in (4), $n^{- 1} Λ_{n} \geq c {∥ A β_{o} - g_{0} ∥}^{2} + o_{P} (1)$ for a constant $c > 0$ .

5.3. Comparison between $W_{n}$ and $Λ_{n}$

In summary, the test

W_{n}

has some advantages over the test

Λ_{n}

. First, the asymptotic null distribution of

W_{n}

is distribution-free, whereas the asymptotic null distribution of

Λ_{n}

in general depends on unknown quantities. Second,

W_{n}

is invariant with re-scaling the generating q-function of the BD, but

Λ_{n}

is not. Third, the computational expense of

W_{n}

is much more reduced than that of

Λ_{n}

, partly because the integration operations for

ρ_{q}

are involved in

Λ_{n}

, but not in

W_{n}

, and partly because

Λ_{n}

requires both unrestricted and restricted parameter estimates, while

W_{n}

is useful in cases where restricted parameter estimates are difficult to compute. Thus,

W_{n}

will be focused on in numerical studies of Section 6.

6. Simulation Study

We conduct simulation evaluations of the performance of robust-BD estimation methods for general partially linear models. We use the Huber

ψ

-function

ψ (\cdot)

with

c = 1.345

. The weight functions are chosen to be

w_{1} (x) = w_{2} (x) = 1 / {1 + \sum_{j = 1}^{d} {(\frac{x_{j} - m_{j}}{s_{j}})}^{2}}^{1 / 2}

, where

x = {(x_{1}, \dots, x_{d})}^{T}

,

m_{j}

and

s_{j}

denote the sample median and sample median absolute deviation of

{X_{i, j} : i = 1, \dots, n}

respectively,

j = 1, \dots, d

. As a comparison, the classical non-robust estimation counterparts correspond to using

ψ (r) = r

and

w_{1} (x) = w_{2} (x) \equiv 1

. Throughout the numerical work, the Epanechnikov kernel function

K (t) = 0.75 \max (1 - t^{2}, 0)

is used. All these choices (among many others) are for feasibility; the issues on the trade-off between robustness and efficiency are not pursued further in the paper.

The following setup is used in the simulation studies. The sample size is

n = 200

, and the number of replications is 500. (Incorporating a nonparametric component in the GPLM desires a larger n when the number of covariates increases for better numerical performance.) Local-linear robust-BD estimation is illustrated with the bandwidth parameter h to be

20 %

of the interval length of the variable T. Results using other data-driven choices of h are similar and are omitted.

6.1. Bernoulli Responses

We generate observations

{(X_{i}, T_{i}, Y_{i})}_{i = 1}^{n}

randomly from the model,

Y ∣ (X, T) \sim Bernoulli (m (X, T)), X \sim N (0, Σ), T \sim Uniform (0, 1),

where

Σ = (σ_{j k})

with

σ_{j k} = {0.2}^{| j - k |}

, and

X

is independent of T. The link function is

logit {m (x, t)} = x^{T} β_{o} + η^{o} (t)

, where

β_{o} = {(2, 2, 0, 0)}^{T}

and

η^{o} (t) = 2 \sin {π (1 + 2 t)}

. Both the deviance and exponential loss functions are employed as the BD.

For each generated dataset from the true model, we create a contaminated dataset, where 10 data points

(X_{i, j}, Y_{i})

are contaminated as follows: they are replaced by

(X_{i, j}^{*}, Y_{i}^{*})

, where

Y_{i}^{*} = 1 - Y_{i}

,

i = 1, \dots, 5

,

\begin{matrix} X_{1, 2}^{*} = 5 sign (U_{1} - 0.5), X_{2, 2}^{*} = 5 sign (U_{2} - 0.5), X_{3, 2}^{*} = 5 sign (U_{3} - 0.5), \\ X_{4, 4}^{*} = 5 sign (U_{4} - 0.5), X_{5, 1}^{*} = 5 sign (U_{5} - 0.5), X_{6, 2}^{*} = 5 sign (U_{6} - 0.5), \\ X_{7, 3}^{*} = 5 sign (U_{7} - 0.5), X_{8, 4}^{*} = 5 sign (U_{8} - 0.5), X_{9, 2}^{*} = 5 sign (U_{9} - 0.5), \\ X_{10, 3}^{*} = 5 sign (U_{10} - 0.5), \end{matrix}

with

{U_{i}} \overset{i . i . d .}{\sim} Uniform (0, 1)

.

Figure 1 and Figure 2 compare the boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

, based on the non-robust and robust-BD estimates, where the deviance loss and exponential loss are used as the BD in the top and bottom panels respectively. As seen from Figure 1 in the absence of contamination, both non-robust and robust methods perform comparably well. Besides, the bias in non-robust methods using the exponential loss (with

p_{2} (y; θ)

unbounded) is larger than that of the deviance loss (with

p_{2} (y; θ)

bounded). In the presence of contamination, Figure 2 reveals that the robust method is more effective in decreasing the estimation bias without excessively increasing the estimation variance.

Figure 1. Simulated Bernoulli response data without contamination. Boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

(from left to right). (Left panels): non-robust method; (right panels): robust method.

Figure 2. Simulated Bernoulli response data with contamination. The captions are identical to those in Figure 1.

For each replication, we calculate

MSE (\hat{η}) = n^{- 1} \sum_{i = 1}^{n} {{\hat{η}}_{\hat{β}} (t_{i}) - η^{o} (t_{i})}^{2}

. Figure 3 and Figure 4 compare the plots of

{\hat{η}}_{\hat{β}} (t)

from typical samples, using non-robust and robust-BD estimates, where the deviance loss and exponential loss are used as the BD in the top and bottom panels, respectively. There, the typical sample in each panel is selected in a way such that its MSE value corresponds to the 50-th percentile among the MSE-ranked values from 500 replications. These fitted curves reveal little difference between using the robust and non-robust methods, in the absence of contamination. For contaminated cases, robust estimates perform slightly better than non-robust estimates. Moreover, the boundary bias issue arising from the curve estimates at the edges using the local constant method can be ameliorated by using the local-linear method.

Figure 3. Simulated Bernoulli response data without contamination. Plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

Figure 4. Simulated Bernoulli response data with contamination. Plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

6.2. Gaussian Responses

We generate independent observations

{(X_{i}, T_{i}, Y_{i})}_{i = 1}^{n}

from

(X, T, Y)

satisfying:

Y ∣ (X, T) \sim N (m (X, T), σ^{2}), (X, Φ^{- 1} (T)) \sim N (0, Σ),

where

σ = 1

,

Σ = (σ_{j k})

with

σ_{j k} = {0.2}^{| j - k |}

,

Φ

denotes the CDF of the standard normal distribution. The link function is

m (x, t) = x^{T} β_{o} + η^{o} (t)

, where

β_{o} = {(2, - 2, 1, - 1, 0, 0)}^{T}

and

η^{o} (t) = 2 \sin {π (1 + 2 t)}

. The quadratic loss is utilized as the BD.

For each dataset simulated from the true model, a contaminated data-set is created, where 10 data points

(X_{i, j}, Y_{i})

are subject to contamination. They are replaced by

(X_{i, j}^{*}, Y_{i}^{*})

, where

Y_{i}^{*} = Y_{i} I {| Y_{i} - m (X_{i}, T_{i}) | / σ > 2} + 15 I {| Y_{i} - m (X_{i}, T_{i}) | / σ \leq 2}

,

i = 1, \dots, 10

,

\begin{matrix} X_{1, 2}^{*} = 5 sign (U_{1} - 0.5), X_{2, 2}^{*} = 5 sign (U_{2} - 0.5), X_{3, 2}^{*} = 5 sign (U_{3} - 0.5), \\ X_{4, 4}^{*} = 5 sign (U_{4} - 0.5), X_{5, 6}^{*} = 5 sign (U_{5} - 0.5), X_{6, 1}^{*} = 5 sign (U_{6} - 0.5), \\ X_{7, 2}^{*} = 5 sign (U_{7} - 0.5), X_{8, 3}^{*} = 5 sign (U_{8} - 0.5), X_{9, 4}^{*} = 5 sign (U_{9} - 0.5), \\ X_{10, 5}^{*} = 5 sign (U_{10} - 0.5), \end{matrix}

with

{U_{i}} \overset{i . i . d .}{\sim} Uniform (0, 1)

.

Figure 5 and Figure 6 compare the boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

, on the top panels, and plots of

{\hat{η}}_{\hat{β}} (t)

from typical samples, on the bottom panels, using the non-robust and robust-BD estimates. The typical samples are selected similar to those in Section 6.1. The simulation results in Figure 5 indicate that the robust method performs, as well as the non-robust method for estimating both the parameter vector and non-parametric curve in non-contaminated cases. Figure 6 reveals that the robust estimates are less sensitive to outliers than the non-robust counterparts. Indeed, the non-robust method yields a conceivable bias for parametric estimation, and non-parametric estimation is worse than that of the robust method.

Figure 5. Simulated Gaussian response data without contamination. Top panels: boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

(from left to right). Bottom panels: plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

Figure 6. Simulated Gaussian response data with contamination. Top panels: boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

(from left to right). Bottom panels: plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

Figure 7 gives the QQ plots of the (first to 95-th) percentiles of the Wald-type statistic

W_{n}

versus those of the

χ_{2}^{2}

distribution for testing the null hypothesis:

H_{0} : β_{2; o} = - 2 and β_{4; o} = - 1 .

(32)

Figure 7. Simulated Gaussian response data with contamination. Empirical quantiles (on the y-axis) of the Wald-type statistics

W_{n}

versus quantiles (on the x-axis) of the

χ^{2}

distribution. Solid line: the 45 degree reference line. (Left panels): non-robust method; (right panels): robust method.

The plots depict that in both clean and contaminated cases, the robust

W_{n}

(in right panels) closely follows the

χ_{2}^{2}

distribution, lending support to Theorem 3. On the other hand, the non-robust

W_{n}

agrees well with the

χ_{2}^{2}

distribution in clean data; the presence of a small number of outlying data points severely distorts the sampling distribution of the non-robust

W_{n}

(in the bottom left panel) from the

χ_{2}^{2}

distribution, yielding inaccurate levels of the test.

To assess the stability of the power of the Wald-type test for testing the hypothesis (32), we evaluate the power in a sequence of alternatives with parameters

β_{o} + Δ c

for each given

Δ

, where

c = β_{o} + {(1, \dots, 1)}^{T}

. Figure 8 plots the empirical rejection rates of the null model in the non-contaminated case and the contaminated case. The price to pay for the robust

W_{n}

is a little loss of power in the non-contaminated cases. However, under contamination, a very different behavior is observed. The observed power curve of the robust

W_{n}

is close to those attained in the non-contaminated case. On the contrary, the non-robust

W_{n}

is less informative, since its power curve is much lower than that of the robust

W_{n}

against the alternative hypotheses with

Δ \neq 0

, but higher than the nominal level at the null hypothesis with

Δ = 0

.

Figure 8. Observed power curves of tests for the Gaussian response data. The dashed line corresponds to the non-robust Wald-type test

W_{n}

; the solid line corresponds to the robust

W_{n}

; the dotted line indicates the 5% nominal level. (Left panels): non-contaminated case; (right panels): contaminated case.

7. Real Data Analysis

Two real datasets are analyzed. In both cases, the quadratic loss is set to be the BD, and the nonparametric function is fitted via local-linear regression method, where the bandwidth parameter is chosen to be 25% of the interval length of the variable T. Choices of the Huber

ψ

-function and weight functions are identical to those in Section 6.

7.1. Example 1

The dataset studied in [19] consists of 2447 observations on three variables,

\log (wage)

,

age

and

education

, for women. It is of interest to learn how wages change with years of age and years of education. It is anticipated to find an increasing regression function of

Y = \log (wage)

in

T = age

as well as in

X_{1} = education

. We fit a partially linear model

Y = η (T) + β_{1} X_{1} + ϵ

. Profiles of the fitted nonparametric functions

\hat{η} (\cdot)

in Figure 9 indeed exhibit the overall upward trend in

age

. The coefficient estimate is

{\hat{β}}_{1} = 0.0809

with standard error 0.0042 using the non-robust method, and is

{\hat{β}}_{1} = 0.1334

with standard error 0.0046 by means of the robust method. It is seen that robust estimates are similar to the non-robust counterparts. Our evaluation, based on both the non-robust and robust methods, supports the predicted result in theoretical and empirical literature in socio-economical studies.

Figure 9. The dataset in [19]. (Left panels): estimate of

η (T)

via the non-robust quadratic loss; (right panels): estimate of

η (T)

via the robust quadratic loss.

7.2. Example 2

We analyze an employee dataset (Example 11.3 of [20]) of the Fifth National Bank of Springfield, based on year 1995 data. The bank, whose name has been changed, was charged in court with that its female employees received substantially smaller salaries than its male employees. For each of its 208 employees, the dataset consists of seven variables,

EducLev

(education level),

JobGrade

(job grade),

YrHired

(year that an employee was hired),

YrBorn

(year that an employee was born),

Female

(indicator of being female),

YrsPrior

(years of work experience at another bank before working at the Fifth National bank), and

Salary

(current annual salary in thousands of dollars).

To explain variation in salary, we fit a partial linear model,

Y = η (T) + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4} + β_{5} X_{5} + ϵ

, for

Y = \log (Salary)

,

T = Age

,

X_{1} = Female

,

X_{2} = YrHired

,

X_{3} = EducLev

,

X_{4} = JobGrade

and

X_{5} = YrsPrior

, where

Age = 95 - YrBorn

is age. Table 1 presents parameter estimates and their standard errors (given within brackets), along with p-values calculated from the Wald-type test

W_{n}

. Figure 10 depicts the estimated nonparametric functions.

Table 1. Parameter estimates and p-values for partially linear model of the dataset in [20].

Figure 10. The dataset in [20]. (Left panel): estimate of

η (T)

via the non-robust quadratic loss; (right panel): estimate of

η (T)

via the robust quadratic loss.

It is interesting to note that for this dataset, results from using the robust and non-robust methods make a difference in drawing conclusions. For example, from Table 1, the non-robust method gives the estimate of parameter

β_{1}

for gender to be below zero, which may be interpreted as the evidence of discrimination against female employees in salary and lends support to the plaintiff. In contrast, the robust method yields

{\hat{β}}_{1} > 0

, which does not indicate that gender has an adverse effect. (A similar conclusion made from penalized-likelihood was obtained in Section 4.1 of [21]). Moreover, the estimated nonparametric functions

\hat{η} (\cdot)

obtained from non-robust and robust methods are qualitatively different: the former method does not deliver a monotone increasing pattern with Age, whereas the latter method does. Whether or not the difference was caused by outlying observations will be an interesting issue to be investigated.

8. Discussion

Over the past two decades, nonparametric inference procedures for testing hypotheses concerning nonparametric regression functions have been developed extensively. See [22,23,24,25,26] and the references therein. The work on the generalized likelihood ratio test [24] offers light into nonparametric inference, based on function estimation under nonparametric models, using the quadratic loss function as the error measure. These works do not directly deal with the robust procedure. Exploring the inference on nonparametric functions, such as

η^{o} (t)

in GPLM associated with a scalar variable T and the additive structure

\sum_{d = 1}^{D} η_{d}^{o} (t_{d})

as in [27] with a vector variable

T = (T_{1}, \dots, T_{D})

, estimated via the “robust-BD” as the error measure, when there are possible outlying data points, will be the future work.

This paper utilizes the class BD of loss functions, the optimal choice of which depends on specific settings and criteria. For e.g., regression and classification will utilize different loss functions, and thus further study on optimality is desirable.

Some recent work on partially linear models in econometrics includes [28,29,30]. There, the nonparametric function is approximated via linear expansions, with the number of coefficients diverging with n. Developing inference procedures to be resistant to outliers could be of interest.

Acknowledgments

The authors thank the two referees for insightful comments and suggestions. The research is supported by the U.S. NSF Grants DMS–1712418, DMS–1505367, CMMI–1536978, DMS–1308872, the Wisconsin Alumni Research Foundation and the National Natural Science Foundation of China, grants 11690014.

Author Contributions

C.Z. conceived and designed the experiments; C.Z. analyzed the data; Z.Z. contributed to discussions and analysis tools; C.Z. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Main Results

Throughout the proof, C represents a generic finite constant. We impose some regularity conditions, which may not be the weakest, but facilitate the technical derivations.

Notation:

For integers

j \geq 0

,

μ_{j} (K) = \int u^{j} K (u) d u

;

c_{p} = {(μ_{p + 1} (K), \dots, μ_{2 p + 1} (K))}^{T}

;

S = {(μ_{j + k - 2} (K))}_{1 \leq j, k \leq p + 1}

. Define:

η (x, t) = F (m (x, t)) = x^{T} β_{o} + η^{o} (t)

;

η_{i} = η (X_{i}, T_{i})

. Set

η_{i} (t; β) = X_{i}^{T} β + η_{β} (t) + \sum_{k = 1}^{p} {(T_{i} - t)}^{k} η_{β}^{(k)} (t) / k!

;

g_{1} (τ; t, β) = E {p_{1} (Y_{i}; η_{i} (t; β)) w_{1} (X_{i}) ∣ T_{i} = τ}

;

g_{2} (τ; t, β) = E {p_{2} (Y_{i}; η_{i} (t; β)) w_{1} (X_{i}) ∣ T_{i} = τ}

.

Condition A:

A1.: $η_{β} (t)$ is the unique minimizer of $S (a; t, β)$ with respect to $a \in R^{1}$ .
A2.: $β_{o} \in R^{d}$ is the unique minimizer of $J (β, η_{β})$ with respect to $β$ , where $d \geq 1$ .
A3.: $η^{o} (\cdot) = η_{β_{o}} (\cdot)$ .

Condition B:

B1.: The function $ρ_{q} (y, μ)$ is continuous and bounded. The functions $p_{1} (y; θ)$ , $p_{2} (y; θ)$ , $p_{3} (y; θ)$ , $w_{1} (\cdot)$ and $w_{2} (\cdot)$ are bounded; $p_{2} (y; θ)$ is continuous in $θ$ .
B2.: The kernel function K is Lipschitz continuous, a symmetric probability density function with bounded support. The matrix $S$ is positive definite.
B3.: The marginal density $f_{T} (t)$ of T is a continuous function, uniformly bounded away from zero and ∞ for $t \in T_{0}$ .
B4.: The function $S (a; t, β)$ is continuous and $η_{β} (t)$ is a continuous function of $(t, β)$ .
B5.: Assume $g_{2} (τ; t, β)$ is continuous in $τ$ ; $g_{2} (t; t, β)$ is continuous in $t \in T_{0}$ .
B6.: Functions $η_{β} (t)$ and $η^{o} (t)$ are $(p + 1)$ -times continuously differentiable at t.
B7.: The link function $F (\cdot)$ is monotone increasing and a bijection, $F^{(3)} (\cdot)$ is continuous, and $F^{(1)} (\cdot) > 0$ . The matrix $var (X ∣ T = t)$ is positive definite for a.e. t.
B8.: The matrix $H_{0}$ in (25) is invertible; $Ω_{0}^{*}$ in (26) is positive-definite.
B9.: ${\hat{η}}_{β} (t)$ and $η_{β} (t)$ are continuously differentiable with respect to $(t, β)$ , and twice continuously differentiable with respect to $β$ such that for any $1 \leq j, k \leq d$ , $\frac{\partial^{2}}{\partial β_{j} \partial β_{k}} η_{β} {(t) |}_{β = β_{o}}$ is bounded. Furthermore, for any $1 \leq j, k \leq d$ , $\frac{\partial^{2}}{\partial β_{j} \partial β_{k}} η_{β} (t)$ satisfies the equicontinuity condition:

$\forall ε > 0, \exists δ_{ε} > 0 : ∥ β_{1} - β_{o} ∥ < δ_{ε} ⟹ {∥ \frac{\partial^{2}}{\partial β_{j} \partial β_{k}} η_{β} |_{β = β_{1}} - \frac{\partial^{2}}{\partial β_{j} \partial β_{k}} η_{β} |_{β = β_{o}} ∥}_{\infty} < ε .$

Note that Conditions A, B2–B5 and B8–B9 were similarly used in [9]. Conditions B1 and B7 follow [10]. Condition B6 is due to the local p-th-degree polynomial regression estimation.

Proof of Lemma 1:

From Condition A1, we obtain

E {p_{1} (Y; X^{T} β + η_{β} (t)) w_{1} (X) ∣ T = t} = 0

and

E {p_{2} (Y; X^{T} β + η_{β} (t)) w_{1} (X) ∣ T = t} > 0

, i.e.,

\begin{matrix} g_{1} (t; t, β) = E {p_{1} (Y; X^{T} β + η_{β} (t)) w_{1} (X) ∣ T = t} & = & 0, \end{matrix}

(A1)

\begin{matrix} g_{2} (t; t, β) = E {p_{2} (Y; X^{T} β + η_{β} (t)) w_{1} (X) ∣ T = t} & > & 0 . \end{matrix}

(A2)

Define by

η_{β}^{(0, \dots, p)} (t) = {(η_{β} (t), η_{β}^{(1)} (t), \dots, η_{β}^{(p)} (t) / p!)}^{T}

the vector of

η_{β} (t)

along with re-scaled derivatives with respect to t up to the order p. Note that:

\begin{matrix} η_{i} (t; β) & = & X_{i}^{T} β + \sum_{k = 0}^{p} {(T_{i} - t)}^{k} \frac{η_{β}^{(k)} (t)}{k!} \\ = & X_{i}^{T} β + t_{i} {(t)}^{T} η_{β}^{(0, \dots, p)} (t) \\ = & X_{i}^{T} β + {H^{- 1} t_{i} (t)}^{T} H η_{β}^{(0, \dots, p)} (t) \\ = & X_{i}^{T} β + t_{i}^{*} {(t)}^{T} H η_{β}^{(0, \dots, p)} (t), \end{matrix}

where

H = diag {(1, h, \dots, h^{p})}

and

t_{i}^{*} (t) = H^{- 1} t_{i} (t) = {(1, (T_{i} - t) / h, \dots, {(T_{i} - t)}^{p} / h^{p})}^{T}

denotes the re-scaled

t_{i} (t)

. Then:

\begin{matrix} X_{i}^{T} β + t_{i} {(t)}^{T} a \\ = & X_{i}^{T} β + t_{i}^{*} {(t)}^{T} H a \\ = & X_{i}^{T} β + t_{i}^{*} {(t)}^{T} H η_{β}^{(0, \dots, p)} (t) + t_{i}^{*} {(t)}^{T} H {a - η_{β}^{(0, \dots, p)} (t)} \\ = & η_{i} (t; β) + t_{i}^{*} {(t)}^{T} H {a - η_{β}^{(0, \dots, p)} (t)} . \end{matrix}

Hence, we rewrite (16) as:

S_{n} (a; t, β) = \frac{1}{n} \sum_{i = 1}^{n} ρ_{q} (Y_{i}, F^{- 1} (η_{i} (t; β) + t_{i}^{*} {(t)}^{T} H {a - η_{β}^{(0, \dots, p)} (t)})) w_{1} (X_{i}) K_{h} (T_{i} - t) .

Therefore,

\hat{a} (t, β)

minimizing

S_{n} (a; t, β)

is equivalent to the one minimizing:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} {ρ_{q} (Y_{i}, F^{- 1} (η_{i} (t; β) + t_{i}^{*} {(t)}^{T} H {a - η_{β}^{(0, \dots, p)} (t)})) \\ - ρ_{q} (Y_{i}, F^{- 1} (η_{i} (t; β)))} w_{1} (X_{i}) K_{h} (T_{i} - t) \end{matrix}

with respect to

a

. It follows that

{\hat{a}}^{*} (t, β)

, defined by

{\hat{a}}^{*} (t, β) = \sqrt{n h} H {\hat{a} (t, β) - η_{β}^{(0, \dots, p)} (t)}

, minimizes:

\begin{matrix} G_{n} (a^{*}; t, β) = n h [\frac{1}{n} \sum_{i = 1}^{n} \{ρ_{q} (Y_{i}, F^{- 1} (η_{i} (t; β) + {a_{n} t_{i}^{*} {(t)}^{T} a^{*}})) - ρ_{q} (Y_{i}, F^{- 1} (η_{i} (t; β)))\} \\ w_{1} (X_{i}) K_{h} (T_{i} - t)] \end{matrix}

with respect to

a^{*} \in R^{p + 1}

, where

a_{n} = 1 / \sqrt{n h}

. Note that for any fixed

a^{*}

,

| t_{i}^{*} {(t)}^{T} a^{*} | \leq C

. By Taylor expansion,

\begin{matrix} G_{n} (a^{*}; t, β) & = & n h (a_{n} [\frac{1}{n} \sum_{i = 1}^{n} p_{1} (Y_{i}; η_{i} (t; β)) {t_{i}^{*} {(t)}^{T} a^{*}} w_{1} (X_{i}) K_{h} (T_{i} - t)] \\ + a_{n}^{2} \frac{1}{2} [\frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; η_{i} (t; β)) {t_{i}^{*} {(t)}^{T} a^{*}}^{2} w_{1} (X_{i}) K_{h} (T_{i} - t)] \\ + a_{n}^{3} \frac{1}{6} [\frac{1}{n} \sum_{i = 1}^{n} p_{3} (Y_{i}; η_{i}^{*} (t; β)) {t_{i}^{*} {(t)}^{T} a^{*}}^{3} w_{1} (X_{i}) K_{h} (T_{i} - t)]) \\ = & I_{n, 1} + I_{n, 2} + I_{n, 3}, \end{matrix}

where

η_{i}^{*} (t; β)

is located between

η_{i} (t; β)

and

η_{i} (t; β) + {a_{n} t_{i}^{*} {(t)}^{T} a^{*}}

. We notice that:

\begin{matrix} I_{n, 1} & \equiv & \sqrt{n h} W_{n} {(t, β)}^{T} a^{*}, \end{matrix}

where:

W_{n} (t, β) = \frac{1}{n} \sum_{i = 1}^{n} p_{1} (Y_{i}; η_{i} (t; β)) t_{i}^{*} (t) w_{1} (X_{i}) K_{h} (T_{i} - t);

also, Lemma A1 implies:

\begin{matrix} I_{n, 2} & = & n h a_{n}^{2} \frac{1}{2} {a^{*}}^{T} [\frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; η_{i} (t; β)) {t_{i}^{*} (t) t_{i}^{*} {(t)}^{T}} w_{1} (X_{i}) K_{h} (T_{i} - t)] a^{*} \\ = & \frac{1}{2} {a^{*}}^{T} S_{2} (t, β) a^{*} + o_{P} (1), \end{matrix}

where:

S_{2} (t, β) = g_{2} (t; t, β) f_{T} (t) S ≻ 0

by (A2), Condition B2 and B5; and (by using

X_{n} = O_{P} (E (| X_{n} |))

):

\begin{matrix} I_{n, 3} & \leq & C O_{P} (n h a_{n}^{3}) = O_{P} (1 / \sqrt{n h}) = o_{P} (1) . \end{matrix}

Then:

\begin{matrix} G_{n} (a^{*}; t, β) = \sqrt{n h} W_{n} {(t, β)}^{T} a^{*} + \frac{1}{2} {a^{*}}^{T} S_{2} (t, β) a^{*} + o_{P} (1), \end{matrix}

where

{a^{*}}^{T} S_{2} (t, β) a^{*} = ({a^{*}}^{T} S a^{*}) g_{2} (t; t, β) f_{T} (t)

is continuous in

t \in T_{0}

by B3 and B5.

We now examine

W_{n} (t, β)

. Note that:

\begin{matrix} var {W_{n} (t, β)} & = & \frac{1}{n} var {p_{1} (Y_{i}; η_{i} (t; β)) t_{i}^{*} (t) w_{1} (X_{i}) K_{h} (T_{i} - t)} \\ \leq & \frac{1}{n} E [p_{1}^{2} (Y_{i}; η_{i} (t; β)) {t_{i}^{*} (t) t_{i}^{*} {(t)}^{T}} w_{1}^{2} (X_{i}) {K_{h} (T_{i} - t)}^{2}] \\ \leq & \frac{C}{n} E [\frac{1}{h^{2}} {\{K (\frac{T_{i} - t}{h})\}}^{2}] \\ = & \frac{C}{n h} . \end{matrix}

To evaluate

E {W_{n} (t, β)}

, it is easy to see that for each

j \in {0, 1, \dots, p}

,

\begin{matrix} e_{j + 1, p + 1}^{T} E {W_{n} (t, β)} & = & E {p_{1} (Y_{i}; η_{i} (t; β)) e_{j + 1, p + 1}^{T} t_{i}^{*} (t) w_{1} (X_{i}) K_{h} (T_{i} - t)} \\ = & E \{p_{1} (Y_{i}; η_{i} (t; β)) {(\frac{T_{i} - t}{h})}^{j} w_{1} (X_{i}) K_{h} (T_{i} - t)\} \\ = & E [E {p_{1} (Y_{i}; η_{i} (t; β)) w_{1} (X_{i}) ∣ T_{i}} {(\frac{T_{i} - t}{h})}^{j} K_{h} (T_{i} - t)] \\ = & E \{g_{1} (T_{i}; t, β) {(\frac{T_{i} - t}{h})}^{j} K_{h} (T_{i} - t)\} \\ = & \int g_{1} (y; t, β) {(\frac{y - t}{h})}^{j} \frac{1}{h} K (\frac{y - t}{h}) f_{T} (y) d y \\ = & \int g_{1} (t + h x; t, β) x^{j} K (x) f_{T} (t + h x) d x . \end{matrix}

Note that by Taylor expansion,

\begin{matrix} η_{β} (t + h x) = \sum_{k = 0}^{p} {(h x)}^{k} \frac{η_{β}^{(k)} (t)}{k!} + {(h x)}^{p + 1} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} + o (h^{p + 1}) . \end{matrix}

This combined with the facts (A1) and (A2) give that:

\begin{matrix} g_{1} (t + h x; t, β) \\ = & E \{p_{1} (Y; X^{T} β + \sum_{k = 0}^{p} {(h x)}^{k} \frac{η_{β}^{(k)} (t)}{k!}) w_{1} (X) | T = t + h x\} \\ = & E [p_{1} (Y; X^{T} β + η_{β} (t + h x)) w_{1} (X) \\ + p_{2} (Y; X^{T} β + η_{β} (t + h x)) \{\sum_{k = 0}^{p} {(h x)}^{k} \frac{η_{β}^{(k)} (t)}{k!} - η_{β} (t + h x)\} w_{1} (X) | T = t + h x] \\ + o (h^{p + 1}) \\ = & g_{1} (t + h x; t + h x, β) - {(h x)}^{p + 1} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} g_{2} (t + h x; t + h x, β) + o (h^{p + 1}) \\ = & - {(h x)}^{p + 1} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} g_{2} (t + h x; t + h x, β) + o (h^{p + 1}) . \end{matrix}

Thus, using the continuity of

g_{2} (t; t, β)

and

f_{T} (t)

in t, we obtain:

\begin{matrix} E {W_{n} (t, β)} = - c_{p} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} g_{2} (t; t, β) f_{T} (t) h^{p + 1} + o (h^{p + 1}) \end{matrix}

uniformly in

(t, β)

. Thus, we conclude that

\sqrt{n h} W_{n} (t, β) = O_{P} (1)

when

n h^{2 p + 3} = O (1)

.

By Lemma A2,

\begin{matrix} \sup_{a^{*} \in Θ, t \in T_{0}, β \in K} | G_{n} (a^{*}; t, β) - \sqrt{n h} W_{n} {(t, β)}^{T} a^{*} - \frac{1}{2} {a^{*}}^{T} S_{2} (t, β) a^{*} | = o_{P} (1) . \end{matrix}

This along with Lemma A.1 of [18] yields:

\begin{matrix} \sup_{t \in T_{0}, β \in K} ∥ {\hat{a}}^{*} (t, β) + {S_{2} (t, β)}^{- 1} \sqrt{n h} W_{n} (t, β) ∥ = o_{P} (1), \end{matrix}

the first entry of which satisfies:

\begin{matrix} \sup_{t \in T_{0}, β \in K} | \sqrt{n h} {{\hat{η}}_{β} (t) - η_{β} (t)} + e_{1, p + 1}^{T} {S_{2} (t, β)}^{- 1} \sqrt{n h} W_{n} (t, β) | = o_{P} (1), \end{matrix}

namely,

\sup_{t \in T_{0}, β \in K} | {\hat{η}}_{β} (t) - η_{β} (t) + e_{1, p + 1}^{T} {S_{2} (t, β)}^{- 1} W_{n} (t, β) | = o_{P} (1 / \sqrt{n h})

. By [31],

\sup_{t \in T_{0}, β \in K} ∥ W_{n} (t, β) - E {W_{n} (t, β)} ∥ = O_{P} ({\frac{\log (1 / h)}{n h}}^{1 / 2})

. Furthermore,

{S_{2} (t, β)}^{- 1} E {W_{n} (t, β)} = - S^{- 1} c_{p} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} h^{p + 1} + o (h^{p + 1})

uniformly in

(t, β)

. Therefore,

\begin{matrix} \sup_{t \in T_{0}, β \in K} | {\hat{η}}_{β} (t) - η_{β} (t) - e_{1, p + 1}^{T} S^{- 1} c_{p} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} h^{p + 1} | = o_{P} (1) . \end{matrix}

This yields:

\begin{matrix} \sup_{β \in K} \sup_{t \in T_{0}} | {\hat{η}}_{β} (t) - η_{β} (t) - e_{1, p + 1}^{T} S^{- 1} c_{p} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} h^{p + 1} | \\ \leq & \sup_{t \in T_{0}, β \in K} | {\hat{η}}_{β} (t) - η_{β} (t) - e_{1, p + 1}^{T} S^{- 1} c_{p} \frac{η_{β}^{(p + 1)} (t)}{(p + 1)!} h^{p + 1} | = o_{P} (1) . \end{matrix}

Note that for

p = 1

,

e_{1, p + 1}^{T} S^{- 1} c_{p} = μ_{2} (K)

. This completes the proof. ☐

Lemma A1.

Assume Condition

B

in the Appendix. If

n \to \infty

,

h \to 0

and

n h \to \infty

, then for given

t \in T_{0}

an

β \in K

,

\frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; η_{i} (t; β)) {t_{i}^{*} (t) t_{i}^{*} {(t)}^{T}} w_{1} (X_{i}) K_{h} (T_{i} - t) = S_{2} (t, β) + o_{P} (1),

where

S_{2} (t, β) = g_{2} (t; t, β) f_{T} (t) S,

with

S = {(μ_{j + k - 2} (K))}_{1 \leq j, k \leq p + 1}

and

μ_{j} (K) = \int u^{j} K (u) d u

,

j = 0, 1, \dots, 2 p

.

Proof.

Recall the

(p + 1) \times (p + 1)

matrix

t_{i}^{*} (t) t_{i}^{*} {(t)}^{T} = {({(\frac{T_{i} - t}{h})}^{j + k - 2})}_{1 \leq j, k \leq p + 1} .

Set

X_{j} = \frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; η_{i} (t; β)) {(\frac{T_{i} - t}{h})}^{j} w_{1} (X_{i}) K_{h} (T_{i} - t)

for

j = 0, 1, \dots, 2 p

. We observe that:

\begin{matrix} E (X_{j}) & = & \frac{1}{n} \sum_{i = 1}^{n} E [E {p_{2} (Y_{i}; η_{i} (t; β)) w_{1} (X_{i}) ∣ T_{i}} {(\frac{T_{i} - t}{h})}^{j} K_{h} (T_{i} - t)] \\ = & \frac{1}{n} \sum_{i = 1}^{n} E \{g_{2} (T_{i}; t, β) {(\frac{T_{i} - t}{h})}^{j} K_{h} (T_{i} - t)\} \\ = & E \{g_{2} (T; t, β) {(\frac{T - t}{h})}^{j} K_{h} (T - t)\} \\ = & \int g_{2} (y; t, β) {(\frac{y - t}{h})}^{j} \frac{1}{h} K (\frac{y - t}{h}) f_{T} (y) d y \\ = & \int g_{2} (t + h x; t, β) x^{j} K (x) f_{T} (t + h x) d x \\ = & g_{2} (t; t, β) f_{T} (t) μ_{j} (K) + o (1), \end{matrix}

using the continuity of

g_{2} (τ; t, β)

in

τ

and

f_{T} (t)

in t. Similarly,

\begin{matrix} var (X_{j}) & = & \frac{1}{n^{2}} \sum_{i = 1}^{n} var \{p_{2} (Y_{i}; η_{i} (t; β)) {(\frac{T_{i} - t}{h})}^{j} w_{1} (X_{i}) K_{h} (T_{i} - t)\} \\ \leq & \frac{1}{n^{2}} \sum_{i = 1}^{n} E [p_{2}^{2} (Y_{i}; η_{i} (t; β)) {(\frac{T_{i} - t}{h})}^{2 j} w_{1}^{2} (X_{i}) {K_{h} (T_{i} - t)}^{2}] \\ \leq & \frac{C}{n h} . \end{matrix}

This completes the proof. ☐

Lemma A2.

Assume Condition

B

. If

n \to \infty

,

h \to 0

,

n h \to \infty

,

\log (1 / h) / (n h) \to 0

, then

\sup_{a^{*} \in Θ, t \in T_{0}, β \in K} | G_{n} (a^{*}; t, β) - \sqrt{n h} W_{n} {(t, β)}^{T} a^{*} - 2^{- 1} {a^{*}}^{T} S_{2} (t, β) a^{*} | = o_{P} (1)

, with a compact set

Θ \subseteq R^{p + 1}

.

Proof.

Let

D_{n} (a^{*}; t, β) = G_{n} (a^{*}; t, β) - \sqrt{n h} W_{n} {(t, β)}^{T} a^{*}

. Note that:

\begin{matrix} D_{n} (a^{*}; t, β) \\ = & n h [\frac{1}{n} \sum_{i = 1}^{n} ρ_{q} (Y_{i}, F^{- 1} (η_{i} (t; β) + {a_{n} t_{i}^{*} {(t)}^{T} a^{*}})) w_{1} (X_{i}) K_{h} (T_{i} - t) \\ - \frac{1}{n} \sum_{i = 1}^{n} ρ_{q} (Y_{i}, F^{- 1} (η_{i} (t; β))) w_{1} (X_{i}) K_{h} (T_{i} - t) \\ - \frac{1}{n} \sum_{i = 1}^{n} p_{1} (Y_{i}; η_{i} (t; β)) {a_{n} t_{i}^{*} {(t)}^{T} a^{*}} w_{1} (X_{i}) K_{h} (T_{i} - t)] \\ = & \frac{1}{2} {a^{*}}^{T} [\frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; {\tilde{η}}_{i} (t; β)) {t_{i}^{*} (t) t_{i}^{*} {(t)}^{T}} w_{1} (X_{i}) K_{h} (T_{i} - t)] a^{*}, \end{matrix}

where

a_{n} = 1 / \sqrt{n h}

and

{\tilde{η}}_{i} (t; β)

is between

η_{i} (t; β)

and

η_{i} (t; β) + {a_{n} t_{i}^{*} {(t)}^{T} a^{*}}

. Then:

\begin{matrix} | D_{n} (a^{*}; t, β) - 2^{- 1} {a^{*}}^{T} S_{2} (t, β) a^{*} | \\ = & \frac{1}{2} | {a^{*}}^{T} [\frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; {\tilde{η}}_{i} (t; β)) {t_{i}^{*} (t) t_{i}^{*} {(t)}^{T}} w_{1} (X_{i}) K_{h} (T_{i} - t) - S_{2} (t, β)] a^{*} | \\ \leq & {∥ a^{*} ∥}^{2} | \frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; {\tilde{η}}_{i} (t; β)) {t_{i}^{*} (t) t_{i}^{*} {(t)}^{T}} w_{1} (X_{i}) K_{h} (T_{i} - t) - S_{2} (t, β) | . \end{matrix}

The proof completes by applying [31]. ☐

Proof of Theorem 1.

Before showing Theorem 1, we need Proposition A1 (whose proof is omitted), where the following notation will be used. Denote by

C^{1} (T)

the set of continuously differentiable functions in

T

. Let

V (β)

denote the neighborhood of

β \in K

. Let

H_{δ} (β)

denote the neighborhood of

η_{β}

such that

V (β) \subseteq K

and

H_{δ} (β) = {u \in C^{1} (T) : {∥ u - η_{β} ∥}_{\infty} \leq δ, {∥ \frac{\partial}{\partial t} u - \frac{\partial}{\partial t} η_{β} ∥}_{\infty} \leq δ} .

Proposition A1.

Let

{(Y_{i}, X_{i}, T_{i})}_{i = 1}^{n}

be independent observations of

(Y, X, T)

modeled by (2) and (5). Assume that a random variable T is distributed on

T

. Let

K

and

H_{1} (β)

be compact sets,

g (\cdot; \cdot) : R^{2} \to R

be a continuous and bounded function,

W (x, t) : R^{d + 1} \to R

be such that

E {| W (X, T) |} < \infty

and

η_{β} (t) = η (t, β) : R^{d + 1} \to R

be a continuous function of

(t, β)

. Then:

(i): $E {g (Y; X^{T} θ + v (T)) W (X, T)} \to E {g (Y; X^{T} β + η_{β} (T)) W (X, T)}$ as $∥ θ - β ∥ + {∥ v - η_{β} ∥}_{\infty} \to 0$ ;
(ii): $\sup_{θ \in K} | n^{- 1} \sum_{i = 1}^{n} g (Y_{i}; X_{i}^{T} θ + η_{θ} (T_{i})) W (X, T) - E {g (Y; X^{T} θ + η_{θ} (T)) W (X, T)} \overset{P}{⟶} 0$ as $n \to \infty$ ;
(iii): if, in addition, $T$ is compact and $η_{β} \in C^{1} (T)$ , then $\sup_{θ \in K, v \in H_{1} (β)}$
$| n^{- 1} \sum_{i = 1}^{n} g (Y_{i}; X_{i}^{T} θ + v (T_{i})) W (X_{i}, T_{i}) - E {g (Y; X^{T} θ + v (T)) W (X, T)} | \overset{P}{⟶} 0$ as $n \to \infty$ .

For part (i), we first show that for any compact set

K

in

R^{d}

,

\begin{matrix} \sup_{β \in K} | J_{n} (β, {\hat{η}}_{β}) - J (β, η_{β}) | \overset{P}{⟶} 0 . \end{matrix}

(A3)

It suffices to show

\sup_{β \in K} | J_{n} (β, η_{β}) - J (β, η_{β}) | \overset{P}{⟶} 0

, which follows from Proposition A1 (ii), and:

\begin{matrix} \sup_{β \in K} | J_{n} (β, {\hat{η}}_{β}) - J_{n} (β, η_{β}) | & \overset{P}{⟶} & 0 . \end{matrix}

(A4)

To show (A4), we note that for any

ε > 0

, let

T_{0}

be a compact set such that

P (T_{i} \notin T_{0}) < ε

. Then:

\begin{matrix} J_{n} (β, {\hat{η}}_{β}) - J_{n} (β, η_{β}) \\ = & \frac{1}{n} \sum_{i = 1}^{n} {ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + {\hat{η}}_{β} (T_{i}))) - ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + η_{β} (T_{i})))} w_{2} (X_{i}) I (T_{i} \in T_{0}) \\ + \frac{1}{n} \sum_{i = 1}^{n} {ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + {\hat{η}}_{β} (T_{i}))) - ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + η_{β} (T_{i})))} w_{2} (X_{i}) I (T_{i} \notin T_{0}) . \end{matrix}

For

T_{i} \in T_{0}

, by the mean-value theorem,

\begin{matrix} | ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + {\hat{η}}_{β} (T_{i}))) - ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + η_{β} (T_{i}))) | \\ = & | p_{1} (Y_{i}; X_{i}^{T} β + η_{i, β}^{*}) {{\hat{η}}_{β} (T_{i}) - η_{β} (T_{i})} | \\ \leq & {∥ p_{1} (\cdot; \cdot) ∥}_{\infty} \sup_{β \in K} {∥ {\hat{η}}_{β} - η_{β} ∥}_{T_{0}; \infty}, \end{matrix}

where

η_{i, β}^{*}

is located between

{\hat{η}}_{β} (T_{i})

and

η_{β} (T_{i})

. For

T_{i} \notin T_{0}

, it follows that:

\begin{matrix} | ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + {\hat{η}}_{β} (T_{i}))) - ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β + η_{β} (T_{i}))) | \\ \leq & 2 {∥ ρ_{q} (\cdot, \cdot) ∥}_{\infty} . \end{matrix}

Hence,

\begin{matrix} | J_{n} (β, {\hat{η}}_{β}) - J_{n} (β, η_{β}) | & \leq & \{{∥ p_{1} (\cdot; \cdot) ∥}_{\infty} \sup_{β \in K} {∥ {\hat{η}}_{β} - η_{β} ∥}_{T_{0}; \infty} + 2 {∥ ρ_{q} (\cdot, \cdot) ∥}_{\infty} T_{n}^{*}\} {∥ w_{2} ∥}_{\infty} \\ \leq & 2 ε, \end{matrix}

where the last inequality is entailed by Lemma 1 and the law of large numbers for

T_{n}^{*} = n^{- 1} \sum_{i = 1}^{n} I (T_{i} \notin T_{0})

. This completes the proof of (A3). The proof of

\hat{β} \overset{P}{⟶} β_{o}

follows from combining Lemma A-1 of [1] with (A3) and Condition A2.

Part (ii) follows from Lemma 1, Part (i) and Condition B5 for

η_{β} (t)

. ☐

Proof of Theorem 2.

Similar to the proof of Lemma 1, it can be shown that

| {\hat{η}}_{β} (t) - η_{β} (t) + e_{1, p + 1}^{T} {S_{2} (t, β)}^{- 1} \frac{1}{n} \sum_{i = 1}^{n} p_{1} (Y_{i}; η_{i} (t; β)) t_{i}^{*} (t) w_{1} (X_{i}) K_{h} (T_{i} - t) | = O_{P} (h^{2} a_{n} + a_{n}^{2} \sqrt{\log (1 / h)}) .

Note that for

p = 1

,

\begin{matrix} e_{1, p + 1}^{T} {S_{2} (t, β)}^{- 1} t_{i}^{*} (t) & = & \frac{1}{g_{2} (t; t, β) f_{T} (t)} (1, 0) (\begin{matrix} 1 & 0 \\ 0 & 1 / μ_{2} (K) \end{matrix}) (\begin{matrix} 1 \\ (T_{i} - t) / h \end{matrix}) \\ = & \frac{1}{g_{2} (t; t, β) f_{T} (t)} . \end{matrix}

Thus:

| {\hat{η}}_{β} (t) - η_{β} (t) + \frac{1}{n f_{T} (t) g_{2} (t; t, β)} \sum_{i = 1}^{n} p_{1} (Y_{i}; η_{i} (t; β)) w_{1} (X_{i}) K_{h} (T_{i} - t) | = O_{P} (h^{2} a_{n} + a_{n}^{2} \sqrt{\log (1 / h)}) .

Consider

\hat{β}

defined in (23). Note that:

\begin{matrix} X_{i}^{T} β + {\hat{η}}_{β} (T_{i}) & = & X_{i}^{T} β_{o} + X_{i}^{T} (β - β_{o}) + {\hat{η}}_{(β - β_{o}) + β_{o}} (T_{i}) \\ = & X_{i}^{T} β_{o} + c_{n} X_{i}^{T} {\sqrt{n} (β - β_{o})} + {\hat{η}}_{c_{n} {\sqrt{n} (β - β_{o})} + β_{o}} (T_{i}), \end{matrix}

where

c_{n} = 1 / \sqrt{n}

. Then,

\hat{θ} = \sqrt{n} (\hat{β} - β_{o})

minimizes:

\begin{matrix} J_{n} (θ) & = & n [\frac{1}{n} \sum_{i = 1}^{n} {ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β_{o} + c_{n} X_{i}^{T} θ + {\hat{η}}_{c_{n} θ + β_{o}} (T_{i}))) w_{2} (X_{i}) \\ - ρ_{q} (Y_{i}, F^{- 1} (X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i}))) w_{2} (X_{i})}] \end{matrix}

with respect to

θ

. By Taylor expansion,

\begin{matrix} J_{n} (θ) \\ = & n (\frac{1}{n} \sum_{i = 1}^{n} p_{1} (Y_{i}; X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})) [c_{n} X_{i}^{T} θ + {{\hat{η}}_{c_{n} θ + β_{o}} (T_{i}) - {\hat{η}}_{β_{o}} (T_{i})}] w_{2} (X_{i}) \\ + \frac{1}{2 n} \sum_{i = 1}^{n} p_{2} (Y_{i}; X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})) {[c_{n} X_{i}^{T} θ + {{\hat{η}}_{c_{n} θ + β_{o}} (T_{i}) - {\hat{η}}_{β_{o}} (T_{i})}]}^{2} w_{2} (X_{i}) \\ + \frac{1}{6 n} \sum_{i = 1}^{n} p_{3} (Y_{i}; η_{i}^{*}) {[c_{n} X_{i}^{T} θ + {{\hat{η}}_{c_{n} θ + β_{o}} (T_{i}) - {\hat{η}}_{β_{o}} (T_{i})}]}^{3} w_{2} (X_{i})) \\ = & I_{n, 1} + I_{n, 2} + I_{n, 3}, \end{matrix}

where

η_{i}^{*}

is located between

X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})

and

X_{i}^{T} β_{o} + c_{n} X_{i}^{T} θ + {\hat{η}}_{c_{n} θ + β_{o}} (T_{i})

,

\begin{matrix} I_{n, 1} & = & \sum_{i = 1}^{n} p_{1} (Y_{i}; X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})) \{c_{n} X_{i}^{T} θ + {\frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β}}^{T} |_{β = β_{n}} c_{n} θ\} w_{2} (X_{i}) \\ = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} p_{1} (Y_{i}; X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})) {\{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = β_{n}}\}}^{T} θ w_{2} (X_{i}) \\ = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} p_{1} (Y_{i}; X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})) {\{X_{i} + \frac{\partial η_{β} (T_{i})}{\partial β} |_{β = β_{o}}\}}^{T} θ w_{2} (X_{i}) + o_{P} (1), \\ I_{n, 2} & = & \frac{1}{2} θ^{T} [\frac{1}{n} \sum_{i = 1}^{n} p_{2} (Y_{i}; X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})) \\ \{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = β_{n}}\} {\{X_{i} + \frac{\partial {\hat{η}}_{β} (T_{i})}{\partial β} |_{β = β_{n}}\}}^{T} w_{2} (X_{i})] θ \\ = & \frac{1}{2} θ^{T} B_{2} θ + o_{P} (1), \\ I_{n, 3} & = & o_{P} (1), \end{matrix}

with

β_{n}

located between

β_{o}

and

c_{n} θ + β_{o}

, and

B_{2} = H_{0}

following Lemma 1, Condition A3 and Proposition A1. Thus:

\begin{matrix} J_{n} (θ) = {I_{n, 1}^{*}}^{T} θ + \frac{1}{2} θ^{T} B_{2} θ + o_{P} (1), \end{matrix}

(A5)

where

I_{n, 1}^{*} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} p_{1} (Y_{i}; X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})) {X_{i} + \frac{\partial η_{β} (T_{i})}{\partial β} |_{β = β_{o}}} w_{2} (X_{i})

. Note that:

\begin{matrix} I_{n, 1}^{*} & = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [p_{1} (Y_{i}; X_{i}^{T} β_{o} + η_{β_{o}} (T_{i})) \{X_{i} + \frac{\partial η_{β} (T_{i})}{\partial β} |_{β = β_{o}}\} w_{2} (X_{i}) \\ + p_{2} (Y_{i}; X_{i}^{T} β_{o} + η_{β_{o}} (T_{i})) \{X_{i} + \frac{\partial η_{β} (T_{i})}{\partial β} |_{β = β_{o}}\} w_{2} (X_{i}) {{\hat{η}}_{β_{o}} (T_{i}) - η_{β_{o}} (T_{i})} \\ + \frac{1}{2} p_{3} (Y_{i}; η_{i}^{* *}) \{X_{i} + \frac{\partial η_{β} (T_{i})}{\partial β} |_{β = β_{o}}\} w_{2} (X_{i}) {{\hat{η}}_{β_{o}} (T_{i}) - η_{β_{o}} (T_{i})}^{2}] \\ = & T_{n, 1} + T_{n, 2} + T_{n, 3}, \end{matrix}

where

η_{i}^{* *}

is between

X_{i}^{T} β_{o} + {\hat{η}}_{β_{o}} (T_{i})

and

X_{i}^{T} β_{o} + η_{β_{o}} (T_{i})

,

\begin{matrix} T_{n, 3} & = & o_{P} (1), \\ T_{n, 2} & = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} p_{2} (Y_{i}; X_{i}^{T} β_{o} + η_{β_{o}} (T_{i})) \{X_{i} + \frac{\partial η_{β} (T_{i})}{\partial β} |_{β = β_{o}}\} w_{2} (X_{i}) \\ \times \frac{(- 1)}{n f_{T} (T_{i}) g_{2} (T_{i}; T_{i}, β_{o})} \sum_{j = 1}^{n} p_{1} (Y_{j}; η_{j} (T_{i}; β_{o})) w_{1} (X_{j}) K_{h} (T_{j} - T_{i}) \\ = & - \frac{1}{\sqrt{n}} \sum_{j = 1}^{n} \frac{p_{1} (Y_{j}; η_{j}) w_{1} (X_{j})}{g_{2} (T_{j}; T_{j}, β_{o})} E [p_{2} (Y_{j}; η_{j}) \{X_{j} + \frac{\partial η_{β} (T_{j})}{\partial β} |_{β = β_{o}}\} w_{2} (X_{j}) | T_{j}] \\ \equiv & - \frac{1}{\sqrt{n}} \sum_{j = 1}^{n} p_{1} (Y_{j}; η_{j}) \frac{γ (T_{j})}{g_{2} (T_{j}; T_{j}, β_{o})} w_{1} (X_{j}), \end{matrix}

with:

γ (t) = E [p_{2} (Y; η (X, T)) \{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\} w_{2} (X) | T = t] .

Therefore,

\begin{matrix} I_{n, 1}^{*} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} p_{1} (Y_{i}; η_{i}) [\{X_{i} + \frac{\partial η_{β} (T_{i})}{\partial β} |_{β = β_{o}}\} w_{2} (X_{i}) - \frac{γ (T_{i})}{g_{2} (T_{i}; T_{i}, β_{o})} w_{1} (X_{i})] + o_{P} (1) . \end{matrix}

By the central limit theorem,

\begin{matrix} I_{n, 1}^{*} \overset{D}{⟶} N (0, Ω_{0}^{*}), \end{matrix}

(A6)

where:

\begin{matrix} Ω_{0}^{*} & = & E (p_{1}^{2} (Y; η (X, T)) [\{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\} w_{2} (X) - \frac{γ (T)}{g_{2} (T; T, β_{o})} w_{1} (X)] \\ {[\{X + \frac{\partial η_{β} (T)}{\partial β} |_{β = β_{o}}\} w_{2} (X) - \frac{γ (T)}{g_{2} (T; T, β_{o})} w_{1} (X)]}^{T}) . \end{matrix}

From (A5) and (A6),

\hat{θ} = - B_{2}^{- 1} I_{n, 1}^{*} + o_{P} (1)

. This implies that

\sqrt{n} (\hat{β} - β_{o}) \overset{D}{⟶} N (0, H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1})

. ☐

Proof of Theorem 3.

Denote

V_{0} = H_{0}^{- 1} Ω_{0}^{*} H_{0}^{- 1}

and

{\hat{V}}_{n} = {\hat{H}}_{0}^{- 1} {\hat{Ω}}_{0}^{*} {\hat{H}}_{0}^{- 1}

. Note that

A \hat{β} - g_{0} = A (\hat{β} - β_{o}) + (A β_{o} - g_{0})

. Thus:

\begin{matrix} {(A {\hat{V}}_{n} A^{T})}^{- 1 / 2} \sqrt{n} (A \hat{β} - g_{0}) \\ = & {(A {\hat{V}}_{n} A^{T})}^{- 1 / 2} {A \sqrt{n} (\hat{β} - β_{o})} + {(A {\hat{V}}_{n} A^{T})}^{- 1 / 2} {\sqrt{n} (A β_{o} - g_{0})} \\ \equiv & I_{1} + I_{2}, \end{matrix}

which implies that

W_{n} = {∥ I_{1} + I_{2} ∥}^{2}

. Arguments for Theorem 2 give

I_{1} \overset{D}{⟶} N (0, I_{k})

. Under

H_{0}

in (4),

I_{2} \equiv 0

and thus

(I_{1} + I_{2}) \overset{D}{⟶} N (0, I_{k})

, which completes the proof. ☐

Proof of Theorem 4.

Follow the notation and proof in Theorem 3. Under

H_{1 n}

in (29),

I_{2} \overset{P}{⟶} {(A V_{0} A^{T})}^{- 1 / 2} c

and thus

(I_{1} + I_{2}) \overset{D}{⟶} N ({(A V_{0} A^{T})}^{- 1 / 2} c, I_{k})

. This completes the proof. ☐

Proof of Theorem 5.

Following the notation and proof in Theorem 3,

W_{n} = {∥ I_{1} ∥}^{2} + 2 I_{1}^{T} I_{2} + {∥ I_{2} ∥}^{2}

. We see that

{∥ I_{1} ∥}^{2} \overset{D}{⟶} χ_{k}^{2}

. Under

H_{1}

in (4),

I_{2} = {(A V_{0} A^{T})}^{- 1 / 2} \sqrt{n} (A β_{o} - g_{0}) {1 + o_{P} (1)}

, which means

{∥ I_{2} ∥}^{2} = n {(A β_{o} - g_{0})}^{T} {(A V_{0} A^{T})}^{- 1} (A β_{o} - g_{0}) {1 + o_{P} (1)}

and thus

I_{1}^{T} I_{2} = O_{P} (\sqrt{n})

. Hence,

n^{- 1} W_{n} \geq λ_{\min} {{(A V_{0} A^{T})}^{- 1}} {∥ A β_{o} - g_{0} ∥}^{2} + o_{P} (1)

. This completes the proof. ☐

Proof of Theorem 6.

Denote

J_{n} (β) = J_{n} (β, {\hat{η}}_{β})

. For the matrix

A

in (4), there exists a

(d - k) \times d

matrix B satisfying

B B^{T} = I_{d - k}

and

A B^{T} = 0

. Therefore,

A β = g_{0}

is equivalent to

β = B^{T} γ + b_{0}

for some vector

γ \in R^{d - k}

and

b_{0} = A^{T} {(A A^{T})}^{- 1} g_{0}

. Then, minimizing

J_{n} (β)

subject to

A β = g_{0}

is equivalent to minimizing

J_{n} (B^{T} γ + b_{0})

with respect to

γ

, and we denote by

\hat{γ}

the minimizer. Furthermore, under

H_{0}

in (4), we have

β_{o} = B^{T} γ_{0} + b_{0}

for

γ_{0} = B β_{o}

, and

\hat{γ} - γ_{0} \overset{P}{⟶} 0

.

For Part (i), using the Taylor expansion around

\hat{β}

, we get:

J_{n} (B^{T} \hat{γ} + b_{0}) - J_{n} (\hat{β}) = \frac{1}{2 n} {\sqrt{n} (B^{T} \hat{γ} + b_{0} - \hat{β})}^{T} J_{n}^{″} (\tilde{β}) {\sqrt{n} (B^{T} \hat{γ} + b_{0} - \hat{β})},

(A7)

where

\tilde{β}

is between

B^{T} \hat{γ} + b_{0}

and

\hat{β}

. We now discuss

B^{T} \hat{γ} + b_{0} - \hat{β}

. From the proof in Theorem 2,

(\hat{β} - β_{o}) = - H_{0}^{- 1} J_{n}^{'} (β_{o}) {1 + o_{P} (1)}

, where

J_{n}^{'} (β_{o}) = {I_{n, 1}^{*} + o_{P} (1)} / \sqrt{n}

. Similar arguments deduce

\hat{γ} - γ_{0} = - {(B H_{0} B^{T})}^{- 1} B J_{n}^{'} (β_{o}) {1 + o_{P} (1)} .

Thus, under

H_{0}

in (4),

B^{T} \hat{γ} + b_{0} - \hat{β} = B^{T} (\hat{γ} - γ_{0}) - (\hat{β} - β_{o}) = H_{0}^{- 1 / 2} P_{H_{0}^{- 1 / 2} A^{T}} H_{0}^{- 1 / 2} J_{n}^{'} (β_{o}) {1 + o_{P} (1)},

and thus by (A6),

\sqrt{n} (B^{T} \hat{γ} + b_{0} - \hat{β}) \overset{D}{⟶} H_{0}^{- 1 / 2} P_{H_{0}^{- 1 / 2} A^{T}} H_{0}^{- 1 / 2} {Ω_{0}^{*}}^{1 / 2} Z,

(A8)

where

Z = {(Z_{1}, \dots, Z_{d})}^{T} \sim N (0, I_{d})

. Combining the fact

J_{n}^{″} (\tilde{β}) \overset{P}{⟶} H_{0}

, (A7) and (A8) gives:

\begin{matrix} Λ_{n} & = & {\sqrt{n} (B^{T} \hat{γ} + b_{0} - \hat{β})}^{T} H_{0} {\sqrt{n} (B^{T} \hat{γ} + b_{0} - \hat{β})} {1 + o_{P} (1)} \\ \overset{D}{⟶} & Z^{T} {Ω_{0}^{*}}^{1 / 2} H_{0}^{- 1 / 2} P_{H_{0}^{- 1 / 2} A^{T}} H_{0}^{- 1 / 2} {Ω_{0}^{*}}^{1 / 2} Z \\ = & \sum_{j = 1}^{d} λ_{j} ({Ω_{0}^{*}}^{1 / 2} H_{0}^{- 1 / 2} P_{H_{0}^{- 1 / 2} A^{T}} H_{0}^{- 1 / 2} {Ω_{0}^{*}}^{1 / 2}) Z_{j}^{2} \\ = & \sum_{j = 1}^{k} λ_{j} {{(A H_{0}^{- 1} A^{T})}^{- 1} (A V_{0} A^{T})} Z_{j}^{2} . \end{matrix}

(A9)

This proves Part (i).

For Part (ii), using

ψ (r) = r

,

w_{1} (x) = w_{2} (x) \equiv 1

and (31), we obtain

Ω_{0}^{*} = Ω_{0} = C H_{0}

, and thus,

A V_{0} A^{T} = C (A H_{0}^{- 1} A^{T})

. Thus, (A9)

= C \sum_{j = 1}^{k} Z_{j}^{2} \sim C χ_{k}^{2}

, which completes the proof. ☐

Proof of Theorem 7.

The proofs are similar to those used in Theorem 4 and Theorems 5 and 6. The lengthy details are omitted. ☐

References

Andrews, D. Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 1994, 62, 43–72. [Google Scholar] [CrossRef]
Robinson, P.M. Root-n consistent semiparametric regression. Econometrica 1988, 56, 931–954. [Google Scholar] [CrossRef]
Speckman, P. Kernel smoothing in partial linear models. J. R. Statist. Soc. B 1988, 50, 413–436. [Google Scholar]
Yatchew, A. An elementary estimator of the partial linear model. Econ. Lett. 1997, 57, 135–143. [Google Scholar] [CrossRef]
Fan, J.; Li, R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Am. Stat. Assoc. 2004, 99, 710–723. [Google Scholar] [CrossRef]
McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman & Hall: London, UK, 1989. [Google Scholar]
Zhang, C.M.; Yu, T. Semiparametric detection of significant activation for brain fMRI. Ann. Stat. 2008, 36, 1693–1725. [Google Scholar] [CrossRef]
Fan, J.; Huang, T. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 2005, 11, 1031–1057. [Google Scholar] [CrossRef]
Boente, G.; He, X.; Zhou, J. Robust estimates in generalized partially linear models. Ann. Stat. 2006, 34, 2856–2878. [Google Scholar] [CrossRef]
Zhang, C.M.; Guo, X.; Cheng, C.; Zhang, Z.J. Robust-BD estimation and inference for varying-dimensional general linear models. Stat. Sin. 2014, 24, 653–673. [Google Scholar] [CrossRef]
Fan, J.; Gijbels, I. Local Polynomial Modeling and Its Applications; Chapman and Hall: London, UK, 1996. [Google Scholar]
Brègman, L.M. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 620–631. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2001. [Google Scholar]
Zhang, C.M.; Jiang, Y.; Shang, Z. New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Can. J. Stat. 2009, 37, 119–139. [Google Scholar] [CrossRef]
Huber, P. Robust estimation of a location parameter. Ann. Math. Statist. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
Carroll, R.; Fan, J.; Gijbels, I.; Wand, M. Generalized partially linear single-index models. J. Am. Stat. Assoc. 1997, 92, 477–489. [Google Scholar] [CrossRef]
Mukarjee, H.; Stern, S. Feasible nonparametric estimation of multiargument monotone functions. J. Am. Stat. Assoc. 1994, 89, 77–80. [Google Scholar] [CrossRef]
Albright, S.C.; Winston, W.L.; Zappe, C.J. Data Analysis and Decision Making with Microsoft Excel; Duxbury Press: Pacific Grove, CA, USA, 1999. [Google Scholar]
Fan, J.; Peng, H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 2004, 32, 928–961. [Google Scholar]
Dette, H. A consistent test for the functional form of a regression based on a difference of variance estimators. Ann. Stat. 1999, 27, 1012–1050. [Google Scholar] [CrossRef]
Dette, H.; von Lieres und Wilkau, C. Testing additivity by kernel-based methods. Bernoulli 2001, 7, 669–697. [Google Scholar] [CrossRef]
Fan, J.; Zhang, C.M.; Zhang, J. Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Stat. 2001, 29, 153–193. [Google Scholar] [CrossRef]
Hong, Y.M.; Lee, Y.J. A loss function approach to model specification testing and its relative efficiency. Ann. Stat. 2013, 41, 1166–1203. [Google Scholar] [CrossRef]
Zheng, J.X. A consistent test of functional form via nonparametric estimation techniques. J. Econ. 1996, 75, 263–289. [Google Scholar] [CrossRef]
Opsomer, J.D.; Ruppert, D. A root-n consistent backfitting estimator for semiparametric additive modeling. J. Comput. Graph. Stat. 1999, 8, 715–732. [Google Scholar] [CrossRef]
Belloni, A.; Chernozhukov, V.; Hansen, C. Inference on treatment effects after selection amongst high-dimensional controls. Rev. Econ. Stud. 2014, 81, 608–650. [Google Scholar] [CrossRef]
Cattaneo, M.D.; Jansson, M.; Newey, W.K. Alternative asymptotics and the partially linear model with many regressors. Econ. Theory 2016, 1–25. [Google Scholar] [CrossRef]
Cattaneo, M.D.; Jansson, M.; Newey, W.K. Treatment effects with many covariates and heteroskedasticity. arXiv, 2015; arXiv:1507.02493. [Google Scholar]
Mack, Y.P.; Silverman, B.W. Weak and strong uniform consistency of kernel regression estimates. Z. Wahrsch. Verw. Gebiete 1982, 61, 405–415. [Google Scholar] [CrossRef]

Figure 1. Simulated Bernoulli response data without contamination. Boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

(from left to right). (Left panels): non-robust method; (right panels): robust method.

Figure 2. Simulated Bernoulli response data with contamination. The captions are identical to those in Figure 1.

Figure 3. Simulated Bernoulli response data without contamination. Plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

Figure 4. Simulated Bernoulli response data with contamination. Plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

Figure 5. Simulated Gaussian response data without contamination. Top panels: boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

(from left to right). Bottom panels: plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

Figure 6. Simulated Gaussian response data with contamination. Top panels: boxplots of

({\hat{β}}_{j} - β_{j; o})

,

j = 1, \dots, d

(from left to right). Bottom panels: plots of

η^{o} (t)

and

{\hat{η}}_{\hat{β}} (t)

. (Left panels): non-robust method; (right panels): robust method.

Figure 7. Simulated Gaussian response data with contamination. Empirical quantiles (on the y-axis) of the Wald-type statistics

W_{n}

versus quantiles (on the x-axis) of the

χ^{2}

distribution. Solid line: the 45 degree reference line. (Left panels): non-robust method; (right panels): robust method.

Figure 8. Observed power curves of tests for the Gaussian response data. The dashed line corresponds to the non-robust Wald-type test

W_{n}

; the solid line corresponds to the robust

W_{n}

; the dotted line indicates the 5% nominal level. (Left panels): non-contaminated case; (right panels): contaminated case.

Figure 9. The dataset in [19]. (Left panels): estimate of

η (T)

via the non-robust quadratic loss; (right panels): estimate of

η (T)

via the robust quadratic loss.

Figure 10. The dataset in [20]. (Left panel): estimate of

η (T)

via the non-robust quadratic loss; (right panel): estimate of

η (T)

via the robust quadratic loss.

Table 1. Parameter estimates and p-values for partially linear model of the dataset in [20].

Variable	Classical-BD Estimation			Robust-BD Estimation
Variable	Estimate (s.e.)		p-Value of $W_{n}$	Estimate (s.e.)		p-Value of $W_{n}$
$Female$	−0.0491	(0.0232)	0.0339	0.0530	(0.0323)	0.1010
$YrHired$	−0.0093	(0.0026)	0.0005	0.0359	(0.0086)	0.0000
$EducLev$	0.0179	(0.0079)	0.0228	−0.0133	(0.0131)	0.3103
$JobGrade$	0.0899	(0.0075)	0.0000	0.1672	(0.0168)	0.0000
$YrsPrior$	0.0033	(0.0023)	0.1528	−0.0050	(0.0061)	0.4104

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Robust-BD Estimation and Inference for General Partially Linear Models

Abstract

1. Introduction

2. Robust-BD and Robust-BD Estimators

2.1. Classical-BD

2.2. Robust-BD $ρ_{q} (y, μ)$

2.3. Local-Polynomial Robust-BD Estimator of $η^{o} (\cdot)$

2.4. Robust-BD Estimator of $β_{o}$

3. Two-Step Iterative Algorithm for Robust-BD Estimation

3.1. Step 1

3.2. Step 2

4. Asymptotic Property of the Robust-BD Estimators

4.1. Consistency

4.2. Asymptotic Normality

5. Robust Inference for $β_{o}$ Based on BD

5.1. Wald-Type Test $W_{n}$

5.2. Likelihood Ratio-Type Test $Λ_{n}$

5.3. Comparison between $W_{n}$ and $Λ_{n}$

6. Simulation Study

6.1. Bernoulli Responses

6.2. Gaussian Responses

7. Real Data Analysis

7.1. Example 1

7.2. Example 2

8. Discussion

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A. Proofs of Main Results

References

Article Metrics

Citations

Article Access Statistics

Robust-BD Estimation and Inference for General Partially Linear Models

Abstract

1. Introduction

2. Robust-BD and Robust-BD Estimators

2.1. Classical-BD

2.2. Robust-BD ρ q ( y , μ )

2.3. Local-Polynomial Robust-BD Estimator of η o ( · )

2.4. Robust-BD Estimator of β o

3. Two-Step Iterative Algorithm for Robust-BD Estimation

3.1. Step 1

3.2. Step 2

4. Asymptotic Property of the Robust-BD Estimators

4.1. Consistency

4.2. Asymptotic Normality

5. Robust Inference for β o Based on BD

5.1. Wald-Type Test W n

5.2. Likelihood Ratio-Type Test Λ n

5.3. Comparison between W n and Λ n

6. Simulation Study

6.1. Bernoulli Responses

6.2. Gaussian Responses

7. Real Data Analysis

7.1. Example 1

7.2. Example 2

8. Discussion

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A. Proofs of Main Results

References

Article Metrics

Citations

Article Access Statistics

2.2. Robust-BD $ρ_{q} (y, μ)$

2.3. Local-Polynomial Robust-BD Estimator of $η^{o} (\cdot)$

2.4. Robust-BD Estimator of $β_{o}$

5. Robust Inference for $β_{o}$ Based on BD

5.1. Wald-Type Test $W_{n}$

5.2. Likelihood Ratio-Type Test $Λ_{n}$

5.3. Comparison between $W_{n}$ and $Λ_{n}$