Robust-BD Estimation and Inference for General Partially Linear Models

: The classical quadratic loss for the partially linear model (PLM) and the likelihood function for the generalized PLM are not resistant to outliers. This inspires us to propose a class of “robust-Bregman divergence (BD)” estimators of both the parametric and nonparametric components in the general partially linear model (GPLM), which allows the distribution of the response variable to be partially speciﬁed, without being fully known. Using the local-polynomial function estimation method, we propose a computationally-efﬁcient procedure for obtaining “robust-BD” estimators and establish the consistency and asymptotic normality of the “robust-BD” estimator of the parametric component β o . For inference procedures of β o in the GPLM, we show that the Wald-type test statistic W n constructed from the “robust-BD” estimators is asymptotically distribution free under the null, whereas the likelihood ratio-type test statistic Λ n is not. This provides an insight into the distinction from the asymptotic equivalence (Fan and Huang 2005) between W n and Λ n in the PLM constructed from proﬁle least-squares estimators using the non-robust quadratic loss. Numerical examples illustrate the computational effectiveness of the proposed “robust-BD” estimators and robust Wald-type test in the appearance of outlying observations.


Introduction
Semiparametric models, such as the partially linear model (PLM) and generalized PLM, play an important role in statistics, biostatistics, economics and engineering studies [1][2][3][4][5].For the response variable Y and covariates (X, T), where X = (X 1 , . . ., X d ) T ∈ R d and T ∈ T ⊆ R D , the PLM, which is widely used for continuous responses Y, describes the model structure according to: where β o = (β 1;o , . . ., β d;o ) T is a vector of unknown parameters and η o (•) is an unknown smooth function; the generalized PLM, which is more suited to discrete responses Y and extends the generalized linear model [6], assumes: Y | (X, T) ∼ exponential family of distributions, where F is a known link function.Typically, the parametric component β o is of primary interest, while the nonparametric component η o (•) serves as a nuisance function.For illustration clarity, this paper focuses on D = 1.An important application of PLM to brain fMRI data was given in [7] for detecting activated brain voxels in response to external stimuli.There, β o corresponds to the part of hemodynamic response values, which is the object of primary interest to neuroscientists; η o (•) is the slowly drifting baseline of time.Determining whether a voxel is activated or not can be formulated as testing for the linear form of hypotheses, where A is a given k × d full row rank matrix and g 0 is a known k × 1 vector.Estimation of the parametric and nonparametric components of PLM and generalized PLM has received much attention in the literature.On the other hand, the existing work has some limitations: (i) The generalized PLM assumes that Y | (X, T) follows the distribution in (3), so that the likelihood function is fully available.From the practical viewpoint, results from the generalized PLM are not applicable to situations where the distribution of Y | (X, T) either departs from (3) or is incompletely known.(ii) Some commonly-used error measures, such as the quadratic loss in PLM for Gaussian-type responses (see for example [7,8]) and the (negative) likelihood function used in the generalized PLM, are not resistant to outliers.The work in [9] studied robust inference based on the kernel regression method for the generalized PLM with a canonical link, based on either the (negative) likelihood or (negative) quasi-likelihood as the error measure, and illustrated numerical examples with the dimension d = 1.However, the quasi-likelihood is not suitable for the exponential loss function (defined in Section 2.1), commonly used in machine learning and data mining.(iii) The work in [8] developed the inference of (4) for PLM, via the classical quadratic loss as the error measure, and demonstrated that the asymptotic distributions of the likelihood ratio-type statistic and Wald statistic under the null of (4) are both χ 2 k .It remains unknown whether this conclusion holds when the tests are constructed based on robust estimators.
Without completely specifying the distribution of Y | (X, T), we assume: with a known functional form of V(•).We refer to a model specified by (2) and (5) as the "general partially linear model" (GPLM).This paper aims to develop robust estimation of GPLM and robust inference of β o , allowing the distribution of Y | (X, T) to be partially specified.To introduce robust estimation, we adopt a broader class of robust error measures, called "robust-Bregman divergence (BD)" developed in [10], for a GLM, in which BD includes the quadratic loss, the (negative) quasi-likelihood, the exponential loss and many other commonly-used error measures as special cases.We propose the "robust-BD estimators" for both the parametric and nonparametric components of the GPLM.Distinct from the explicit-form estimators for PLM using the classical quadratic loss (see [8]), the "robust-BD estimators" for GPLM do not have closed-form expressions, which makes the theoretical derivation challenging.Moreover, the robust-BD estimators, as numerical solutions to non-linear optimization problems, pose key implementation challenges.Our major contributions are given below.

•
The robust fitting of the nonparametric component η o (•) is formulated using the local-polynomial regression technique [11].See Section 2.3.

•
We develop a coordinate descent algorithm for the robust-BD estimator of β o , which is computationally efficient particularly when the dimension d is large.See Section 3.

•
Theorems 1 and 2 demonstrate that under the GPLM, the consistency and asymptotic normality of the proposed robust-BD estimator for β o are achieved.See Section 4.

•
For robust inference of β o , we propose a robust version of the Wald-type test statistic W n , based on the robust-BD estimators, and justify its validity in Theorems 3-5.It is shown to be asymptotically χ 2 (central) under the null, thus distribution free, and χ 2 (noncentral) under the contiguous alternatives.Hence, this result, when applied to the exponential loss, as well as other loss functions in the wider class of BD, is practically feasible.See Section 5.1.

•
For robust inference of β o , we re-examine the likelihood ratio-type test statistic Λ n , constructed by replacing the negative log-likelihood with the robust-BD.Our Theorem 6 reveals that the asymptotic null distribution of Λ n is generally not χ 2 , but a linear combination of independent χ 2 variables, with weights relying on unknown quantities.Even in the particular case of using the classical-BD, the limit distribution is not invariant with re-scaling the generating function of the BD.Moreover, the limit null distribution of Λ n (in either the non-robust or robust version) using the exponential loss, which does not belong to the (negative) quasi-likelihood, but falls in BD, is always a weighted χ 2 , thus limiting its use in practical applications.See Section 5.2.
Simulation studies in Section 6 demonstrate that the proposed class of robust-BD estimators and robust Wald-type test either compare well with or perform better than the classical non-robust counterparts: the former is less sensitive to outliers than the latter, and both perform comparably well for non-contaminated cases.Section 7 illustrates some real data applications.Section 8 ends the paper with brief discussions.Details of technical derivations are relegated to Appendix A.

Robust-BD and Robust-BD Estimators
This section starts with a brief review of BD in Section 2.1 and "robust-BD" in Section 2.2, followed by the proposed "robust-BD" estimators of η o (•) and β o in Sections 2.3 and 2.4.

Classical-BD
To broaden the scope of robust estimation and inference, we consider a class of error measures motivated from the Bregman divergence (BD).For a given concave q-function, [12] defined a bivariate function, We call Q q the BD and call q the generating q-function of the BD.For example, a function q(µ) = aµ − µ 2 for some constant a yields the quadratic loss Q q (Y, µ) = (Y − µ) 2 .For a binary response variable Y, q(µ) = min{µ, (1 − µ)} gives the misclassification loss [13].Moreover, [14] showed that if: with a finite constant a such that the integral is well defined, then Q q (y, µ) matches the "classical (negative) quasi-likelihood" function.

Local
. observations of (Y, X, T) captured by the GPLM in ( 2) and ( 5), where the dimension d ≥ 1 is a finite integer.From (2), it is directly observed that if the true value of β o is known, then estimating η o (•) becomes estimating a nonparametric function; conversely, if the actual form of η o (•) is available, then estimating β o amounts to estimating a vector parameter.
To motivate the estimation of η o (•) at a fitting point t, a proper way to characterize η o (t) is desired.For any given value of β, define: where a is a scalar, ρ q (y, µ) is the "robust-BD" defined in (8), which aims to guard against outlying observations in the response space of Y, and w 1 (•) ≥ 0 is a given bounded weight function that downweights high leverage points in the covariate space of X. See Sections 6 and 7 for an example of w 1 (x).Set: S(a; t, β).(14) Theoretically, η o (t) = η β o (t) will be assumed (in Condition A3) for obtaining asymptotically unbiased estimators of η o (•).Such property indeed holds, for example, when a classical quadratic loss combined with an identity link is used in (14).Thus, we call η β (•) the "surrogate function" for η o (•).  (1(t), . . ., (η o ) (p) (t)/p!) T ∈ R p+1 the vector consisting of η o (t) along with its (re-scaled) derivatives.For observed covariates T i close to the point t, the Taylor expansion implies that: where For any given value of β, let a(t; β) = ( a 0 (t; β), a 1 (t; β), . . ., a p (t; β)) T be the minimizer of the criterion function, with respect to a ∈ R p+1 , where where e j,p+1 denotes the j-th column of a (p + 1) × (p + 1) identity matrix.
It is noted that the reliance of η β (t) on β does not guarantee its consistency to η o (t).Nonetheless, it is anticipated from the uniform consistency of η β in Lemma 1 that η β (t) will offer a valid estimator of η o (t), provided that β consistently estimates β o .Section 2.4 will discuss our proposed robust-BD estimator β.Furthermore, Lemma 1 will assume (in Condition A1) that η β (t) is the unique minimizer of S(a; t, β) with respect to a.
Remark 1.The case of using the "kernel estimation", or locally-constant estimation, corresponds to the choice of degree p = 0 in (15).In that case, the criterion function in (16) and the estimator in (17) reduce to: S n (a; t, β), (19) respectively.

Robust-BD Estimator of β o
For any given value of β, define: where η β (•) is as defined in ( 14) and w 2 (•) plays the same role as w 1 (•) in (13).Theoretically, it is anticipated that: which holds for example in the case where a classical quadratic loss combined with an identity link is used.To estimate β o , it is natural to replace (20) by its sample-based criterion, where η β (•) is as defined in (17).Hence, a parametric estimator of β o is provided by: Finally, the estimator of η o (•) is given by: To achieve asymptotic normality of β, Theorem 2 assumes (in Condition A2) that β o is the unique minimizer in (21), a standard condition for consistent M-estimators [16].
As a comparison, it is seen that w 1 (•) in ( 16) is used to robustify covariates X i in estimating η o (•), w 2 (•) in ( 22) is used to robustify covariates X i in estimating β o and ρ q (•, •) serves to robustify the responses Y i in both estimating procedures.

Two-Step Iterative Algorithm for Robust-BD Estimation
In a special case of using the classical quadratic loss combined with an identity link function, the robust-BD estimators for parametric and nonparametric components have explicit expressions, where w 2 = diag(w 2 (X 1 ), . . ., w 2 (X n )), y = (I − S h )y, X = (I − S h )X, with I being an identity matrix, y = (Y 1 , . . ., Y n ) T , X = (X 1 , . . ., X n ) T the design matrix, and: When w 1 (x) = w 2 (x) ≡ 1, (24) reduces to the "profile least-squares estimators" of [8].
In other cases, robust-BD estimators from (17) and (23) do not have closed-form expressions and need to be solved numerically, which are computationally challenging and intensive.We now discuss a two-step robust proposal for iteratively estimating Step 1: Instead of solving ( 23) directly, we propose to solve a surrogate optimization problem, 1] ).This minimizer approximates β.
Step 2: Obtain η The algorithm terminates provided that β is below some pre-specified threshold value, and all { η [k] (T i )} n i=1 stabilize.

Step 1
For the above two-step algorithm, we first elaborate on the procedure of acquiring β in Step 1, by extending the coordinate descent (CD) iterative algorithm [17] designed for penalized estimation to our current robust-BD estimation, which is computationally efficient.For any given value of η, by Taylor expansion, around some initial estimate β * (for example, β ), we obtain the weighted quadratic approximation, where C i is a constant not depending on β, with p j (y; θ) defined in (10).Hence, Thus it suffices to conduct minimization of 2 with respect to β, using a coordinate descent (CD) updating procedure.Suppose that the current estimate is , with the current residual vector r old = ( r old 1 , . . ., r old n ) = z I − X β old , where z I = (Z I 1 , . . ., Z I n ) T is the vector of pseudo responses.Adopting the Newton-Raphson algorithm, the estimate of the j-th coordinate based on the previous estimate β old j is updated to: .
As a result, the residuals due to such an update are updated to: Cycling through j = 1, . . ., d, we obtain the estimate . Iterate the process of weighted quadratic approximation followed by the CD updating, for a number of times, until the estimate β new stabilizes to the solution β [k] .The validity of β [k] in Step 1 converging to the true parameter β o is justified as follows.(i) Standard results for M-estimation [16] indicate that the minimizer of set T , where P −→ stands for convergence in probability.Using derivations similar to those of (A4) ) is asymptotically equivalent to minimizing J n (β, η β ).Assembling these three results with the definition of β [k] yields:

Step 2
In Step 2, obtaining η β (t) for any given values of β and t is equivalent to minimizing S n (a; t, β) in (16).Notice that the dimension (p + 1) of a is typically low, with degrees p = 0 or p = 1 being the most commonly used in practice.Hence, the minimizer of S n (a; t, β) can be obtained by directly applying the Newton-Raphson iteration: for k = 0, 1, . .., , where a [k] (t; β) denotes the estimate in the k-th iteration, and: The iterations terminate until the estimate η [k+1] (t) = e T 1,p+1 a [k+1] (t; β) stabilizes.Our numerical studies of the robust-BD estimation indicate that (i) the kernel regression method can be both faster and stabler than the local-linear method; (ii) to estimate the nonparametric component η o (•), the local-linear method outperforms the kernel method, especially at the edges of points {T i } n i=1 ; (iii) for the performance of the robust estimation of β o , which is of major interest, there is a relatively negligible difference between choices of using the kernel and local-linear methods in estimating nonparametric components.

Asymptotic Property of the Robust-BD Estimators
This section investigates the asymptotic behavior of robust-BD estimators β and η β , under regularity conditions.The consistency of β to β o and uniform consistency of η β to η o are given in Theorem 1; the asymptotic normality of β is obtained in Theorem 2. For the sake of exposition, the asymptotic results will be derived using local-linear estimation with degree p = 1.Analogous results can be obtained for local-polynomial methods with lengthier technical details and are omitted.
We assume that T ∈ T , and let T 0 ⊆ T be a compact set.For any continuous function v :

Consistency
We first present Lemma 1, which states the uniform consistency of η β (•) to the surrogate function η β (•).Theorem 1 gives the consistency of β and η β .
(i) If there exists a compact set K 1 such that lim n→∞ P( β ∈ K 1 ) = 1 and Condition A2 holds, then

Asymptotic Normality
The asymptotic normality of β is provided in Theorem 2.
Theorem 2 (For the parametric part β o ).Assume Conditions A and Condition B in the Appendix.If n → ∞, nh 4 → 0 and log(1/h)/(nh 2 ) → 0, then: where: and: with: From Condition A1, ( 13) and ( 14), we can show that if w 1 (•) ≡ Cw 2 (•) for some constant C ∈ (0, ∞), then γ(t) = 0.In that case, Ω * 0 = Ω 0 , where: Consider the conventional PLM in (1), estimated using the classical quadratic loss, identity link and , and thus, the result of Theorem 2 agrees with that in [18].Remark 2. Theorem 2 implies the root-n convergence rate of β.This differs from η β (t), which converges at some rate incorporating both the sample size n and the bandwidth h, as seen in the proofs of Lemma 1 and Theorem 2.

Robust Inference for β o Based on BD
In many statistical applications, we will check whether or not a subset of explanatory variables used is statistically significant.Specific examples include: for j = j 0 , H 0 : β j;o = 0, for j = j 1 , . . ., j 2 .
These forms of linear hypotheses for β o can be more generally formulated as: (4).

Wald-Type Test W n
We propose a robust version of the Wald-type test statistic, based on the robust-BD estimator β proposed in Section 2.4, where Ω * 0 and H 0 are estimates of Ω * 0 and For example, and: fulfill the requirement, where: , Again, we can verify that if w 1 (•) ≡ Cw 2 (•) for some constant C ∈ (0, ∞) and η β (t) is obtained from kernel estimation method, then γ(t) = 0, and hence, Ω * 0 = Ω 0 , where: Theorem 3 justifies that under the null, W n would for large n be distributed as χ 2 k , thus asymptotically distribution-free.
Theorem 3 (Wald-type test based on robust-BD under H 0 ).Assume conditions in Theorem 2, and (28).Then, under H 0 in (4), we have that: Theorem 4 indicates that W n has a non-trivial local power detecting contiguous alternatives approaching the null at the rate n −1/2 : where c = (c 1 , . . ., c k ) T = 0.
Theorem 4 (Wald-type test based on robust-BD under H 1n ).Assume conditions in Theorem 2, and To appreciate the discriminating power of W n in assessing the significance, the asymptotic power is analyzed.Theorem 5 manifests that under the fixed alternative H 1 , W n P −→ +∞ at the rate n.Thus, W n has the power approaching to one against fixed alternatives.
For the conventional PLM in (1) estimated using the non-robust quadratic loss, [8] showed the asymptotic equivalence between the Wald-type test and likelihood ratio-type test.Our results in the next Section 5.2 reveal that such equivalence is violated when estimators are obtained using the robust loss functions.

Likelihood Ratio-Type Test Λ n
This section explores the degree to which the likelihood ratio-type test is extended to the "robust-BD" for testing the null hypothesis in (4) for the GPLM.The robust-BD test statistic is: where β is the robust-BD estimator for β o developed in Section 2.4.Theorem 6 indicates that the limit distribution of Λ n under H 0 is a linear combination of independent chi-squared variables, with weights relying on some unknown quantities, thus not distribution free.
Theorem 6 (Likelihood ratio-type test based on robust-BD under H 0 ).Assume conditions in Theorem 2.
(ii) Moreover, if ψ(r) = r, w 1 (x) = w 2 (x) ≡ 1, and the generating q-function of BD satisfies: , for a constant C > 0, (31) then under H 0 in (4), we have that Theorem 7 states that Λ n has non-trivial local power for identifying contiguous alternatives approaching the null at rate n −1/2 and that Λ n P −→ +∞ at the rate n under H 1 , thus having the power approaching to one against fixed alternatives.
Theorem 7 (Likelihood ratio-type test based on robust-BD under H 1n and H 1 ).Assume conditions in Theorem 2.
∼ N(0, 1), and S is a matrix satisfying S T S = (AH −1 0 A T ) −1 and S(AV for a constant c > 0.

Comparison between W n and Λ n
In summary, the test W n has some advantages over the test Λ n .First, the asymptotic null distribution of W n is distribution-free, whereas the asymptotic null distribution of Λ n in general depends on unknown quantities.Second, W n is invariant with re-scaling the generating q-function of the BD, but Λ n is not.Third, the computational expense of W n is much more reduced than that of Λ n , partly because the integration operations for ρ q are involved in Λ n , but not in W n , and partly because Λ n requires both unrestricted and restricted parameter estimates, while W n is useful in cases where restricted parameter estimates are difficult to compute.Thus, W n will be focused on in numerical studies of Section 6.

Simulation Study
We conduct simulation evaluations of the performance of robust-BD estimation methods for general partially linear models.We use the Huber ψ-function ψ(•) with c = 1.345.The weight functions are chosen to be w 1 (x) = w 2 (x) = 1/{1 + ∑ d j=1 ( x j −m j s j ) 2 } 1/2 , where x = (x 1 , . . ., x d ) T , m j and s j denote the sample median and sample median absolute deviation of {X i,j : i = 1, . . ., n} respectively, j = 1, . . ., d.As a comparison, the classical non-robust estimation counterparts correspond to using ψ(r) = r and w 1 (x) = w 2 (x) ≡ 1.Throughout the numerical work, the Epanechnikov kernel function K(t) = 0.75 max(1 − t 2 , 0) is used.All these choices (among many others) are for feasibility; the issues on the trade-off between robustness and efficiency are not pursued further in the paper.
The following setup is used in the simulation studies.The sample size is n = 200, and the number of replications is 500.(Incorporating a nonparametric component in the GPLM desires a larger n when the number of covariates increases for better numerical performance.)Local-linear robust-BD estimation is illustrated with the bandwidth parameter h to be 20% of the interval length of the variable T. Results using other data-driven choices of h are similar and are omitted.
For each generated dataset from the true model, we create a contaminated dataset, where 10 data points (X i,j , Y i ) are contaminated as follows: they are replaced by (X * i,j , ∼ Uniform(0, 1).Figures 1 and 2 compare the boxplots of ( β j − β j;o ), j = 1, . . ., d, based on the non-robust and robust-BD estimates, where the deviance loss and exponential loss are used as the BD in the top and bottom panels respectively.As seen from Figure 1 in the absence of contamination, both non-robust and robust methods perform comparably well.Besides, the bias in non-robust methods using the exponential loss (with p 2 (y; θ) unbounded) is larger than that of the deviance loss (with p 2 (y; θ) bounded).In the presence of contamination, Figure 2 reveals that the robust method is more effective in decreasing the estimation bias without excessively increasing the estimation variance.For each replication, we calculate MSE Figures 3 and 4 compare the plots of η β (t) from typical samples, using non-robust and robust-BD estimates, where the deviance loss and exponential loss are used as the BD in the top and bottom panels, respectively.There, the typical sample in each panel is selected in a way such that its MSE value corresponds to the 50-th percentile among the MSE-ranked values from 500 replications.These fitted curves reveal little difference between using the robust and non-robust methods, in the absence of contamination.For contaminated cases, robust estimates perform slightly better than non-robust estimates.Moreover, the boundary bias issue arising from the curve estimates at the edges using the local constant method can be ameliorated by using the local-linear method.
For each dataset simulated from the true model, a contaminated data-set is created, where 10 data points (X i,j , Y i ) are subject to contamination.They are replaced by (X * i,j , Y * i ), where ∼ Uniform(0, 1).Figures 5 and 6 compare the boxplots of ( β j − β j;o ), j = 1, . . ., d, on the top panels, and plots of η β (t) from typical samples, on the bottom panels, using the non-robust and robust-BD estimates.The typical samples are selected similar to those in Section 6.1.The simulation results in Figure 5 indicate that the robust method performs, as well as the non-robust method for estimating both the parameter vector and non-parametric curve in non-contaminated cases.Figure 6 reveals that the robust estimates are less sensitive to outliers than the non-robust counterparts.Indeed, the non-robust method yields a conceivable bias for parametric estimation, and non-parametric estimation is worse than that of the robust method.Figure 7 gives the QQ plots of the (first to 95-th) percentiles of the Wald-type statistic W n versus those of the χ 2 2 distribution for testing the null hypothesis: The plots depict that in both clean and contaminated cases, the robust W n (in right panels) closely follows the χ 2  2 distribution, lending support to Theorem 3. On the other hand, the non-robust W n agrees well with the χ 2  2 distribution in clean data; the presence of a small number of outlying data points severely distorts the sampling distribution of the non-robust W n (in the bottom left panel) from the χ 2  2 distribution, yielding inaccurate levels of the test.To assess the stability of the power of the Wald-type test for testing the hypothesis (32), we evaluate the power in a sequence of alternatives with parameters β o + ∆c for each given ∆, where c = β o + (1, . . . , 1)T .Figure 8 plots the empirical rejection rates of the null model in the non-contaminated case and the contaminated case.The price to pay for the robust W n is a little loss of power in the non-contaminated cases.However, under contamination, a very different behavior is observed.The observed power curve of the robust W n is close to those attained in the non-contaminated case.On the contrary, the non-robust W n is less informative, since its power curve is much lower than that of the robust W n against the alternative hypotheses with ∆ = 0, but higher than the nominal level at the null hypothesis with ∆ = 0.

Real Data Analysis
Two real datasets are analyzed.In both cases, the quadratic loss is set to be the BD, and the nonparametric function is fitted via local-linear regression method, where the bandwidth parameter is chosen to be 25% of the interval length of the variable T. Choices of the Huber ψ-function and weight functions are identical to those in Section 6.

Example 1
The dataset studied in [19] consists of 2447 observations on three variables, log(wage), age and education, for women.It is of interest to learn how wages change with years of age and years of education.It is anticipated to find an increasing regression function of Y = log(wage) in T = age as well as in X 1 = education.We fit a partially linear model Y = η(T) + β 1 X 1 + .Profiles of the fitted nonparametric functions η(•) in Figure 9 indeed exhibit the overall upward trend in age.The coefficient estimate is β 1 = 0.0809 with standard error 0.0042 using the non-robust method, and is β 1 = 0.1334 with standard error 0.0046 by means of the robust method.It is seen that robust estimates are similar to the non-robust counterparts.Our evaluation, based on both the non-robust and robust methods, supports the predicted result in theoretical and empirical literature in socio-economical studies.

Example 2
We analyze an employee dataset (Example 11.3 of [20]) of the Fifth National Bank of Springfield, based on year 1995 data.The bank, whose name has been changed, was charged in court with that its female employees received substantially smaller salaries than its male employees.For each of its 208 employees, the dataset consists of seven variables, EducLev (education level), JobGrade (job grade), YrHired (year that an employee was hired), YrBorn (year that an employee was born), Female (indicator of being female), YrsPrior (years of work experience at another bank before working at the Fifth National bank), and Salary (current annual salary in thousands of dollars).
To explain variation in salary, we fit a partial linear model, = JobGrade and X 5 = YrsPrior, where Age = 95 − YrBorn is age.Table 1 presents parameter estimates and their standard errors (given within brackets), along with p-values calculated from the Wald-type test W n .Figure 10 depicts the estimated nonparametric functions.
It is interesting to note that for this dataset, results from using the robust and non-robust methods make a difference in drawing conclusions.For example, from Table 1, the non-robust method gives the estimate of parameter β 1 for gender to be below zero, which may be interpreted as the evidence of discrimination against female employees in salary and lends support to the plaintiff.In contrast, the robust method yields β 1 > 0, which does not indicate that gender has an adverse effect.(A similar conclusion made from penalized-likelihood was obtained in Section 4.1 of [21]).Moreover, the estimated nonparametric functions η(•) obtained from non-robust and robust methods are qualitatively different: the former method does not deliver a monotone increasing pattern with Age, whereas the latter method does.Whether or not the difference was caused by outlying observations will be an interesting issue to be investigated.

Discussion
Over the past two decades, nonparametric inference procedures for testing hypotheses concerning nonparametric regression functions have been developed extensively.See [22][23][24][25][26] and the references therein.The work on the generalized likelihood ratio test [24] offers light into nonparametric inference, based on function estimation under nonparametric models, using the quadratic loss function as the error measure.These works do not directly deal with the robust procedure.Exploring the inference on nonparametric functions, such as η o (t) in GPLM associated with a scalar variable T and the additive structure ∑ D d=1 η o d (t d ) as in [27] with a vector variable T = (T 1 , . . ., T D ), estimated via the "robust-BD" as the error measure, when there are possible outlying data points, will be the future work.
This paper utilizes the class BD of loss functions, the optimal choice of which depends on specific settings and criteria.For e.g., regression and classification will utilize different loss functions, and thus further study on optimality is desirable.Some recent work on partially linear models in econometrics includes [28][29][30].There, the nonparametric function is approximated via linear expansions, with the number of coefficients diverging with n.Developing inference procedures to be resistant to outliers could be of interest.
The proof completes by applying [31].
Proof of Theorem 1.Before showing Theorem 1, we need Proposition A1 (whose proof is omitted), where the following notation will be used.Denote by C 1 (T ) the set of continuously differentiable functions in T .Let V (β) denote the neighborhood of β ∈ K. Let H δ (β) denote the neighborhood of Proposition A1.Let {(Y i , X i , T i )} n i=1 be independent observations of (Y, X, T) modeled by (2) and (5).Assume that a random variable T is distributed on T .Let K and H 1 (β) be compact sets, g(•; •) : R 2 → R be a continuous and bounded function, W(x, t) : R d+1 → R be such that E{|W(X, T)|} < ∞ and η β (t) = η(t, β) : R d+1 → R be a continuous function of (t, β).Then: For part (i), we first show that for any compact set which follows from Proposition A1 (ii), and: To show (A4), we note that for any ε > 0, let T 0 be a compact set such that P(T i / ∈ T 0 ) < ε.Then: For T i ∈ T 0 , by the mean-value theorem, where η * i,β is located between η β (T i ) and η β (T i ).For T i / ∈ T 0 , it follows that: where the last inequality is entailed by Lemma 1 and the law of large numbers for T * n = n −1 ∑ n i=1 I(T i / ∈ T 0 ).This completes the proof of (A3).The proof of β P −→ β o follows from combining Lemma A-1 of [1]  Consider β defined in (23).Note that: where c n = 1/ √ n.Then, θ = √ n( β − β o ) minimizes: with respect to θ.By Taylor expansion, which implies that W n = I 1 + I 2 2 .Arguments for Theorem 2 give I 1 For a matrix M, the smallest and largest eigenvalues are denoted by λ j (M), λ min (M) and λ max (M), respectively.Let M = sup x =1 Mx = {λ max (M T M)} 1/2 be the matrix L 2 norm.Denote by P −→ convergence in probability and D −→ convergence in distribution.

Figure 2 .
Figure 2. Simulated Bernoulli response data with contamination.The captions are identical to those in Figure 1.

Figure 7 .
Figure 7. Simulated Gaussian response data with contamination.Empirical quantiles (on the y-axis) of the Wald-type statistics W n versus quantiles (on the x-axis) of the χ 2 distribution.Solid line: the 45 degree reference line.(Left panels): non-robust method; (right panels): robust method.

Figure 8 .
Figure 8. Observed power curves of tests for the Gaussian response data.The dashed line corresponds to the non-robust Wald-type test W n ; the solid line corresponds to the robust W n ; the dotted line indicates the 5% nominal level.(Left panels): non-contaminated case; (right panels): contaminated case.