Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates

Yunquan Song; Yaqi Liu; Hang Su

doi:10.3390/math10122003

,

and

College of Science, China University of Petroleum, Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Mathematics2022, 10(12), 2003;https://doi.org/10.3390/math10122003

This article belongs to the Special Issue Recent Advances of Computational Statistics in Industry and Business II

Version Notes

Order Reprints

Abstract

As applied sciences grow by leaps and bounds, semiparametric regression analyses have broad applications in various fields, such as engineering, finance, medicine, and public health. Single-index varying-coefficient model is a common class of semiparametric models due to its flexibility and ease of interpretation. The standard single-index varying-coefficient regression models consist mainly of parametric regression and semiparametric regression, which assume that all covariates can be observed. The assumptions are relaxed by taking the models with missing covariates into consideration. To eliminate the possibility of bias due to missing data, we propose a probability weighted objective function. In this paper, we investigate the robust variable selection for a single-index varying-coefficient model with missing covariates. Using parametric and nonparametric estimates of the likelihood of observations with fully observed covariates, we examine the estimators for estimating the likelihood of observations. For variable selection, we use a weighted objective function penalized by a non-convex SCAD. Theoretical challenges include the treatment of missing data and a single-index varying-coefficient model that uses both the non-smooth loss function and the non-convex penalty function. We provide Monte Carlo simulations to evaluate the performance of our approach.

Keywords:

single-index varying-coefficient model; missing data; variable selection; inverse probability weighting; sparsity

MSC:

62F12; 62G08; 62G20; 62J07T07

1. Introduction

Traditional statistical techniques are based on completely observed data. However, in many scientific experiments, such as questionnaire survey, medical research and psychological science, respondents are unwilling to provide some information which the researchers need. In addition, there are many factors that cannot be controlled in the research process, and it is often impossible to obtain all the desired data. When data are missing, traditional statistical techniques cannot be directly applied. Some statisticians consider using the observed data to draw valid conclusions in this situation. Until now, in order to deal with missing data, various methods have been employed such as complete-case analysis (CC) (Yates [1] and Healy and Westmacott [2]), imputation and inverse probability weighting (IPW), and methods based on likelihood. The IPW method proposed by Horvitz and Thompson [3] a way to deal with the missing data problems, which selects the inverse of the probability as the estimated weight so that it is not distorted by random missing data. It has earned extensive attention in the field of missing data research. There are also some related literatures, such as Robins et al. [4], Wang et al. [5], Little and Rubin [6], Liang et al. [7], Tsiatis [8], etc. However, when the error distribution is highly tailed or skewed, the results of the two aforementioned methods are not stable because they are based on least squares (LS) method.

In most regression models, it is critical to choose the proper loss function

ρ (\cdot)

to make the resulting estimator robust. Therefore, researchers pay more attention to loss functions that have higher robustness. The exponential squared loss that has robustness is defined as

ψ_{η} (t) = 1 - exp (- t^{2} / η)

, where

η

is the tuning parameter that determines the robustness degree of the estimator. For large

η

,

ψ_{η} (t)

is approximately equal to

t^{2} / η

. Thus the proposed estimator is the same as the LS estimator in some extreme circumstances. When

η

is small, observations with absolute values of

t_{i} = Y_{i} - x_{i}^{T} β

will lead to a great loss of

ψ_{η} (t_{i})

, whose influence upon the estimate of

β

is insiganificant. Thus, making

η

smaller limits the impact of outliers on the estimator but also reduces the sensitivity of the estimator. Moreover, quantile regression (QR) has become an increasingly popular method because regression methods based on exponential squared loss are more resistant to the effects of outliers than LS. Such exponential loss functions have been used in classification problems in AdaBoost (Friedman et al. [9]) and variable selection in regression models (Wang et al. [10]).

As applied sciences grow, research on semiparametric models has been extensively developed due to the high degree of flexibility and ease of interpretation. The singleindex varying-coefficient model (SIVCM) is a common semiparametric model. The main advantage of the model is that it avoids the curse of dimensionality. Another is that it has the explanatory power like parametric models. Generally, it takes the following form

Y = g^{T} (β_{0}^{T} X) Z + ε,

(1)

where

Y

is the dependent variable,

(X, Z)

are the covariates and

(X, Z) : R^{p} \times R^{q}

.

g (\cdot)

and

β_{0}

represent the vector of unknown functions and unknown parameters, respectively, whose dimension are

q \times 1

and

p \times 1

.

ε

is the disturbance term with zero mean and finite variance

σ^{2}

which is independent of

(X, Z)

. Furthermore, assume that the Euclidean norm of

β_{0}

is equal to 1 and its first component is positive. Moreover, in order to avoid the influence due to the lack of uniqueness of the index direction

β_{0}

,

g (x)

cannot take the form of

g (x) = α^{T} x β_{0}^{T} x + γ^{T} x + c

, where

α, γ, c

are constants,

α \in R^{p}

,

γ \in R^{p}

,

c \in R

and

β_{0}

are not parallel to each other (Feng and Xue [11]; Xue and Pang [12]).

Model (1) is so flexible that it covers a class of significant statistical models. It becomes the standard single-index model (SIM) when

Z = 1

and

q = 1

; for related literatures, see Hardle et al. [13] and Wu et al. [14]. When

β_{0} = 1

and

p = 1

, it is simplified to the varying coefficient models (VCM) proposed by Hastie and Tibshirani [15] and Fan and Zhang [16]. Consequently, it is easily interpretable and has broad applications in practice. In particular, Xia and Li [17] first studied Model (1) using the kernel smoothing method with the LS method. The empirical likelihood ratio method was proposed by Xue and Wang [18]. Based on estimating equations, the estimate of the parametric component was built by Xue and Pang [12]. Using the function approximation, Feng and Xue [11] investigated Model (1).

Variable selection is of great importance to statistical modeling. The reason is that it will cause seriously biased results if researchers ignore the significant variables, whereas including spurious variables suffers from substantial loss in estimation efficiency. Hence, there are many popular choices for penalty functions, such as least absolute shrinkage and selection operator (LASSO, Tibshirani [19]), bridge penalty, smoothly clipped absolute deviation (SCAD, Fan and Li [20]), and adaptive lasso (Zou [21]). In particular, the non-conave least-squares penalty method based on SCAD penalization in SIM has been proposed by Peng and Huang [22] using SCAD penalization; Yang and Yang [23] adopted the SCAD penalty to achieve efficient estimation and variable selection simultaneously in partially linear single-index models (PLSIM); Wang and Kulasekera [24] proposed the partial linear varying-coefficient model (PLVCM) based on adaptive lasso.

SIVCM is a common semiparametric model. The selection of variables in semiparametric models includes two parts: the selection of the model in the nonparametric part and the selection of significant variables in the parametric part. Classical variable selection procedure involves stepwise regression and optimal subsets selection. However, the nonparametric parts of each submodel need to be extracted separately, leading to high computational cost. It is a great challenge to select variables in SIVCM for the reason that it has a complex multivariate nonlinear structure that incudes both a nonparametric function vector

g (\cdot)

and an unknown parameter vector

β

. Based on the approximation of the SCAD function and penalties, Feng and Xue [11] developed a penalty method for SIVCM. The method they propose allows the selection of significant variables into parametric and nonparametric components. It should be noted that existing research adopts the LS or likelihood method and assume that the error follows a normal distribution. Therefore, when the error is highly tailed, it makes the method sensitive to outliers and it becomes inefficient. It is not robust to outliers in the dependent variable due to using least squares criterion. Yang and Yang [25] proposed an efficient iterative procedure for SIVCM based on quantile regression. The results indicate that the resulting estimator is robust without accounting for both outliers and errors of variation. However, all existing work on SIVCM assumes that all variables are fully observed. A robust variable selection approach for SIVCM with missing covariates has not yet been studied.

The following are the innovations of this paper:

For the case of missing covariates, we propose a robust variable selection approach based on exponential squared loss and adopt the IPW method to eliminate the latent bias due to the missing values in covariates.
We consider parametric and nonparametric methods to estimate the probabilistic model and propose a objective function with a weighted penalty for variable selection.
We also examine how to select the parameters $η$ of the squared exponential loss function to ensure that the corresponding estimator is robust.

The rest of this article is organized as follows. Section 2 proposes an efficient iterative SIVCM method using exponential quadratic loss, and the SCAD penalty is applied to select both important parametric variables and nonparametric components. In addition, we discusses the implementation, including bandwidth selection and tuning parameters. Section 3 conducts several Monte Carlo experiments with different error distributions in order to show the finite sample performance of the proposed method. Section 4 concludes the paper briefly.

2. Methodology

Using the exponential squared loss functions, the basis function approximation, and the SCAD penalty function, a robust variable selection procedure for SIVCM with missing covariates is proposed. First, the unknown coefficient functions are approximated applying the B-spline function. Next, under the constraint of

∥ β ∥ = 1

, we use the ’delete-one-component’ approach constructed by Yu and Ruppert [26] in order to establish the objective function of the penalized exponential squared loss.

2.1. Basis Function Expansion

Consider that

{(X_{i}, Z_{i}, Y_{i}), 1 \leq i \leq n}

is a sample from model (1), i.e.,

Y_{i} = g^{T} (β^{T} X_{i}) Z_{i} + ε_{i}, i = 1, \dots, n,

(2)

where

X_{i} = {(X_{i 1}, \dots, X_{i p})}^{T}

and

Z_{i} = {(Z_{i 1}, \dots, Z_{i q})}^{T}

are p-dimensional and q-dimensional independent variables, respectively. The disturbance term

ε_{i}

is unobserved random variable with zero mean and finite variance

σ^{2}

. We assume that

{ε_{i}, 1 \leq i \leq n}

are independent of

{(X_{i}, Z_{i}), 1 \leq i \leq n}

.

In order to get the unknown

g (\cdot)

, according to He et al. [27], we use its basis function approximations to replace the original

g (\cdot)

. More specifically, construct a B-spline basis function of order M+1,

B (u) = {(B_{1} (u), \dots, B_{L} (u))}^{T}

, where

L = K + M + 1

, and K is the number of interior knots. We can approximate

g_{k} (u)

as

g_{k} (u) \approx B^{T} (u) γ_{k}, k = 1, \dots, q,

(3)

where

γ_{k}

is the vector of the spline coefficient. The following robust estimation procedure will be performed if all the data

{(X_{i}, Z_{i}, Y_{i}), 1 \leq i \leq n}

could be collected.

ℓ (β, g (\cdot)) = \sum_{i = 1}^{n} exp {- {(Y_{i} - g^{T} (β^{T} X_{i}) Z_{i})}^{2} / η_{n}},

(4)

where

η_{n} > 0

is a tuning parameter. To prevent outliers from affecting the estimate, we introduce an exponential squared loss in (4). However, (4) cannot be directly optimized when

g (\cdot)

is unknown. After we replace the unknown

g (\cdot)

by its basis function approximations in (4), we get

ℓ_{n} (β, γ) = \sum_{i = 1}^{n} exp {- {(Y_{i} - W_{i}^{T} (β) γ)}^{2} / η_{n}},

(5)

where

γ = {({γ_{1}}^{T}, \dots, {γ_{q}}^{T})}^{T}, W_{i} (β) = I_{p} \otimes B (β^{T} X_{i}) \cdot Z_{i} .

We first handle the constraints

| | β | | = 1

and

β_{1} > 0

on the p-dimensional single index parameter vector

β

by reparametrization. Denote

ϕ = {(β_{2}, \dots, β_{p})}^{T}

and define

β = β (ϕ) = {(\sqrt{1 - {| | ϕ | |}^{2}}, ϕ^{T})}^{T} .

(6)

The true parameter

ϕ_{0}

must satisfy

| | ϕ_{0} | | < 1

, which is an inequality constraint. Therefore,

β (ϕ)

is infinitely differentiable with respect to

ϕ

. Therefore, the Jacobian matrix of

β

with respect to

ϕ

is

J_{ϕ} = (\binom{{- (1 - | | ϕ | |}^{2})^{- 1 / 2} ϕ^{T}}{I_{p - 1}}),

where

I_{q}

is the q-order identity matrix. As we can see,

ϕ

is one dimension lower than

β

, and the penalized robust regression with the exponential squared loss is converted to

ℓ_{n} (ϕ, γ) = \sum_{i = 1}^{n} exp {- {(Y_{i} - W_{i}^{T} (ϕ) γ)}^{2} / η_{n}},

(7)

where

W_{i} (ϕ) = W_{i} (β)

. By maximizing (7),we can get

\hat{ϕ}

and

\hat{γ} = {({\hat{γ}}_{1}^{T}, \dots, {\hat{γ}}_{q}^{T})}^{T}

. Then, through (3) and (6), the robust regression estimator of

β

based on the exponential squared loss is

\hat{β} = {(\sqrt{1 - | | \hat{ϕ} {| |}^{2}}, {\hat{ϕ}}^{T})}^{T},

(8)

and the estimator of

g_{k} (u)

can be procured by

{\hat{g}}_{k} (u) = B^{T} (u) {\hat{γ}}_{k} .

(9)

2.2. Robust Estimation Based on Inverse Probability Weighting

We consider the case where a subset of covariates has missing values when estimating (5). Let

1_{i} \in R^{p + q - k}

be the vector of always obtained covariates and

m_{i} \in R^{k}

is a vector of covariates that may contain some missing parts from

X_{i}

or

Z_{i}

. We define the vector of variables which can be always observed as

t_{i} = {(Y_{i}, {1_{i}}^{T})}^{T} \in R^{s}

,and

s = p + q - k

. Based on each observation, the value of an indicator variable R is related to whether

m_{i}

is completely observed, which can be obtained by the following formula

R_{i} = \{\begin{matrix} 1, & if m_{i} is observed, \\ 0, & otherwise . \end{matrix}

The missing mechanism we proposed satisfies:

P (R_{i} = 1 | Y_{i}, X_{i}, Z_{i}) = P (R_{i} = 1 | Y_{i}, m_{i}, t_{i}) = P (R_{i} = 1 | t_{i}) \equiv π (t_{i}) \equiv π_{i},

(10)

With this missing mechanism, under the condition of

t_{i}

, we can ensure the event that

m_{i}

is missing has no connection with

(Y_{i}, m_{i}^{T})

. Although the response data are fully observed, the selection probability

π (\cdot)

in (10) still only related to the observed covariates

t_{i}

instead of the observed response. Therefore, we conclude that the missing mechanism is different from the missing at random (MAR) mechanism. We need this missing mechanism in order to continue the theoretical research.

When faced with missing covariates, we estimate (5) with a naive approach; only observations with complete data are used to fit the model. The naive estimator is

({\hat{ϕ}}^{N}, {\hat{γ}}^{N}) = a r g m a x \sum_{i = 1}^{n} R_{i} exp {- {(Y_{i} - W_{i}^{T} (ϕ) γ)}^{2} / η_{n}},

(11)

while all observations with missing data are dropped when we estimate the model. Under the assumption that it is not the MAR, this estimator will be asymptotically biased.

An objective function based on inverse probability weights (IPW) is proposed in order to reduce the potential error caused by missing data. The expression

R_{i} / π_{i 0}

is used to weight the ith data point in the IPW method. The difference between IPW and naive method is that IPW provides different weights for records with fully observed data. The idea behind weighting is that for every fully observed data point with probability

π_{i 0}

of being fully observed,

1 / π_{i 0}

data points with the same covariates are expected if there were no missing data.

The weight

1 / π_{i 0}

is usually unknown and needs to be estimated. We consider estimating the weights using a parametric model. The general parametric relationship of the parametric model is assumed as

π_{i 0} \equiv π_{i} (t_{i}, η_{0}) .

Assuming the logistic relationship as an example

π_{i} (t_{i}, η_{0}) = \frac{exp {{(1, t_{i})}^{T} η_{0}}}{1 + exp \{{(1, t_{i})}^{T} η_{0}\}} .

In practice

π_{i} (t_{i}, η_{0})

is replaced with

π_{i} (t_{i}, \hat{η}) \equiv π_{i} (\hat{η})

. The parametric model

P (R_{i} = 1 | t_{i})

is used to estimate

\hat{η}

.

Throughout the paper

π_{i} (\hat{η})

will denote the parametric estimate,

{\hat{π}}_{i}

will denote a general estimate that could be parametric, and

π_{i 0}

will denote the true probability when observation i has full data. The definition of our parametric robust regression estimator is

({\hat{ϕ}}^{L}, {\hat{γ}}^{L}) = a r g m a x \sum_{i = 1}^{n} \frac{R_{i}}{π_{i} (\hat{η})} exp {- {(Y_{i} - W_{i}^{T} (ϕ) γ)}^{2} / η_{n}} .

(12)

According to the above, through (3) and (6) and using the exponential squared loss,

β

can be robustly estimated by

{\hat{β}}^{L} = {(\sqrt{1 - | | {\hat{ϕ}}^{L} {| |}^{2}}, {({\hat{ϕ}}^{L})}^{T})}^{T} .

(13)

Then, the estimator of

g_{k} (u)

can be written as

{\hat{g}}_{k}^{L} (u) = B^{T} (u) {\hat{γ}}_{k}^{L} .

(14)

2.3. The Penalized Robust Regression Estimator

Here we consider the variable selection problem when Model (2) has missing covariates. In order to improve the accuracy and interpretability of model fitting and ensure the identifiability of the model, the vector of the real regression coefficient

β^{*}

is generally set to a scattered state with only a small fraction of non-zeroes (Fan and Li [20]; Tibshirani [19]).

For the purpose of getting the true model and estimating

β^{*}

and

g (\cdot)

, a penalized robust regression that uses exponential squared loss is as follows

\begin{matrix} ℓ (β, g (\cdot)) & = \sum_{i = 1}^{n} \frac{R_{i}}{π_{i} (\hat{η})} exp {- {(Y_{i} - g^{T} (β^{T} X_{i}) Z_{i})}^{2} / η_{n}} \\ - n λ_{1} \sum_{k = 1}^{q} p_{λ_{1 k}} (| | g_{k} (\cdot) | |) - n λ_{2} \sum_{l = 1}^{p} p_{λ_{2 l}} (| | β_{l} | |), \end{matrix}

(15)

where

| | g_{k} (\cdot) | | = {(\int g_{k}^{2} (u) d u)}^{1 / 2} .

The penalty function

p_{λ} (\cdot)

is defined on the interval

[0, \infty)

and the regularization parameter

λ

is non-negative. It is necessary to emphasize that the tuning parameters

λ_{1}

and

λ_{2}

have no need to be the same for all

g_{k} (\cdot)

and

β_{l}

. Our purpose of using exponential squared loss in (5) is to prevent outliers from affecting the estimation process. It is unrealistic to directly optimize (15) when

g (\cdot)

is unknown. To solve this problem, the unknown function

g (\cdot)

in (15) is replaced by its basis function approximation, which can be written as

\begin{matrix} ℓ_{n} (β, γ) & = \sum_{i = 1}^{n} \frac{R_{i}}{π_{i} (η)} exp {- {(Y_{i} - W_{i}^{T} (β) γ)}^{2} / η_{n}} \\ - n λ_{1} \sum_{k = 1}^{q} p_{λ_{1 k}} (| | γ_{k} {| |}_{H}) - n λ_{2} \sum_{l = 1}^{p} p_{λ_{2 l}} (| | β_{l} | |), \end{matrix}

(16)

where

| | γ_{k} {| |}_{H} = {(γ_{k}^{T} H γ_{k})}^{1 / 2}, H = \int B (u) B^{T} (u) d u .

When

π_{i}

’s parametric estimate is

π_{i} (\hat{η})

, the parametric penalized robust regression with the exponential squared loss transforms to

\begin{matrix} ℓ_{n} (ϕ, γ) & = \sum_{i = 1}^{n} \frac{R_{i}}{π_{i} (\hat{η})} exp {- {(Y_{i} - W_{i}^{T} (ϕ) γ)}^{2} / η_{n}} \\ - n λ_{1} \sum_{k = 1}^{q} p_{λ_{1 k}} (| | γ_{k} {| |}_{H}) - n λ_{2} \sum_{l = 1}^{p - 1} p_{λ_{2 l}} (| | ϕ_{l} | |), \end{matrix}

(17)

where

W_{i} (ϕ) = W_{i} (β)

. By maximizing (17), we can get the result

{\hat{ϕ}}^{P}

and

{\hat{γ}}^{P} = {({\hat{γ}}_{1}^{T}, \dots, {\hat{γ}}_{q}^{T})}^{T}

. Then, through (3) and (6), the penalized robust regression estimator of

β

based on the exponential squared loss is

{\hat{β}}^{P} = {(\sqrt{1 - | | {\hat{ϕ}}^{P} {| |}^{2}}, {({\hat{ϕ}}^{P})}^{T})}^{T},

(18)

and the estimator of

g_{k} (u)

can be obtained by

{\hat{g}}_{k}^{P} (u) = B^{T} (u) {\hat{γ}}_{k}^{P} .

(19)

2.4. Algorithm

A quadratic approximation is used to replace the loss function for the purpose of facilitating the computation. Let

ℓ^{*} (ϕ, γ) = \sum_{i = 1}^{n} \frac{R_{i}}{π_{i} (\hat{η})} exp {- {(Y_{i} - W_{i}^{T} (ϕ) γ)}^{2} / η_{n}} .

When we get the initial estimator

(\tilde{ϕ}, \tilde{γ})

, then the loss function can be approximated as

ℓ^{*} (ϕ, γ) \approx ℓ^{*} (\tilde{ϕ}, \tilde{γ}) + \frac{1}{2} {(ϕ, γ) - (\tilde{ϕ}, \tilde{γ})}^{T} \nabla^{2} ℓ^{*} (\tilde{ϕ}, \tilde{γ}) {(ϕ, γ) - (\tilde{ϕ}, \tilde{γ})} .

What makes implementing the Newton–Raphson algorithm directly difficult is that the SCAD-penalty function is irregular at the origin. Now, we develop an iterative algorithm based on the local quadratic approximation of the penalty function

p_{λ} (\cdot)

as in Fan and Li [20]. More specially, in a neighborhood of a given nonzero

ω_{0}

, an approximation of the penalty function at the value

ω_{0}

can be given by

p_{λ} (| ω |) \approx p_{λ} (| ω_{0} |) + \frac{1}{2} \frac{{\dot{p}}_{λ} (ω_{0})}{| ω_{0} |} (ω^{2} - ω_{0}^{2}) .

Hence, for the given initial value

ϕ_{l}^{0}

with

| ϕ_{l}^{0} | > 0

,

l = 1, \dots, p - 1

, and

γ_{k}^{0}

with

| | γ_{k}^{0} {| |}_{H} > 0

,

k = 1, \dots, q

, we have

p_{λ_{1 k}} (| | γ_{k} {| |}_{H}) \approx p_{λ_{1 k}} (| | γ_{k}^{0} {| |}_{H}) + \frac{1}{2} \frac{{\dot{p}}_{λ_{1 k}} (| | γ_{k}^{0} {| |}_{H})}{| | γ_{k}^{0} {| |}_{H}} (| | γ_{k} {| |}_{H}^{2} - | | γ_{k}^{0} {| |}_{H}^{2}),

p_{λ_{2 l}} (| ϕ_{l} |) \approx p_{λ_{2 l}} (| ϕ_{l}^{0} |) + \frac{1}{2} \frac{{\dot{p}}_{λ_{2 l}} (| ϕ_{l}^{0} |)}{| ϕ_{l}^{0} |} (| ϕ_{l} |^{2}) - | ϕ_{l}^{0} |^{2}) .

Let

Σ (ϕ, γ) = diag {\frac{{\dot{p}}_{λ_{21}} (| ϕ_{1} |)}{| ϕ_{1} |}, \dots, \frac{{\dot{p}}_{λ_{2, p - 1}} (| ϕ_{p - 1} |)}{| ϕ_{p - 1} |}, \frac{{\dot{p}}_{λ_{11}} (| | γ_{1} {| |}_{H})}{| | γ_{1} {| |}_{H}} H, \dots, \frac{{\dot{p}}_{λ_{1 q}} (| | γ_{q} {| |}_{H})}{| | γ_{q} {| |}_{H}} H} .

Then, in addition to the constant term, we maximize

\begin{array}{l} ℓ (ϕ, γ) = \frac{1}{2} {(ϕ, γ) - (\tilde{ϕ}, \tilde{γ})}^{T} \nabla^{2} ℓ^{*} (\tilde{ϕ}, \tilde{γ}) {(ϕ, γ) - (\tilde{ϕ}, \tilde{γ})} \\ - \frac{n}{2} (ϕ^{T}, γ^{T}) Σ (ϕ, γ) {(ϕ^{T}, γ^{T})}^{T} \end{array}

(20)

with respect to

ϕ

and

γ

, which brings about an approximated solution of (17). We can get estimates

\hat{β}

and

{\hat{g}}_{k} (u)

of

β

and

g_{k} (u)

by solving for (3) and (6) respectively.

In order to implement the above method, we should correctly choose the number of interior knots K and make appropriate adjustments to the tuning parameters a,

λ_{1}

,

λ_{2}

and

η_{n}

in the penalty function. Fan and Li [20] showed that the choice of

a = 3.7

performs well in variety of situations. Hence, we also follow their setup in this article.

2.5. The Choice of the Regularization Parameter $λ_{1}$ and $λ_{2}$

We can choose the tuning parameters using a method that is similar to cross-validation. However, our penalty function contains too many tuning parameters, and higher-dimensional space makes it difficult to solve the minimization problem for the cross-validation score. To overcome this difficulty, similar to Zhao and Xue [28], we take the tuning parameters as

λ_{1} = \frac{λ}{| | {\hat{γ}}_{k}^{u} {| |}_{H}}, λ_{2} = \frac{λ}{| | {\hat{ϕ}}_{l}^{u} | |},

(21)

where

{\hat{γ}}_{k}^{u}

and

{\hat{ϕ}}_{l}^{u}

are the unpenalized estimators of

{γ_{k}}^{u}

and

ϕ_{l}^{u}

, respectively. Then, we can estimate

λ

and K by minimizing the following cross-validation score:

C V (K, λ) = \sum_{i = 1}^{n} {Y_{i} - W_{i}^{T} ({\hat{ϕ}}_{[i]}) {\hat{γ}}_{[i]}}^{2},

(22)

where

{\hat{ϕ}}_{[i]}

and

{\hat{γ}}_{[i]}

are the solutions ground on (17) after deleting the ith subject.

2.6. The Choice of the Regularization Parameter $η_{n}$

The tuning parameter

η_{n}

plays a decisive role in the degree of robustness and efficiency of the proposed robust regression estimators. A data-driven procedure is proposed to choose the appropriate

η_{n}

, the new method yields both high efficiency and high robustness simultaneously. We first choose a series of the tuning parameters that makes the proposed penalized robust estimators have an asymptotic breakdown point at 1/2 and then use the maximum efficiency as a measure to select the tuning parameter.

The specific procedure steps are as follows:

Step 1. In this step, we will find the pseudo outlier set of the sample as in Wang et al. [10]. Let

D_{i} = (X_{i}, Z_{i}, Y_{i})

and

D = (D_{1}, \dots, D_{n})

. Calculate

r_{i} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) = Y_{i} - W_{i}^{T} ({\hat{γ}}_{n}) {\hat{γ}}_{n}

,

i = 1, \dots, n

and

S_{n} = 1.486 \times {median}_{i} | r_{i} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) - {median}_{j} (r_{j} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) |

. Then, take the pseudo outlier set

D_{m} = {(X_{i}, Z_{i}, Y_{i}) : | r_{i} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) | > 2.5 S_{n}}

, set

m = ♯ {1 \leq i \leq n : | r_{i} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) | > 2.5 S_{n}}

, and

D_{n - m} = D_{n} / D_{m}

.

Step 2. In this step, we are going to update the tuning parameter

η_{n}

. Suppose there are m bad points and

n - m

good points in

D_{n}

. Define the bad points by

D_{m} = (D_{1}, \dots, D_{m})

and the good points by

D_{n - m} = (D_{m + 1}, \dots, D_{n})

.

The proportion of bad points in

D_{n}

is

m / n

. The computation of the initial estimators

{\tilde{ϕ}}_{n}

and

{\tilde{γ}}_{n}

is the first thing to do . For a contaminated sample

D_{n}

, let

ξ (η) = \frac{2 m}{n} + \frac{2}{n} \sum_{m + 1}^{n} ψ_{η} {r_{i} ({\tilde{ϕ}}_{n}, {\tilde{γ}}_{n})},

(23)

where

r_{i} (ϕ, γ) = Y_{i} - W_{i}^{T} (ϕ) γ

. Let

η_{n}

be the minimizer of

det (\hat{V} (η))

in the set

G = {η : ξ (η) \in (0, 1]}

, where

det (\cdot)

indicate the determinant operator,

\hat{V} (η) = {{\hat{I}}_{1} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n})}^{- 1} {\tilde{Σ}}_{2} {{\hat{I}}_{1} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n})}^{- 1},

and

{\hat{I}}_{1} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) = \frac{2}{η} \{\frac{1}{n} \sum_{i = 1}^{n} exp (- r_{i}^{2} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) / η) (\frac{2 r_{i}^{2} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n})}{η} - 1)\} \times (\frac{1}{n} \sum_{i = 1}^{n} W_{i} W_{i}^{T}),

{\tilde{Σ}}_{2} = cov \{exp (- r_{1}^{2} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) / η) \frac{2 r_{1} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n})}{η} W_{1}, \dots, exp (- r_{n}^{2} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n}) / η) \frac{2 r_{n} ({\hat{ϕ}}_{n}, {\hat{γ}}_{n})}{η} W_{n}\} .

Step 3. The value of

λ

can be calculated from (22). Then, we can get the value of

λ_{1}

and

λ_{2}

by (21). Through fixed

λ_{1}

and

λ_{2}

, and selected

η_{n}

in Step 2,

{\hat{ϕ}}_{n}

and

{\hat{γ}}_{n}

can be updated by maximizing (17).

Step 4. We learn from Xue and Pang [12] to set the estimator

\tilde{ϕ}

and

\tilde{γ}

as the initial estimate, which means

\hat{ϕ} = \tilde{ϕ}

and

\hat{γ} = \tilde{γ}

. We then repeat Steps 1-3 until

\hat{ϕ}

,

\hat{γ}

, and

η_{n}

converge.

Step 5. Using (3) and (6), we get the penalized robust regression estimator

\hat{β}

of

β

, and the estimator

{\hat{g}}_{k} (u)

of

g_{k} (u)

.

3. Simulation

Here we compare the performance of the estimation and variable selection methods we propose for the finite samples with that of Yang and Yang [25] (QR), Xue and Wang [18] (EL), Xue and Pang [12] (EE) via some Monte Carlo simulations. In contrast, Xue and Wang [18] (EL) and Xue and Pang [12] (EE) fail to take into account the problem of selection of significant variables, so we introduced an adaptive penalty term into their objective function to ensure that significant variables are selected.

According to Yang and Yang [25], we choose the Gaussian kernel function in the simulations of the quantile regression method with

τ = 0.5

. Evaluation of the performance of the estimators noted above is based on the following three criteria: (1) the average absolute deviations (AAD) of the estimated coefficients and the standard deviations (SD) for each; (2) mean absolute deviations (MAD) of

\hat{β}

, which can be calculated by the expression

M A D (\hat{β}) = E (| | \hat{β} - β_{0} {| |}_{1})

, where

| | \cdot {| |}_{p}

represents the p-norm; and (3) the square root of the average square error (RASE) as a measure of the performance of estimator

{\hat{g}}_{k} (\cdot)

, calculated as follows:

R A S E_{k} = {\frac{1}{n_{g r i d}} \sum_{i = 1}^{n_{g r i d}} {({\hat{g}}_{k} (u_{i}) - g_{k} (u_{i}))}^{2}}^{1 / 2}

for

k = 1, \dots, q

, where

{u_{i}, i = 1, 2, \dots, n_{g r i d}}

denote the grid points used to assess the function

g_{k} (\cdot)

.

Additionally, in order to demonstrate the effectiveness of the variable selection procedure, the average number of real zero coefficients accurately identified as zero (NC), the average number of real non-zero coefficients mistakenly identified as zero (NIC), as well as the probability of correctly selecting the real model (PC) are presented in our simulation. The tuning parameter

η

is chosen for each simulation sample.

Example 1. In this example, we focus attention on the estimation of the proposed estimation procedure, and the following SIVCM is considered:

Y = g_{1} (X^{T} β_{0}) + g_{2} (X^{T} β_{0}) Z_{1} + g_{3} (X^{T} β_{0}) Z_{2} + ε,

(24)

where

β_{0} = {(\frac{1}{3}, \frac{2}{3}, \frac{2}{3})}^{T}

,

X = {(X_{1}, X_{2}, X_{3})}^{T}

, and

Z = {(Z_{1}, Z_{2})}^{T}

are jointly normally distributed with mean 0, variance 1 and correlation

0 . 5^{| i - j |}

,

g_{1} (u) = 2 cos (π u)

,

g_{2} (u) = 1 + u^{2} / 2

and

g_{3} (u) = exp (- u)

. The error

ε

and

X_{1}

,

X_{2}

,

X_{3}

,

Z_{1}

,

Z_{2}

are independent;

X_{1}

may have missing values. The selection probability functions are given by:

\begin{matrix} π_{1} (X_{2}, X_{3}, Z_{1}, Z_{2}) = {1 + exp (- (γ_{0} + γ_{1} X_{2} + γ_{2} X_{3} + γ_{3} Z_{1} + γ_{4} Z_{2}))}^{- 1} . \end{matrix}

We consider

π_{1}

with

(γ_{0}, γ_{1}, γ_{2}, γ_{3}, γ_{4}) = (1, 0.2, 0.2, 0.4, 0.5)

. The corresponding average missing rates are

25 %

. In our simulation, three different distributions of model error

ε

are considered:

case1: The standard normal distribution

N (0, 1)

.

case2: The centralized t-distribution with three degrees of freedom

t (3)

that is used to generate heavy-tailed distribution.

case3: The mixture of normals

0.9 N (0, 1) + 0.1 N (0, 100) (M N (1, 100))

which is used to produce the outliers.

Table 1 displays the average absolute deviations (AAD) and the standard deviations (SD), as well as the mean absolute deviations (MAD), for each case with sample sizes

n = 50, 200, 400

. It can be seen that when the errors are normally distributed, our proposed estimator, based on the exponential loss squared (ESL), has smaller AAD, SD, and MAD than the

Q R_{0.5}

, the estimating equations (EE) and the empirical likelihood ratio (EL) methods for all sample sizes, which means that the proposed estimator performs better than the other three estimators. The proposed estimator also gives good results for the other two error distributions,

t (3)

and

M N (1, 100)

. The significant improvement in the performance of our proposed estimator over the EE, EL, and

Q R_{0.5}

estimators indicates that our proposed estimation method ESL is robust to datasets with outliers or error distributions of response variables with high tails. More importantly, as the sample size n increases, the performance of the estimator

\hat{β}

tends to improve significantly.

Table 1. Simulation results of AAD (

\times 10^{2}

), SD (

\times 10^{2}

), and MAD (

\times 10^{2}

) for the estimators of

β_{i} (i = 1, 2, 3)

.

The square root of average square error (RASE) of the estimator

{\hat{g}}_{k} (\cdot)

for the nonparametric function

g_{k} (\cdot)

with sample sizes of n = 50, 200. and 400 is reported in Table 2. Table 2 gives results similar to those in Table 1. We note that no matter which of the above three distributions the error follows, our proposed estimator, compared with the other three estimators, has smaller RASE and performs better. That is, for the non-normal distributions, our proposed estimate method ESL is consistently superior to QR, EE, and EL. When the probability of selection

π (\cdot)

is correctly specified and estimated using the parametric model, a clear pattern emerges: as the sample size n increases, the performance of the two estimators

\hat{β}

and

\hat{g} (\cdot)

becomes greater and greater.

Table 2. Simulation results of RASE for the estimators of

g_{i} (\cdot) (i = 1, 2, 3)

.

Example 2. This example aims to study the variable selection performance of the index parameters in model (1). The model setup is similar to (24) except that

X = {(X_{1}, X_{2}, \dots, X_{8})}^{T}

independently generated from

{[- 1, 1]}^{8}

and

β_{0} = {(\frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, 0, 0)}^{T}

. As considered in Example 1, three different error distributions

N (0, 1)

,

t (3)

, and

M N (1, 25)

are considered to show the robustness of the proposed estimator method based on the exponential squared loss (ESL). The error

ε

and

X_{1}

, ⋯,

X_{8}

,

Z_{1}

,

Z_{2}

are independent;

X_{1}

may have missing values. The selection probability functions are given by:

\begin{matrix} π_{2} (X_{2}, X_{5}, Z_{1}, Z_{2}) = {1 + exp (- (γ_{0} + γ_{1} X_{2} + γ_{2} X_{5} + γ_{3} Z_{1} + γ_{4} Z_{2}))}^{- 1} . \end{matrix}

We consider

π_{1}

with

(γ_{0}, γ_{1}, γ_{2}, γ_{3}, γ_{4},) = (1, 0.2, 0.2, 0.4, 0.5)

. The corresponding average missing rates are

25 %

.

For each mechanism mentioned above, we compare the performance of four methods: our proposed method [ESL-SCAD], LSE-SCAD proposed by Feng and Xue [11], LAD-SCAD proposed by Yang and Yang [25], and EE-SCAD method based on Xue and Pang [12]. The results are reported in Table 3 and are similar to the conclusions of Example 1. Whether the error term follows the normal distribution, the centralized t-distribution, or the mixture of normals, our proposed method performs more efficiently in variable selection, which has larger NC and smaller NIC. When there exist outliers in the response variables or heavy-tailed error distributions, ESL-SCAD has an obviously better performance than LAD-SCAD, EE-SCAD, or LSE-SCAD estimators. For normal error, ESL-SCAD hardly loses any efficiency.

Table 3. Variable selection results and RASE of

{\hat{g}}_{k} (\cdot)

,

k = 1, 2, 3

in Example 2.

The proposed procedure is also competitive in terms of computational cost. The calculation was performed on a computer with AMD Ryzen processors, a 16 GB RAM, running a Windows 10 system, and only one CPU was used for fair comparisons. Results on computational efficiency of the our proposed method are presented in Table 4 and Table 5, which show CPU times (in seconds) for different combinations of the full data size n and the number of covariates p. It is seen that the proposed algorithm is faster.

Table 4. CPU times for different n in Example 1.

Table 5. CPU times for different n in Example 2.

4. Discussion

In this paper, we use penalized regression with exponential squared loss to propose a robust variable selection procedure for a single-index model along with missing data. The B-spline is a method that can estimate the relationship with the response. IPW is a frequently used method dealing with the bias resulting from missing covariates, and the non-convex penalty method is used to estimate and select the variable at the same time. We examine the properties of sampling and robustness of our estimator. From theoretical and simulation study in this paper, the merits of our method are obvious. We also illustrate that the outcomes are good when using our method for actual data. In particular, we reveal that this estimator has the highest sample breakdown point, and the influence function for outliers are limited either in the response domain or in the covariate domain. In this paper, simulation studies and applications indicate the advantage of our method. When outliers are presented (regardless of the mechanism), EE-SCAD and LSE-SCAD are inferior in terms of non-caused selection rate.

Moreover, we can make further studies based on our proposed method. First, it is worth considering the goodness-of-fit test; in this paper we only study the sparse estimation and variable selection, however. Second, censoring can be examined based on this model. An investigation of the difficulties above is a portion of further study but is out of this paper’s scope. In the proposed theory, internal knots are considered as fixed values. Finally, how to optimally select internal knots when data are missing is an interesting problem worthy of future research.

Author Contributions

Formal analysis, H.S.; Methodology, Y.S.; Software, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NNSF project (61503412) of China, NSF project (ZR2019MA016) of Shandong Province of China.

Conflicts of Interest

The authors declare that they have no competing interest.

References

Yates, F. The analysis of replicated experiments when the field results are incomplete. Emp. J. Exp. Agric. 1933, 1, 129–142. [Google Scholar]
Healy, M.; Westmacott, M. Missing values in experiments analysed on automatic computers. J. R. Stat. Soc. Ser. B Methodol. 1956, 5, 203–206. [Google Scholar] [CrossRef]
Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
Wang, C.; Wang, S.; Zhao, L.P.; Ou, S.T. Weighted semiparametric estimation in regression analysis with missing covariate data. J. Am. Stat. Assoc. 1997, 92, 512–525. [Google Scholar] [CrossRef]
Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793. [Google Scholar]
Liang, H.; Wang, S.; Robins, J.M.; Carroll, R.J. Estimation in partially linear models with missing covariates. J. Am. Stat. Assoc. 2004, 99, 357–367. [Google Scholar] [CrossRef]
Tsiatis, A.A. Semiparametric Theory and Missing Data; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
Wang, X.; Jiang, Y.; Huang, M.; Zhang, H. Robust variable selection with exponential squared loss. J. Am. Stat. Assoc. 2013, 108, 632–643. [Google Scholar] [CrossRef]
Feng, S.; Xue, L. Variable selection for single-index varying-coefficient model. Front. Math China 2013, 8, 541–565. [Google Scholar] [CrossRef]
Xue, L.; Pang, Z. Statistical inference for a single-index varying-coefficient model. Stat. Comput. 2013, 23, 589–599. [Google Scholar] [CrossRef]
Hardle, W.; Hall, P.; Ichimura, H. Optimal smoothing in single-index models. Ann. Stat. 1993, 21, 157–178. [Google Scholar] [CrossRef]
Wu, T.Z.; Lin, H.; Yu, Y. Single-index coefficient models for nonlinear time series. J. Nonparametr. Stat. 2011, 23, 37–58. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R. Varying-coefficient models. J. R. Stat. Soc. Ser. B Methodol. 1993, 55, 757–779. [Google Scholar] [CrossRef]
Fan, J.; Zhang, W. Statistical estimation in varying coefficient models. Ann. Stat. 1999, 27, 1491–1518. [Google Scholar] [CrossRef]
Xia, Y.; Li, W.K. On single-index coefficient regression models. J. Am. Stat. Assoc. 1999, 94, 1275–1285. [Google Scholar] [CrossRef]
Xue, L.; Wang, Q. Empirical likelihood for single-index varying-coefficient models. Bernoulli 2012, 18, 836–856. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Peng, H.; Huang, T. Penalized least squares for single index models. J. Stat. Plan. Inference 2011, 141, 1362–1379. [Google Scholar] [CrossRef]
Yang, H.; Yang, J. A robust and efficient estimation and variable selection method for partially linear single-index models. J. Multivar. Anal. 2014, 129, 227–242. [Google Scholar] [CrossRef]
Wang, D.; Kulasekera, K. Parametric component detection and variable selection in varying-coefficient partially linear models. J. Multivar. Anal. 2012, 112, 117–129. [Google Scholar] [CrossRef][Green Version]
Yang, J.; Yang, H. Quantile regression and variable selection for single-index varying-coefficient models. Commun. Stat.-Simul. C 2017, 46, 4637–4653. [Google Scholar] [CrossRef]
Yu, Y.; Ruppert, D. Penalized spline estimation for partially linear single-index models. J. Am. Stat. Assoc. 2002, 97, 1042–1054. [Google Scholar] [CrossRef]
He, X.; Zhu, Z.Y.; Fung, W.K. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika 2002, 89, 579–590. [Google Scholar] [CrossRef]
Zhao, P.; Xue, L. Variable selection for semiparametric varying coefficient partially linear models. Stat. Probab. Lett. 2009, 79, 2148–2157. [Google Scholar] [CrossRef]

Table 1. Simulation results of AAD (

\times 10^{2}

), SD (

\times 10^{2}

), and MAD (

\times 10^{2}

) for the estimators of

β_{i} (i = 1, 2, 3)

.

Table 1. Simulation results of AAD (

\times 10^{2}

), SD (

\times 10^{2}

), and MAD (

\times 10^{2}

) for the estimators of

β_{i} (i = 1, 2, 3)

.

$Dist$	n	Method	${\hat{β}}_{1}$		${\hat{β}}_{2}$		${\hat{β}}_{3}$		MAD
$Dist$	n	Method	AD	SD	AD	SD	AD	SD	MAD
$N (0, 1)$	50	ESL	0.0862	0.0842	0.0934	0.0940	0.0916	0.0928	0.0923
		${QR}_{0.5}$	0.0924	0.0908	0.0945	0.0966	0.0928	0.0936	0.09379
		EE	0.6386	0.6812	0.6806	0.7013	0.7047	0.7014	0.6734
		EL	0.6418	0.6826	0.6814	0.702	0.7069	0.7022	0.6742
	200	ESL	0.0521	0.0517	0.0568	0.0655	0.0656	0.0637	0.0623
		${QR}_{0.5}$	0.0643	0.0635	0.0696	0.0743	0.0704	0.0725	0.0739
		EE	0.4993	0.5014	0.4884	0.5113	0.5025	0.5204	0.4997
		EL	0.4998	0.5022	0.4892	0.5121	0.5033	0.5212	0.4999
	400	ESL	0.0467	0.0475	0.0482	0.0499	0.0491	0.0493	0.0478
		${QR}_{0.5}$	0.0473	0.0489	0.0495	0.0504	0.0493	0.0497	0.0484
		EE	0.4306	0.4682	0.4673	0.4809	0.4835	0.4824	0.4531
		EL	0.4312	0.4694	0.4682	0.4816	0.4847	0.4830	0.4538
$t (3)$	50	ESL	1.8642	1.9342	1.9575	2.0416	1.9464	1.9488	1.9240
		${QR}_{0.5}$	2.0468	2.1726	2.1934	2.2682	2.2610	2.3208	2.2627
		EE	4.2203	4.8418	5.7262	5.9258	5.4436	6.0240	5.1639
		EL	4.2217	4.8446	5.7280	5.9264	5.4475	6.0264	5.1653
	200	ESL	0.4734	0.4892	0.4957	0.5044	0.4936	0.4978	0.4846
		${QR}_{0.5}$	0.4902	0.5013	0.5075	0.5184	0.5118	0.5250	0.5129
		EE	2.3115	2.7526	3.2385	3.5381	3.1327	3.5504	2.9215
		EL	2.3121	2.7532	3.2390	3.5388	3.1331	3.5508	2.9217
	400	ESL	0.0643	0.0635	0.0696	0.0743	0.0704	0.0725	0.0739
		${QR}_{0.5}$	0.0713	0.0762	0.0728	0.0801	0.0697	0.0755	0.0784
		EE	1.4832	1.8364	2.3734	2.4119	2.3658	2.5706	2.1897
		EL	1.4838	1.8370	2.3742	2.4126	2.3664	2.5712	2.1903
$M N (1, 100)$	50	ESL	2.2328	2.3444	2.3727	2.4228	2.4053	2.4264	2.4306
		${QR}_{0.5}$	2.7892	2.8526	2.8913	2.9726	2.9436	3.1175	2.8328
		EE	4.6206	5.2304	5.8304	7.4631	5.6529	6.4336	5.9205
		EL	4.6224	5.2120	5.8316	7.4655	5.6542	6.4368	5.9217
	200	ESL	0.5036	0.5158	0.5525	0.5844	0.5534	0.5812	0.5406
		${QR}_{0.5}$	0.5142	0.5276	0.5697	0.5982	0.5680	0.5903	0.5534
		EE	2.5612	3.0546	3.5274	4.9437	3.4935	4.1178	3.6893
		EL	2.5616	3.0550	3.5278	4.9443	3.4940	4.1182	3.6899
	400	ESL	0.0565	0.0547	0.0585	0.0657	0.0646	0.0621	0.0633
		${QR}_{0.5}$	0.0720	0.0755	0.0718	0.0762	0.0722	0.0719	0.0743
		EE	1.5764	1.8035	2.4832	2.7761	2.5167	2.6839	2.3106
		EL	1.5772	1.8043	2.4838	2.7767	2.5171	2.6842	2.3110

Table 2. Simulation results of RASE for the estimators of

g_{i} (\cdot) (i = 1, 2, 3)

.

Table 2. Simulation results of RASE for the estimators of

g_{i} (\cdot) (i = 1, 2, 3)

.

$Dist$	n	Method	${\hat{g}}_{1}$	${\hat{g}}_{2}$	${\hat{g}}_{3}$
$Dist$	n	Method	RASE	RASE	RASE
$N (0, 1)$	50	ESL	0.3647	0.2281	0.3872
		${QR}_{0.5}$	0.3893	0.2463	0.3969
		EE	0.3854	0.2358	0.3905
		EL	0.3872	0.2364	0.3916
	200	ESL	0.0941	0.0915	0.1030
		${QR}_{0.5}$	0.1083	0.1041	0.1152
		EE	0.1034	0.0967	0.1096
		EL	0.1038	0.0969	0.1098
	400	ESL	0.0323	0.0304	0.0357
		${QR}_{0.5}$	0.0447	0.0428	0.0483
		EE	0.0333	0.0314	0.0446
		EL	0.0339	0.0320	0.0448
$t (3)$	50	ESL	0.4028	0.3884	0.3916
		${QR}_{0.5}$	0.4264	0.4152	0.4237
		EE	1.6802	1.4716	1.7231
		EL	1.6826	1.4738	1.7242
	200	ESL	0.1052	0.1056	0.1104
		${QR}_{0.5}$	0.1188	0.1170	0.1219
		EE	0.7308	0.6131	0.7463
		EL	0.7314	0.6138	0.7469
	400	ESL	0.0366	0.0340	0.0485
		${QR}_{0.5}$	0.0492	0.0476	0.0513
		EE	0.4120	0.3493	0.5042
		EL	0.4124	0.3495	0.5048
$M N (1, 100)$	50	ESL	0.4356	0.3751	0.3938
		${QR}_{0.5}$	0.4682	0.4065	0.4175
		EE	1.5145	1.4127	1.6127
		EL	1.5163	1.4203	1.6343
	200	ESL	0.1102	0.1045	0.1146
		${QR}_{0.5}$	0.1228	0.1137	0.1201
		EE	0.7089	0.6715	0.7141
		EL	0.7093	0.6719	0.7147
	400	ESL	0.0379	0.0384	0.0361
		${QR}_{0.5}$	0.0483	0.0496	0.0489
		EE	0.3747	0.3572	0.5387
		EL	0.3751	0.3577	0.5393

Table 3. Variable selection results and RASE of

{\hat{g}}_{k} (\cdot)

,

k = 1, 2, 3

in Example 2.

Table 3. Variable selection results and RASE of

{\hat{g}}_{k} (\cdot)

,

k = 1, 2, 3

in Example 2.

$Dist$	n	Method	NC	NIC	PC	${RASE}_{1}$	${RASE}_{2}$	${RASE}_{3}$
$N (0, 1)$	50	ESL-SCAD	4.850	0	0.948	0.2351	0.2306	0.2412
		LAD-SCAD	4.820	0	0.942	0.2364	0.2318	0.2430
		LSE-SCAD	4.865	0	0.950	0.2336	0.2140	0.2384
		EE-SCAD	4.855	0	0.944	0.2340	0.2153	0.2396
	200	ESL-SCAD	4.945	0	0.962	0.1143	0.1132	0.1228
		LAD-SCAD	4.940	0	0.960	0.1187	0.1145	0.1256
		LSE-SCAD	4.955	0	0.970	0.1132	0.1065	0.1182
		EE-SCAD	4.950	0	0.965	0.1138	0.1071	0.1194
	400	ESL-SCAD	5.000	0	1.000	0.0467	0.0432	0.0556
		LAD-SCAD	5.000	0	1.000	0.0545	0.0526	0.0581
		LSE-SCAD	5.000	0	1.000	0.0423	0.0404	0.0536
		EE-SCAD	5.000	0	1.000	0.0429	0.0408	0.0540
$t (3)$	50	ESL-SCAD	4.924	0.006	0.948	0.2262	0.2250	0.2333
		LAD-SCAD	4.916	0.009	0.922	0.2350	0.2344	0.2475
		LSE-SCAD	3.503	0.178	0.594	0.9616	0.9826	0.9688
		EE-SCAD	3.524	0.190	0.589	0.9624	0.9856	0.9723
	200	ESL-SCAD	4.946	0.003	0.962	0.1128	0.1117	0.1253
		LAD-SCAD	4.930	0.005	0.950	0.1290	0.1274	0.1321
		LSE-SCAD	3.765	0.160	0.690	0.7398	0.7015	0.7547
		EE-SCAD	3.775	0.175	0.675	0.7412	0.7035	0.7569
	400	ESL-SCAD	4.998	0	0.998	0.0466	0.0432	0.0585
		LAD-SCAD	4.990	0	0.995	0.0598	0.0584	0.0619
		LSE-SCAD	4.215	0.105	0.750	0.4206	0.3573	0.5134
		EE-SCAD	4.190	0.110	0.735	0.4224	0.3597	0.5148
$M N (1, 25)$	50	ESL-SCAD	4.895	0	0.930	0.2268	0.2150	0.2269
		LAD-SCAD	4.880	0	0.922	0.2440	0.2312	0.2453
		LSE-SCAD	3.425	0.175	0.546	0.9367	0.9536	0.9516
		EE-SCAD	3.440	0.190	0.536	0.9435	0.9557	0.9535
	200	ESL-SCAD	4.940	0	0.955	0.1129	0.1063	0.1147
		LAD-SCAD	4.935	0	0.950	0.1335	0.1241	0.1305
		LSE-SCAD	3.805	0.155	0.685	0.7151	0.6803	0.7233
		EE-SCAD	3.815	0.165	0.660	0.7193	0.6819	0.7245
	400	ESL-SCAD	4.997	0	1.000	0.0414	0.0563	0.0467
		LAD-SCAD	4.995	0	1.000	0.0587	0.0601	0.0595
		LSE-SCAD	4.355	0.090	0.840	0.3823	0.3634	0.5467
		EE-SCAD	4.275	0.105	0.755	0.3851	0.3676	0.5493

Table 4. CPU times for different n in Example 1.

n	$N (0, 1)$	$t (3)$	$MN (1, 100)$
50	0.5609	0.6491	0.7832
200	0.7326	0.8169	0.8664
400	0.9528	0.9868	1.0610

Table 5. CPU times for different n in Example 2.

n	$N (0, 1)$	$t (3)$	$MN (1, 100)$
50	0.8702	0.9564	1.1062
200	1.1470	1.2235	1.3598
400	1.3682	1.4373	1.6476

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates

Abstract

1. Introduction

2. Methodology

2.1. Basis Function Expansion

2.2. Robust Estimation Based on Inverse Probability Weighting

2.3. The Penalized Robust Regression Estimator

2.4. Algorithm

2.5. The Choice of the Regularization Parameter $λ_{1}$ and $λ_{2}$

2.6. The Choice of the Regularization Parameter $η_{n}$

3. Simulation

4. Discussion

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates

Abstract

1. Introduction

2. Methodology

2.1. Basis Function Expansion

2.2. Robust Estimation Based on Inverse Probability Weighting

2.3. The Penalized Robust Regression Estimator

2.4. Algorithm

2.5. The Choice of the Regularization Parameter λ 1 and λ 2

2.6. The Choice of the Regularization Parameter η n

3. Simulation

4. Discussion

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

2.5. The Choice of the Regularization Parameter $λ_{1}$ and $λ_{2}$

2.6. The Choice of the Regularization Parameter $η_{n}$