Robust Variable Selection with Exponential Squared Loss for the Spatial Durbin Model

With the continuous application of spatial dependent data in various fields, spatial econometric models have attracted more and more attention. In this paper, a robust variable selection method based on exponential squared loss and adaptive lasso is proposed for the spatial Durbin model. Under mild conditions, we establish the asymptotic and “Oracle” properties of the proposed estimator. However, in model solving, nonconvex and nondifferentiable programming problems bring challenges to solving algorithms. To solve this problem effectively, we design a BCD algorithm and give a DC decomposition of the exponential squared loss. Numerical simulation results show that the method is more robust and accurate than existing variable selection methods when noise is present. In addition, we also apply the model to the 1978 housing price dataset in the Baltimore area.


Introduction
In recent years, spatial section data have been widely used in geography, politics, environment, and other fields. Therefore, spatial econometrics, a model initially used in the economic area, has also attracted much attention. Anselin (1988) [1] divides spatial econometric models into spatial error models, spatial hysteresis models, and spatial Durbin models (SDM). Among them, the spatial Durbin model is represented as y = ρWy + Xβ + WXδ + ε. The spatial Dubin model considers the influence of the independent variable and the dependent variable of the spatial lag on the dependent variable simultaneously and can more easily estimate the unbiased coefficient. At the same time, the spatial Dubin model can also calculate spatial spillover effects based on panel data. In spatial regression analysis, the influence of regional locations on observations is expressed employing spatial weight matrix W, and the appropriate setting of spatial weight matrix is an essential basis for spatial econometric analysis. There are two main ways to select a spatial weight matrix: The first method is to select a spatial weight matrix from an optional set of spatial weight matrices. Kelejian (2008) [2] uses GMM estimation to select an actual spatial weight matrix. A non-nested J test method is proposed to test a set of alternative models with different spatial weight matrices for the empty SAR model. The second type of method estimates the weight matrix by averaging different spatial weight matrices. Zhang and Yu (2018) [3] propose a model averaging process to reduce estimation error. This approach overcomes the difficulty that the actual spatial weight matrix is not in the candidate matrix.
In the field of classical linear regression, much work has contributed to variable selection. Among them, the most popular method is to add penalty functions to the model for variable selection. These punishment regression methods have a unified theoretical framework, such as most minor absolute shrinkage and selection operators (lasso, Tibshirani, 1996) [4], smoothly clipped absolute deviation (SCAD, Fan and Li, 2001) [5], and adaptive lasso (Zou, 2006) [6]. Since SDM has spatial autocorrelation, the above variable selection method can be directly applied to the SDM model.
Due to noise and outliers, the classical variable selection methods in regression models often face the problem of instability, so many scholars have proposed some robust variable selection algorithms. The Huber loss function was widely used in early studies, but this function has some limitations in efficiency and solution. Wang et al. (2013) [7] proposed a robust parameter estimation method based on the exponential squared loss function, which is widely used in boosting algorithms (Friedman et al., 2000) [8]. When γ is small, the loss of experience caused by a larger |t| value is close to 1; therefore, the loss function is robust to parameter estimation. Wang et al. (2013) [7] also point out that this method is more robust than other robust variable selection methodsm such as Huber estimation, quantile regression estimation (Koenker and Bassett, 1978) [9], and compound quantile regression estimation (Zou and Yuan, 2008) [10], and proposed the selection method of parameter γ.
Our research focuses on variable selection for spatial Durbin models. The spatial Durbin model combines the spatial interaction of dependent and explanatory variables, but only a few researchers use and study this model. Beer and Rield (2011) [11] used the maximum likelihood estimation to estimate the parameters of the spatial Durbin model. They used the Monte Carlo method to analyze the characteristics of the estimator. Mustaqim (2018) [12] discusses instrumental variable efficiency in simultaneous spatial Durbin models. Estimation methods are 2SLS and GMM-S2SLS. The analysis results show that the GMM-S2SLS method produces less bias than the 2SLS method. Zhu, Yanli (2020) [13] proposed parameter estimation of the spatial Durbin model based on Markov Chain Monte Carlo (MCMC). Wei, Lili (2021) [14] proposed a within-group spatial two-stage least squares estimator. However, the existing variable selection methods are affected by outliers in limited samples and are not robust enough. Therefore, it is imperative to study a robust variable selection method.
Considering robustness, we combine parameter penalty with exponential square loss and assume that the errors of the model are independent and identically distributed. For the parameter penalty method, we use adaptive lasso. We applied the robust selection method based on the exponential squared loss variable to the spatial autoregressive model and achieved satisfactory results [15]. The spatial autoregressive model is one of the special cases of the spatial Durbin model. In this paper, we aim to study the application of the robust selection method based on the exponential squared loss variable in the spatial Durbin model.
A robust variable selection method for the spatial Durbin model based on adaptive lasso penalty and exponential square loss function is proposed in this paper. This method cannot only estimate the regression coefficient but also has the function of variable selection. Next, we show the framework of the paper.

1.
We build a robust variable selection method for SDM, equipped with an exponential squared loss, resistant to the influence of outliers in the observed values and errors estimating the space weight matrix.

2.
To solve the optimization problem of SDM, we propose a block coordinate descent (BCD) algorithm. Secondly, to solve the subproblems generated by the BCD algorithm, we design the DC decomposition of exponential square loss and construct the CCCP program. Finally, to obtain the BCD algorithm's convergence, we analyze the algorithm's convergence rate to the stagnation point under mild conditions. 3.
We proved the "Oracle" property of the robust variable selection method and conducted numerical experiments to verify the robustness and effectiveness of the model. Numerical studies show that when there are outliers in the observed data, the method proposed in this paper is superior to the comparison method in correctly identifying zero coefficients, nonzero coefficients, and MedSE incorrectly.
The structure of this paper is as follows. Section 2 introduces the spatial Durbin model and gives the exponential square loss function based on adaptive lasso. In Section 3, we propose an effective algorithm to complete the variable selection process. In order to check the performance of the model under limited samples, we have carried out a numerical simulation in Section 4. In Section 5, we apply our model to real-world datasets. We summarize the full text in Section 6.

Spatial Durbin Model
The observed dependent variable Y i ∈ R 1×1 , and the corresponding independent vector X i = X i1 , . . . , X ip , where the p is a fixed constant. Let the dependent variable vector Y = (Y 1 , . . . , Y n ) T and the independent variable matrix X = (X 1 , . . . , X n ) T ∈ R n×p . The spatial Durbin model is as follows: where the regression coefficient vector β = β 1 , . . . , β p T ∈ R p×1 , the spatial autocorrelation coefficient ρ ∈ R 1×1 , the regression coefficient vector of exogenous variables δ = δ 1 , . . . , δ p T ∈ R p×1 , and the error vector ε = (ε 1 , . . . , ε n ) T ∈ R n×1 . WX is a spatial lag term that reflects the interaction of independent variables between individuals. Wy embodies the interaction between the strain variable y and its surrounding y. We assume that noises ε all obey N 0, σ 2 and are independent of each other. y can be expressed as the following formula: Since the maximum eigenvalue of W is 1 after normalization, to guarantee (I n − ρW) reversibility, we order |ρ| < 1. Additionally, in this article, we ignore the endogenous nature of the model.

Variable Selection Method for SDM
Rewrite model (1) as model (3) in the following form: Take the variable selection for the SDM into consideration. In practical applications, the regression coefficient vector β * is usually sparse. At the same time, sparse solutions can find useful dimensions and reduce redundancy, as well as improve the accuracy and robustness of regression prediction (Fan and Li, 2001 [5]; Tibshirani, 1996 [4]). Applying the penalized method to variable selection is natural, which can select essential variables and estimate the regression coefficient. In this article, we punish the loss function using the adaptive lasso penalty function. The adaptive lasso penalty is described as follows: where η j = 1 |β j | ,β j is generally given by least squares estimates. Considering that the exponential square loss function has good robustness, we use it as the model's loss function in this paper. The exponential square loss expression is as follows: Here, γ is a parameter that controls the robustness of the loss function. γ limits the effect of outliers on the model but also reduces the accuracy of the model. Therefore, it is essential to choose the right γ. The method of selecting the right γ is shown in Section 2.4.
The model is constructed on the basis of the above model (3). The objective function to be solved is as follows: We may as well orderỸ = WY, We can obtain a simplified expression of (6) as follows as (7): where λ > 0 is a regularization parameter. φ γ (.) is exponential squared loss.

Oracle Properties and Large Sample Properties
In this section, we discuss the large sample properties and oracle properties of the proposed spatial Durbin model parameter estimation method.
be the resulting estimator of (4), suppose that theβ β β here has also undergone the above transformation.
We prove the asymptotic and oracle properties of the proposed penalty estimators. Before we can prove it, we need the following hypothesis. Assumption 1. Σ = E ZZ T is positive definite and E Z 3 < ∞. Assumption 2. The matrix I n − ρW is nonsingular with |ρ| < 1.

Assumption 8.
There are constants C 1 and C 2 such that, when θ 1 , For our proposed estimator, we give the following sample properties. The following theorem gives the consistency and "oracle" property of the proposed estimator. Theorem 1. If Assumptions 1 − 8 are true, then there is a local maximizerθ θ θ such that θ θ θ − θ θ θ 0 = O p n −1/2 + a n .

The Selection of Parameter γ
Parameter γ can control the robustness and efficiency of the robust variable selection method. Wang et al. (2013) [7] proposed a parameter selection method based on normal regression. In this paper, we extend the selection method of parameter γ to the spatial Durbin model. The specific process is as follows: Step Step 2. Find the pseudo-outlier set of the sample: . . , n and S n = 1.4826 × median i | r i (β * )− median j r j (β * ) |. Then, there exist the pseudo-outlier set D m = (X i , Y i ) : r i (β * ) ≥ 2.5S n , set m = 1 ≤ i ≤ n : r i (β * ) ≥ 2.5S n }, and D n−m = D n \D m .
In the above process, the initial step requires an initial valueβ (0) . In practice, the estimate of LAD loss is usually used asβ (0) .

The Selection of Parameter λ and η j
We order λ i = λ · η i , in which λ and η i are from model (7). Usually, researchers use cross-validation, AIC, and BIC criteria to select λ i . In this paper, considering the complexity of computation and the consistency of variable selection, we adopt the method of Wang, Li, and Tsai (2007) [16] to consider regularization parameters by minimizing the BIC-type objective function. The BIC-type objective function is as follows: The selection method of parameter γ is given above. This makes λ i = log(n)/n|θ i |. In practice, let θ i =θ i , whereθ i is the exponential square loss estimator without penalty term. Note that this choice satisfies the conditionλ i → 0 for i ≤ d 0 , andλ i → ∞ for i > d 0 , with d 0 as the number of nonzero value in the θ 0 . Therefore, the final estimator can ensure the consistency of variable selection.

Algorithm for Model Solving
In this section, we focus on designing algorithms to solve model (7). This optimization problem has two optimization variables,β ∈ R 2p and ρ ∈ [0, 1]. So, the block coordinate descent algorithm becomes our first choice. However, the subproblems used to solveβ are nonconvex functions and are not differentiable, and the convergence of the block coordinate drop algorithm is difficult to guarantee. In this case, we used bump decomposition and CCCP algorithms to deal with it. Finally, regarding the processing of penalty terms in the optimization model, we use the ISTA algorithm. This is reflected below.

Block Coordinate Descent Algorithm Frame
We present the framework of the block coordinate descent algorithm in Algorithm A. Next, we need to solve subproblems (8) and (9).

Solving the Subproblem (8)
Subproblem (8) minimizes the univariate function at the interval [0,1], so it can be solved using a golden section algorithm based on parabolic interpolation. For more information about the algorithm, see Forsythe et al. (1977) [17]. It is not repeated in this article.

Solving the Subproblem (9)
For subproblem (9), by observation, we can see that the penalty term part of the optimization model is the convex function, and the loss function part φ γ can also be decomposed into the difference between the two convex functions, that is, the DC function. So, subproblem (8) is DC programming. We can construct corresponding algorithms to solve the problem. Algorithm 1: Block coordinate descent algorithm 1. Set initial value for 0 ∈ R 2p and ρ 0 ∈ (0, 1); 2. repeat repeat repeat{ For k = 0, 1, 2, · · · } 3.
Solve the subproblem about ρ with initial point ρ k :

4.
Solve the subproblem with initial valueβ k , to obtain a solutionβ k+1 , ensuring that 5. until until until convergence.
We can first perform a DC decomposition of the loss function φ It can be verified that both F(t) and G(t) are convex functions. The DC decomposition of φ γ (t) is as follows: We can use the CCCP algorithm to solve the problem after DC decomposition. Next, define the following two functions: w i is the i th row of the weight matrix W, and ∑ p j=1 P β j is a convex penalty with respect toβ. Then, J vex (·) and J cav (·) are a convex function and concave function, respectively. So, the suboptimization problem (9) can be rewritten as At this point, it can be found that the optimization problem (15) can be solved by the CCCP(Concave-Convex Procedure) algorithm. The CCCP algorithm framework is shown below (Algorithm 2): 4. untill untill untill convergence ofβ k .
It is easy to know that the optimization problem (16) is a convex optimization problem. The CCCP algorithm minimizes the problem (15) by iteratively solving a series of convex problems (16). Therefore, the solving method of subproblem (16) directly affects the iterative efficiency of the CCCP algorithm.
Observe subproblem (16): j=1 P β j for β. We might as well order where ψ(β) is a convex function aboutβ. So, subproblem (16) can be represented as Optimization problems (17) are composed of convex functions and adaptive lasso penalty terms, and we can use the ISTA algorithm to solve such problems.
For all L > 0, ISTA approximates the function This function has the following minimum point: With Thus, the solution of problem (11) can be simply expressed asβ k = Θ L β k−1 .
In this article, we use the FISTA algorithm with a faster convergence speed than ISTA (Beck and Teboulle, 2009) [18]. The FISTA algorithm framework with backtracking steps is given below (Algorithm 3):

Numerical Examples
We designed five numerical experiments to verify the performance and accuracy of variable selection methods under different conditions. For example, there are abnormal values in dependent variable Y and too many insignificant covariates.
Data generation will be based on model (1). We make the covariance matrix X an n × (q + 3) matrix , and the X obeys the (q + 3)-dimensional normal distribution, the mean value is zero, and the covariance matrix is σ ij , where σ ij = 0.5 |i−j| . This means that the number of samples is n, the number of significant covariates is 3, and the number of insignificant covariates is q. In the following experiments, we set n and q to n ∈ {200, 360, 500} and q ∈ {5, 20, 40, 60}. For the spatial regression coefficient ρ, in the experiment, we set it to ρ ∈ {0.2, 0.5, 0.8}.
We define the spatial weights matrix as a k-diagonal matrix, i.e., a matrix with only the main diagonal and the k-1 skew diagonals around it as element 1, and the other elements as 0. In numerical experiments, we set k = 7.
For the error term, let ε ∼ N 0, σ 2 I n , σ 2 obeys uniform distribution, and its generation interval Of course, in practice, the observation noise does not completely conform to the Gaussian distribution, and there may be abnormal values in the response. The abnormal values in the response are discussed in Section 4.3.
To reflect the excellence of this model, we also used square loss and LAD to compare with our exponential square loss. To ensure the accuracy of the experiment, we repeated each experiment 100 times. The following results are the results of MSE in the middle of 100 repeated experiments.We express the median of MSE as MedSE.

Nonregular Estimation of Normal Data
In this section, we conduct experiments on the condition that q = 5, the noise is Gaussian noise, and the penalty term is not set for the parameter estimation model. The results are shown in Table 1. Square, Exp, and LAD represent square loss, exponential square loss, and LAD loss, respectively. (1) This shows that Exp, Square, and LAD made estimates of βand δ, which are close to typical values (the means of the true values of beta1, β2, and β3 are 3.0, 2.0, and 1.6, then, the mean sof the true values of δ 1 , δ2, and δ3 are 2.0, 1.5, and 1.0.). By comparison, the estimated value obtained by the square loss model is the best. (2) For MedSE, the square loss model also performs the best. (3) The three loss functions can give accurate estimates of the spatial autoregressive coefficients ρ.

Nonregular Estimation for High-Dimensional Data
In this subsection, we made q ∈ {20, 40, 60}, and the parameter estimation results of the model on normal data with huge sample dimensions are explained. The results are shown in Table 2. It can be found that the estimation of β, δ, and ρ of any model is far less effective than that of q = 5. The results of MedSE are also not satisfactory. Due to the insufficient number of samples, such results can be expected.

Nonregular Estimation of Data with Outliers in Dependent Variable y
In this subsection, we make the error term obey the mixed Gaussian distribution (1 − ξ 1 ) · N (0, 1) + ξ 1 · N 10, 6 2 , where ξ 1 ∈ {0.01, 0.05}. In this case, the observed y will have many outliers. We illustrate the results (Table 3) Table 1, where y has no outliers, in almost all tests in Table 3, exponential square loss performed the best. (2) By comparison, the estimated values of β and δ obtained by the exponential square loss model are the best. (3) For the estimation of ρ, the exponential square loss is also the best. Therefore, we can conclude that when y has outliers, the SDM based on exponential square loss has good robustness.

Nonregular Estimation of Data with Noise in Spatial Weight Matrix
In this section, we simulate the presence of noise in the spatial weight matrix. We added a minor disturbance term to each nonzero element in the spatial weight matrix W, where ∼ (1 − ξ 2 ) · N (0, 0.001) + ξ 2 · N (0, 1), ξ 2 ∈ {0.01, 0.03, 0.05}, and all the simulated data are generated with ρ = 0.5, σ = 1. The test results are shown in Table 4. Compared with normal data (Table 1), the MedSE value increased. Additionally, for each loss function, the estimation of β, δ, and ρ also worsens. When the weight matrix has noise, the exponential square loss and LAD loss have good performance. Compared with the square loss, they have more accurate estimates of the parameters and smaller MedSE values. However, it cannot be denied that LAD loss performs better than exponential square loss.

Estimation with Adaptive-l1 Regularizer
We add adaptive L1 regularization to the loss function in this section and conduct experiments. We also record the average number of zero coefficients correctly selected by the model as "Correct" and the average number of nonzero coefficients incorrectly judged by the model as "Incorrect". Table 5 shows the results of adaptive lasso regularization on normal data with q = 5. The results show that, under almost all test results, the SDM model with exponential square loss and adaptive lasso cannot only identify more true zero coefficients ("Correct" with exponential square loss model is almost twice as much as that with square loss and LAD loss model) and nearly zero 'Incorrect' numbers but also has the best MedSE and accurate estimation ofρ. Table 6 shows the results of adaptive lasso regularization on normal data with q ∈ {20, 40, 60}. The results show that when there are too many insignificant covariates, the accuracy of the results of the Square loss model with adaptive lasso and the LAD loss model with adaptive lasso decreases significantly. However, the model with adaptive lasso and exponential square loss is still accurate. It can identify more true "Correct" numbers and nearly zero "Incorrect" numbers and has the best MedSE and precise estimation ofρ. Table 7 shows the results of estimation with adaptive-l1 regularization when the observations of y have outliers. The results show that, in almost all test results, the exponential square loss model with adaptive L1 has identified more true zero coefficients ("Correct") and, in most cases, has lower MedSE. Compared with the model without regularization term (Table 3), the model with adaptive L1 has a better effect. In the test, the exponential square loss model with adaptive L1 identified at least 8 zero coefficients and, in most cases, determined 10 zero coefficients. For MedSE, the exponential square loss model with adaptive L1 has the smallest MedSE in all cases, except that when n = 500 and 2q = 10, the MedSE in some cases is slightly larger than the LAD loss model with adaptive L1. This shows that the SDM using exponential square loss and adaptive lasso has excellent variable selection ability and strong robustness when the Y observation has outliers.    Table 8 shows the results of adaptive lasso regularization for data that q = 5, rho = 0.5, and spatial weight matrix has noise. For all test results, the exponential square loss with adaptive L1 identifies more zero coefficients than other models ('Correct'). Compared with the results of the model without regularization term (Table 4), the model with adaptive L1 has a better effect. However, for MedSE, when n = 200, 2q = 10, the exponential square loss with adaptive L1 is the best; when n = 500, 2q = 10, the LAD loss with adaptive L1 is the best; when n = 360, 2q = 10, the LAD loss with adaptive L1 and the exponential square loss with adaptive L1 have little difference. However, since the exponential square loss with adaptive L1 can identify more nonzero coefficients, we believe that the exponential square loss with adaptive L1 is better than the LAD loss with adaptive L1. The results show that when the spatial weight matrix has estimation error, the SDM with exponential square loss and adaptive lasso has excellent variable selection ability and robustness.

Application of Practical Examples
In this part, we apply the model to actual data to verify the accuracy and efficiency of variable selection and parameter estimation.
We selected a dataset with 211 observations. The dataset describes house sales in the Baltimore area in 1978 and contains home prices and other relevant features. Original data were made available by Robin Dubin [19], Weatherhead School of Management, Case Western Research University, Cleveland, OH. The characteristics of this data are described in Table 9. We mainly study the relationship between price and several other variables. We let the dependent variable be log(PRICE), and the independent variables are NROOM, DWELL, NBATH, PATIO, FIREPL, AC, BMENT, NSTOR, GAR, AGE, CITCOU, LOTSZ, and SQFT. We set the spatial weight matrix W by geographic location relationship. The geographic location can be determined by features X and Y. The expression for w ij looks like this: In addition, we normalize the spatial weights matrix. Table 10 shows the variable selection results of SDM for square loss, exponential square loss, and LAD loss with adaptive lasso and no penalty. To make variable selection results more intuitive, we designed Table 11. In Table 11, if the model believes that the independent variable has a positive effect on the dependent variable, we mark it as "+"; if the model believes that the independent variable is negatively correlated to the dependent variable, we mark it as "−"; and if the model considers the independent variable not to affect the dependent variable (the absolute value of the parameter estimate is less than 0.001), we do not label it. Additionally, we let the total number of "+" features be count "+"; Let the total number of "−" features be count "−"; make the total number of all independent variables related to the dependent variable count. We can find that the BIC index, with or without regularization, is the lowest exponential square loss. As seen from Tables 10 and 11, our variable selection method has a smaller BIC index than other variable selection methods and selects fewer independent variables, making the model more accurate and more straightforward. This fully illustrates the excellence of the variable selection method proposed in this paper.
Next, we analyze our regression results. For the variable NROOM, the six models all think it positively correlates with the house price, so the more rooms, the higher the house price. For variables DWELL, EXP+adaptive-l1, Square+adaptive-l1, and LAD+adaptive-l1, it is not considered that they will impact house prices, while EXP+null, Square+null, and LAD+null think that they have a specific positive correlation with housing prices. The three models believe that if it is a detached unit, it will make the house price higher. For the variable NBATH, all models believe it positively correlates with the housing price. Therefore, the more bathrooms, the higher the house price. For the variables PATIO and FIREPL, the models with regularization term are considered independent of house price; however, the model without regularization term believes that it is related to house price, the regression coefficients are very small, and the signs of regression coefficients are different. Therefore, we believe that these two characteristics have little impact on house price. For the variable AC, the model, without adding the with no regularization term, thinks that it is positively related to the house price; that is, the house price with air conditioning is higher than that without air conditioning. For the variable BMENT, except for EXP+adaptive-l1, other models believe it positively relates to the house price; that is, houses with basements tend to have higher prices. For the variable CITCOU, the nonregularized model considers that it positively correlates with the house price; in Baltimore, houses in the city will be more expensive. For the spatial autocorrelation coefficient, the six models' estimated values are close to 0.5. It can be seen that the rise in house price will lead to an increase in surrounding house prices. Additionally, we can see that NROOM_W, BMENT_W, and CITCOU_W, under the estimation of the six models, all have negative regression coefficients. Therefore, we can know that the spatial regression coefficients of NROOM, BMENT, and CITCOU are negative. As a result, houses with a lot of rooms, houses with basements, and houses in urban areas can have a negative impact on house prices around them. This is also customary. After all, if all the configurations of a house perform well, people will naturally expect more from the houses around them.

Conclusions
This paper constructs a robust method for SDM variable selection based on adaptive lasso and exponential square loss. We established the "oracle" nature of the proposed estimators. For the nondifferentiable and nonconvex problems when the model is solved, we design the BCD algorithm, DC decomposition, and CCCP algorithm to solve them. Numerical simulations show that our method has good robustness and accuracy when there is noise in the observed data. Additionally, when the spatial weight matrix estimation is inaccurate, our method also has some robustness. In variable selection, our method is significantly better than exponential squared loss and LAD loss, and almost all zero coefficients can be identified in numerical simulations. Taking the housing price dataset of the Baltimore region in 1978 as an example, the excellence and accuracy of the variable selection method of the SDM proposed in this paper are verified. Our analysis demonstrates the difference between our robust variable selection approach and other penalty regression methods, demonstrating the importance of developing robust variable selection methods. Let ξ n = n −1/2 + a n and set u = C, where u is d-dimensional vector and C is a large enough constant. Similar to Fan and Li (2001), we first show that β −β 0 = O p (ξ n ). It suffices to show that, for any given > 0, there is a large constant C such that, for large n, which can be expressed as Since p λ j (0) = 0 for j = 1, 2, . . . , p, we have = S n (u) + K n (u). (A4) Note that = > 0, for 0 <β j < n < 0, for − n <β j < 0 .