Robust Variable Selection with Exponential Squared Loss for the Spatial Single-Index Varying-Coefficient Model

As spatial correlation and heterogeneity often coincide in the data, we propose a spatial single-index varying-coefficient model. For the model, in this paper, a robust variable selection method based on spline estimation and exponential squared loss is offered to estimate parameters and identify significant variables. We establish the theoretical properties under some regularity conditions. A block coordinate descent (BCD) algorithm with the concave–convex process (CCCP) is composed uniquely for solving algorithms. Simulations show that our methods perform well even though observations are noisy or the estimated spatial mass matrix is inaccurate.


Introduction
Spatial econometrics is one of the essential branches of econometrics. Its basic content is to consider the spatial effects of variables in regional scientific models. The most widely used spatial econometric model is the spatial autoregressive (SAR) model, first proposed by [1], which has been extensively studied and applied in the fields of economy, finance, and environment.
The SAR model is mainly a parameter model. However, in the practical application, only the parametric model cannot fully explain the complex economic problems and phenomena. Therefore, in order to improve the flexibility and applicability of the spatial econometric model, the non-parametric spatial econometric model has received more attention. Ref. [2] studied the SAR model in the non-parametric frame, obtained the parameter estimators by using the generalized moment estimation, and proved the consistency and asymptotic property of the estimator. The instrumental variable method was used by [3] to study semi-parametric varying-coefficient spatial panel data models with endogenous explanatory variables.
However, for all practical purposes, data may have spatial correlation and spatial heterogeneity simultaneously, which leads to spatial heterogeneity that cannot be fully considered and reflected by the SAR model in the parametric form and the non-parametric SAR model.
The single-index varying-coefficient model is a generalization of the single-index and varying-coefficient models, which can effectively avoid the "curse of dimension" in multidimensional non-parametric regression. Many domestic and foreign researchers have learned this. Refs. [4,5] studied the evaluation of the single-index varying-coefficient model. Ref. [6] constructed the empirical likelihood confidence region of the single-index varying-coefficient model by using the empirical likelihood method; Ref. [7] proposed a new estimated empirical likelihood ratio statistic, obtained maximum likelihood estimators of the model parameters, and proposed a new Profile empirical likelihood ratio, which was shown to be asymptotically close to the standard chi-square distribution.
In addition, selecting significant explanatory variables is one of the most important problems of statistical learning. Some robust regression methods have been proposed, such as quantile regression, composite quantile regression, and modal regression. Ref. [8] presented a new class of robust regression estimators methods for linear models based on exponential square loss. The specific method is as follows: for the linear regression model y i = X T i β + ε i , minimize ∑ n i=1 1 − exp − y i − X T i β 2 /h this objective function to estimate the regression parameters β, in which h > 0 controls the robustness of the estimation. For a large h, 1 − exp −r 2 /h ≈ r 2 /h. Therefore, the proposed estimation is similar to the least squares estimation in the extreme case. For a small h, the value of |r| is large, and the impact on the estimated value is small. Hence, a small value of h will limit the influence of outliers on the estimation, thus improving the robustness of estimators. Ref. [8] also pointed out that their method is more robust than the other general robust estimators methods. Ref. [9] made a robust estimation based on exponential square loss for some linear regression models and proposed a data driver to select adjustment parameters. The exponential square loss square is used in data simulation, and positive results were obtained by the method. Ref. [10] suggested a robust variable selection for the high-dimensional single-index varying-coefficient model based on exponential square loss, established and proved the theoretical properties of estimators, and demonstrated the robustness of this method through numerical simulation. Ref. [11] applied exponential square loss to conduct robust structure analysis and variable selection for some linear variable coefficient models and obtained good results.
Inspired by the above article, we introduce the spatial position of the observed objects into a single-index variable coefficient model, and a spatial single-index variable coefficient model is proposed. We also presented a variable selection method for the spatial single-index varying-coefficient model based on spline estimation and the exponential loss function. This method was capable of selecting significant predictors while estimating regression coefficients. The following are the main contributions of this work.
1. We propose a novel model: the spatial single-index varying-coefficient model, which can deal with the spatial correlation and spatial heterogeneity of data at the same time. 2. We construct a robust variable selection method for the spatial single-index varyingcoefficient model, which uses exponential square loss function to resist the influence of strong noise and inaccurate spatial weight matrix. Furthermore, we present the BCD (block coordinate descent) algorithm to solve the optimization problem of the objective function. 3. Under reasonable assumptions, we give theoretical properties of this method. In addition, we verify the robustness and effectiveness of the variable selection method through numerical simulation studies. The numerical study shows that the method is more robust than other comparative methods in variable selection and parameter estimation when outliers or noise are presented in the observations. The rest of this paper is organized as follows. In Section 2, we develop the methodology for variable selection with exponential squared loss and give the theoretical properties of the proposed method in Section 3. In Section 4, we present the related algorithms. The experimental results are carried out in Section 5, and we conclude the paper in Section 6. All of the details of the proofs of the main theorems are collected in the Appendix A.

Model Setup
Consider the following spatial single-index varying-coefficient model: where y i is the response variable, z i = z i1 , z i2 , · · · , z iq T is the q-dimensional of the observed variable, and U i = (U i1 , U i2 , · · · , U im ) T is the m-dimensional spatial location parameter. The n × n matrix of the spatial weights matrix W in dimensional space is w ij .
ρ and α = (α 1 , α 2 , · · · , α m ) T are the parameters to be estimated. It is natural to suppose that ε i is independent and subject to a mean value of zero and a variance of σ 2 . g(·) is an unknown function. For the identifiability of the model, it is assumed that α = 1 and the first nonzero element of α is positive. It can be seen from the model (1) that the spatial single-index varying-coefficient model is a semi-parametric varying-coefficient model, and the unknown function g changes with the transformation of geographical location. When ρ = 0, the model becomes the partial linear single-index varying-coefficient model. When z i1 = 1 and g 1 U T i α = U T i α while the other g(·) = 0, the model becomes the SAR model.

Basis Function Expansion
Since g(·) is unknown, we replace g(·) with its basis function approximations. The specific estimation steps are as follows: Step 1. The initial value α 0 should be given. This paper uses the method proposed by [12]. We roughly calculate the estimated value of α by the linear regression model: set the estimated value of α as α 0 , in which α 0 = 1 and the first nonzero element in α 0 is positive.
Step 2. Set a k 1 < k 2 < · · · < k l b as l nodes on the interval [a, b]. By the initial value α 0 , let t i = U T i α 0 , then the radial basis function of degree p is Suppose that the coefficient of the radial basis function is then, the sth unknown function g is (t i ) ≈ δ(t i ) T γ 1−s , where i = 1, 2, · · · , n, s = 1, 2, · · · , q. Substituting the radial basis function into model (1), we can obtain the following: Let Y = (y 1 , y 2 , · · · , y n ) T , As can be seen from model (3), model (1) is transformed from the spatial single-index varying-coefficient model to the classical SAR model under the fitting of the radial basis function. The theory of the SAR model is relatively well-equipped, and the exponential squared loss-based variable selection method for the SAR model is used to estimate the unknown parameters.

The Penalized Robust Regression Estimator
Now, we consider the variable selection for the model (3). To guarantee the model identifiability and to improve the model fitting accuracy and interpretability, we normally assume that the true regression coefficient vector α * is sparse with only a small proportion of nonzeros [13,14]. It is natural to employ the penalized method that simultaneously selects important variables and estimates the values of parameters. The constructed model is recast as follows: where λ > 0,Ỹ = WY, ∑ n j=1 P γ 1−j is a penalty term, φ γ 2 (·) is the exponential squared loss function: φ γ 2 (t)=1−exp(−t 2 /γ 2 ) , in which γ 2 is the tuning parameter controlling the degree of robustness.
Concerning the choice of the penalty term. The lasso or adaptive lasso penalty could be considered if there is no extra structured information. Assume thatγ 1 is a root-n-consistent estimator for γ 1 , for instance, the naive least square estimatorγ 1(ols) . Define the weight vector η ∈ R p with η j = 1/ γ 1−j r (j = 1, . . . , q), r > 0, and then we set r = 1 in this paper as suggested by [15]. An adaptive lasso penalty is described as The objective function of penalized robust regression that consists of exponential squared loss and an adaptive lasso penalty is formulated as The selection of tuning parameter γ 2 and regularization parameter λ is discussed in Section 4.

Estimation of the Variance of the Noise
Set H = (I n − ρW) −1 , then the variance of the noise is estimated aŝ where ρ and γ 1 could be estimated by the solutions of (6). It is pointed out that H is a nonsingular matrix, then HH T −1 = H T −1 H −1 = (I n − ρW) T (I n − ρW). Let u = HDγ 1 , then u = HDγ 1 = (I n − ρW) −1 (Dγ 1 ) and thenσ 2 defined by (7) can be computed bŷ
In addition, we have proved that when some suitable conditions hold, the consistent estimation must be sparse, as described below.
Theorem 2. Suppose that conditions C1 ∼ C9 hold, and the number of knots Then, with probability approaching 1,α andĝ(·) satisfy We then show that the estimators of nonzero coefficients for the parameter components have the same asymptotic distribution as the estimators based on the correct submodel. Set and let α * 0 and g * 0 (t) be true values of α * and g * , respectively. Corresponding covariates are denoted by U * i and Z * i , i = 1, . . . , n. Furthermore, let The following result presents the asymptotic properties ofα * .

Theorem 3.
If the assumptions of Theorem 2 hold, we have Theorems 1 and 2 show that the proposed variable selection procedure is consistent, and Theorems 1 and 3 show that the penalized estimators have the oracle property. This demonstrates that if the subset of true zero coefficients are known, the penalty estimators perform well.

Algorithm
In this section, we talk about a feasible algorithm for the solution of (6). A data-driven procedure for γ 2 and a simple selection method for λ are considered. Moreover, effective optimization algorithms have been composed to solve non-convex and non-differentiable objective functions.

Choice of the Tuning Parameter γ 2
The tuning parameter γ 2 controls the level of robustness and performance of the proposed robust regression estimators. Ref. [16] propose a data-driven procedure to choose γ 2 for ordinary regression. We follow its steps and apply it to the spatial single-index varying-coefficient model. Firstly, a set of tuning parameters is determined to ensure that the proposed penalized robust estimators have an asymptotic breakdown point at 1/2. Then, the tuning parameter is selected with the maximum efficiency.
The whole procedures are presented as follows: Step 1.
Step 4. Updateρ andγ 1 as the optimal solution of min It is noted that an initial robust estimator γ (0) 1 is needed in the initial step above. In practice, we make the estimator of the LAD loss as the initial estimator. In this sense, the selection of γ 2 does not depend on λ basically. Meanwhile, one could also select the two parameters γ 2 and λ jointly by cross-validation as discussed in [8]. Nevertheless, this approach needs huge computation. Moreover, the candidate interval of γ 2 is {γ 2 : ζ(γ 2 ) ∈ (0, 1]}. Practically, we find the threshold of γ 2−1 subject to ζ(γ 2−1 ) = 1. The choice of γ 2 is usually located in the interval of [5γ 2−1 , 30γ 2−1 ].

Choice of the Regularization Parameter λ and η j
With regard to the choice of the regularization parameter λ and η j in (6), as the parameter λ can be unified with η j , we set λ j = λ · η j . Generally, many methods can be applied to select λ j , such as AIC, BIC, and cross-validation. To ensure that variable selection is consistent and that the intensive computation can be reduced, we propose the regularization parameter by minimizing a BIC-type objective function as [16]: with d the number of nonzeros in the true value of γ 1 . Thus, the consistent variable selection is ensured by the final estimator.

Block Coordinate Descent (BCD) Algorithm
We seek to compose an effective algorithm to solve the objective function (6). Finding an effective algorithm is difficult because the optimization problem is non-convex and nondifferentiable. We embark on using the BCD algorithm proposed by [17] and then overcome the above challenges. The BCD algorithm framework is shown in Algorithm 1 specifically.

3.
Solve the subproblem about ρ with initial point ρ k : 4. Solve the subproblem with initial value γ k 1 , to get a solution γ k+1 5. until convergence.

DC Decomposition and CCCP Algorithm
An elemental observation for problem (12) is that the exponential squared loss function is a DC function, and the lasso or the adaptive lasso penalty function is convex. As a result, problem (12) is a DC programming. It can be solved by the following algorithms.
We first analyze whether the exponential squared loss function φ γ 2 (t) can be denoted as the difference of two convex functions: in which u(·), v(·) is defined in (13), w i is in the ith row of the weight matrix W, and ∑ q j=1 P γ 1−j a convex penalty with regard to γ 1 . Then, J vex (·) and J cav (·) are convex and concave functions, respectively. Subproblem (12) is recast as follows: Furthermore, it can be solved by the concave-convex procedure algorithm structure proposed by [18] as shown in Algorithm 2.
We focus on the lasso and the adaptive lasso penalty. Since J cav γ k 1 · γ 1 is linear to γ 1 , according to the definition in (15), the objective function of (16) can be expressed as where ψ(γ 1 ) is a convex and continuously differentiable function, . . , q. Therefore, we can refer to an efficient algorithm ISTA and FISTA proposed by [19] to solve the model with a framework (17) for the lasso penalty. The iterative steps , where L is the unknown Lipschitz constant. FISTA is an accelerated version of ISTA that has been shown to have a better convergence rate in theory and practice, proven by [19]. Ref. [17] extended it to solve the model by adaptive lasso penalty and can ensure numerical efficiency. Now consider solving subproblem (11) to update ρ k . Since problem (11) minimizes a function of univariate variable, we employ the classical golden section search algorithm based on parabolic interpolation (see [20] for details).
In accordance with Beck and Teboulle, the value of the iterative function generated by FISTA for solving the subproblem (16) of CCCP converges to the optimal function value at the speed of O 1/k 2 , with an iteration step of k. The ordinary termination criterion of ISTA and FISTA is where tol γ 1 is a tolerance approaching zero and greater than zero. Under the criterion of either γ k 1 − γ k−1 1 ≤ 1 , or L γ k 1 − L γ k+1 1 ≤ 2 , Algorithm 1 terminates. Therefore, to obtain an optimal solution of , the required iterations of the FISTA algorithm are O(1/ √ ) and the gradient ∇ψ(γ 1 ) of (17) is computed for each iteration. Suppose that the BCD algorithm converges with a specified number of iterations and the CCCP algorithm terminates at most m times in each iteration. Since O(np) computation is needed to calculate the gradient ∇ψ(γ 1 ), the total computational complexity is O(mnp/ √ ).

Simulation Studied
In this section, we conduct numerical studies to illustrate the performance of the proposed method, including the cases of normal data and noisy data.

Simulation Sampling
The data is generated from model (1). We set α = α 1 , α 2 , 0 q T , where (α 1 , α 2 ) generates from a 2-dimensional normal distribution of the mean vector (0.6, 0.8) and covariance matrix 0.01 · I 2 , with I 2 the unit matrix ∈ R 2×2 , 0 q is the zero vector of q dimension. Set the sample size n ∈ {25, 144, 324}, and spatial coefficient ρ is generated by uniform distribution on interval [ρ 1 − 0.1, ρ 1 + 0.1], where ρ 1 ∈ {0.8, 0.5, 0.2}. For comparison's sake, we also consider ρ = 0, which means that there is no spatial dependency, and model (1) changes into the normal single-index varying-coefficient model. The variable Y n follows ε ∼ N 0, σ 2 I n , in which σ 2 is generated from a uniform distribution by We also consider the case when there are outliers in the response. The error term follows a mixed normal distribution (1 − δ 1 ) · N (0, 1) + δ 1 · N 10, 6 2 , where δ 1 ∈ {0.01, 0.05}. z ij is independent and randomly taken from the normal distribution N(0, 1), and the space weight matrix W n = I R ⊗ B m , where B m = (1/(m − 1)) 1 m · 1 T m − I m , ⊗ is a Kronecker product, and 1 m is the mdimensional column vector of ones. We take different values of m = 2 and R, where R = 20,100.
Moreover, we construct the spatial location information, where two-dimensional plane coordinates are used in this paper. Take a square to simulate the geographical area object, set the end point of the lower left corner of the square as the origin, and establish a rectangular coordinate system along the horizontal and vertical directions. Each side is divided into h − 1 equal points, and corresponding equal points are connected along the horizontal and vertical axes to form h * h crossing points (including the equal points of each square side). Each crossing point is the geographical location point. Set sampling capacity n = h 2 ; then, the geographical location coordinate U i = (u i1 , u i2 , . . . , u iq ) T is expressed as: Another important problem of the spatial single-index varying-coefficient model is the estimation of weight matrix W. Since W ∈ R n×n is composed of the correlation of every two observations, it is usually difficult to obtain an accurate estimation of the weight matrix W in practical applications. In order to confirm the effect of inaccurate estimation of the matrix W, we randomly remove 30%, 50%, and 80% non-zero weights from each row of the true weight matrix W, respectively.
For each case of the simulation experiment, all of the results shown below are averaged over 100 replications to avoid unintended effects. We adopt the node selection method proposed by [12], with step = 10 and the number of radial basis functions p = 3.

Simulation Results
The evaluation of simulation results is shown as follows. We use the median of squared error (MedSE) proposed by [21]. It is defined as α −α 2 in this paper, where . . , α n ),α is the estimator of α. The square root of mean deviation (MAISE) is used as the evaluation index for the unknown function. Specifically, represents the total simulation times of the model, and t represents the tth unknown function of the model, g t (·). The smaller the value of each index, the higher the accuracy of parameter estimation and the better fitting effect of the unknown function. Table 1 illustrates the results of the estimated coefficient by the spatial single-index varying-coefficient model with q = 5, the null penalty term, and Gaussian noise in y, where "E", "S", and "L" indicate the exponential squared loss, the square loss, and the LAD loss, respectively. It is shown that both of the three loss functions bring nonzero estimates of α 1 and α 2 , which are close to the true values (the mean of the true values of α 1 and α 2 are 0.6, 0.8 resp.). Comparatively, the model with the square loss produces the most accurate estimation. As the sample size increases, all three loss functions bring an accurate estimate of α and σ 2 . Table 2 presents the results of the estimated coefficient by the spatial single-index varying-coefficient model when the dimension is comparatively close to the sample size. Similar results in Table 1 have been observed, except for q = 5. As the sample size is not enough compared with the dimension, these results are as expected.  Table 3 illustrates the results of the model when the observations of y have outliers. Compared with the square loss model and LAD loss model, the model with exponential square loss shows advantages in parameter estimation in terms of MedSE, especially when the sample size is large.
We list the results of the estimated coefficients with inaccurate weight matrix W in Table 4. Compared with the results with normal data (Table 1), the MedSE values increase, and the estimations ofρ andσ 2 become worse for each loss functions in total. Particularly, for removing a certain part (30%, 50%, and 80%) of nonzero weights of the matrix W, MedSE increases as the moving nonzeros increase and decreases as the sample size n increases for each of the three loss functions. The exponential squared loss has the lowest MedSE among the three loss functions. Correspondingly, Tables 5-8 show the variable selection results compared with other loss functions. The average number of zero coefficients that are correctly chosen is labeled as "Correct". The label "Incorrect" depicts the average number of nonzero coefficients incorrectly identified as zero. "˜ 1 ", "l 1 ", and "null" express the adaptive lasso penalty, the lasso penalty, and without penalty term, respectively. Table 5 shows the variable section results of the lasso and the adaptive lasso regularizer on normal data with q = 5. In almost all of the tested cases, the model with the exponential squared loss and the lasso penalty or the adaptive lasso penalty (i.e., E + l 1 , E +˜ 1 ) identifies more numbers of true zero coefficients ("'Correct") and much lower MedSE. Table 5. Variable section with regularizer on normal data (q = 5), E: the exponential loss; S: the square loss; L: the LAD loss; l 1 : the lasso penalty; and˜ 1 : the adaptive lasso penalty. n = 25, q = 5 n = 324, q = 5 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1 Similar results have been observed when the sample dimension is close to the sample size, which is presented in Table 6. In the tested cases of n = 360, q = 200, the model with l 1˜ 1 almost correctly identifies all the zero coefficients. The above performance of the proposed exponential squared loss and lasso or adaptive lasso penalty is beyond our expectations. Table 6. Variable section with regularizer on normal data when the dimension is close to the sample size, E: the exponential loss; S: the square loss; L: the LAD loss; l 1 : the lasso penalty; and˜ 1 : the adaptive lasso penalty. n = 25, q = 20 n = 324, q = 200 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1 Tables 7 and 8 list the variable selections results with noise in the observations and the inaccurate weight matrix. The model with the exponential squared loss and the lasso penalty or the adaptive lasso penalty (i.e., E + l 1 , E +˜ 1 ) identifies many more numbers of true coefficients ("Correct") and has much lower MedSE. Compared with the results in the normal cases (Table 5), the superiority of E + l 1 and E +˜ 1 is more evident. Table 7. Variable selection with regularizer when the observations y have outliers, E: the exponential loss; S: the square loss; L: the LAD loss; l 1 : the lasso penalty; and˜ 1 : the adaptive lasso penalty. n = 25, q = 5 n = 324, q = 5 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1  Table 8. Variable selection with regularizer and noisy weighting matrix w, E: the exponential loss; S: the square loss; L: the LAD loss; l 1 : the lasso penalty; and˜ 1 : the adaptive lasso penalty. n = 25, q = 5 n = 324, q = 5 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1 E + l 1 E +˜ 1 S + l 1 S +˜ 1 L + l 1 L +˜ 1 For the fitting effect diagram of the coefficient function surface, we select the case at the median position in 100 repeated experiments as the standard. Note, we present the situation when h = 16 on the normal data. The fitting surfaces ofĝ 1 ,ĝ 2 , andĝ 3 are shown in Figure 2.
From the fitting effect of each coefficient function, it can be seen that the model has an excellent fitting effect for unknown coefficient functions, which shows that in the case of limited samples, the fitting effect of the spatial single-index varying-coefficient model based on radial basis function and exponential squared loss is excellent. In other cases, the fitting effect of each coefficient function also performs well.
We also present the fitting evaluation index MAISE when h = 16, 18 on normal data, which is shown in Table 9. It can be seen that with the increase in the total number of spatial objects, the value of the unknown function fitting evaluation index MAISE shows a downward trend. That is, the fitting effect is getting better and better. Similarly, the MAISE value of y also shows a downward trend, indicating that for the model as a whole, the relevant data is getting closer to the real data.
(a) Estimated surface of g 1 (b) Estimated surface of g 2 (c) Estimated surface of g 3 Figure 2. Estimated surfaces of coefficient functions with exponential squared loss. When the observations of y have outliers, the coefficient function surface fitting effect is compared. We still select the one in the median of 100 repetitions and take the fitted surface g 3 as an example. When ρ 1 = 0.5, σ 1 = 1, δ 1 = 0.05, the fitting effect of loss functions with adaptive lasso is shown in Figure 3. This shows that our method performs better. The same conclusion can be conducted in the case of noisy weighting matrix W. Figure 4 illustrates the results when we remove 50% nonzero weights.

Summary
In this paper, we propose a novel model (the spatial single-index varying-coefficient model) and introduce a robust variable selection based on spline estimation and exponential squared loss for the model. The theoretical properties of the proposed estimators are established under reasonable assumptions. We especially design a BCD algorithm equipped with a CCCP procedure for efficiently solving the non-convex and non-differentiable mathematical optimization problem about the variable selection process. Numerical studies show that our proposed method is particularly robust and applicable when observations and the weight matrix are noisy.
Thus, the proof is completed.