Variable Selection for the Spatial Autoregressive Model with Autoregressive Disturbances

: Along with the rapid development of the geographic information system, high-dimensional spatial heterogeneous data has emerged bringing theoretical and computational challenges to statistical modeling and analysis. As a result, effective dimensionality reduction and spatial effect recognition has become very important. This paper focuses on variable selection in the spatial autoregressive model with autoregressive disturbances (SARAR) which contains a more comprehen-sive spatial effect. The variable selection procedure is presented by using the so-called penalized quasi-likelihood approach. Under suitable regular conditions, we obtain the rate of convergence and the asymptotic normality of the estimators. The theoretical results ensure that the proposed method can effectively identify spatial effects of dependent variables, ﬁnd spatial heterogeneity in error terms, reduce the dimension, and estimate unknown parameters simultaneously. Based on step-by-step transformation, a feasible iterative algorithm is developed to realize spatial effect identiﬁcation, variable selection, and parameter estimation. In the setting of ﬁnite samples, Monte Carlo studies and real data analysis demonstrate that the proposed penalized method performs well and is consistent with the theoretical results.


Introduction
Spatial econometric models are mainly used to deal with spatial dependent data in applications. Spatial dependence across sectional units may concern a spatial autocorrelation in a dependent variable or disturbance term. The first form of dependence is usually defined by a spatial autoregressive (SAR) model and another by a spatial error model (SEM). In fact, both spatial dependencies may be reflected in a spatial autoregressive model with autoregressive disturbances (SARAR). These models were first introduced by Cliff and Ord [1], which have aroused wide concern, see, e.g., the research by Kelejian and Prucha [2], Lee [3], Arraiz et al. [4], and the books by Anselin [5] and Cressie [6].
In practice, explanatory variables are needed to be chosen from a number of variables during the initial data analysis. How to select significant variables to keep in the final model becomes very important for further analysis. Therefore, variable selection has received increasing attention in statistical modeling and inference. However, the study of variable selection in spatial econometric models is not as sufficient as that in classical linear models due to the complexity caused by spatial dependence. The main goal of our analysis is to fill some gaps in this area to a certain degree. We mainly focus on a variable selection method for the SARAR model based on a penalized quasi-likelihood method and investigate its oracle property. Furthermore, a feasible algorithm is given for realizing these procedures.
The methods of variable selection for classical linear models have been developed rapidly since the Akaike information criterion (AIC) was proposed by Akaike [7]. Then, similar methods based on the information criterion have progressed remarkably, such as the Bayesian information criterion (BIC) [8], risk inflation criterion (RIC) [9], etc. Using these criteria, the best subset selection became the standard method to select covariants for a long time. Although they are practically useful, the common drawback is the lack of stability and incorporating stochastic errors from each stage of variable selection as noted by Liang and Li [10]. Moreover, it may require a comparison of all possible submodels. This is a combinational problem with NP-complexity [11]. In order to overcome these drawbacks, penalized methods of variable selection have been proposed in recent years, including least absolute shrinkage and selection operator (LASSO) [12], smoothly clipped absolute deviation (SCAD) penalty [13], elastic-net (ENet) [14], adaptive LASSO [15], minimax concave penalty (MCP) [16], and so on. These methods can select significant variables and estimate unknown parameters simultaneously. Fan and Li [13] established the oracle property in the sense that the penalized estimator behaves the same as the ordinary least squares estimator as we know the true linear model, which can be used to assess the efficiency of the penalized estimator. In the Bayesian framework, some developments include Mitchell and Beauchamp [17], Raftery et al. [18], Jiang [19], etc. Other related methods can be found in Chen et al. [20] and Steel [21].
Along with the rapid development of the geographic information system (GIS), variable selection for the spatial econometric models has become a new concern in the last 10 years or so. Based on the Bayesian idea, LeSage and Parent [22] developed the Bayesian model averaging (BMA) technique for the SAR model and SEM. Some extension works include LeSage and Fischer [23] and Cuaresma et al. [24,25]. In order to avoid the complex calculation of marginal likelihoods in BMA, Piribauer [26] used stochastic search variable selection (SSVS) prior to deal with the identification of the SAR model. Generally, it is challenging to extend the penalized methods to data that are dependent either over time or across space, as variable selection involves not only regression coefficients but also autocorrelation coefficients [27]. In recent years, Liu et al. [28] gave an efficient variable selection procedure for the SAR model and obtained the large sample properties by a penalized quasi-likelihood method. Using SCAD penalty and instrumental variable, Xie et al. [29] considered variable selection in the SAR model with a diverging number of parameters. They showed that the SCAD penalty in the SAR model for variable selection also has a nice oracle property as in the classical linear model.
High dimensional spatial data may lead to complex and multiple spatial dependencies. However, the existing methods are constrained by dimension and spatial heterogeneity, which brings great challenges to the application of traditional spatial econometric models. Although the technology of dimension reduction by eliminating redundant information through variable selection in classical linear models is being gradually developed and the research on variable selection in spatial lag models has been completed, it is still difficult to effectively solve the problem of variable selection with spatial heterogeneity in error terms. The SARAR model has both a dependent variable spatial effect and spatial error term, it can reflect spatial effect information and describe spatial heterogeneity relatively comprehensively. Moreover, once the spatial effect of error is ignored, it will lead to model recognition errors, reduce the estimation error and prediction accuracy, and bring concerns to the application research. In light of the above considerations and the excellent performance of penalized methods, we studied the variable selection of spatial cross-section data based on the SARAR model. The main contributions are as follows: (1) For high-dimensional spatial heterogeneous data, a penalty quasi-likelihood method is proposed to solve the problem of dimensionality reduction of explanatory variables and the identification of two kinds of spatial effects. (2) Using the idea of step-by-step transformation, a new iterative numerical algorithm is proposed to avoid the influence of spatial heterogeneity. (3) Simulation and case analysis will help practitioners in related fields to use reasonably. (4) The proposed method can provide a useful reference for the study of variable selection in semi-parametric and nonparametric spatial regression models.
The remainder of this paper is as follows. Section 2 presents a penalized quasilikelihood method in the SARAR model. Section 3 introduces a feasible algorithm to complete a variable selection procedure. Section 4 provides a Monte Carlo study to investigate the finite sample performance. Section 5 illustrates the proposed method through an application of the Boston housing data. Summary and discussion is stated in Section 6. Appendixes A and B contain some assumptions and proofs of theorems.

The SARAR Model
The SARAR model can be specified as: where Y n denotes an n × 1 vector of observations on the dependent variable, X n is an n × k matrix of observations on k exogenous explanatory variables, W 1n and W 2n are known n × n spatial weight matrices, β is a k-dimensional parameter vector of regression coefficients, ρ 1 and ρ 2 are scalar spatial autoregressive coefficients with |ρ 1 | < 1 and |ρ 2 | < 1, U n is an n × 1 vector of regression disturbances, both W 1n Y n and W 2n U n are the spatial lag term and spatial error lag term respectively, and E n = (e 1 , · · · , e n ) T is an n-dimensional vector of i.i.d. innovations with zero mean and finite variance σ 2 . Note that this model is also known as the Cliff-Ord model or the SARAR(1,1) model. The SAR model and SEM are corresponding to ρ 2 = 0 and ρ 1 = 0, respectively. Let θ 0 = σ 2 0 , ρ 10 , ρ 20 , β T 0 T = (θ 1,0 , θ 2,0 , · · · , θ k+3,0 ) T be the true value of θ, and According to the idea of quasi-maximum likelihood estimation [3], we can write the log-quasi-likelihood function of the model (1) as where L n (θ) is the quasi-likelihood function of the model (1).

Penalized Method
The spatial econometric research has shown that it is inappropriate to use the the ordinary least squares estimation (OLS) method directly for SAR models. In the case of the SARAR model, the OLS estimators of the spatial autoregressive coefficients are biased and inconsistent. Therefore, the penalized least squared method can not be directly used for variable selection in this model. Considering a good performance of the quasi-maximum likelihood estimation in the SARAR model, the penalized quasi-likelihood method deserves priority. We start with a penalized quasi-likelihood function for the model (1) defined as: where p λ (·) is the SCAD penalty function defined by Fan and Li [13] as: For comparison, we also introduce the following two popular penalty functions.
In the classical linear models, Fan and Li [13] proposed that a perfect variable selection method should possess the following three properties: (1) Unbiasedness: The resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias; (2) Sparsity: The resulting estimator automatically sets small estimated coefficients to zero to reduce model complexity; (3) Continuity: The resulting estimator is continuous in data to avoid instability in the model prediction.
Under some regular conditions, they showed that variable selection via the SCAD penalty function possesses above properties, but the other penalty functions proposed above may not satisfy the three properties simultaneously. Related references can be seen in Fan and Li [13], and Wang and Zhu [30] for more information.

Main Results
Note that it may be chaotic in the arrangement of the original non-zero elements of θ 0 . Re-labeling θ 0 can put the non-zero elements in the front together and separate them from the zero elements, which is convenient for the concise expression of the theorems and proofs. Therefore, denote θ 0 = θ T 10 , θ T 20 T , where we assume that θ 10 is a vector containing s nonzero elements and θ 20 is the penalized quasi-likelihood estimator of θ. The theorems stated below give some satisfactory properties of a large sample. (1), and the assumptions in Appendix A hold. Then there is a local minimizerθ of J(θ) such that: where a n = max 2≤j≤s p λ jn θ j,0 , b n = max 2≤j≤s p λ jn θ j,0 .
For the SCAD penalty function, the p λ jn (·) exists at any non-zero point by choosing a proper λ jn . Theorem 1 shows that there is a local minimizer of J(θ) which is a √ n consistent penalized quasi-likelihood estimator by choosing appropriate regularization parameter λ jn .
Theorem 2 shows that the proposed method can identify the SAR model (ρ 1 = 0, ρ 2 = 0), SEM (ρ 1 = 0, ρ 2 = 0), SARAR model (ρ 1 = 0, ρ 2 = 0), select explanatory variables and estimate unknown parameters simultaneously. Similar to the analysis of Fan and Li [13], if λ jn → 0 as n → ∞, then a n → 0 for both SCAD and HARD thresholding penalty functions. Moreover, we obtain that Λ → 0 and d → 0 as n → ∞. Thus, under regular conditions, the responding oracle property of the penalized quasi-likelihood estimators can be obtained. That is, the penalized quasi-likelihood estimators perform asymptotically as well as the ordinary quasi-likelihood estimators for nonzero parameters when knowing the correct submodel. However, for the LASSO penalty function, some conditions in Theorem 2 can not be satisfied.

Algorithm Design and Implementation
In this section, we consider the implementation of the proposed procedures. Since the penalized quasi-likelihood function J(θ) is nonconcave, it is challenging to get the global optimum solution. The study by Liu et al. [28] proposed: The existing algorithms, such as local quadratic approximation (LQA) algorithm [13] and local linear approximation (LLA) algorithm [31], can not be used directly to the SAR model. Similarly, those algorithms also do not give the correct minimizer of J(θ) for the SARAR model. Hence, we design the following iterative algorithm. Initialization: Iteration: Find β (p+1) by arg min Find ρ by arg min Find σ (p+1) by arg min Iterate (5) to (7) until the successive value satisfies ||θ (q+1) −θ (q) || < ε, wherê 2 ,β (q) T T and ε is a given tolerance value. In the following simulation, we let ε be 10 −4 . Denote the final estimate of σ 2 , ρ 1 , ρ 2 , β as σ 2 ,ρ 1 ,ρ 2 ,β , thenθ = σ 2 ,ρ 1 ,ρ 2 ,β T T . In (4), the initial value of θ is the quasi-maximum likelihood estimate based on the log-quasi-likelihood function of the model (1). In (5), we note that if both autoregressive coefficients ρ 1 and ρ 1 are known in the SARAR model (1), then we can transform it as the following linear model Therefore, the LQA algorithm can be used to complete this step as in the classical linear models. In (6), the optimization problem of bivariate functions can be solved by the Nelder-Mead method [32]. In (7), by using the partial derivative, the unique minimum point is: Figure 1 presents a flowchart of the proposed algorithm.
To implement the above algorithm, the tuning parameters need to be chosen. For the SCAD penalty function, we set a = 3.7 as recommended by Fan and Li [13]. Moreover, it is desirable to select a proper data-driven method to estimate all tuning parameters λ 2 , · · · , λ k+3 . Wang et al. [33] proved that the optimal tuning parameter in the SCAD penalty can be determined by BIC for the linear regression models. Thus, we can select λ = (λ 2 , · · · , λ k+3 ) T by the following Bayesian information criterion: In fact, minimizing the BIC over a k + 2-dimensional space is an unduly onerous task for a large k. To save computation time, one may use the same tuning parameter for all penalty functions. However, the experiments, though not given for saving space, show that the spatial regression coefficient ρ 2 is easy to compress to 0 even if the sample size is medium. Intuitively, we should use different tuning parameters for spatial regression coefficients ρ j (j = 1, 2) and regression coefficients β j (j = 1, · · · , k) because the range of ρ j (j = 1, 2) are known before estimation, but the range of β j (j = 1, · · · , k) are not. Thus, we set λ 2 = λ 3 , and λ 4 = · · · = λ k+3 to optimize the results. It should be pointed out that we can prove the consistency of the BIC criterion under more stringent conditions, such as the bounded derivative of the quasi-likelihood function and α(λ). However, it is very difficult to prove the consistency under some mild conditions and will be left for further study.

Numerical Simulation
In this section, we conduct some Monte Carlo experiments to evaluate the finite sample performance of the proposed variable selection method in the SARAR model using R codes.

Simulation Sampling
The sample data is generated by model (1). We consider eight explanatory variables following an 8-dimensional normal distribution with zero mean and covariance matrix σ ij , where σ ij = 0.5 |i−j| . The spatial autoregressive coefficients are set to be (ρ 1 , ρ 2 ) = (0.7, 0.7), (0.7, 0.3), (0.7, 0), (0, 0.7), and (0, 0). For simplicity, let ⊗ is the Kronecker product, and l m is an m-dimensional column vector of ones [3,34], which is called the Case spatial weight matrix. To observe the influence of different spatial weight matrices, the Rook spatial weight matrix is introduced, in which w ij is set to be 1 when the regions share a common boundary and set to be 0 for other cases. For the Case spatial weight matrix, we take m = 3 and different values of R, where R = 10, 20, 60, then corresponding sample sizes are n = 30, 60, 180. For the Rook spatial weight matrix, we use the grid square area to generate it according to whether the edges are adjacent. To ensure that the region is square, the value of n is the square of the integer value and n = 36, 64, 196. The regression coefficients are assumed to be β = (3, 2, 0, 0, 1, 0, 0, 0) T . The innovation e i follows a normal distribution with mean 0 and variance σ 2 = 1, 1.5.

Simulation Results
For each case, we do 100 repetitions. The average number of zero coefficients which are correctly identified is denoted as "C". The label "I" indicates the average number of non-zero coefficients incorrectly shrunk to zero. To measure the estimation accuracy of θ, we compare the estimation accuracy using the medians of squared error (SE) as in Liang and Li [10], which is defined as: whereθ n = θ 1 ,θ 2 , · · · ,θ k+3 T is the estimate of θ 0 . In Tables 1-3, Oracle implies the results of variable selection knowing zero parameters. Moreover, other penalty functions, such as HARD and LASSO, are introduced in the penalized quasi-likelihood function for comparison. Tables 1-3 clearly show that there are similar performances for variable selection under both different spatial weight matrices. In other words, the proposed method is not sensitive to the change of the spatial weight matrix. As we expected, all penalty functions can reduce their mSE (the median of SE) and give close results of Oracle with the increase of sample size. In most cases, the SCAD penalty produces the lowest mSE, the HARD penalty has a little bigger than the SCAD penalty, and the LASSO penalty produces the largest mSE. Moreover, if there are spatial effects for both the spatial lag term and the spatial error lag term (ρ 1 = 0 and ρ 2 = 0), the mSE is often relatively large; if only one of the spatial lag term and the spatial error lag term is related to the spatial effect (ρ 1 = 0, ρ 2 = 0, and ρ 1 = 0, ρ 2 = 0), the value of the mSE is usually smaller, especially when there is no spatial effect (ρ 1 = 0 and ρ 2 = 0). However, like most of the existing results of variable selection, the mSE will be less accurate in all cases if the variance σ 2 of the innovation becomes large. In terms of C and I, we can see that the average number of correctly identifying zero-valued coefficients approaches the true value and the average number of incorrectly identifying zero-valued coefficients approaches 0 as the sample size n increases. These simulation results accord with the theoretical analysis. The SCAD and HARD penalties have good performance about C, there is little difference between them in most cases. They can converge rapidly to the real number of 0 except a LASSO penalty with a low convergence rate, which may imply that both SCAD and HARD tend to give smaller models than LASSO. In the case of small samples, the LASSO penalty has the lowest value of the I in most cases. However, their differences quickly disappear in large samples for all penalties. These results are similar to those obtained by Fan and Li [13]. It is worth noting that when ρ 2 is small, it is easy to compress to 0, and then produce a larger error rate I in the setting of small samples.  Table 4 shows the results of ignoring spatial effects by the LQA algorithm [13] under the same context as in Table 1. In terms of I, when there are two spatial effects (ρ 1 = 0 and ρ 2 = 0), the number of incorrect zero in Table 1 is much lower than those in Table 4. When only one spatial effect exists (ρ 1 = 0, ρ 2 = 0, or ρ 1 = 0, ρ 2 = 0), the number of incorrect zero in Table 4 decreases slightly compared to the first case and is also larger than that in Table 1. When there is no spatial effect (ρ 1 = 0 and ρ 2 = 0), the results of our algorithm are close to that of the LQA algorithm. Meanwhile, turning attention to the C, our algorithm can identify more true zeros than the LQA algorithm as long as the spatial effect exists (ρ 1 = 0). Although we are surprised to find that the value of C and I under the LQA algorithm seem to be getting close to the correct values with slow speed as the sample size increases for the SCAD and HARD penalties, the mSE reflected the estimation errors of their parameters are large and outrageous. This is in line with our intuition: Ignoring both spatial effects, the LQA algorithm is implemented on the wrong model and easily leads to a large estimated deviation. Moreover, the LQA algorithm is affected by the initial estimation. In simulation, the initial estimation is the quasi-maximum likelihood estimation, which is equal to the least square estimation (including the observation value of dependent variable Y n ). If the strong spatial effect about ρ 1 is ignored, the observation value of the dependent variable will deviate from the requirement of unbiased estimation seriously, which will lead to a great deviation of the initial estimation. With the influence of iteration, the accumulated error of final estimation will be extraordinary. However, when the spatial effects disappear, we can see that both algorithms have similar good performances, which indicates that no matter whether there are spatial effects, the proposed algorithm still has a satisfactory performance in a finite sample. Considering the complexity of the asymptotic covariance matrix of θ, we use the traditional bootstrap method in which the sample size of the resampled observations is 100 to obtain the standard deviations of parameter estimates. The parameter vector θ is estimated by our algorithm. SD indicates the median absolute deviation of 100 estimated coefficients in the 100 simulations, which can be regarded as an estimate of the true standard deviation of θ. Using the bootstrap, we calculate a median of estimated standard deviations, denoted as SDm, and estimate its standard deviation by median absolute deviation, denoted as SDmad. Table 5 provides the numerical simulation results of nonzero coefficients under ρ 1 = 0.7, ρ 2 = 0.3, σ 2 = 1, n = 30, and n = 60 with the Case spatial weight matrix. The simulation results show that the bootstrap estimated standard deviation becomes increasingly accurate when sample size n increases. In most cases, the SD, SDm, and SDmad obtained by the SCAD and HARD penalties are smaller than that obtained by the LASSO penalty, which shows that the LASSO penalty does not appear to be as stable as the SCAD and HARD penalties. Furthermore, when the σ 2 increases and is away from 1, the estimation of the standard deviation will be less accurate although the results are not presented. In one world, the LASSO penalty generally lags behind the SCAD and HARD penalties concerning the accuracy of estimates. For saving space, the other cases, such as ρ 1 = 0.7, ρ 2 = 0.7, or σ 2 = 1.5, have similar results and are omitted.  Table 4. Simulation results of variable selection when we ignore spatial effects.

Data Example
Now, we consider a real example for the application and performance of the proposed variable selection method in the SARAR model.

The Sample Data
We consider the Boston housing data set which was originally given by Harrison and Rubinfeld [35] and has been used by many authors, for example, Pace and Gilley [36,37], and so on. The data set contains 506 census tracts with 14 nonconstant independent variables. It can be found in the spdep library of R. Similar to the analysis of Harrison and Rubinfeld [35], the dependent variable is set to be log(MEDV) and the explanatory variables are assumed as RM 2 , AGE, log(DIS), log(RAD), TAX, PTRATIO, (B − 0.63) 2 , log(LSTAT), CRIM, ZN, INDUS, CHAS, and NOX 2 . Table 6 gives the interpretation of all abbreviated variables. For subsequent analysis, the data are centralized and standardized. The spatial weight matrix is constructed with rook contiguity: The weight is 1 if two different areas share a common boundary, and 0 otherwise. Then the matrix is rownormalized as is usually carried out in practice.  PART Proportion of population that is lower status = 1 2 (proportion of adults without some high school education and proportion of male workers classified as laborers). Source: 1970 U.S. Census.

Spatial Dependence Test
In spatial data analysis, the Moran's I statistic (Moran I) is usually used to test spatial dependence. Table 7 shows the value of the Moran's I in the Boston housing data. It is 0.7644 with a p-value 2.2 × 10 −16 , which implies that the MEDV has a strong spatial correlation. It is well known that the Moran's I reflects the degree of spatial autocorrelation and can not effectively identify specific spatial autoregressive models due to the existence of different spatial correlations. Fortunately, the popular Lagrange multiplier diagnostics can help us to complete this specification for several different spatial autoregressive models. This test method avoids the optimization of the nonlinear function and is easy to implement. Using the spdep package in R, we can obtain the desired results for identification. From Table 7, it is obvious to see that the p-value in each case is very small, which implies that the Boston housing data can be modeled by spatial models. However, the values of test statistics and p-values suggest that the SARAR model is the best choice among these spatial models to fit the Boston housing data. Moreover, previous studies have used multiple hypothesis tests to judge spatial effects and select explanatory variables, and then determine the model. It is difficult to prove the relevant theoretical properties. Based on the proposed variable selection method, the SARAR model can not only be used to identify different spatial effects and select explanatory variables simultaneously, but also has a good theoretical guarantee. Therefore, we will use the SARAR model for variable selection in this data.

Model Selection and Estimation
Under a SARAR model, the results are reported in Table 8, where the quasi-maximum likelihood estimate (QMLE) and penalized quasi-likelihood estimate (PQLE) via the SCAD, HARD, and LASSO penalties are listed to assess the performance of variable selection.
The QMLE demonstrates that there are four variables that show a relatively small impact on the MEDV, including ZN, INDUS, CHAS, and AGE. These variables in other studies also show a small effect on the MEDV, such as Harrison and Rubinfeld [35], Pace and Gilley [36], and so on. Moreover, variables with positive effects include ZN, INDUS, RM 2 , log(RAD), (B − 0.63) 2 , while others have negative effects. As we expected, the parameter estimates obtained by the penalized method are close to the QMLE, and both nonzero estimates keep the same sign. Moreover, the four insignificant variables (ZN, INDUS, CHAS, and AGE) are penalized to zero under different penalty functions. Therefore, these penalties produce the same selection results in this setting. However, BIC in Table 8 shows that the SCAD and HARD penalties are preferable to the LASSO penalty. Interestingly, although the spatial correlation coefficients ρ 1 and ρ 2 are also penalized by different penalty functions, they do not shrink to zero and have similar results with the QMLE. From the perspective of model specification, we can say that the penalty method recognizes the spatial autoregressive relationship.
For comparison, the Boston housing data is also fitted by a classical linear regression model and the related results are presented in Table 9. The QMLE shows that there are three unimportant variables, including ZN, INDUS, and AGE. Moreover, variables with positive effects include ZN, INDUS, CHAS, RM 2 , AGE, log(RAD), (B − 0.63) 2 , while others have negative effects. In addition, all penalties also produce the same selection results in this model. According to the QMLE, these penalties can also select important variables and shrink unimportant variables to zero. Based on the BIC, the SCAD and HARD penalties also outperform the LASSO penalty in this setting.
Although both models have similar selection results, the differences between them are quite obvious. For the QMLE, the estimated coefficient of AGE is negative in the SARAR model, a plausible result, but it is positive in the linear model, which seems implausible. For the PQLE, it is easy to see that the CHAS disappears in the SARAR model while it is relatively important in the linear model. Furthermore, the meaning of the parameter estimation in these two models is also distinctly different. The interpretation of parameter estimates in the SARAR model will become richer and more complicated than that in the linear model because of the spatial autocorrelation [38]. As we expected, the BIC for the SARAR model is far less than that for the classical linear model, which indicates that the SARAR model has a better fitting effect than the classical linear model in such data.

Summary and Discussion
In theory, the proposed penalized quasi-likelihood method can identify two kinds of spatial effects, select significant explanatory variables, and estimate unknown parameters simultaneously. The penalized estimators has consistency, sparsity, and normality, which show that the penalty estimation of the coefficient of the significant variable with an unknown zero coefficient is as good as that of the significant variable with a known zero coefficient. In application, the proposed method is consistent with the theoretical results, which can effectively penalize the coefficients of insignificant variables to zero, identify the appropriate spatial regression model, and improve the interpretability of the results due to the decrease of the variable dimension.
From the analysis results of theory and application, it can be seen that the proposed method can effectively achieve a variable selection and identify spatial effects. At the same time, due to the complexity and time consumption of high-dimensional matrix inverse operation, we also find that the optimization efficiency of the penalty quasi-likelihood function still has room for further improvement. Therefore, this method is suitable for the case of a medium sample size and variable dimension not exceeding the sample size. When the sample size is large enough, the penalty GMM method can be considered to improve the operation speed. Once the dimension of the variable exceeds the sample size, our proposed method will not be applicable. Even so, the proposed method can also be used as a basis for future research, such as a new feature selection in spatial data.
In conclusion, it is significant to extend this model to other high dimensional parameter regression models, such as spatial Durbin models, dynamic panel data models, or super high dimensional nonparametric spatial regression models or semi-parametric spatial regression models, such as varying-coefficient spatial regression models, single index spatial regression models, additive spatial regression models, etc. These contents are optional for further research.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Assumptions
The following regular conditions are needed for the large sample properties of the penalized quasi-likelihood estimator.
Assumption A5. The lim n→∞ n −1 X T n X n exists and is nonsingular. The elements of X n are uniformly bounded constants for all n.
Assumption A6. The row and column sums of S −1 in (ρ i ) are uniformly bounded, uniformly in ρ i in a closed subset Λ of (−1, 1) and the true ρ i0 is an interior point of Λ, i = 1, 2.
Assumption A1 provides an essential condition for the use of the central limit theorem in Kelejian and Prucha [40]. Assumption A2 describes the dynamic relation between the spatial weight matrix and sample size n. If {h n } is a bounded sequence, Assumption A2 is easily satisfied. In the Case model [34] where h n may diverge to infinity also satisfies Assumption A2. Assumption A3 can guarantee the existence of mean and variance of independent variable. Assumption A4 implies that the variance of Y n is bounded as n goes to infinity. Similar conditions have been adopted in Kelejian and Prucha [40] and Lee [3]. Assumption A5 can exclude the multicollinearity of the regressors X n . Assuming that the regressors are uniformly bounded is convenient for analysis. If not, it can be replaced by stochastic regressors with certain finite moment conditions [3]. Assumption A6 is deals well with the nonlinearity of ln |S 1n (ρ 1 )| and ln |S 2n (ρ 2 )|in the log-quasi-likelihood function. Assumption A7 means that G kn X n β 0 and X n are not asymptotically multicollinear with k = 1, 2. It is an identification condition of θ 0 . Assumptions A8 and A9 are applied for Taylor expansion of the log-quasi-likelihood function and asymptotic normality of the estimator.

Appendix B. Proofs of Theorems 1 and 2
The following Lemmas are used for proofs of Theorems 1 and 2.
Proof of Lemma A1. It follows from a straightforward calculation that: By (A1) and some operational properties of related matrices in [3], we have: (S 2n G 1n X n β 0 ) T (S 2n G 1n X n β 0 ) By the Chebyshev inequality, we obtain: Proof of Lemma A2. Note that: 1 n ∂ 2 ln L n (θ 0 ) Then, similar to the proof of Theorem 3.2 in [3], we can obtain Lemma 2.
Proof of Theorem 1. Let z n = n −1/2 + a n . As demonstrated by Fan and Li [13], it suffices to prove that for any given η > 0, there is a positive constant C such that: From Lemma 1, b n = o(1), and O p (n −1/2 z n ) = O p (z 2 n ), it follows that J 1 = u · O p z 2 n , J 2 = u 2 · O p z 2 n , and J 3 is bounded by u · O p z 2 n + u 2 · o p z 2 n . Thus, J 1 and J 3 can be dominated by J 2 uniformly with a sufficiently large u = C when n → ∞. Hence, (A2) holds. Note that z n = o(n −1/2 ). This completes the proof of Theorem 1.
Note that lim inf n→∞ lim inf δ→0 + λ −1 jn p λ jn (δ) > 0 and lim n→∞ n −1/2 λ −1 jn = 0. The sign of the derivative is the same as that of θ j for a sufficiently large n. This shows that the minimizer attains at θ 2 = 0. Lemma 3 is proven.