A Semiparametric Bayesian Approach to Heterogeneous Spatial Autoregressive Models

Many semiparametric spatial autoregressive (SSAR) models have been used to analyze spatial data in a variety of applications; however, it is a common phenomenon that heteroscedasticity often occurs in spatial data analysis. Therefore, when considering SSAR models in this paper, it is allowed that the variance parameters of the models can depend on the explanatory variable, and these are called heterogeneous semiparametric spatial autoregressive models. In order to estimate the model parameters, a Bayesian estimation method is proposed for heterogeneous SSAR models based on B-spline approximations of the nonparametric function. Then, we develop an efficient Markov chain Monte Carlo sampling algorithm on the basis of the Gibbs sampler and Metropolis–Hastings algorithm that can be used to generate posterior samples from posterior distributions and perform posterior inference. Finally, some simulation studies and real data analysis of Boston housing data have demonstrated the excellent performance of the proposed Bayesian method.


Introduction
In recent decades, spatial data analysis based on spatial autoregressive (SAR) models has become a very important and popular research direction in the academic research of econometricians and statisticians.Among them, in-depth research has been conducted on inference theories and methods based on linear SAR models and their extensions, including estimation, variable selection, hypothesis testing, etc., and a large amount of literature has also been produced, such as [1][2][3].Specifically, there are many in-depth research achievements on linear SAR models, such as [4][5][6][7].However, in spatial data analysis, there is often a nonlinear relationship between response variables and covariates.Therefore, in order to explore this complex phenomenon, some semiparametric SAR models have been proposed in recent years and have been thoroughly studied.For example, based on partially linear spatial autoregressive models, Su and Jin [8] proposed a profile quasi-maximum likelihood estimation method and established asymptotic theoretical properties of the obtained estimators.By using the spline approximations and instrumental variables estimation method, Du et al. [9] developed an estimation method for partially linear additive spatial autoregressive models, and derived asymptotic theoretical properties of the obtained estimators.For partially linear single-index spatial autoregressive models, Cheng and Chen [10] developed an estimation method and established consistency and asymptotic normality of the estimators under some mild assumptions.Other related research results on semiparametric SAR models can also be found in [11,12].Previous research on various spatial autoregressive models was mainly based on the assumption of homoscedasticity, which assumes that the variance of model errors is constant.As is well known, heteroscedasticity is a common phenomenon in spatial data analysis.Therefore, using statistical inference methods under the assumption of homogeneity may lead to erroneous inference, as seen in Lin and Lee [13].Therefore, it is necessary to study heterogeneous spatial autoregressive models.Especially in recent years, many researchers have conducted in-depth research on spatial autoregressive models where the error variance is heteroscedasticity.For example, SAR models with heteroscedasticity was studied by Dai et al. [14] for Bayesian local influence analysis.However, existing literature on heterogeneous spatial autoregressive models assume that the variance term is fixed and does not perform regression modeling analysis like the mean.In addition, in many application fields, such as econometrics, the assumption of equal variance may not be suitable for modeling data that exhibit heteroscedasticity.Therefore, we propose a new model, called a heterogeneous semiparametric spatial autoregressive model, in which the variance parameters are allowed to be modeled by using some covariates.
In addition, many estimation methods have recently been developed for SAR models from both frequentist and Bayesian perspectives.Specifically, due to the rapid development of advanced computing technology in the era of big data, Bayesian statistical analysis of SAR models and various other statistical models has received increasing attention, and large quantities of related research achievements have emerged in recent years.For example, within the framework of longitudinal data and for the generalized partial linear mixed models, Tang and Duan [15] studied an effective semiparametric Bayesian method.By using spline approximation, Xu and Zhang [16] introduced a Bayesian method for the partially linear model with heteroscedasticity based on the variance modelling technique.Based on the assumption that the response variables and random effects follow multivariate skew-normal distributions, a new spatial dynamic panel data model was proposed by Ju et al. [17] and a Bayesian local influence analysis method was developed to simultaneously evaluate the impact of small perturbations on the data, priors, and sampling distributions.Pfarrhofer and Piribauer [18] studied Bayesian variable selection for highdimensional spatial autoregressive models based on two shrinkage priors.Wang and Tang [19] made Bayesian statistical inference based on a quantile regression model with nonignorable missing covariates.To capture the linear and nonlinear relationships between explanatory variables and their responses to spatially relevant data, Chen and Chen [20] developed a Bayesian sampling-based method based on the partially linear single-index spatial autoregressive models, in which it includes an efficient MCMC approach and explores the joint posterior distributions by using a Gibbs sampler.Within the framework of longitudinal data, Zhang et al. [21] proposed semiparametric mixed-effects double regression models for analysis based on spline approximation technology, in which they jointly modeled the mean and variance of the mixed-effects as a function of covariates.To our knowledge, there is not much work on semiparametric Bayesian methods for heterogeneous spatial autoregressive models due to their complex spatial correlation structures.Therefore, based on a hybrid effective algorithm that combines a Gibbs sampler and the Metropolis-Hastings algorithm and has the advantages of both algorithms, this paper develops a Bayesian method for heterogeneous semiparametric spatial autoregressive models based on variance modeling.
The outline of the paper is as follows.A new heterogeneous SSAR model is introduced in Section 2. In Section 3, we derive the full conditional distributions for implementing the sampling-based method, and develop a Bayesian method to obtain estimates by using a Gibbs sampler and the Metropolis-Hastings algorithm.Section 4 presents some simulation studies to illustrate the proposed methodology.As an application example, Section 5 analyzes the Boston house price data by using the proposed method.A brief conclusion and discussion is given in Section 6.

Heterogeneous Semiparametric Spatial Autoregressive Models
As is well known, the form of classical semiparametric spatial autoregressive models is as follows: where Y = (y 1 , y 2 , • • • , y n ) T is an n-dimensional response variable, |ρ| < 1 is an unknown spatial lag parameter that reflects spatial autocorrelation between neighbors, and W is a known spatial weight matrix with zero diagonal elements.In the mean model, T is an n × p explanatory variable matrix where the ith row is T is a p-dimensional unknown regression coefficient to be estimated; moreover, g(•) is an arbitrary unknown smooth function in the mean model, which needs to be estimated; U = (u 1 , u 2 , • • • , u n ) T is an n-dimensional vector whose ith row u i is an univariate observed covariate; ε is an n-dimensional vector that represents the regression errors of an independent and identically distributed regression disturbances with zero mean and finite variance σ 2 .
In addition, according to Xu and Zhang [16], this paper considers the heterogeneity of the variance in the model and assumes that the variance parameters are related to other explanatory variables; thus, we establish a regression model for the variance parameters, namely ) is an explanatory variable vector related to the variance of ε i and γ = (γ 1 , • • • , γ q ) T is a q-dimensional unknown regression coefficient to be estimated in the variance model.Some elements in z i may coincide with some elements in x i .In addition, for the identifiability of the models and considering that the variance is positive, h(•) is a known monotonic positive function.For example, exponential functions are often used to model the variance.So, heterogeneous SSAR models are considered in this paper as follows: (3)

B-Splines for the Nonparametric Function
From model (3), we obtain the log-likelihood function where e = AY − Xβ − g(U), A = I n − ρW, and I n is an (n × n) identity matrix.
There are now a large number of nonparametric techniques for handling nonparametric functions g(•) in (3), such as the smoothing splines and the kernel methods.This paper considers B-spline approximation to transform g(•) into a linear function composed of a set of basis functions.It can be summarized as follows: {s i }, i = 1, • • • , n performs a partition on the interval [0, 1], which is called as the internal knots and satisfies 0 This results in K = K n + l normalized B-spline basis functions of order l, which form the basis of a linear spline space.The main reason for using B-splines here is because they have advantages such as bounded support and numerical stability.As well as we know, selection of knots is usually an important aspect that cannot be ignored in the implementation process of B-splines.In this paper, our main focus is inference on the parameters in the mean model and the variance model.Therefore, by following the idea of Zhang et al. [21], the number of internal knots is selected as the integer part of n 1/5 .Thus, π T (u)α is used to approximate g(u), in which π(u) = (B 1 (u), ..., B K (u)) T is a basis function vector and α ∈ R K .In this way, we can linearize the nonparametric function g(•) in (3) as follows: Thus, based on (5), we can rewritten the likelihood function (4) as follows: where

Prior Selection of Parameters
This paper will use a Bayesian approach to estimate unknown parameters ρ, β, α and γ.Thus, to obtain Bayesian estimation, the prior distributions of unknown parameters to be estimated in the model should be given first.For the convenience of algorithm implementation, normal prior distributions are often chosen as , in which β 0 , α 0 , γ 0 and b β , B γ are known hyperparameter vectors or matrices.Moreover, the prior distribution of τ 2 is the IG(a τ , b τ ), and its density function is in which a τ and b τ are assumed to be positive constants and known.In this paper, we mainly focus on the case where the prior distribution of model parameters is a normal distribution.However, the proposed computational algorithm is also applicable to other specific prior distributions.

Posterior Inference
Let θ = (β, α, γ, ρ) and then we aim to estimate the unknown parameters of θ.Based on the proposed model ( 3) and Gibbs sampling, the specific sampling process is carried out according to the following steps by sampling from joint posterior distribution p(θ|Y, X, Z, U).
In addition, according to [18], the Bayesian estimate of ρ is obtained based on a Metropoliswithin-Gibbs step and by using sampling observations from (11).

Simulation Study
This section conducts a simulation study to evaluate the performance of the proposed Bayesian method under different sample sizes, spatial parameter values, and prior information selections.According to Lee [24] and Xie et al. [7], let n = R × m and W = I R ⊗ H m be the weight matrix, in which ⊗ represents the Kronecker product and , l m is an m-dimensional vector with all component elements being 1.In order to investigate whether the proposed Bayesian estimation method is sensitive to the selection of prior distributions, we consider three hyperparameter values in the prior distributions of unknown parameters β, α, γ, ρ: Type I: This situation can be considered as having good prior information.
Type II: This hyperparameter situation is considered to have no prior information.
Type III: This can be seen as a situation where the previous information was inaccurate.
In this section, R is selected as 25, 50, 75 and m is set to 4, and thus, n is to be 100, 200, 300.Furthermore, we generate X and Z, respectively, from the multivariate normal distribution with zero mean vector and covariance matrix . Moreover, to reflect the different spatial dependencies between response variables, spatial parameters ρ = −0.5, 0, 0.5 are selected to represent different spatial dependencies; β = (1, −0.5, 0.5) T , g(u i ) = 0.5 sin(2πu i ) where u i follows a uniform distribution U(0, 1),and the structure of the variance model is log(σ 2 i ) = z T i γ with γ = (1, −0.5, 0.5) T .
Based on the various parameter setting environments and generated datasets mentioned above, we use the hybrid MCMC algorithm based on 100 replications to evaluate Bayesian estimations of unknown parameters under different sample sizes.Checking whether the MCMC sampler converges in algorithm implementation is an important thing.
Therefore, here the estimated potential scale reduction (EPSR) value is used to diagnose whether the MCMC algorithm converges for each dataset [25].It can be easily observed that in all the runs we are considering, the EPSR value is very close to 1 and less than 1.1 after 3000 iterations.Therefore, after discarding the first 3000 burn-in iterations, collect the observation results of the following J = 2000 for statistical inference.In addition, to evaluate the performance of the nonparametric function estimation, the square root of average square errors(RASE) are used here as the criterion for evaluation, .
The simulation results are listed in Tables 1-4.Moreover, in order to directly examine the accuracy of the estimation of function g(u), we plot the true value of function g(u) and its estimated curve under different cases.To save space, we only list some nonparametric estimation curve results with different spatial parameters in Figures 1-3.Figures 1-3 depict the real sine curve and its estimated curve based on B-spline approximation.It is easy to observe that all estimated curves are close to the true curve, which indicates that the estimation of g(•) using B-splines in the mean model performs well.In Tables 1-3, "Bias" represents the absolute difference between the true and mean values estimated by Bayesian estimation of parameters based on 100 replicates, and "SD" denotes standard deviation of the Bayesian estimates, while "RMS" is the root mean square between the estimated and true values based on 100 replicates.From Tables 1-3, we can see that (i) From the Bias, RMS and SD values of Bayesian estimation, it can be seen that regardless of the prior information input, Bayesian estimation is quite accurate; and for different prior distributions, the proposed estimation method performs well, indicating that Bayesian estimation is not sensitive to prior information input.(ii) Bayesian estimation improves as the sample size increases; (iii)The Bayesian estimation results obtained based on different spatial parameters are similar.(iv) In the same situation, the RMS and SD values of the mean parameters are smaller than that of the variance parameters, which is consistent with the fact that lower order moments are easier to estimate than higher order moments.Furthermore, from Figures 1-3, it can be seen that under the considered parameter settings, the estimated function curve is very close to its corresponding true curve, which is consistent with the phenomenon found in Table 4.In summary, the above simulation research results indicate that applying the Bayesian estimation method proposed in this paper to heterogeneous SSAR models is effective.

Real Data Analysis
Boston housing price data are a commonly used example, and many authors have conducted in-depth analysis based on different statistical models, such as [26,27], and so on.This section will also use the Bayesian estimation method proposed in this paper to analyze these data.This dataset can be easily obtained from R's spdep library, which includes 14 variables and 506 observations.A detailed explanation of the variables involved in the dataset is presented in Table 5.In addition, following the variable selection results of Xie et al. [7], we take MEDV as the response variable of the model (represented by Y) and the important explanatory variables selected by Xie et al. [7] as the X-variables: CRIM (denoted by X 1 ), ZN (denoted by X 2 ),NOX (denoted by X 3 ), RM(denoted by X 4 ), DIS (denoted by X 5 ), RAD (denoted by X 6 ),TAX (denoted by X 7 ), PTRATIO(denoted by X 8 ), and B (denoted by X 9 ).In order to facilitate data modeling and analysis, all variables were centralized and the index variable was set to u = √ LSTAT.In addition, the Euclidean distances calculated using longitude and latitude as [27,28] are used to generate the space weight matrix W = (w ij ), where d ij is the Euclidean distance, and d 0 takes a value of 0.05 as the threshold distance.Thus, the spatial weight matrix contains 19.1% non-zero elements.Then, here we consider the heterogeneous SSAR model as follows: where three explanatory variables Z 1 = X 4 , Z 2 = X 5 , Z 3 = X 6 are selected in the variance model.Thus, the hybrid algorithm proposed earlier is applied to obtain Bayesian estimates of β's, γ's, and ρ's, in which a B-spline with K = 3 and noninformative prior information are used.In order to check the convergence of the algorithm, Figure 4 shows the relationship between the EPSR values of all unknown parameters iterations, indicating that the algorithm converges after approximately 3000 iterations due to the EPSR values of all unknown parameters being less than 1.1 in approximately 3000 iterations.The Bayesian estimates (EST) of β's, γ's and ρ's and their standard deviation estimates (SD), 95% credible intervals (CI) are calculated.Results are given in Table 6. Figure 5 displays the Bayesian estimate of the nonparametric function g(u), which also confirms a significant nonlinear relationship between housing prices and the variable u.Some useful conclusions can be obtained from the results of the table, which are basically consistent with the research results of other authors.For example, the regression parameter corresponding to X 1 in the mean model is estimated to be negative, indicating that housing prices will decrease as the per capita crime rate in urban areas increases.The estimated coefficient of X 4 in the mean model is 0.3899, indicating that as the average number of rooms per dwelling increases, housing prices will also increase.β5 is negative, indicating that the greater the weighted distances to five Boston employment centres, the lower the housing price will be.The regression parameter corresponding to X 8 in the mean model is estimated to be negative, indicating that housing prices will decrease as the pupil-teacher ratio by town increases.The regression parameter corresponding to Z 2 in the variance model is estimated to be negative, indicating that as the weighted distances to five Boston employment centres increases, the fluctuation of housing prices will also decrease.In addition, based on the estimation of γ, we can obtain the estimated value of σ 2 i and present the scatter plot of σ2 i in Figure 6, indicating that heteroscedasticity modeling for this dataset is reasonable.

Conclusions and Discussion
Heteroscedasticity is a common phenomenon in spatial data modeling and analysis.Therefore, this paper proposes heterogeneous SSAR models, where the variance parameter is modeled as a function of the explanatory variable.Like mean regression modeling, the variance component may also depend on various explanatory variables of interest, so estimating the joint models of the mean and variance becomes the basis for avoiding modeling bias and reducing model complexity.Then, based on the nonparametric components approximated by B-splines, we propose a complete Bayesian analysis of a heterogeneous semiparametric spatial autoregressive models.Based on the Gibbs sampler and Metropolis-Hastings algorithm, an effective MCMC sampling algorithm was developed for posterior inference by generating posterior samples from the posterior distributions.Simulation studies and real data analysis based on Boston housing data are used to illustrate the proposed method.The results show that the proposed Bayesian method has high efficiency and fast computational speed.
In addition, there are several interesting questions worth further research in the future.For example, (i) it is interesting to consider variable selection for the parametric component in the context of heterogeneous spatial autoregressive model; (ii) the model proposed in the paper is also a worthwhile issue to study when there are missing response variables.

Figure 1 .
Figure 1.When m = 4, R = 25, tpye = I, the curve plot of the estimated function and its true values of g(u) based on different ρ's (the corresponding spatial parameters from left to right are ρ = 0.5, 0, −0.5).

Figure 2 .Figure 3 .
Figure 2. When m = 4, R = 50, tpye = I, the curve plot of the estimated function and its true values of g(u) based on different ρ's (the corresponding spatial parameters from left to right are ρ = 0.5, 0, −0.5).

Figure 4 .
Figure 4. EPSR values for all parameters in Boston data analysis.

Figure 5 .
Figure 5.The solid curve is the estimated function of ĝ(u).

Figure 6 .
Figure 6.The scatter plot of σ2 i .

Table 1 .
Bayesian estimation results of unknown parameters under different sample sizes and prior information when ρ = 0.5.

Table 2 .
Bayesian estimation results of unknown parameters under different sample sizes and prior information when ρ = 0.

Table 3 .
Bayesian estimation results of unknown parameters under different sample sizes and prior information when ρ = −0.5.

Table 4 .
The estimate for the nonparametric components based on RASE.

Table 5 .
Detailed description of relevant variables involved in Boston housing data.

Table 6 .
Bayesian estimation results in Boston data analysis.