Bayesian Analysis of Partially Linear Additive Spatial Autoregressive Models with Free-Knot Splines

: This article deals with symmetrical data that can be modelled based on Gaussian distribution. We consider a class of partially linear additive spatial autoregressive (PLASAR) models for spatial data. We develop a Bayesian free-knot splines approach to approximate the nonparametric functions. It can be performed to facilitate efﬁcient Markov chain Monte Carlo (MCMC) tools to design a Gibbs sampler to explore the full conditional posterior distributions and analyze the PLASAR models. In order to acquire a rapidly-convergent algorithm, a modiﬁed Bayesian free-knot splines approach incorporated with powerful MCMC techniques is employed. The Bayesian estimator (BE) method is more computationally efﬁcient than the generalized method of moments estimator (GMME) and thus capable of handling large scales of spatial data. The performance of the PLASAR model and methodology is illustrated by a simulation, and the model is used to analyze a Sydney real estate dataset. SSE 1 = 0.1257 and SSE 2 = 0.1154 for ( r , m ) = ( 80, 5 ) , respectively. The results show that the Bias values and the SEE values of the nonparametric functions decrease with the increase in the sample size, indicating that the nonparametric estimation is convergent. It is evident that the weight matrix of Case and Rook can obtain a reasonable estimation effect.


Introduction
Spatial econometrics models are frequently proposed to analyze spatial data that arise in many disciplines such as urban, real estate, public, agricultural, environmental economics and industrial organizations. These models address relationships across geographic observations caused by spatial autocorrelation in cross-sectional or longitudinal data. Spatial econometrics models have a long history in both econometrics and statistics. Early developments and relevant surveys can be found in Cliff and Ord [1], Anselin [2], Case [3], Cressie [4], LeSage [5,6], Anselin and Bera [7].
Among spatial econometrics models, spatial autoregressive (SAR) models [1] have gained much attention by theoretical econometricians and applied researchers. Many approaches have been used to estimate the SAR models, which include the maximum likelihood estimator (MLE) [8], the generalized method of moment estimator (GMME) [9], and the quasi-maximum likelihood estimator (QMLE) [10]. However, these methods mainly focused on parametric SAR models, which are frequently assumed to be linear, few researchers have explicitly examined non-/semi-parametric SAR models. Indeed, it has been confirmed that a lot of economic variables exhibit highly nonlinear relationships on the dependent variables [11][12][13]. Neglecting the latent nonlinear functional forms often results in an inconsistent estimation of the parameters and misleading conclusions [14].
Although many empirical studies and econometric analyses applying the parametric SAR models ignore latent nonlinear relationships, several nonlinear forms [15][16][17][18] have been considered. Nevertheless, the nonlinear parametric SAR models can at most supply certain protection against some specific nonlinear functional forms. Since the nonlinear function is unknown, it is unavoidable to assume the risk of misspecifying the nonlinear function. As nonparametric techniques advance, the advantage of nonparametric SAR models are often used to model nonlinear economic relationships. However,

Model
We begin with the PLASAR model that is defined as where x i = (x i1 , . . . , x iq ) T and z i = (z i1 , . . . , z ip ) T are covariate vectors, y i is a response variable, w il is a specified constant spatial weight, g j (·) is unknown univariate nonparametric function for j = 1, . . . , p, α = (α 1 , . . . , α q ) T is q × 1 vector of the unknown parameters, the unknown spatial parameter ρ reflects the spatial autocorrelation between the neighbors with stability condition |ρ| < 1, and ε i 's are mutually independent and identically distributed normal with zero mean and variance σ 2 . In order to ensure model identifiability of the nonparametric function, it is often assumed that the condition E[g j (z j )] = 0 for j = 1, . . . , p.
To achieve identification, we set ∑ n i=1 ∑ K j l=1 B jl (z ij )β jl = 0, which is written as 1 T B j β j = 0. Denote Q j = 1 T B j , then the constraint becomes Q j β j = 0.
It follows that the model (1) can be equivalent to Then the matrix form of the model (1) can be represented as where x = (x 1 , . . . , x n ) T , y = (y 1 , . . . , y n ) T , ε = (ε 1 , . . . , ε n ) T , W = (w il ) is an n × n specified constant spatial wight matrix, K = ∑ p j=1 K j , and B T (u) is an n × K matrix with B T i (u) as its ith row. The likelihood function for the PLASAR model is proportional to is an n × (q + K) matrix, A(ρ) = I n − ρW, and I n is an identity matrix of order n.

Bayesian Estimation
In this section, we consider a Bayesian free-knots splines approach with MCMC techniques to analyze the PLASAR model. We begin with the specification of the prior distributions of all unknown parameters, then the derivations of posterior distributions and the narration of the detailed sampling scheme for all of the unknown parameters. Meanwhile, we modify the movement step of Bayesian free-knots splines approach so that all the knots can be repositioned in each iteration.

Priors
As we will consider a Bayesian approach with free-knots splines to analyze the PLASAR models, all unknown parameters are assigned prior distributions. Note that besides regression coefficient vectors θ, the spatial autocorrelation coefficient ρ and the quantities σ 2 , the number of knots k = (k T 1 , . . . , k T p ) T and location of knots ξ = (ξ T 1 , . . . , ξ T p ) T also need prior distributions in the sense that they are random variables in the Bayesian approach with free-knot splines. We avoid the use of improper prior distributions to prevent improper joint posterior distributions.
For j = 1, . . . , p, we followed Poon and Wang [49] by puting a Poisson prior with mean λ j for number k j of the knots and a conditional flat prior for knot location ξ j We set a conjugate normal inverse-gamma prior for the unknown parameters (θ, σ 2 ), which is a composite of inverse-Gamma prior distributions for σ 2 where r 0 and s 2 0 are hyperparameters; a conditional normal prior distribution with mean vector 0 and covariance matrix τ 0 σ 2 I q for α and a conditional normal prior distribution with mean vector 0 and covariance matrix τ j σ 2 I K j for β j with the constraint Q j β j = 0 as follows: for j = 1, . . . , p. In order to improve the robustness of our method, we choose an inversegamma prior where r τ 0 and s 2 τ 0 are pre-specified hyper-parameters. Throughout this article we set r 0 = s 2 0 = 1 to obtain a Cauchy distribution of σ 2 and assign r τ α0 = r τ β j 0 = 1 and s 2 τ α0 = s 2 τ β j 0 = 0.005 to acquire a highly dispersed inverse gamma prior on τ j for j = 0, 1, . . . , p.

The Full Conditional Posterior Distributions of Unknown Quantities
Since the joint posterior distribution of the quantities is very complicated, it is difficult to generate samples directly. To solve this problem, we derive the full conditional posterior distributions of unknown quantities, modify the movement step of Bayesian free-knots splines to speed up the convergence, and describe the detailed sampling method in our algorithm.
It follows from the likelihood function (4) and the joint priors (5) that the conditional posterior distribution of ρ given the remaining unknown parameters is proportional to p(ρ|y, x, z, α, β, k, ξ, σ 2 , τ) It is not easy to directly simulate from (6), which does not have the form of any standard density function. Therefore, we prefer the Metropolis-Hastings algorithm [53,54] to solve this difficulty: draw ρ * from a truncated Cauchy distribution with location ρ and scale σ ρ on (−1, 1), where σ ρ is treated as a tuning parameter; and accept the candidate value ρ * with probability min 1, p(ρ * |x, y, z, α, β, k, ξ, σ 2 , τ) p(ρ|x, y, z, α, β, k, ξ, σ 2 , τ) From likelihood function (4) and priors (5), we can see that given (ρ, τ), the joint posterior of (α, β, k, ξ, σ 2 ) is given by A(ρ)y, and S 2 = y T A(ρ) T A(ρ)y − θ T Ξθ, which gives rise to a marginal posterior distribution It is easy to see from (8) that where respectively. It follows from (7) that the approach of composition [55] can be used to generate σ 2 from a conditional inverse-gamma posterior and to sample θ from a conditional normal posterior It follows from (11) that where for j = 1, . . . , p. To achieve identification, we focus on the constraint Q j β j = 0, which should be imposed on β j . According to Panagiotelis and Smith [56], drawing β j from (13) is equivalent to drawing β j from a normal distribution with mean vectorβ j and covariance matrix Ξ j , then β j is transformed to β j by As it is convenient to sample (σ 2 , α, β) from the conditional posterior (10), (12) and (13), we concentrate on sampling from (9). A sampling method is applied, in which the original Bayesian free-knots spline [43][44][45][46][47][48][49][50][51] is used as a reversible-jump sampler [57]. It includes three types of movement: the deletion, the addition and the movement of only one knot [48]. We keep the first two move-types unchanged but improve the movement step through the hit-and-run algorithm [58] so that all the knots can be repositioned in each iteration instead of only one knot: for j = 1, . . . , p select a k j dimension direction vector c j = (c j1 , . . . , c jk ) T randomly, and define generate ω j from a Cauchy distribution with location 0 and scale σ ξ j truncated on ω j1 , ω j2 , where σ ξ j acts as a tuning parameter; assign ξ * j = ξ j + ωc j and reorder all of the knots. The proposed number and positions of knots are finally accepted with probability where Ξ * j and S * 2 j correspond to Ξ j and S 2 j in the candidate posterior, respectively, and the factor It is evident that the posterior of hyperparameter τ j is a conditional inverse-gamma distribution and which can be simulated directly from (15) and (16).

Sampling Scheme
The Bayesian estimate of Θ = {ρ, α, k, ξ, β, σ 2 , τ} is obtained by observations generated from the posterior of all unknown quantities by running the Gibbs sampler. Moreover, simulating β j from (13) is challenging and nonstandard, and the parameter space on the constraint Q j β j = 0 for j = 1, . . . , p. According to Panagiotelis and Smith [56], it is equivalent that β j is transformed to β j by (14). The MCMC sampling algorithm (Algorithm 1) is described in the following manner.

Algorithm 1
The MCMC sampling algorithm.
where the unknown parameters are generated from the priors, respectively. MCMC iterations: Given the current state of Θ (t−1) successively, draw Θ (t) from p(Θ|x, y, z), for t = 1, 2, 3, . . . The detailed MCMC sampling cycles are outlined in the following manner.

Empirical Illustrations
We demonstrate the performance of the PLSISAR model and methodology by a simulation and use them to analyze a real data. We set the Rook weight matrix [2] and the Case weight matrix [3] to examine the influence of the spatial weight matrix W. The Rook weight matrix is sampled from Rook contiguity in [59] by randomly allocating the n spatial units on a lattice of m × m (≥ n) squares, finding the neighbors for the unit, and then row normalizing. Meanwhile, we generated the Case weight matrix from the spatial scenario W = I r T m in [3] with m members in a district and r districts, and each neighbor of a member in each district given equal weight [10], where is the Kronecker product,
In our computation, we run each simulation with 1000 replications, adopt a quadratic B-spline and set hyper-parameters (r 0 , s 2 0 , r τ α0 , s 2 τ α0 ) = (1, 1, 1, 0.005) and (r τ β j 0 , s 2 τ β j 0 ) = (1, 0.005) for j = 1, . . . , p. The initial state of the Markov chain of all unknown parameters is selected as follows. All unknown parameters are sampled from the respective priors by gradually decreasing or increasing the use of tuning parameters σ ρ and σ ξ j for j = 1, . . . , p so that the acceptable rates are about 25%. For each replication, we generate 6000 sampled values and then delete the first 2000 sampled values as a burn-in period by running our MCMC sampling. Based on the last 4000 sampled values, we compute the corresponding average of 1000 replications as the posterior mean (mean), the 95% posterior credible intervals (95% CI), and standard error (SE). In addition, the standard derivations (SD) of the estimated posterior mean are calculate to compare them with the mean of the estimated posterior SE. We evaluate the performance of nonparametric estimators by the integrated squared bias (Bias), the root integrated mean squared errors (SSE), the mean absolute deviation errors (MADE) According to LeSage and Pace [52] suggestions, the mean of either the row sums or the column sums of the non-diagonal elements is used as the indirect effects, the mean of the diagonal elements is used as the direct effects, and the sum of the indirect and direct effects is taken as the total effects.
To check the convergence of our algorithm, we run five Markov chains corresponding to different starting values through the MCMC sampling algorithm to perform each replication. The sampled traces of some parameters and nonparametric functions on grid points are displayed in Figure 1. It is obvious that the five parallel sequences mix quite well. We compute the "potential scale reduction factor" √R [60] for all unknown parameters and nonparametric functions at 20 selected grid points. The estimation results are reported in Table 1. We observe that the mean values of all estimators are very close to the corresponding true values, and the mean value of SE is close to the respective SD. The results show that the parameter estimation and SE are precise. Meanwhile, the larger the sample sizes under the same weight matrix, the more precise the estimates are. The above experiences corresponding to different starting values have been repeated, and the results are similar. It implies that the MCMC sampling works well. Moreover, we find that the estimation effect of ρ with the Case weight matrix is slightly better than that with Rook weight matrix under the same sample sizes. The possible main reason is that the performance of the Case weight matrix is superior to the Rook weight matrix under different variances σ 2 . In addition, the general pattern of estimates reported in Table 1 is that all estimators impose a relatively bigger bias on the total effect when the same sample sizes have a strong positive spatial dependence. Figure 4 depicts the fitted functions, together with its 95% CI from a typical sample with ρ = 0.5 and σ 2 = 0.25. The typical sample is selected in such a way that the SSE values are equal to the median in the 1000 replications. It is obvious that the fitted nonparametric functions are improving with increasing the sample size.   For comparison purposes, we use the Bayesian P-splines approach to approximate the nonparametric functions [61], where we assign a second-order random walk prior to the spline coefficients. The boxplots of MADE values with the Case weight matrix in Figure 5. In our method, the medians of MADE are MADE 1 = 0.0997, MADE 2 = 0.0804 and MADE = 0.0916 for (r, m) = (20,5), and the medians of MADE are MADE 1 = 0.0504, MADE 2 = 0.0433 and MADE = 0.0476 for (r, m) = (80, 5), respectively, which are slightly smaller than the Bayesian P-splines approach. The results show that the Bayesian free knots splines approach is superior to the Bayesian P-splines approach in terms of fitting unknown nonparametric functions and computing time. Furthermore, we also compare the performance between the generalized method of moment estimator (GMME) in Du et al. [29] and the Bayesian MCMC estimator (BE) in our method. In order to evaluate the estimation effect of the nonparametric functions, we calculate the integrated squared bias (Bias) and the root integrated mean squared errors (SSE). Table 2 reports the results of the nonparameter estimation for GMME and BE (only a replication with (ρ, σ 2 ) = (0.5, 0.25) is displayed). It is evident that the estimates are improving with increasing the sample size, the Bias of the BE estimates are slightly smaller than the Bias of the GMME, and the SSE of the BE estimates are very smaller than Bias of the GMME under the same sample size, showing that BE is better than GMME, although the latter can also obtain a reasonable estimation.

Application
We use the proposed model and estimation methods to analyze the well-known Sydney real estate data. A detailed description of the data set can be found in Harezlak et al. [62]. The data set contains a total of 37,676 properties sold in the Sydney Statistical Division in the calendar year of 2001, which is available from the HRW package in R.
We only focus on the last week of February to avoid the temporal issue, and there are 538 properties.
In this application, the house price (Price) is explained by four variables that include average weekly income (Income), levels of particulate matter with a diameter of less than a 10 micrometers level recorded at the air pollution monitoring station closest to the house (PM 10 ), lot size (LS), and distance from house to the nearest coastline location in kilometers (DC). On the one hand, Income and PM 10 have a linear effect on the response Price. On the other hand, LS and DC have a nonlinear effect on the response Price. Meanwhile, logarithmic transformation is performed on all variables to alleviate the trouble caused by large gaps in the domain. In addition, all variables are transformed such that the marginal distribution is approximately a standard normal distribution. This motivates us to consider the PLASAR model: where the response variable Regarding the choice of the weight matrix, we use the Euclidean distance between any two houses to calculate the spatial weight matrix [63]. The spatial weight w il is where s i = (Lon i , Lat i ) is represented as the longitude and latitude of location. We apply a quadratic B-splines and assign hyperparameters (λ, r 0 , s 2 0 , r τ α0 , s 2 τ α0 , r τ β j 0 , s 2 τ β j 0 ) = (2, 1, 1, 1, 0.005, 1, 0.005) for j = 1, . . . , p in our computation. We gradually decrease or increase the use of tuning parameters σ ρ and σ ξ j such that the acceptable rates for updating ρ and (k j , ξ j ) are around 25% for j = 1, . . . , p.
We generate 10,000 sampled values following a burn-in of 10,000 iterations and run the proposed Gibbs sampler five times with different initial states in each replication. Figure 6 plot the traces of some unknown parameters and nonparametric functions on grid points. It is obvious that the five parallel Markov chains mix well. We further calculate the "potential scale reduction factor" √R for each of the unknown parameters and nonparametric functions on 20 selected grid points, which are plotted in Figure 7. The result indicates that the Markov chains have converged within the first 10,000 burn-in iterations.   Figure 7. The "potential scale reduction factor" √R for Sydney real estate data. Table 3 lists the estimated parameters together with their SE and 95% CI, which show that the estimation ofρ is 0.5548 with SE = 0.0307. It implies that there exists a significant and positive spatial relationship on the housing price. We find that two covaraites have significant effects on the housing price, and the effects of Income are positive, but PM 10 is negative. The regression coefficient of Income isα 1 = 0.3269 > 0, which indicates that the Income has a positive effect on the housing price. Moreover, the regression coefficient of PM 10 isα 2 = −0.0810 < 0, which reveals that the housing price would decrease as the PM 10 increases.  Figure 8 depicts the fitted functions, together with its 95% CI, which look like two nonlinear functions. The curves show that g 1 (z 1 ) has a local maximum 0.6184 at around z 1 = 3.9198 and a local minimum 0.1557 at around z 1 = −0.8237, and g 2 (z 2 ) has a local minimum −0.8224 at around z 2 = 2.1605. The results provide evidence that the significant effects of LS and DC on the housing price have a nonlinear S-shape and U-shape, respectively.  (17) for Sydney real estate data.

Summary
Spatial data are frequently encountered in practical applications and can be analyzed through the SAR model. To avoid some serious shortcomings of fully nonparametric models and reduce the high risk of misspecification of the traditional SAR models, we have considered PLASAR models for spatial data, which combine the PLA model and SAR model. In addition to spatial dependence between neighbors, it captures the linear and nonlinear effects between the related variables. We specify the prior of all unknown parameters, which led to a proper posterior distribution. The posterior summaries are obtained via the MCMC technique, and we have considered a fully Bayesian approach with free-knot splines to analyze the PLASAR model and design a Gibbs sampler to explore the full conditional posterior distributions. To obtain a rapidly-convergent algorithm, a modified Bayesian free-knot splines approach incorporated with powerful MCMC techniques is employed. We have illustrated that the finite sample of the proposed model and estimation method perform satisfactorily through a simulation study. The results show that the Bayesian estimator is efficient relative to the GMME, although the latter can also obtain reasonable estimations. Finally, the proposed model and methodology are applied to analyze real data.
This article focuses only on symmetrical data and the homoscedasticity of independent errors. Since spatial data cannot easily meet the conditions, it is fairly straightforward to analyze the proposed model and methodology to deal with spatial errors and heteroscedasticity. While we use PLASAR models to assess the linear and nonlinear effects of the covariates on the spatial response, the other models, such as partially linear single-index SAR models and partially linear varying-coefficient SAR models, can also be considered. Moreover, it would be interesting to develop a model selection method in which covariates are linear or nonlinear. We leave these topics for future research.

Data Availability Statement:
The data presented in this study are openly available in Reference [62].