Selection of the Bandwidth Matrix in Spatial Varying Coefﬁcient Models to Detect Anisotropic Regression Relationships

: The commonly used Geographically Weighted Regression (GWR) ﬁtting method for a spatial varying coefﬁcient model is to select a bandwidth h for the geographic location ( u , v ), and as-sign the same weight to the two dimensions. However, spatial data usually present anisotropy. The introduction of a two-dimensional bandwidth matrix not only gives weight from two dimensions separately, but also increases the direction of kernel smoothness. The adaptive bandwidth matrix is more ﬂexible. Therefore, in this paper, a two dimensional bandwidth matrix is introduced into the spatial varying coefﬁcient model for parameter estimation. Through simulation experiments, the results obtained under the adaptive bandwidth matrix are compared with those obtained under the global bandwidth matrix, indicating the effectiveness of introducing the adaptive bandwidth matrix.


Introduction
The spatial varying coefficient model proposed by Brunsdon et al. (1996) [1] is a widely used and significant model in non-parametric statistics. It is a non-parametric smoothing method with the distance function between observation points as the weight.
The work of Brunsdon et al. (1998) [2] and Fotheringham et al. (1998) [3] shows the role of the Geographically Weighted Regression (GWR) method in studying the nonstationarity of spatial data regression relationships, and the effectiveness of the method was demonstrated through actual data. Since the two-dimensional region is larger than the one-dimensional region, it is easy to produce boundary effects. Wang et al. (2008) [4] extended the local linear fitting method from the varying coefficient model to the spatial varying coefficient model, and proposed the local linear GWR method. Recently, the backfitting technique has been successfully applied to the GWR models for dealing with the multiscale issue by manipulating the bandwidth in the kernel function (Fotheringham et al. (2017) [5], Leong and Yue (2017) [6], Yu et al. (2020) [7]). The kernel smoothing method is an effective method for solving nonparametric problems, in which the selection of the bandwidth h is more important than the selection of the kernel function itself. For the selection of bandwidth, the cross-validation method (CV) and its variants are widely used in practice, such as the generalized crossvalidation method (GCV), or the AICc method to select the optimal bandwidth. If the estimated curve is smooth, the global bandwidth can be selected; while if the estimated curve has a complex structure, the adaptive bandwidth can be considered. There are two important studies on adaptive bandwidth: one is the K-nearest neighbor method proposed by Loftsgaarden and Quesenberry (1965) [8]; the second is the adaptive kernel density estimation proposed by Breiman et al. (1977) [9]. Subsequently, ref. [10] proposed using the square root method to select the adaptive bandwidth, that is, for any dimension, the bandwidth in the form h i ∝ f (x i ) −1/2 can be selected. Fan and Gijbels (1995) [11] used a data-driven method to select the bandwidth; based on the estimated curvature, they proposed a new method for estimating the approximate deviation and variance of local polynomial fitting. Tian (2019) [12] used the cutting-edge knowledge of statistics to research the four major contemporary non-parametric statistics in the selection of bandwidth.
Many studies are limited to one-dimensional cases, while non-parametric problems will have more important applications in higher dimensions. The bandwidth matrix obtained under high-dimensional data not only affects the smoothness, but also affects the smoothing direction. For high-dimensional data, there are mainly the Plug-in insertion method, SCV and other methods for selecting the bandwidth matrix (Gramacki (2018) [13], Chacón and Duong (2018) [14]). Among them, the Plug-in insertion method is more stable for selecting bandwidth matrix H. Wand and Jones (1994) [15] used the Plug-in method for multi-dimensional data to select a constrained bandwidth matrix. Duong and Hazelton (2003) [16] used the Plugin method to select a binary unconstrained bandwidth matrix. Subsequently, Chacón and Duong (2010) [17] improved it by selecting unconstrained bandwidth matrix for high-dimensional data. In addition, the Bayesian method is also applied to the field of bandwidth selection. There are also many ways to select high-dimensional bandwidths by using the prior information from the data. Brewer (2000) [18] proposed a new method for adaptive bandwidth in univariate kernel density estimation based on likelihood cross-validation and Bayesian graphical model analysis. Zhang et al. (2006) [19] took the bandwidth matrix elements as the parameters, and the posterior density was obtained by the Likelihood Cross-Validation criterion. The optimal window width was estimated by MCMC algorithm. Zougab et al. (2014) [20] used Bayesian estimation method to select adaptive bandwidth matrix, and so forth. For multi-dimensional data, besides bandwidth matrix, the product kernel is usually used to solve high-dimensional problems. Belaid et al. (2018) [21] applied the adaptive bandwidth selected by the Bayesian method to the high-dimensional discrete kernel. Gramacki and Gramacki (2017) [22] used the FFT method to solve some problems of the bandwidth matrix. Sain et al. (1994) [23] used the CV method to select multiple bandwidths by the form of the product kernel. Wu et al. (2007) [24] regard the local bandwidth as the product of the local bandwidth factor and the global bandwidth, calculate the local window width factor with the clustering analysis algorithm, and obtain the multiple adaptive bandwidth. However, the kernel form of the multivariate product is essentially similar to the diagonal matrix. Xu (2009) [25] extended the bandwidth h of the coefficient function in the varying coefficient model to multidimensional, and theoretically tried to give a diagonal bandwidth matrix, where the values of diagonal elements were consistent. Wand and Jones (1993) [26] believe that the unconstrained bandwidth matrix is more general than the diagonal bandwidth matrix, and is applicable to a wider range. Unconstrained matrices can include many more smooth directions.
In many practical fields, such as geography, ecology, and meteorology, the data obtained usually contain geographic location information, and this kind of data is called spatial data. Spatial effects mainly come from spatial dependence (that is, spatial correlation) and spatial heterogeneity (that is, spatial non-stationarity). The spatial data are usually not in the same quantity distribution in all directions of geographical position, and the data distribution is not uniform, showing anisotropy. At present, there is no good way to deal with this phenomenon. The local linear geographically weight regression method is usually adopted to solve the fitting problem of the spatial variable coefficient model. In this method, the distance function between observation points is usually used as the weight value, and the same window width h is selected for the geographical position (u, v), that is, the data of two dimensions are given the same weight as isotropy. Considering the spatial anisotropy and dependence of the two variables (u, v), this paper introduces the two-dimensional Gaussian kernel function into the coefficient function estimation of the spatial variable coefficient model, and uses the binary unconstrained bandwidth matrix H to select different bandwidths from the two dimensions of geographical location, u and v. This can increase the direction of the smooth kernel, which can also explain the anisotropy. Furthermore, the adaptive bandwidth matrix is studied, and the coefficient estimates of the spatial variable coefficient models of the binary bandwidth matrix and the adaptive window width matrix are obtained respectively. Through simulation experiments, it is shown that the introduction of adaptive bandwidth matrix is more effective and flexible than the global bandwidth matrix.

Spatial Varying Coefficient Model
The spatial varying coefficient model is to embed the data with the geographical location information into the varying coefficient model, which can not only avoid the dimensional disaster problem in fitting, but can also well reflect the spatial non-stationary and heterogeneous characteristics of the regression relationship. The specific form of the spatial varying coefficient model is as follows: where (Y i ; X i1 , X i2 , · · · , X ip ) represents the observed value of the dependent variable and independent variable at the geographic location(u i , v i ), (i = 1, 2, · · · , n), β j (u, v)(j = 1, 2, · · · , p), is a function of geographic location, and ε i (i = 1, 2, · · · , n) is the i.i.d. error term, which satisfies E(ε i ) = 0, Var(ε i ) = σ 2 . The model allows a spatially varying interception by setting x i1 = 1, i = 1, 2, · · · , n. Brunsdo et al. [1] explicitly gave the specific form of the spatial varying coefficient model for the first time and proposed a method using the distance function between observation points as the weight, namely the geographically weighted regression method, which essentially belongs to the Nadaraya-Watson kernel estimation method. The boundary effects exists in this method, while the boundary of the two-dimensional geographic region is larger than that of the one-dimensional region, and the data of the boundary region is more prone to distortion. The local polynomial fitting method can automatically correct the boundary effect (Mei and Wang (2012) [27]). Wang et al. (2008) [4] extended the local linear fitting method from the varying coefficient model to the spatial varying coefficient model, and presented the locally linear GWR method. In order to avoid the boundary effects, the local linear GWR method is used for model fitting in this paper.
For a given location (u 0 , v 0 ) in the studied area, the estimatesβ j (u 0 , v 0 ) of β j (u, v) (j = 1, 2, · · · , p) at (u 0 , v 0 ), and the matrix notation as follows: where I p is the p-order identity matrix and 0 p is the p-order zero matrix, and Y = Y 1 , Y 2 , · · · , Y n T is the observations vector of the response variable. The design matrix at is a kernel function. The kernel functions used in this paper are all Gaussian kernel functions. For the traditional GWR method, the specific form of the one-dimensional Gauss kernel function is as follows: where, d ij represents the Euclidean distance between (u i , v i ) and (u j , v j )(i = 1.2, · · · , n). Specifically, the set of weights at The specific form of the two-dimensional Gauss kernel function is as follows: The set of weights at ( The spatial varying coefficient model involves two variables, u and v, so d = 2, t i is a binary vector about u and v, H is a binary bandwidth matrix about (u, v), the distance used here is formally known as the Mahalanobis distance.
The estimate of the dependent variable Y is: where L is the p × p hat matrix. The hat matrix used for parameter estimation using the one-dimensional kernel function has the following form: The hat matrix used in parameter estimation using two-dimensional kernel function has the following form:

Selection of Bandwidth Matrix
(1) The bandwidth matrix of the unary variable is H = h 2 , Here, the square of the window width is used to consider the problem. In this paper, the generalized cross validation method (GCV) is used to select the one-dimensional bandwidth in traditional GWR method: where L is the hat matrix, tr(·) represents the trace of the matrix, Y i represents the value of the ith dependent variable,Ŷ(h) = L(h)Y. To select the optimal window width h 0 , in practical application, a range is first given in accordance with a certain step size, the corresponding GCV(h) value is calculated, and the h 0 that makes GCV(H) reach the minimum is chosen, that is, GCV(h 0 ) = min h>0 GCV(h), then h 0 is the selected optimal bandwidth.
(2) The form of the binary unconstrained window width matrix is as follows: For the selection of the binary unconstrained bandwidth matrix H, the Plug-in method proposed by Duong and Hazelton (2003) [16] is used for reference in this paper. The Plug-in method mainly relies on replacing the unknown function by the estimator in approximation, so that the asymptotic expression AMISE (Asymptotic Mean Integrated Squared Error) of the estimated MISE (Mean Integrated Squared Error) can reach the minimum value. Here, t can be regarded as a set of data about u and v, and f(t) is the target density function. The specific expressions of MISE and AMISE are as follows: where Then Ψ 4 is a 3 × 3 matrix, the specific form is as follows: To obtain the optimal bandwidth matrix H, the key is to estimate Ψ 4 . The elements in Ψ 4 are ψ r = R 2 f (r) (x) f (x)dx, r = (r 1 , r 2 ), |r| = r 1 + r 2 , r 1 , r 2 are non-negative integers, f ((r)) (t) is the r-order partial derivative with respect to (t 1 , t 2 ). Firstly, the original data are standardized, and the test bandwidth matrix G is used to replace H to estimate each element Ψ r in Ψ 4 , separately: Calculate the variance and deviation of related G. Let G = g 2 I. Since g is also unknown, calculate the variance and deviation of g to obtain the AMSE related to g: e i represents the unit vector whose ith value is 1, and sum over all AMSEψ r (g) to get SAMESΨ 4 .
To take the g which minimizes SAMES Ψ 4 . Here, the final bandwidth matrix H is generally obtained by the two-stage method. The selection of the optimal bandwidth matrix can be implemented in the "ks" package of the R language.
(3) As for the selection of the adaptive bandwidth matrix, the Bayesian method proposed by Zougab et al. (2014) [20] was adopted in this paper. Under the Bayesian framework, the prior distribution is selected as the conjugate prior distribution, and the explicit expression of the adaptive window width matrix is obtained by combining with the square loss function. Here, t can be viewed as a set of data about u and v. The specific calculation steps are as follows: the kernel estimation is carried out by leave-one out method estimation.f The kernel function generally uses Gaussian kernel: At this time, its prior distribution can take its conjugate prior distribution, that is, the inverse Wishard prior distribution, is the classic Gamma function. Under the square loss function, the explicit expression of a posteriori estimate can be given, which can avoid the consumption of a lot of operation time during sampling calculation. For each bandwidth matrix H i , the form of a posteriori probability is as follows: The posterior distribution π(H i |x i ) can be expressed as follows: For each j term in the posterior distribution, where j = i is multiplied by something with unit 1, where the quantity with unit 1 can be expressed as: Then, we can get: where r is the degree of freedom, given r = n 2/(d+4) + d, Q is the covariance matrix of the sample. For the adaptive bandwidth matrix, the specific form of the two-dimensional Gauss kernel function is as follows: The set of weights at (u binary vector about (u, v), H i is a binary bandwidth matrix about (u, v). The corresponding hat matrix L has the following forms:

Simulation Experiment
In order to reflect the spatial scale difference and interdependence between (u, v), The model used to generate spatial data is as follows: where the independent variable X 1 is a column vector of all 1, that is, the observation values X 11 , X 21 , · · · , X n1 are all 1, and the observation values X 12 , X 22 , · · · , X n2 are independent values generated from the uniform distribution U(0,2), the error 1 , 2 , · · · , n are n independent and identically distributed random numbers that obey the normal distribution N(0, 2 2 ), where the sample size is taken as n = 100, let β 1 (u, v) = 1 6 (u + v), β 2 (u, v) = 1 3 u, and the real surfaces are shown in Figure 1. The coefficient functions in Equation (19) are calibrated by the Formula (2), and the optimal bandwidth in the kernel function is selected in three ways: the GCV-based method in Equation (8), the Plug-in-based matrix H method in Equation (11) and the adaptive bandwidth matrix H method in Equation (16). We get the GCV-based op-timal bandwidth h 0 = 35.09, and the Plug-in-based optimal bandwidth matrix H is the smaller the value of the MSE, the better the fitting effect. In order to compare the performance of the above three estimation methods, keep other quantities fixed, re-sample data with a sample size of n from N(0, 2 2 ), do N = 200 repeated experiments, calculate the MSE of Y, β 1 , β 2 , and take the average of the N replications. The numerical results are shown in the Table 1. Additionally, Figures 2 and 3 display the estimated coefficients and the estimation residuals ofβ 1 (u, v) andβ 2 (u, v) with the GCV-based bandwidth, the bandwidth matrix and the adaptive bandwidth matrix method, respectively.  Table 1 that the MSE of regression coefficients estimation obtained by using the bandwidth matrix H and the adaptive bandwidth matrix are smaller than that obtained by using GCV-based bandwidth h for 200 replications. The parameter estimation MSE of β 1 (u, v) and β 2 (u, v) obtained by the adaptive bandwidth matrix is less than that of the parameter estimation of β 1 (u, v) and β 2 (u, v) obtained by the direct use of the bandwidth matrix. In other words, the estimation result obtained by the adaptive bandwidth matrix is better than that obtained by the use of the bandwidth matrix. That means the performance of two dimensions' unconstrained bandwidth matrix for geographical location (u, v) is better than that of selecting one bandwidth for two variables (u, v) in the spatially varying coefficient model. In addition, when the bandwidth matrix H obtained by the Plug-in method is used for parameter estimation, the running time of the program is shorter than that obtained by the GCV-based method.
From the residual graph of the coefficient function in Figures 2 and 3, the estimated residuals are dispersed by selecting the same bandwidth h for u and v, and the estimated residuals are relatively concentrated around zero by selecting the bandwidth matrix H, although there are a few points with large differences. The residuals boxplots of β 1 (u, v) in Figure 2 and β 2 (u, v) in Figure 3 show that the estimated residuals of β(u, v) are more concentrated by using the adaptive bandwidth matrix than the global bandwidth matrix. The reason for the occurrence of large bias points may be related to the non-diagonal elements of the bandwidth matrix H, or an adaptive bandwidth matrix can be introduced. The introduction of the adaptive bandwidth matrix can reduce the outliers. It further proves the effectiveness of introducing a two-dimensional adaptive bandwidth matrix into the spatial varying coefficient model. The above results show that, in the spatial varying coefficient model, considering the spatial anisotropy and correlation of spatial coordinates (u, v), the estimated coefficient surfaces obtained by selecting bandwidth matrix H is more accurate than that obtained by selecting the same bandwidth h for the two spatial coordinates (u, v). In addition, compared with GCV-based method, using the Plug-in-based method to select the bandwidth matrix can reduce the number of iterations in model fitting and improve computational efficiency.

Empirical Research
The Fractional Vegetation Cover (FVC) quantifies the denseness of vegetation and reflects the growth trend of vegetation. FVC plays an important role in reflecting the process of climate change and is an important indicator of ecological environmental change and is a useful parameter for many environmental and climate-related applications. In this research, the datasets of variables FVC, temperature and precipitation in Yili, Xinjiang were collected to study the spatial varying relationships between FVC and climate factors.

Data Sources
Yili Kazakh Autonomous Prefecture was the selected research area, located in northwest China's Xinjiang Uygur Autonomous Region, surrounded by mountains, and vulnerable to influence of airflow around the Tianshan mountains, the Wusun mountain, Kequqin and the Nalati mountain. Inside the Yili river near the Yili river valley, there is a humid climate and abundant rainfall; it is arguably the wettest region in Xinjiang. In this paper, the data on air temperature, precipitation and FVC in Yili, Xinjiang in 2011 adopted by Feng et al. (2016) [28] were used, with a total of 835 corresponding data points.

Modelling
The specific form of the spatial varying coefficient model in this section is as follows: where Y i represents the observation value of the ith vegetation coverage index, X i1 is always equal to 1, X i2 represents the observation value of the ith annual average precipitation, and X i3 represents the observation value of the ith annual average temperature, β 1 (u i , v i ) represents the benchmark vegetation coverage index, β 2 (u i , v i ) represents the influence intensity of the annual average precipitation at the ith point on the vegetation coverage index, β 3 (u i , v i ) represents the intensity of influence of the annual average temperature at the ith point on the vegetation coverage index, i represents the residual. For the estimation of the coefficient function, the local linear GWR method is used for model fitting and parameter estimation.  (3) and (4), a one-dimensional kernel function and a two-dimensional kernel function, are constructed to fit the model respectively and to obtain the MSE of the estimated value of the dependent variable and the running time of the program. The results in Table 2 show that the MSE(Y) obtained by the two-dimensional kernel function with the bandwidth matrix is smaller, the estimation is more accurate, the program running time is greatly shortened, and the running efficiency is greatly improved. The estimated values obtained from the bandwidth matrix H in Table 3 show that: the coefficient function β 2 (u, v) and the coefficient function β 3 (u, v) of annual mean temperature and precipitation both show obvious spatial and regional differences. There was a positive correlation between precipitation and FVC, and the regions with more precipitation had higher vegetation coverage. There was also a positive correlation between temperature and FVC. The results in Figure 4 show that, in the spatial varying coefficient model, the estimated value obtained by selecting the binary bandwidth matrix H for the kernel function is roughly consistent with the real vegetation coverage index, and the residual is small. The vegetation coverage index in most areas of Yili is relatively high, while the vegetation coverage index in the border area and the northwest is slightly lower, which is consistent with the actual situation. The reasons for this phenomenon may be related to temperature, precipitation, terrain, and so forth. It should be noted that, in addition to intuitively exploring spatial varying regression relationships by summary tables or images of the estimated coefficient functions, the statistical inference methods can be used to test the significance of spatial non-stationary relationships, including the residual sum of square based statistics with an F-distribution (Leung et al., 2000 [29]; Mei et al., 2006 [30]) and the bootstrap methods (Wei and Qi 2012 [31]; Mei et al., 2016 [32]).

Results of Bandwidth Matrix and Adaptive Bandwidth Matrix
According to the adaptive bandwidth matrix H method discussed in Section 2, Equation (16), the estimated result of the coefficient function can be found in Table 4, and Figure 5 shows the prediction of the FVC and the residuals with the adaptive bandwidth matrix. By observing the estimated values of the coefficient functions in Table 4, it is easy to see that the values of β(u, v) are mostly positive numbers. For β 2 (u, v) and β 3 (u, v), most of the values are positive, that is, precipitation and temperature have a positive effect on the FVC. By comparing Tables 3 and 4 we can find that, compared to the range of parameter estimates obtained with a fixed bandwidth matrix, the range of parameter estimates obtained by the adaptive bandwidth matrix is smaller.

Summary and Prospect
(1) When there is a scale problem and a correlation between two variables, that is, the smoothness between the two variables is different and there is anisotropy in space, it is more reasonable to consider the two variables in selecting the binary bandwidth matrix than to select the same bandwidth fitting value between the two variables. At the same time, as an intermediary between a one-dimensional univariate and a high-dimensional multivariate, a bivariate has a smoother direction than a univariate kernel, and is easy to visualize. This also makes it possible to generalize the spatio-temporal variable coefficient model to the unconstrained bandwidth matrix of the spatio-temporal variable coefficient model or an even higher dimensional bandwidth matrix.
(2) Compared with the GCV method, using the Plug-in method to select the bandwidth matrix can reduce the number of iterations in the process of selecting the optimal bandwidth h, and can shorten the iteration time of program operation and improve the efficiency of calculation.
(3) By observing the residuals of the parameter estimates, we find that the residual range of the parameter estimates obtained under the bandwidth is small, while most of the residual values of the parameter estimates obtained under the bandwidth matrix are very concentrated. It will have points with larger residual values, which may be related to non-diagonal elements. We can also try to add an adaptive bandwidth matrix, which is worth continuing to study.  Data Availability Statement: The data in the paper can be requested by email to the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.