Variable Selection of Spatial Logistic Autoregressive Model with Linear Constraints

In recent years, spatial data widely exist in various fields such as finance, geology, environment, and natural science. These data collected by many scholars often have geographical characteristics. The spatial autoregressive model is a general method to describe the spatial correlations among observation units in spatial econometrics. The spatial logistic autoregressive model augments the conventional logistic regression model with an extra network structure when the spatial response variables are discrete, which enhances classification precision. In many application fields, prior knowledge can be formulated as constraints on the parameters to improve the effectiveness of variable selection and estimation. This paper proposes a variable selection method with linear constraints for the high-dimensional spatial logistic autoregressive model in order to integrate the prior information into the model selection. Monte Carlo experiments are provided to analyze the performance of our proposed method under finite samples. The results show that the method can effectively screen out insignificant variables and give the corresponding coefficient estimates of significant variables simultaneously. As an empirical illustration, we apply our method to land area data.


Introduction
Spatial econometrics, developed to deal with spatial correlation and spatial heterogeneity of data, has become a standard analytical tool for spatial data and has begun to enter the mainstream of econometrics. Spatial models have a long history in econometrics. Much progress has been made in the estimation of spatial models, please refer to the special works of Anselin (1988) [1]; LeSage and Pace (2009) [2]. Nowadays, the spatial autoregressive (SAR) model developed by Cliff and Ord in 1973 [3] is the most studied and widely applied modeling method for dealing with spatial correlation. Additionally, the model can be widely applied in many fields including social networks (Ma et al. (2020) [4]), real estate (Dubin (1999) [5]; Osland (2010) [6]), crime incidents (Kakamu et al. (2008) [7]), sampled network data (Zhou et al. (2017) [8]), artificial neural networks (Wang et al. (2019) [9]), and geospatial data (Khalfi et al. (2021)). We can consider the spatial autoregressive model as an ordinary regression model that additively takes the spatial spillover effect of the dependent variable into account. Thus, this model can model both traditional covariates and network dependence simultaneously in a convenient manner.
However, most of the conventional spatial analyses were designed to address the problem of estimation or prediction based on continuous observations. In the case of discrete or binary variables, for instance, in pathological diagnosis, there are only two possible outcomes, positive (denoted as 1) or negative (denoted as 0). The logistic regressive model is a popular method to deal with discrete binary responses. As one of the most popular classification methods, the logistic regressive model has been studied extensively. Essentially, the model assumes that an individual's class label is influenced by a set of predictors. In practical use, observational data can be taken from different places. In other words, the data generated can be cross-section data. The cross-section data involve several locations. Therefore, it is possible that spatial effect influences the model. In the presence of spatial effects, the usual logistic regression does not sufficiently model the data. Thus, the spatial logistic regression model will be better to model data that contains spatial effects [10]. Theoretical economists and practical researchers are interested in the spatial logistic autoregressive model, which investigates how covariates affect the correlation response of spatial discrete values. The study of spatial logistic regression models, which use categorization technologies to model spatial data, is a relatively new area of spatial econometrics, and research in this area is still quite restricted. Calabrese and Elkink (2014) [11] introduced the binary spatial autoregressive model for the first time. Hilwin Nisa et al. (2019) [12] proposed the spatial logistic regression model which is obtained by the logistic regression model and spatial binary regression model.
High-dimensional spatial data appear frequently in many fields of social life and scientific research, such as biomedical imaging, X-ray tomography, finance, and geoscience. In recent years, a variety of regression methods have been proposed to model high-dimensional data in spatial statistics. For example, Piribauer et al. (2016) [13] proposed a Bayesian variable selection procedure in a spatial autoregressive model. A penalized quasi-maximum likelihood method was put forth by Liu et al. (2018) [14] for variable selection in the spatial autoregressive model. Model selection in spatial autoregressive models with varying coefficients was studied by Wei et al. (2019) [15]. Variable selection for the spatial autoregressive models with a diverging number of parameters was considered by Xie et al. (2020) [16]. Cai et al. (2020) [17] considered variable selection and estimation for a high-dimensional spatial autoregressive model. Li et al. (2020) [18] proposed a variable selection method for the partially linear varying coefficient spatial autoregressive model. More recently, Li et al. (2021) [19] proposed a new variable selection method for a higher-order partially linear spatial autoregressive model with a diverging number of parameters. Liu et al. (2021) [20] studied variable selection for the spatial autoregressive model with autoregressive disturbances. Song et al. (2021) [21] proposed a new robust variable selection method with an exponential squared loss for the spatial autoregressive model.
The above methods mainly focus on the variable selection of continuous response variables based on the penalty regression technique. Penalized regression techniques shrink the insignificant coefficients to 0, which has attracted increasing attention to high-dimensional data analysis, such as least absolute shrinkage and selection operator (LASSO) (Tibshirani (1996) [22]), smoothly clipped absolute deviation (SCAD) (Fan and Li (2001) [23]), and minimax concave penalty (MCP) (Zhang (2010) [24]) for mean regression. LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, for which it tends to produce some coefficients that are exactly 0 and hence give interpretable models. However, LASSO has some bias in the estimation of the coefficients. SCAD attempts to mitigate this bias and produce nearly unbiased estimates for large coefficients while still retaining the continuous penalty for sparsity. MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. However, penalized spatial logistic autoregression has been rarely studied. For high-dimensional spatial data, there are several problems using spatial logistic autoregressive modeling such as endogeneity and including too many variables in the model. First, the spatial lag term in the model will make it endogenous. In the presence of endogeneity, the ordinary least squares (OLS) method can produce biased and inconsistent parameter estimates. Second, as the dimensionality of the variables increases, redundant variables will bring challenges to the estimation in the modeling process.
Furthermore, as the penalized spatial logistic regression does not account for any prior information, and then we can consider how to incorporate prior information into the modeling procedure. Statistical models with linear constraints on variables have gained widespread applications recently. These constraints on regression coefficients reflect the prior information and structure, which can help us to find the optimal parameters with the given information. To incorporate the prior information in the modeling process, we add linear constraints to the penalized spatial logistic autoregressive model. As far as we are aware, no previous research has investigated penalized spatial logistics autoregressive models with linear constraints. Thus, in this paper, we will study a penalized spatial logistic autoregressive model with linear constraints. For the spatial logistic regression model (4), we estimate β by solving the following optimization where L(β, ρ) is the likelihood function, ρ ∈ R, β ∈ R p , n is the sample size, p λ (·) is the penalty function, and C ∈ R q×p , d ∈ R q , E ∈ R s×p , and f ∈ R s are determined concretely according to the experience and knowledge of practical problems. In this paper, our contribution is summarized as follows: 1.
Propose a penalized spatial logistics autoregression with linear constraints. These constraints contain the prior information and structure, which can improve the robustness of the model and help us to find the optimal parameters. Thus, the model performs better in the variable selection and estimation under a high-dimensional data space.

2.
Provide the formula for degrees of freedom, and then construct the model selection criteria to select the optimal tuning parameter.

3.
Simulation results show that the performance of the proposed method is more explanatory and reasonable than penalized quasi-maximum likelihood without linear constraints, and an empirical application illustrates the usefulness of the methods in practical work. The effectiveness of the penalized quasi-maximum likelihood with linear constraints algorithm is demonstrated.
The following is how the paper is arranged. In Section 2, we introduce the general form of the problem we study and present our penalized quasi-maximum likelihood without linear constraints algorithm. The formula for degrees of freedom using Stein's unbiased risk estimation (SURE) lemma is derived in detail in Section 3. Some Monte Carlos results on the performance of the proposed method are discussed in Section 4. Section 5 shows our method for analyzing real data sets. Section 6 presents the conclusions.

Spatial Autoregressive Model (SAR)
Think of a network that has n nodes. The matrix A ∈ R n×n can be used to characterize how the network is structured. Define a ij = 1 when node i follows node j, and a ij = 0 otherwise. With a n × 1 vector of observations on the dependent variable Y and a n × p matrix of regressors X, we can establish the following SAR [3] model: where ρ ∈ R is network autocorrelation coefficient and β = (β 1 , ..., β p ) T ∈ R p is the regression coefficient vector. W is the row-normalized version of A such that w ij = a ij / ∑ n j=1 a ij . Let θ = (ρ, β T ) T ∈ R p+1 and denote = ( 1 , ..., n ) T as the error vector of independent disturbances with mean zero and finite variances σ 2 . Denote Then the SAR model's log-likelihood function is shown as follows:

Spatial Logistic Regression Model
The spatial logistic regression model is a combination of the spatial autoregressive model and the logistic regression model. Binary classification or multi-classification are both acceptable response variables for a logistic regression model. However, we solely take into account the case of the binary of the response variable.
The model (2) can be written as: where MV N denotes the multivariate normal distribution. For simplicity, we use y * instead of Y where H = (I − ρW ) −1 is an (n × n) matrix, and define e = (I − ρW ) −1 ε as an (n × 1) vector. Latent variable y * has a binary category which is defined as variable y: Therefore, the probability of P(y i = 1) and P(y i = 0) is: When we assume the mean value of e is 0 and the variance is Ω, then we get where Ω ii is the diagonal element of Ω, which is formed as The same idea applies to P(y i = 0). The parameter estimation of spatial logistic regression can be obtained by maximum likelihood estimation (MLE). The parameter is estimated by maximizing the likelihood function of random variable y i , which follows a Bernoulli distribution: Then, the natural log(ln) is used to transform the likelihood function as follows: To estimate β and ρ, we use the maximization formula (10), then define

Variable Selection with Linear Constraints
In many application fields, prior knowledge can be formulated as constraints on parameters to improve the effectiveness of variable selection and estimation. In this section, we consider the variable selection of the spatial logistic regression model with linear constraints.
We will concentrate on the variable selection for the spatial logistic regression model with linear constraints, that is where ρ ∈ R, β ∈ R p , n is the sample size, and C ∈ R q×p , d ∈ R q , E ∈ R s×p , and f ∈ R s are determined concretely according to practical knowledge and experience. p λ (•) is the penalty function, and the shrinkage degree of the penalty is determined by the tuning parameter λ in the penalty term. There are many popular choices for the penalty function p λ (•) in the statistics literature: Penalty functions can provide estimators with three properties which include unbiasedness, sparsity, and continuity according to Fan and Li (2001) [23].
LASSO is not unbiased, and MCP calculation is relatively complex. Fan and Li (2001) [23] demonstrated the oracle properties for the SCAD in the variable selection aspect, and pointed out that the LASSO penalty does not possess the oracle properties. Compared with ridge regression, the SCAD penalty method reduces the prediction variance of the model. Moreover, the SCAD penalty method outperforms the LASSO penalty ones, which reduces the deviation of parameter estimation. Thus, we choose to use the SCAD penalty here.

Selection of the Tuning Parameter
We decided to choose the SCAD [23] penalty, relying on the analyses mentioned above. The penalty function is defined as: where λ ≥ 0 and a > 2 are tuning parameters. Here, a is usually taken to be 3.7, a fact that is elucidated in the work of Fan and Li (2001) [23]. At the same time, they have also shown that λ determines the shrinkage strength of parameter estimation. In this paper, the selection of tuning parameter λ by Bayesian information criterion (BIC) is also related to degrees of freedom. The number of degrees of freedom measures the number of effective parameters in the regression model and the complexity of the model. It plays an important role in model assessment and selection. There are different ways to measure degrees of freedom. Assume that y follows a distribution y ∼ (µ, σ 2 ), where µ is the true mean and σ 2 is the variance. Ye(1998) [25] and Efron(2004) [26] defined the number of degrees of freedom as whereμ(y) =ŷ = X * β is the fitted response for y ∈ R n . Under the framework of Stein's unbiased risk estimation (SURE) theory (Stein (1981) [27]), cov(y i ,μ i ) can be estimated by σ 2 E ∂μ i ∂y i , ifμ(y) is continuous and almost differentiable. Then the expression for degrees of freedom of fittedμ can be calculated as to apply it. We need to assume that the response is normally distributed, that is, y ∼ N µ, σ 2 I n .
Degrees of freedom are used effectively while selecting the tuning parameter λ. In this study, the model selection criteria will be based on the Bayesian information criterion (BIC) (Schwarz (1978) [28]): Since we have calculated the correspondingρ(λ) andβ(λ) at each tuning parameter λ, we can use the corresponding fitted valuet(λ) and degrees of freedom d f (t(λ)) to select the optimal λ that minimizes BIC(λ).

Simulation Experiment Design
In the simulation experiment, we test the performance of the model through Monte Carlo simulation. The random sample is generated by model (2.1) combined with model (2.7), in which the covariate is considered when the (q + 3)-dimensional normal distribution with zero mean and covariance matrix σ ij , where σ ij = 0.5 |i−j| . Therefore, X is an n × (q + 3) matrix. In the following simulation, we set the number of samples n ∈ {60, 90, 120}, and the number of insignificant covariates q ∈ {5, 10, 25}. In this paper, we show the cases of q = 5 and q = 10 in the simulation results.
To simplify the calculation, the regression coefficients are set to 3, 2, 1.6, 0 q T , where 0 q is a q-dimensional zero vector. The response variable is given by the following formula: Then use the following formula to convert the response variable into a category variable: Thus, we can obtain the binary classification of the response variable Y. For purpose of confirming the robustness of the model, the error terms ε i s are independently generated from the following two distributions: (a) normal distribution ε i ∼ N 0, σ 2 I n , denoted as ε 0 and (b) Gaussian mixture distribution ε i ∼ 0.5N −1, 2.5 2 + 0.5N 1, 0.5 2 , denoted as ε 1 . σ 2 is generated by the uniform distribution on the interval [σ 1 − 0.1, σ 1 + 0.1], where σ 1 ∈ {1, 2}. In the simulation experiment, we set σ = 1.5.
In order to verify that the effect of the model with linear constraints is better, we will compare it with the model without constraints. In one case, we can set the constraints as: Obviously, we can find C , d , E, and f .

Evaluation Indicators
According to the above simulation experiments, we set the number of Monte Carlo repetitions at 1000. We define the following three indicators to evaluate the performance of variable selection in different settings.
Correct: the average number of coefficients of the true zeros correctly set to zero; Incorrect: the average number of coefficients of the true nonzeros incorrectly set to zero; ME: the mean error between the true value and the estimate, which is defined by:

Simulation Results
Tables 1 and 2 show the results of models with linear constraints without linear constraints, respectively. The constrained model is recorded as "Const", and the unconstrained model is recorded as "Unconst". The results in Table 1 clearly show that the spatial logistic model performs better with linear constraints, which also confirms the effectiveness of our model. Most significantly, when ρ 1 = 0.8, the error of the models with constraints and without constraints are very different, which shows that when the spatial effect is strong the constraints can greatly improve the accuracy of model parameter estimations. Moreover, we find that the effect of the model tends to become bad by increasing the network autocorrelation coefficient ρ 1 , which indicates the importance of the spatial effect. Moreover, by setting two types of errors, we observe that the model has good robustness. Additionally, in most cases, the model with an error term of ε 1 performs better than the one with an error term of ε 0 . Similarly, with the increase in sample size n, the incorrect rate of variable selection and the estimation error both decrease. This situation is in line with our prediction of the effect of the model. Moreover, it can be seen in Table 1 that the mean error is minimum when n = 120, ρ 1 = 0.2, and ε = ε 1 , which also confirms our analysis above.
In Table 2, by increasing the dimension to q = 10, we find that with the increase in the network autocorrelation coefficient ρ 1 , the effect of the model becomes bad. Compared to the case of q = 5, we find an increase in the mean error of estimation. After analysis, the possible reason is that the proportion of data size to dimension is bad, that is, the dimension of the sample is higher than the sample size. suppose the network complexity increases, it will have a certain negative impact on the effect of variable selection. For high-dimensional samples, the model results can be optimized by increasing the sample size. Simultaneously, according to the simulation results, compared with the unconstrained case, the spatial logistic model has stronger robustness, higher accuracy, and lower estimation error rate.
We compare the variable selection using the SCAD penalty and LASSO penalty under constraints. The simulation results are shown in Table 3. We can clearly observe that in the case where ρ 1 = 0.2 and ρ 1 = 0.5, the performance of the SCAD penalty is significantly better than the LASSO penalty, which is shown in higher correct selection rate and lower estimation error. As the network autocorrelation coefficient increases to ρ 1 = 0.8, the correct rate of variable selection of the SCAD penalty is higher than the LASSO penalty, while the incorrect rate of the LASSO penalty is lower than the SCAD penalty. Additionally, the estimation error between them is not much different. The reason may be that the tuning parameter λ in the LASSO penalty is too large and the shrinkage strength is stronger.

Real Data Example
In this section, we provide a real-world example to demonstrate the performance of the variable selection procedure proposed in this paper for spatial logistic regression models with linear constraints.

The Land Area Utilization Data
Land area utilization is analyzed by the spatial logistic model. The data set is different types of land area data from 48 states in the United States from 1954 to 2012 (recorded every five years). The dependent variables are binary, with "1" denoting a low land utilization rate, which means that most of the land has not been properly developed, and "0", denoting a high land utilization rate, which means that most of the land has been efficiently developed and exploited. As for the independent variables, there are eight properties, which are Cropland used for crops, Cropland used for pasture, Cropland idled, Grassland pasture and range, Forest-use land grazed, Land in rural transportation facilities, Land in urban areas, and Other idle land (shown in Table 4).

Variable Selection and Estimation
For the above land area utilization data sets, we constructed a spatial logistic autoregressive model. We use the land utilization rate as a response variable, and take eight variables, Cropland used for crops, Cropland used for pasture, Cropland idled, Grassland pasture and range, Forest-use land grazed, Land in rural transportation facilities, Land in urban areas, and Other idle land, as the independent variables.
According to theoretical knowledge, the idle area of cropland and other idle land areas have a significant impact on the probability of classification results. Through the fitting of the model, the following two cases are considered: the first is the parameter estimation without constraints, and the other is the parameter estimation with linear constraints, as shown in Tables 5 and 6.
According to the results of parameter estimation based on the spatial logistic model, it is found that the performance of variable selection is not obvious under unconstrained conditions. Among them, "Forest-use land grazed" and "Land in rural transportation facilities" have little impact on land use efficiency, which can be almost ignored. However, "Other idle land" has a great influence on the classification effect. Considering the greater relationship between land use area and idle land area, the performance of model selection is greatly improved with constraints. According to the table below, it is found that "Cropland idle" and "Other idle land" have a great impact on the classification results, and the fitting parameters of other attributes are very small and can be ignored.

Conclusions
In the paper, we obtain a spatial logistic model from the spatial autoregressive model (SAR) and logistic regression model. In order to improve the accuracy of the model, we take the prior information into account, and finally, propose a variable selection method with linear constraints for the spatial logistic model. According to the simulation results, by comparing the constrained and unconstrained variable selection models, we find that the performance of variable selection is more stable with the increase in sample size in the case of limited samples. When we fix the sample size at a constant value, the performance of the model tends to improve with the increase in network complexity. At the same time, the model has strong robustness to noise. In order to verify the superiority of the SCAD penalty, we compare the performance of the SCAD and LASSO penalties in the case of the linear constraint model and find that the SCAD penalty has a better effect on variable selection.
In practical application, most data show the characteristics of small sample size and high dimensions. For purpose of verifying the wide adaptability of the model, we verify that the proposed model can be effectively applied to the data set of high-dimensional and small samples through simulation experiments. However, we find that in this case, when the network complexity is very high, the performance effect of the model is not very good. The sample size being too small might be the cause of this.