Scalable model selection for spatial additive mixed modeling: application to crime analysis

A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects the true model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for extracting effective crime factors but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran.


Introduction
Regression modeling is widely used to investigate the factors of geographical phenomena, such as plague spread, species distribution, economic agglomeration, and crime rates. Regression modeling is used in crime analysis to [1][2][3] study the influence of neighborhood affluence, race, unemployment rate, and facilities like liquor stores and stations, and other covariates on crime risk.
Nowadays, an increasing number of open datasets of crime statistics are available [4]. For example, the Tokyo Metropolitan Government (https://www.bouhan.metro.tokyo.lg.jp/opendata/index.html), has made crime statistics (2014-present) classified by crime type and minor municipal districts publicly available. Moreover, many of the crime statistics are of good geographical resolutions at district-level or other spatial fine scales, and they record thousands to tens of thousands of data entries in each time period. For such large spatiotemporal data, a computationally efficient regression approach is very important.
In applied spatial analysis, estimation and identification of linear effects and spatially effects to objective variables are actively studied. For example, spatial econometric models are used to estimate linear effects in the presence of spatial dependence [5]. Gaussian process models have been extended to accommodate a wide variety of spatial effects in geostatistics [6]. Geographically weighted regression (GWR) [7] is used to estimate spatially varying coefficients (SVCs) on covariates [8]. Among spatial effects, we especially focus on SVC modeling that allows for estimating the local determinants of crimes [2,9,10]. For instance, [2] found that affluence reduces crime in a suburban area of Portland, Oregon, whereas it increases crime in the city center. Understanding such local differences in crime occurrence is important when considering security measures against crimes.
Apart from the spatial effects, influence from covariates can also vary non-spatially depending on time, quantile, or other variables or events [11]. Unfortunately, when all possible effects are included in the model, over-parametrization occurs and the model becomes unstable. To balance model accuracy and complexity, model selection is crucially important. For example, in crime analysis, it is needed to appropriately specify key factors behind crimes. There are many model selection methods for SVC models [12][13][14] and other additive models that accommodate spatial and/or non-spatial effects [15,16]. Model selection is typically performed through iterations of model updating through inclusion/exclusion of effects (e.g., SVC) until convergence. In the case of large samples, however, SVC model selection by iterative fitting is computationally demanding. For example, mixed/semiparametric GWR [12,17], which selects constant coefficients or SVC, requires a computational time complexity of O(N 2 ) per iteration [18], where N is the sample size and O(•) denotes the order. Although there are fast approaches for selecting spatial and/or non-spatial effects, they still iterate model-fitting steps with a computational complexity of O(N).
Given this background, this study develops a computationally efficient approach for selecting spatial and/or non-spatial effects under the framework of spatial additive mixed modeling [19,20]. It employs a pre-conditioning treatment to reduce the computation cost of parameter estimation but is not applied to model/effects selection. By extending the idea of [19,20], we develop a scalable approach for selecting both spatial/non-spatial effects. This method significantly reduces the computational time complexity of the iterative fitting steps such that the cost is independent of sample size N.
The remainder of this paper is organized as follows. Section 2 introduces our model. Section 3 develops our model selection procedures, and Section 4 examines its performance through Monte Carlo simulation experiments. Section 5 applies the developed approach to crime modeling and forecasting and Section 6 concludes.

Model
We consider the following spatial additive mixed model: where ° is the element-wide product operator, is a vector of response variables ( × 1), where N is the sample size, ! is a vector of the p-th covariate, is a vector of disturbances with variance % , 0 is a vector of zeros, and I is an identity matrix. ! is a vector of coefficients describing the influence of the p-th covariate.
There are many specifications for ! . The most basic specification is the constant ! = ! , where ! is a coefficient and is a vector of ones, which is assumed in common linear regression models.
One key idea used in this study is to specify ! SVCs. For instance, [21] adopted the following specification: is a ( × ! ) matrix of ! eigenvectors corresponding to positive eigenvalues, which are called Moran eigenvectors 1 ; they are extracted from a doubly-centered spatial proximity matrix [23]. is a 1 One characteristic advantage of the Moran eigenvector approach is that the resulting SVC ( ! ) is interpretable through the Moran coefficient, which is a diagonal statistic of spatial dependence and its value can be positive (negative) in the presence of positive (negative) spatial dependence. Specifically, when considering all the MEs that have positive eigenvalues, # (%) ' ! ! describes a positively dependent map pattern. In other words, Eq. (2) furnishes a positively dependent spatial process, which is dominant in many real-world cases [22] efficiently. The Moran coefficient value increases as ! grows [20]. random variables that acts as a Gaussian prior to stabilizing the SVC estimations. The ! and ! (') parameters determine the scale and standard error of the spatial process.
Alternatively, the ! function can also be determined in terms of a non-spatially varying coefficient (NVC). This specification captures the influence varying with respect to the covariate ! as follows: where ! (*) is a × ! matrix of ! , the basis function generated from ! . ! (*) denotes the variance of non-spatial effects.
The following specification, which assumes both spatially and non-spatially varying coefficients (S&NVC) on all the covariates is also possible: [20] showed that the S&NVC model is robust against spurious correlations, whereas naive SVC models tend to have spurious correlations [24].
In summary, ! , the constant is given by a fixed coefficient ( ! ), while SVC, NVC, and S&NVC are specified by sums (linear combinations) of fixed and random effects. By substituting these values, the model Eq.
(1) is formulated as follows: where , where "°" is an operator multiplying a column vector element-wise, with each column of . The matrices ! , ! 5 ! 8, and parameter ! are defined in Table 1. Eq. (5) suggests that our model is formulated as a linear mixed-effects model [25] with fixed effects , random effects, > ( ) and .
Although this study focuses on four specifications, other ! functions have been proposed to represent linear effects, non-linear effects, group effects, and other effects, as outlined in [11]. Due to its flexibility, the additive mixed model is now used in many applied studies [26,27].

Estimation
Among the estimation algorithms for spatial additive mixed models, we focus on the fast restricted maximum likelihood (REML) estimation of [19], which is scalable for both sample size N and the number of effects P. The computational bottleneck here is an iterative evaluation of the restricted log-likelihood + ( ) of Eq.(1) (or Eq.5) to numerically estimate the variance parameters ∈ { $ , ⋯ " } (see Appendix A). To reduce cost, [19] developed a sequential estimation of { $ , ⋯ " } parameters by applying Eq. (6) until convergence.
While the direct maximization of Eq. (6) still has the same computational burden, [19] developed the following procedure for the fast REML: (I) Replace the data matrices { , , $ , … , " } whose dimensions are dependent on N, with their inner products whose dimensions are independent of N.
(III) Output the final model.
In this algorithm, the data matrices are replaced with their inner products before the iterative likelihood evaluation step. After all, the computational complexity of the iterative likelihood evaluation to find Z ! in step (II-1) reduces to ( ! / ) [19]. In other words, after the pre-conditioning step (I), the computational complexity of the fast REML is highly scalable for both the sample size N and the number of effects P. Owing to this property, the spatial additive mixed model Eq.(1) can be estimated efficiently even when N and P are very large.

Introduction
As explained in Section 2, the coefficients in Eq. (1) can be constant, SVC, NVC, or S&NVC.
The complexity of the model depends considerably on the selection of the coefficient type. For instance, Eq. (1) has only P coefficients if all the coefficients are assumed to be constant. In the case of the SVCs-based model, on the other hand, the number of coefficient parameters is ∑ ! " !#$ , since the model utilizes ! -dimension Moran eigenvectors for its P coefficient representations. Too many parameters can lead to overfitting and overestimation of statistical significance, whereas too few parameters can cause underfitting. This issue can be remedied by choosing an optimal model that provides an appropriate balance of the sizes of parameters and datasets with respect to model accuracy.
There is one difficulty in the selection of a large number of candidate models; there are 4 P model specifications for Eq. (1). For example, if P = 9, which we will assume later, there are 4 9 = 262,144 models. However, in practice, it is desirable to find or approximate the best model within seconds or minutes. Hence, this study develops a computationally efficient model selection approach.
We attempt to search the model by minimizing the cost function, which can be defined by the Akaike information criterion (AIC) or Bayesian information criterion (BIC).
It is important to note that constant, SVC, NVC, and S&NVC share the same fixed effect ( ! ), whereas their random effects differ from each other. In other words, we select random effects.
In such a case, REML-based AIC and BIC are available for the model selection of linear additive mixed models [28]. While there are marginal and conditional AIC/BIC specifications, we focus on marginal BIC for the following reasons: -It is the most common specification for linear mixed effects models [29], including spatial additive mixed models.
-Poor performance of conditional AIC/BIC-based model selection was reported when considering two or more random effects (see [30,31]) whereas [32] showed that conditional AIC/BIC outperforms when comparing models with/without one random effect.
-Although the marginal specification suffers from a theoretical bias, [31] showed that the influence of the bias on model selection result is quite small.
Based on a preliminary analysis, we decided to use REML-based marginal BIC, defined by where Q is the number of fixed coefficients (P) and variance parameters in in the crime analysis in Section 4.

Model selection procedures
This section proposes two practical model selection methods. The first incorporates a model selection into the sequential REML estimation (see Section 2.2). To reduce the chance of trapping to local optima, the second approach relies on a Monte Carlo (MC) simulation iterating the sequential REML estimation. We call the former a simple selection method, which emphasizes simplicity and practicality, and the latter MC selection method. Sections 3.2.1 and 3.2.2. explain these methods.

Simple selection method
The simple selection method consists of model selection steps in sequential REML estimation. The procedure of this method is as follows: (a) Replace the data matrices { , , $ , … , " } with the inner products as processed in step (II) in Section 2.2.
(b) Perform the following calculation sequentially for each ∈ {1, … , }: , which is a subset of ! characterizing the SVC and Z -!(') represents the set of variance parameters excluding Z ! from Z . (d) Output the final model.
While there are similar selection approaches [33], ours is distinctive because its computational complexity for model selection is independent of the sample size, owing to step (a) that renders all the data matrices into their inner products. Due to the drastic dimension reduction, this simple method is suitable for very large samples.
One problem is the large number of combinations of pre-determined sequence ∈ {1, … , }. For example, if P = 9, there are ! = 362,880 sequences; some of them might result in poor model selection result (i.e., local optimum). At least, unlike the maximum likelihood estimation or cross-validation, REML tends to not have local optima [34,35]. Section 4 examines if this simple approach accurately approximates/selects the true model through Monte Carlo experiments.

Monte Carlo (MC) selection method
To reduce the risk of falling into local optima, we randomly sample sequences ∈ {1, … , } and iterate the REML-based estimation given the sequence. Specifically, we propose the following model selection approach: (A) Replace the data matrices { , , ! , … , " } with the inner products.
(B) Iterate the following calculation G times using the inner products: (B-2) Perform the following calculation sequentially for each # ∈ {1 # , … , # }: (B-2b) Select the SVC if it improves the cost function value (e.g., BIC). Otherwise, replace it with a constant.
(B-2c) Estimate the # -th NVC by maximizing (B-2d) The NVC is selected if it improves the cost function value (e.g., BIC). Otherwise, it is replaced with a constant. (C) Output the best model in the selected G models in terms of the lowest cost function.
As with the simple selection approach, the computational cost for iterative step (B) is independent of the sample size. In addition, the iterative step is easily parallelized. Thus, this is a computationally efficient model selection procedure.
Step (B) performs a MC simulation to marginalize g and obtain the distribution of the cost value. We call this approach MC selection approach.

Computational details
Here, we examine whether this simple selection method accurately approximates the true model or if MC selection method will be needed to achieve accurate model selection through comparative Monte Carlo experiments. We compare with/without the effects selection model by fitting these models to the synthetic data generated from 0~( , ), The coefficient estimation accuracy is evaluated using the root mean squared error (RMSE), which is defined as where iter represents the iteration number, 2,! is the i-th element of ! , and | Under these settings, Section 4.2 examines if our approaches accurately select the model, while Section 4.3 compares our approaches with other approaches.

Performance of model selection
In this experiment, we compare the following six models. As baseline models, the linear regression model (LM) and the SVCs across coefficients (SVC model) are used. As another baseline, we use the model assuming true coefficients types as known (i.e., constant coefficient on ! , SVC on $,! , and NVC on %,! (true model)). We construct a full model that assumes S&NVC across coefficients (S&NVC model). In addition, we prepare simple and MC-based S&NVC model selection approaches that select their coefficients among constant, SVC, NVC, and S&NVC using the simple approach and Monte Carlo approach, respectively.     Here, we compare our model selection approach with another commonly used model selection approach. Among the existing approaches, we selected the one implemented in the mgcv package in R (https://cran.r-project.org/web/packages/mgcv/index.html) because of the following reasons: (i) mgcv is one of the most popular packages for additive mixed modeling; (ii) the effects selection procedure in mgcv is computationally highly scalable [36], and it is a sensible benchmark for testing both the accuracy and computational efficiency of our approach.
The following models are compared: LM, S&NVC model, simple S&NVC model selection, another S&NVC model estimated from mgcv (Mgcv), and double-penalty-based Mgcv model selection [36]. The double-penalty approach is the default model selection method in the mgcv package. Roughly speaking, the double-penalty approach imposes penalty parameters to select effects in addition to the usual penalty parameters in additive models determining the smoothness of each effect. [36] showed the superior accuracy of this approach over alternatives. For faster computation, we estimate Mgcv and Mgcv with the double-penalty model selection using the bam function in the mgcv package, using fast REML [37]. As in the previous section, we assume Eq.   In summary, the analysis results demonstrate that our approach selects effects accurately and computationally efficiently, even in comparisons with the state-of-arts approaches.   We model the bicycle theft and shoplifting cases per area (km 2 ) per quarter by using the following model: where 2,4 8 is the number of c-th crime per area in the i-th district in the t-th quarter. Crime events tend to be repeated in the neighborhood where previous events have occurred, which is known as nearrepeat victimization [38]. Therefore, we include crime density in the previous quarter 2, 4

Coefficient estimation results
The (conditional) adjusted R-squared value is 0.914 for the bicycle theft model and 0.928 for the shoplifting model. The accuracy of these models was verified. Tables 2 and 3 summarize the estimated coefficients and their statistical significance. For bicycle theft, the variables Repeat, RepOther, and Popden are positively statistically significant at the 1 % level across districts, while the same is true for Dpopden except in two districts. Based on these results, we can infer that bicycle theft tends to be repeated in densely populated areas. The selected coefficient type for Repeat is S&NVC, while those for RepOther, Popden, and Dpopden are NVCs.
Although NVC has rarely been considered in spatial modeling to the best of our knowledge, this result indicates the importance of considering NVC. Fpopden, UnEmp, and Univ, whose coefficients are estimated to be constant, are statistically insignificant (Table 3). Among the group effects, <(2) (4) is selected. In summary, nighttime population density, daytime population density, repeat tendency, and season are significant determinants of bicycle theft.
For shoplifting, Repeat and RepOther are again positively significant. This suggests that shoplifting is repeated in the same area. Dpopden is also positively significant, whereas Popden is insignificant. This suggests that shoplifting increases in central areas where people concentrate during daytime mainly for business, whereas shoplifting does not necessarily increase even when the number of residents increases. Fpopden, UnEnp, Univ, and the two group effects are insignificant. For the significant variables, the selected type of coefficients are as follows: S&NVC for Repeat; NVC for RepOther and Dpopden; and constant for the others. In sum, repeat tendency and daytime population are shown to be significant determinants for shoplifting.  to be less repeated in these areas. Conversely, in the middle area, the coefficients increase along the Chuo line, which is a major railway (Figure 6) line. The repetitive tendency is especially strong near major stations on the line.
As for shoplifting, the coefficients on Repeat locally increase around stations. The value is especially high near the Shinjuku and Shibuya stations (Figure 6), which are centers of major commercial areas. Shoplifting tends to be repeated in these areas.

Application to crime prediction
Crime prediction for the near future is important for preventing crimes. Here, we apply our model (SNMSel) to predict the number of bicycle theft and shoplifting cases in the first quarter of 2019 by employing the model trained with the data between 2017 and 2018. The accuracy is compared with the kernel density estimation (KDE), which is a popular crime prediction method in this area [39]. The ks package (https://cran.r-project.org/web/packages/ks/index.html) in R was used for the KDE model estimation and prediction. The plug-in selector of [40] is used to optimize the kernel bandwidth.

Concluding remarks
This study develops a model or coefficient type selection approach for spatial additive mixed models. The simulation experiments suggest that the present simple approach accurately selects the true model. Moreover, even though model selections usually increase computational time, our model selection dramatically reduces it. This property will be valuable for estimating a complex model that involves many candidate effects from large samples. The criminal analysis demonstrates that our approach provides intuitively reasonable results. Our approach highlights and quantifies numerous hidden effects behind geographic phenomena. These prominent features will be useful for a wide spectrum of analyses, such as crime analysis, ecological studies, and environmental studies.
However, there are many issues to be addressed. First, we need to incorporate spatiotemporal variations in regression coefficients or residuals. Spatially and temporally varying coefficients (STVC) have been studied by [41][42][43], among others. However, the discussion on selecting SVC, STVC, or other coefficients is still limited in spatial statistics. The spatio-temporal process, which comprises pure-spatial, pure-temporal, and spatio-temporal interaction processes, is much more complex than purely spatial variation. A well-manipulated selection of SVC, STVC, or other coefficient types will be very important in improving the accuracy and stability of spatio-temporal modeling. The consideration of dynamic temporal behavior is important therein [44,45]. Second, it is important to consider a larger number of coefficients, such as over 100. Sparse modeling approaches, such as LASSO [46] and smoothly cropped absolute deviation (SCAD) penalty [47] will be helpful in estimating SVC, NVC, or other varying coefficients while avoiding overfitting.

Appendix.1 Restricted log likelihood function of the spatial additive mixed model
The fast REML maximizes the marginal likelihood + ( ) defined by integrating { , } from the full likelihood, which is formulated as where " ' " represents the matrix transpose.
The additive mixed model is readily estimated by first: (i) estimating Z by maximizing + ( ), and then, (ii) estimating the fixed and random coefficients " | ′, Z ′•′ by substituting Z into Eq.(8). One major difficulty is the computational cost of estimating Z in step (i) because the cost increases exponentially with respect to the number of parameters in Z . The computational time can be disappointingly long, even for 10 variance parameters. Unfortunately, to flexibly capture the influence of many covariates, 10 or more variance parameters are typically required.

Appendix.2 Details of the model selection approaches
This appendix explains the computational details of the simple model selection method.
Then, the details of the MC method are explained.
In step (a) of the simple method (see Section 3.2.1), the data matrices { , , $ , … , " } are replaced with the following inner products: The restricted log-likelihood maximized in step (b) (see Section 3.2.1) is rewritten by substituting these inner products into Eq. (A1) [19]: Eqs. (A4)-(A7) do not include any matrix whose size depends on N. Thus, in step (b), the likelihood is evaluated with a computational cost independent of the sample size.