Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis

Murakami, Daisuke; Kajita, Mami; Kajita, Seiji

doi:10.3390/ijgi9100577

Open AccessArticle

Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis

by

Daisuke Murakami

^1,2,*,

Mami Kajita

¹ and

Seiji Kajita

¹

Singular Perturbations Co. Ltd., 1-5-6 Risona Kudan Building, Kudanshita, Chiyoda, Tokyo 102–0074, Japan

²

Department of Statistical Data Science, Institute of Statistical Mathematics, 10–3 Midori-cho, Tachikawa, Tokyo 190–8562, Japan

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2020, 9(10), 577; https://doi.org/10.3390/ijgi9100577

Submission received: 7 August 2020 / Revised: 18 September 2020 / Accepted: 28 September 2020 / Published: 30 September 2020

Download

Browse Figures

Versions Notes

Abstract

:

A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction, dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects a model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for selecting factors influencing crime risk but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran.

Keywords:

model selection; spatial regression; crime; fast computation; spatially varying coefficient modeling

1. Introduction

Regression modeling is widely used to investigate the factors of geographical phenomena, such as plague spread, species distribution, economic agglomeration, and crime rates. For instance, [1,2,3] apply regression models to study the influence of neighborhood affluence, race, unemployment rate, and facilities, such as liquor stores and stations, and other covariates on crime risk. Nowadays, an increasing number of open datasets of crime statistics are available [4]. For example, the Tokyo Metropolitan Government (https://www.bouhan.metro.tokyo.lg.jp/opendata/index.html), has made crime statistics (2014–present), classified by crime type and minor municipal districts, publicly available. Moreover, many of the crime statistics are of good geographical resolutions at district-level or other spatial fine scales, and they record thousands to tens of thousands of data entries in each time period. For such large spatiotemporal data, a computationally efficient regression approach is very important.

In applied spatial analysis, estimation and identification of linear effects and spatially effects to objective variables are actively studied. For example, spatial econometric models are used to estimate linear effects in the presence of spatial dependence [5]. Gaussian process models have been extended to accommodate a wide variety of spatial effects in geostatistics [6]. Geographically weighted regression (GWR) [7] is used to estimate spatially varying coefficients (SVCs) on covariates [8]. Among spatial effects, we especially focus on SVC modeling that allows for estimating the local determinants of crimes [2,9,10]. For instance, [2] found that affluence reduces crime in a suburban area of Portland, Oregon, whereas it increases crime in the city center. Understanding such local differences in crime occurrence is important when considering security measures against crimes.

Apart from the spatial effects, influence from covariates can also vary non-spatially depending on time, quantile, or other variables or events [11]. Unfortunately, when all possible effects are included in the model, over-parametrization occurs and the model becomes unstable. To balance model accuracy and complexity, model selection is crucially important. For example, in crime analysis, it is needed to appropriately specify the key factors behind crimes.

There are many model selection methods for SVC models [12,13,14] and other additive models that accommodate spatial and/or non-spatial effects [15,16]. Model selection is typically performed through iterations of model updating through inclusion/exclusion of effects (e.g., SVC) until convergence. In the case of large samples, however, SVC model selection by iterative fitting is computationally demanding. For example, mixed/semiparametric GWR [12,17], which selects constant coefficients or SVC, requires a computational time complexity of O(N²) per iteration [18], where N is the sample size and O(

\cdot

) denotes the order. Although there are fast approaches for selecting spatial and/or non-spatial effects, they still iterate model-fitting steps with a computational complexity of O(N).

Given this background, this study develops a computationally efficient approach for selecting spatial and/or non-spatial effects under the framework of spatial additive mixed modeling [19,20]. It employs a pre-conditioning treatment to reduce the computation cost of parameter estimation but is not applied to model/effects selection. By extending the idea of [19,20], we develop a scalable approach for selecting both spatial/non-spatial effects. This method significantly reduces the computational time complexity of the iterative fitting steps such that the cost is independent of the sample size N.

The remainder of this paper is organized as follows. Section 2 introduces our model. Section 3 develops our model selection procedures, and Section 4 examines its performance through Monte Carlo simulation experiments. Section 5 applies the developed approach to crime modeling and forecasting and Section 6 concludes our discussion.

2. Spatial Additive Mixed Model

2.1. Model

We consider the following spatial additive mixed model:

y = \sum_{p = 1}^{P} x_{p} ° f_{p} + ε, ε ~ N (0, σ^{2} I)

(1)

where

°

is the element-wise product operator,

y

is a vector of response variables (

N \times 1

), N is the sample size,

x_{p}

is a vector of the p-th covariate,

ε

is a vector of disturbances with variance

σ^{2}

, 0 is a vector of zeros, and I is an identity matrix.

f_{p}

is a vector of coefficients describing the influence of the p-th covariate.

There are many specifications for

f_{p}

. The most basic specification is the constant

f_{p} = b_{p} 1

, where

b_{p}

is a coefficient and

1

is a vector of ones, which is assumed in common linear regression models.

One key idea used in this study is to specify

f_{p}

SVCs. For instance, [21] adopted the following specification:

f_{p} = b_{p} 1_{p} + \frac{τ_{p}^{(s)}}{σ} E^{(s)} Λ^{α_{p}} u_{p}^{(s)}, u_{p}^{(s)} ~ N (0_{p}, σ^{2} I_{p})

(2)

E^{(s)}

is a

(N \times L_{p})

matrix of

L_{p}

eigenvectors corresponding to positive eigenvalues, which are called Moran eigenvectors (ME); they are extracted from a doubly-centered spatial proximity matrix [22].

Λ

is a

L_{p} \times L_{p}

diagonal matrix whose elements are positive eigenvalues.

u_{p}^{(s)}

is a

(L_{p} \times 1)

vector of random variables that acts as a Gaussian prior to stabilizing the SVC estimations. The

α_{p}

and

τ_{p}^{(s)}

parameters determine the scale and standard error of the spatial process. One characteristic advantage of the ME approach is that the resulting SVC (

f_{p}

) is interpretable through the Moran coefficient, which is a diagonal statistic of spatial dependence and its value can be positive (negative) in the presence of positive (negative) spatial dependence. Specifically, when considering all the MEs that have positive eigenvalues,

\frac{τ_{p}^{(s)}}{σ} E^{(s)} Λ^{α_{p}} u_{p}

describes a positively dependent map pattern. In other words, Equation (2) furnishes a positively dependent spatial process, which is dominant in many real-world cases [23], efficiently. The Moran coefficient value increases as

α_{p}

grows.

The

f_{p}

function can also be determined in terms of a non-spatially varying coefficient (NVC). This specification captures the influence varying with respect to the covariate

x_{p}

as follows:

f_{p} = b_{p} 1_{p} + \frac{τ_{p}^{(n)}}{σ} E_{p}^{(n)} u_{p}^{(n)}, u_{p}^{(n)} ~ N (0_{p}, σ^{2} I_{p})

(3)

where

E_{p}^{(n)}

is a

N \times L_{p}

matrix of

L_{p},

and the basis function generated from

x_{p}

.

τ_{p}^{(n)}

denotes the variance of non-spatial effects.

The following specification, which assumes both spatially and non-spatially varying coefficients (S&NVC) on all the covariates is also possible:

f_{p} = b_{p} 1_{p} + \frac{τ_{p}^{(s)}}{σ} E^{(s)} Λ^{α_{p}} u_{p}^{(s)} + \frac{τ_{p}^{(n)}}{σ} E_{p}^{(n)} u_{p}^{(n)}, u_{p}^{(s)} ~ N (0_{p}, σ^{2} I_{p}), u_{p}^{(n)} ~ N (0_{p}, σ^{2} I_{p})

(4)

Ref. [24] showed that the S&NVC model is robust against spurious correlations, whereas naive SVC models tend to have spurious correlations [25]. Finally, Figure 1 illustrates the coefficients we will consider in this study.

In summary,

f_{p}

, the constant is given by a fixed coefficient (

b_{p}

), while SVC, NVC, and S&NVC are specified by sums (linear combinations) of fixed and random effects. By substituting these values, the model Equation (1) is formulated as follows:

y = X b + \tilde{E} (Θ) U + ε, U ~ N (0, σ^{2} I), ε ~ N (0, σ^{2} I)

(5)

X = [x_{1}, \dots, x_{P}]

,

b = [b_{1}, \dots, b_{P}]^{'}

,

U = [u_{1}, \dots, u_{P}]^{'}

, and

Θ \in {θ_{1}, \dots, θ_{P}},

where “ ′ ” represents the matrix transpose.

\tilde{E} (Θ) = [x_{1} ° E_{1} V_{1} (θ_{1}), \dots, x_{P} ° E_{P} V_{P} (θ_{P})]

, where “

a ° B

” is an operator multiplying a column vector

a

element-wise with each column of

B

. The matrices

E_{p}

,

V_{p} (θ_{p})

, and parameter

θ_{p}

are defined in Table 1. Equation (5) suggests that our model is formulated as a linear mixed-effects model [26] with fixed effects

X b

, random effects,

\tilde{E} (Θ) U,

and

ε

.

Although this study focuses on four specifications, other

f_{p}

functions have been proposed to represent linear effects, non-linear effects, group effects, and other effects, as outlined in [11]. Due to its flexibility, the additive mixed model is now used in many applied studies [27,28].

2.2. Estimation

Among the estimation algorithms for spatial additive mixed models, we focus on the fast restricted maximum likelihood (REML) estimation of [19], which is scalable for both sample size N and the number of effects P. The computational bottleneck here is an iterative evaluation of the restricted log-likelihood

l o g l i k_{R} (Θ)

of Equation (1) (or Equation (5)) to numerically estimate the variance parameters

Θ \in {θ_{1}, \dots θ_{P}}

(see Appendix A). To reduce cost, [19] developed a sequential estimation of

{θ_{1}, \dots θ_{P}}

parameters by applying Equation (6) until convergence.

{\hat{θ}}_{p} = \arg \max_{θ_{p}} l o g l i k_{R} (θ_{p} | Θ_{- p})

(6)

While the direct maximization of Equation (6) still has the same computational burden, [19] developed the following procedure for the fast REML:

(I)

Replace the data matrices {

y, X, E_{1}, \dots, E_{P}

}, whose dimensions are dependent on N, with their inner products, whose dimensions are independent of N.

(II)

Using the inner products, iterate the following calculations sequentially for

p \in {1, \dots, P}

:

(II–1): Estimate ${\hat{θ}}_{p}$ by maximizing $l o g l i k (θ_{p} | {\hat{Θ}}_{- p})$ with $Θ_{- p} \in {θ_{1}, \dots θ_{p - 1}, θ_{p + 1}, \dots θ_{P}}$ .
(II–2): Go to (III) if the likelihood value converges. Else, return to (II-1).

(III)

Output the final model.

In this algorithm, the data matrices are replaced with their inner products before the iterative likelihood evaluation step. After all, the computational complexity of the iterative likelihood evaluation to find

{\hat{θ}}_{p}

in step (II–1) reduces to

O (L_{p}^{3})

[19]. In other words, after the pre-conditioning step (I), the computational complexity of the fast REML is highly scalable for both the sample size N and the number of effects P. Owing to this property, the spatial additive mixed model Equation (1) can be estimated efficiently even when N and P are very large.

3. Model Selection

3.1. Introduction

As explained in Section 2, the coefficients in Equation (1) can be constant, SVC, NVC, or S&NVC. The complexity of the model depends considerably on the selection of the coefficient type. For instance, Equation (1) has only P coefficients if all the coefficients are assumed to be constant. In the case of the SVC-based model, on the other hand, the number of coefficient parameters is

\sum_{p = 1}^{P} L_{p},

since the model utilizes

L_{p}

-dimension Moran eigenvectors for its P coefficient representations. Too many parameters can lead to overfitting and overestimation of statistical significance, whereas too few parameters can cause underfitting. This issue can be remedied by choosing an optimal model that provides an appropriate balance of the sizes of parameters and datasets with respect to model accuracy.

There is one difficulty in the selection of a large number of candidate models; there are 4^P model specifications for Equation (1). For example, if P = 9, which we will assume later, there are 4⁹ = 262,144 models. However, in practice, it is desirable to find or approximate the best model within seconds or minutes. Hence, this study develops a computationally efficient model selection approach. We attempt to search the model by minimizing the cost function, which can be defined by the Akaike information criterion (AIC) or Bayesian information criterion (BIC).

It is important to note that constant, SVC, NVC, and S&NVC share the same fixed effect (

b_{p} 1

), whereas their random effects differ from each other. In other words, we select random effects. In such a case, REML-based AIC and BIC are available for the model selection of linear additive mixed models [29]. While there are marginal and conditional AIC/BIC specifications, we focus on marginal BIC for the following reasons:

It is the most common specification for linear mixed effects models [30], including spatial additive mixed models.
Poor performance of conditional AIC/BIC-based model selection was reported when considering two or more random effects (see [31,32]), whereas [33] showed that conditional AIC/BIC outperforms when comparing models with/without one random effect.
Although the marginal specification suffers from a theoretical bias, [32] showed that the influence of the bias on model selection result is quite small.

Based on a preliminary analysis, we decided to use the REML-based marginal BIC, defined by

- 2 l o g l i k_{R} (Θ) - 2 Q \log (N)

, where Q is the number of fixed coefficients (P) and variance parameters in

Θ

in the crime analysis in Section 4.

3.2. Model Selection Procedures

This section proposes novel practical methods for model selection. The first incorporates a model selection into the sequential REML estimation (see Section 2.2). To reduce the chance of trapping to local optima, the second approach relies on a Monte Carlo (MC) simulation iterating the sequential REML estimation. We call the former a simple selection method, which emphasizes simplicity and practicality, and the latter a MC selection method (see Appendix B). Section 3.2.1 and Section 3.2.2 explain these methods.

3.2.1. Simple Selection Method

The simple selection method consists of model selection steps in sequential REML estimation. The procedure of this method is as follows:

(a)

Replace the data matrices {

y, X, E_{1}, \dots, E_{P}

} with the inner products as processed in step (II) in Section 2.2.

(b)

Perform the following calculation sequentially for each

p \in {1, \dots, P}

:

(b–1): Estimate the p-th SVC by maximizing $l o g l i k (θ_{p (s)} | {\hat{Θ}}_{- p (s)})$ with respect to $θ_{p (s)}$ , which is a subset of $θ_{p}$ characterizing the SVC, and ${\hat{Θ}}_{- p (s)}$ represents the set of variance parameters excluding ${\hat{θ}}_{p}$ from $\hat{Θ}$ .
(b–2): Select the SVC if it improves the cost function value (e.g., BIC). Otherwise, replace it with a constant.
(b–3): Estimate the p-th NVC by maximizing $l o g l i k (θ_{p (n)} | {\hat{Θ}}_{- p (n)})$ with respect to $θ_{p (n)}$ , which is a subset of $θ_{p}$ characterizing the NVC, and ${\hat{Θ}}_{- p (s)}$ represents the set of variance parameters excluding ${\hat{θ}}_{p (n)}$ from $\hat{Θ}$ .
(b–4): The NVC is selected if it improves the cost function value (e.g., BIC). Otherwise, it is replaced with a constant.

(c)

Go to (d) if the cost function converges. Otherwise, go back to (b).

(d)

Output the final model.

While there are similar selection approaches [34], ours is distinctive because its computational complexity for model selection is independent of the sample size, owing to step (a) that renders all the data matrices into their inner products. Due to the drastic dimension reduction, this simple method is suitable for very large samples. Specifically, the computational complexity of the iterative part in the model selection procedure equals

O (L_{p}^{3})

, which is required to evaluate the log likelihood value in step (b–1) and (b–3). Because the computational complexity is very small and independent of the sample size N, our model selection procedure is extremely fast even for large samples.

One problem is the large number of combinations of pre-determined sequence

p \in {1, \dots, P}

. For example, if P = 9, there are

P! = 362, 880

sequences; some of them might result in poor model selection result (i.e., local optimum). At least, unlike the maximum likelihood estimation or cross-validation, REML tends to not have local optima [35,36]. Section 4 examines if this simple approach accurately approximates/selects the true model through Monte Carlo experiments.

3.2.2. Monte Carlo (MC) Selection Method

To reduce the risk of falling into local optima, we randomly sample sequences

p \in {1, \dots, P}

and iterate the REML-based estimation given the sequence. Specifically, we propose the following model selection approach:

(A)

Replace the data matrices {

y, X, E_{1}, \dots, E_{P}

} with the inner products.

(B)

Iterate the following calculation G times using the inner products:

(B–1)

Randomly sample the g-th sequence

{1_{g}, \dots, P_{g}}

without replacement.

(B–2)

Perform the following calculation sequentially for each

p_{g} \in {1_{g}, \dots, P_{g}}

:

(B–2a): Estimate the $p_{g}$ -th SVC by maximizing $l o g l i k (θ_{p_{g} (s)} | {\hat{Θ}}_{- p_{g} (s)})$ , where ${θ_{p_{g} (s)}, {\hat{Θ}}_{- p_{g} (s)}}$ are defined similarly as ${θ_{p (s)}, {\hat{Θ}}_{- p (s)}}$ .
(B–2b): Select the SVC if it improves the cost function value (e.g., BIC). Otherwise, replace it with a constant.
(B–2c): Estimate the $p_{g}$ -th NVC by maximizing $l o g l i k (θ_{p_{g} (n)} | {\hat{Θ}}_{- p_{g} (n)})$ , where ${θ_{p_{g} (n)}, {\hat{Θ}}_{- p_{g} (n)}}$ are defined similarly as ${θ_{p (n)}, {\hat{Θ}}_{- p (n)}}$ .
(B–2d): The NVC is selected if it improves the cost function value (e.g., BIC). Otherwise, it is replaced with a constant.

(B–3)

Go to (B–4) if the cost function converges. Otherwise, go back to (B–2).

(B–4)

Calculate the cost function value of the selected model.

(C)

Output the best model in the selected G models in terms of the lowest cost function.

As with the simple selection approach, the computational cost for iterative step (B) is independent of the sample size. In addition, the iterative step is easily parallelized. Thus, this is a computationally efficient model selection procedure. Step (B) performs a MC simulation to marginalize G and obtain the distribution of the cost value. We call this approach the MC selection approach.

4. Numerical Experiments

4.1. Computational Details

Here, we examine whether this simple selection method accurately approximates the true model, which is the model with all the coefficient types (i.e., constant, SVC, NVC, or S&NVC) being defined correctly, or if the MC selection method is needed to achieve accurate model selection through comparative Monte Carlo experiments. We compare with/without the effects selection model by fitting these models to the synthetic data generated from

y = β_{0} + \sum_{p = 1}^{P} x_{p} b_{p} + \sum_{p = 1}^{P} x_{1, p} β_{1, p} + \sum_{p = 1}^{P} x_{2, p} β_{2, p} + ε, ε ~ N (0, I)

(7)

β_{0} = 1 + \tilde{C} u_{0}, u_{0} ~ N (0, I),

(8)

β_{1, p} = 1 + \tilde{C} u_{1, p}, u_{1, p} ~ N (0, τ_{1, p}^{2} I),

(9)

β_{2, p} = 1 + E_{2, p} u_{2, p}, u_{2, p} ~ N (0, τ_{2, p}^{2} I),

(10)

where the covariates

{x_{1}, \dots, x_{1}}, {x_{1, p}, \dots, x_{1, p}}, {x_{2, p}, \dots, x_{2, p}}

are generated from independent standard normal distributions. The matrix

\tilde{C}

is constructed from row standardization of spatial connectivity matrix

C

, whose (i, j)-th element equals

\exp (- d_{i, j})

, where

d_{i, j}

is the Euclidean distance between sample sites i and j. The sample sites were generated from two independent standard normal distributions. The SVCs

β_{0}

and

β_{1, p}

are defined by spatial moving average processes.

β_{2, p}

is a NVC that varies with respect to

x_{2, p}

, in which

E_{2, p}

is a matrix of 10 polynomial basis functions generated from

x_{2, p}

. Figure 2 illustrates the coefficients obtained from these generating processes.

The main objective is to compare the coefficient estimation accuracy and computational efficiency of simple and MC-based S&NVC model selections with alternatives. For MC model selection, we assumed 30 replicates. These models were fitted 200 times while varying

P \in {1, 2, 3}

, and sample size

N \in {50, 200, 1000}

.

The coefficient estimation accuracy is evaluated using the root mean squared error (RMSE), which is defined as

R M S E (β_{i, p}) = \sqrt{\frac{1}{200 N} \sum_{i t e r = 1}^{200} \sum_{i = 1}^{N} {({\hat{β}}_{i, p}^{(i t e r)} - β_{i, p})}^{2}}

(11)

where iter represents the iteration number,

β_{i, p}

is the i-th element of

β_{p}

, and

{\hat{β}}_{i, p}^{(i t e r)}

is the estimate given in the iter-th iteration. The bias of the estimates is evaluated using

B i a s (β_{i, p}) = \frac{1}{200 N} \sum_{i t e r = 1}^{200} \sum_{i = 1}^{N} ({\hat{β}}_{i, p}^{(i t e r)} - β_{i, p})

(12)

Under these settings, Section 4.2 examines if our approaches accurately select the best model, while Section 4.3 compares our approaches with other approaches.

4.2. Performance of Model Selection

In this experiment, we compare the following six models. As baseline models, the linear regression model (LM) and the SVCs across coefficients (SVC model) are used. As another baseline, we use the model assuming true coefficients types as known (i.e., constant coefficient on

x_{p}

, SVC on

x_{1, p}

, and NVC on

x_{2, p}

(true model)). Note that the true model assumes coefficients values as unknown (i.e., only coefficients types are known), and the values are estimated from samples as with the other models. The estimates will have an estimation error. We construct a full model that assumes S&NVC across coefficients (S&NVC model). In addition, we prepare simple and MC-based S&NVC model selection approaches that select their coefficients among constant, SVC, NVC, and S&NVC using the simple approach and the Monte Carlo approach, respectively.

Figure 3 summarizes the RMSEs of the estimated coefficients. As expected, LM has higher estimation errors because of ignoring the spatial and non-spatial variations in regression coefficients. Although the SVC model is popular in spatial statistics, the RMSEs for the NVCs are considerably high. In addition, probably due to error, the estimation accuracy for the constant coefficients by the SVC model is worse than that of LM model. These results suggest that SVC-based models become unstable in the presence of constant or non-spatial coefficients.

Regarding the S&NVC model without model selection, the estimated SVCs and NVCs are as accurate as the true model. However, its RMSE values for the constants are the highest across the models, probably because of over-parameterization. On the other hand, the RMSEs of the simple and MC-based S&NVC model selections are close to the true model across all coefficients. These results demonstrate the importance of effect selection in spatial regression modeling.

Figure 4 plots the biases of the coefficient standard errors, which are used to evaluate statistical significance; the downward bias of the standard errors yields an overestimation of the statistical significance, whereas the opposite is true for the upward bias. Here, the standard errors estimated from the true model are regarded as the true values. Standard errors of the constant coefficients estimated from LM, SVC, and S&NVC models are downwardly biased. For SVC, those estimated from LM are upwardly biased while the SVC and SNVC models are downwardly biased. For NVC, the standard errors estimated from the LM and SVC models are upwardly biased. The RMSEs of the standard errors obtained by LM, SVC and S&NVC models are also high as shown in Figure 5. Based on the results, the models without effect selection (LM, SVC, and S&NVC models) suffer from an overestimation or underestimation of statistical significance.

Conversely, the standard errors estimated from the simple and MC-based model selection approaches are almost the same as those of the true model, except for the case with P = 3 and N = 100, which is the most severe case, estimating 10 effects from 100 samples. Our effects selection approaches are useful for improving estimation accuracy for both the coefficients and their standard errors. Surprisingly, it was also found that results from the simple model selection approach are almost the same as the MC-based approach, in spite of the fact that the simple approach relies on a pre-determined sequence

p \in {1, \dots, P}

for model selection, whereas the MC-based method implicitly optimizes it. This is attributable to the stability of REML (see Section 3.2.1).

4.3. Benchmark Comparison of Model Selection Methods

Here, we compare our model selection approach with another commonly used model selection approach. Among the existing approaches, we selected the one implemented in the mgcv package in R (https://cran.r-project.org/web/packages/mgcv/index.html) because of the following reasons: (i) mgcv is one of the most popular packages for additive mixed modeling; (ii) the effects selection procedure in mgcv is computationally highly scalable [37], and it is a sensible benchmark for testing both the accuracy and computational efficiency of our approach.

The following models are compared: LM, S&NVC model, simple S&NVC model selection, another S&NVC model estimated from mgcv (Mgcv), and double-penalty-based Mgcv model selection [37]. The double-penalty approach is the default model selection method in the mgcv package. Roughly speaking, the double-penalty approach imposes penalty parameters to select effects in addition to the usual penalty parameters in additive models determining the smoothness of each effect. [37] showed the superior accuracy of this approach over alternatives. For faster computation, we estimated Mgcv and Mgcv with the double-penalty model selection using the bam function in the mgcv package, using fast REML [38]. As in the previous section, we assumed Equations (7)–(10) as the true model and each model was iteratively fitted 200 times.

Figure 6 summaries the RMSEs for the estimated coefficients when P = 3 and

N \in {400, 1000}

. For SVC and NVC, the RMSE values obtained from all the models except for LM are quite similar. Regarding the constant effect, our simple S&NVC model selection yields the lowest estimation error. Therefore, our approach is a promising alternative to Mgcv with/without the double-penalty model selection, which is now widely used.

Finally, Figure 7 compares the computational time in the cases of P = 10 and

N \in {1000, 10, 000, 30, 000, 50, 000, 100, 000}

. Here, the estimations were performed five times in each case, and the resulting computation times were averaged. Mgcv with double-penalty-based model selection is slightly slower than Mgcv because the former additionally estimates the penalty parameters of the selection effects. As explained in Section 1, such additional computational time for model/effect selection is taken for granted. On the other hand, the simple S&NVC model selection approach has a considerably shorter computational time than the naïve S&NVC model because the former treats only selected effects when evaluating likelihood in each iteration, while the latter includes all the effects when evaluating likelihood. In addition, owing to the pre-conditioning procedure, the increase in computational time with respect to N is suppressed with respect to the Mgcv-based alternatives. In contrast, the MC-based S&NVC model selection is slower than the alternatives, as shown in Figure 7. Based on estimation accuracy and computational efficiency, we recommend the simple S&NVC model selection as the default choice.

Theoretically, the simple S&NVC model selection is the fastest because of the following reasons: Monte Carlo simulation is not needed unlike the MC-based approach; it requires a smaller number of penalty parameters than Mgcv using K penalty parameters for smoothness selection while using other K parameters for model selection. The results here confirm the computational efficiency of our simple selection approach experimentally.

In summary, the analysis results demonstrate that our approach selects effects accurately and computationally efficiently, even in comparisons with state-of-the-art approaches.

5. Application to Crime Modeling

5.1. Outline

This section applies a simple approach to the Dai-Tokyo Bouhan network database (https://www.bouhan.metro.tokyo.lg.jp/) provided by the Office for Promotion of Citizen Safety, Tokyo Metropolitan Government (https://www.tomin-anzen.metro.tokyo.lg.jp/english/). This database records crime statistics categorized by crime type and 1529 minor municipal districts in Tokyo, Japan. The district zones are delineated to be homogeneous within each zone. Districts in urban areas are quite small while some districts in western mountain areas are large. The minimum area equals 0.003 km², the 1st quarter 0.301 km², the median 0.657 km², the 3rd quarter 1.102 km², and the maximum 68.714 km².

This study focuses on bicycle theft and shoplifting, the two most frequent non-burglary crimes reported between 2017 and 2018. The sample size was 12,232. Figure 8 plots the crime densities by minor municipal districts (number of incidences/km²) (first quarter of 2017). The eastern area within the looping railway (Yamanote line) is the central area, and other railways extend in all directions from the looping railway, as shown in the figure. The Chuo line is one such line.

We model the logged number of bicycle theft and shoplifting cases per area by districts by quarter. Based on Figure 9, which displays the histograms of the crime densities, data distributions are nearly Gaussian. A (Gaussian) linear model will be acceptable in our case. This study applies the following linear model:

y_{i, t}^{c} = β_{i, 0} + β_{i, 1} y_{i, t - 1}^{c} + β_{i, 2} Y_{i, t - 1}^{- c} + \sum_{k = 3}^{6} β_{i, k} x_{i, k} + g_{d (i)}^{(s)} + g_{q (i)}^{(t)} + ε_{i}, ε_{i} ~ N (0, σ^{2})

(13)

where

y_{i, t}^{c}

is the number of c-th crime per area in the i-th district in the t-th quarter (log-scale).

We then consider explanatory variables. Based on the routine activity theory [39], the following three are the triggers for crimes: (i) potential offender, (ii) suitable target, and (iii) absence of guardian. While the crime prevention level explaining (iii) is roughly equal across Tokyo, (i) and (ii) considerably change depending on the district. Regarding bicycle theft, we consider nighttime population density (Popden) as an explanatory variable approximating (i) the number of potential attackers. Since bicycles are widely used for shopping in the study area, we consider Popden and the number of retailers (Retail) as explanatory variables describing (ii) the number of targets or bicycles. Regarding shoplifting, we consider daytime population density (Dpopden) as a variable explaining (i) the number of attackers, while considering Retail as another explanatory variable explaining (ii) the number of targets.

In crime geography, near repeat victimization [40], explaining the tendency that crimes tend to repeat in similar region, is known as a common phenomenon. Such repetitive tendency occurs because of the regional heterogeneity, such as resident characteristics (risk heterogeneity hypothesis) or repetitive crimes by the same group that have knowledge about the crime area (event dependency hypothesis) (see [41]). To consider such local repetitiveness, we include crime density in the previous quarter

y_{i, t - 1}^{c}

as an explanatory variable and call this variable Repeat. As such repetitive tendencies can occur across crime types, we also include the logged density of non-burglary crime

Y_{i, t - 1}^{- c}

(RepOther) apart from the c-th crime in the previous quarter. For the bicycle theft model,

Y_{i, t - 1}^{- c}

is given by log(the number of non-burglary crimes except for bicycle theft/area (km²)), while

Y_{i, t - 1}^{- c}

for the shoplifting model is similarly defined.

In addition, we include the following explanatory variables

{x_{i, 3}, x_{i, 4}, x_{i, 5}, x_{i, 6}}

describing local environmental factors: ratio of foreigners among residents (Fpopden); unemployment ratio (UnEmp); ratio of residents who graduated from university (Univ). These three are proxy variables of race, economic deprivation, and education. Strong influence from such local environmental factors has been demonstrated in studies relating to risk terrain modeling (RTM; [42]). These data were collected from the National Census Statistics by minor municipal districts in 2015.

As in the previous section, we assume a spatially varying intercept

β_{i, 0}

to eliminate residual spatial dependence and select coefficient type {

β_{i, 1}, \dots, β_{i, 6}

} among {constant, SVC, NVC, S&NVC} using a simple approach. Furthermore, to capture the heterogeneity among individual districts and time periods, we consider

g_{d (i)}^{(s)} ~ N (0, τ_{(s)}^{2})

and

g_{q (i)}^{(t)} ~ N (0, τ_{(t)}^{2})

, respectively, where

τ_{(s)}^{2}

and

τ_{(t)}^{2}

are variance parameters. These terms represent the group effects of district

d (i)

and quarter

q (i)

in which the i-th sample is observed. They are constant coefficients by districts and quarter, respectively (e.g.,

g_{d (i)}^{(s)}

takes larger positive values in districts with larger number of crimes). The inclusion or exclusion of

{g_{d (i)}^{(s)}, g_{q (i)}^{(t)}}

is also automatically selected by the simple approach.

5.2. Coefficient Estimation Results

The (conditional) adjusted R-squared value is 0.914 for the bicycle theft model and 0.928 for the shoplifting model. The accuracy of these models was verified. The estimation of these models took 67.3 s and 52.0 s, respectively, confirming the computational efficiency of our approach.

Table 2 and Table 3 summarize the estimated coefficients and their statistical significance. For bicycle theft, the variables Repeat, RepOther, and Popden are positively statistically significant at the 1% level across districts. The results suggest the strong tendency of near repeat victimization not only within the same crime type, but also across crime types. Given that a higher population density implies larger number of potential offenders and targets (see Section 5.1), the positive sign of Popden is intuitively reasonable. The selected coefficient type for Repeat was S&NVC, while those for RepOther and Popden were NVCs. Although NVCs have rarely been considered in spatial modeling, these results indicate the importance of considering NVCs. Fpopden, UnEmp, and Univ, whose coefficients are estimated to be constant, are statistically insignificant (Table 3). Among the group effects,

g_{q (i)}^{(t)}

was selected. In short, near repeat victimization, population, and season are the dominant determinants, whereas the number of foreign people, economic status, and education are not. Later, we will look at the estimated coefficients in more detail.

For shoplifting, Repeat and RepOther are, again, positively significant, suggesting that not only shoplifting but also other non-burglary crimes in a quarter increases shoplifting in the next quarter. Dpopden is also positively significant. This suggests that shoplifting is increased in central areas where people concentrate during daytime. Retail, Fpopden, UnEnp, Univ, and the two group effects are insignificant. For the significant variables, the selected type of coefficients are as follows: S&NVC for Repeat; NVC for RepOther and Dpopden; constant for the others. In sum, repeat tendency and daytime population are shown to be significant determinants for shoplifting.

Figure 10 plots the estimated S&NVC on Repeat. The coefficients for bicycle theft decrease in the central area and south-west of the center. These areas are affluent areas. Bicycle theft is expected to be less repeated in these areas. Conversely, in the middle area, the coefficients increase along the Chuo line, which is a major railway (Figure 8) line. The repetitive tendency is especially strong near major stations on the line. Because subways and bus routes are densely spread in the affluent area (although not shown in the figure), bicycle users are relatively limited in this area. In addition, most residents in this area can afford to buy a bicycle. These properties might lower the number of potential offenders for bicycle theft. By contrast, in the middle area, railways and bus routes are relatively limited, and many people use bicycles for shopping or going to railway stations; bicycles are really needed in this area. In addition, the income level in this area is low relative to the affluent area. These properties might increase potential offenders.

As for shoplifting, the coefficients on Repeat increase in the central area where the Yamanote line runs (see Figure 8). This area includes central commercial districts near Shinjuku station and Tokyo station. A wide variety of products are sold in these districts. Besides, luxury stores are concentrated in these areas. Such properties might attract potential offenders and increase near repeat victimization.

Figure 11 displays the estimated NVCs. For both bicycle theft and shoplifting, the NVC on Repeat, which is in S&NVC, takes a high value if Repeat is high (see the left two panels of Figure 11). This means that repetitive tendency becomes strong if there were many crimes in the previous quarter. In contrast, a smaller RepOther has higher coefficients. This means that the number of bicycle thefts and shoplifting are correlated with other crimes in low-risk areas (i.e., low RepOther values). For bicycle theft, higher Popden and Dpopden have lower coefficients. This means that bicycle thefts increase as the population grows, but the rate of increase declines as population increases.

Finally, Table 4 summarizes the estimated group effects on bicycle thefts by quarter. The results on shoplifting are not shown because both group effects were not selected. One interesting finding is that bicycle theft cases decrease in the first quarter (January to March), while they increase in the second quarter (April to June). This suggests that bicycle theft cases increase in the second quarter (April–June) probably because of the warm climate increasing bicycle users and/or changes in lifestyle and routine activities in the spring season (see [43]).

While spatial regression models have been used for crime modeling, their results strongly depend on model assumptions. For example, studies using GWR always discuss coefficient estimates smoothly varying over geographical space while spatial econometric studies discuss constant coefficient estimates in the presence of spatial dependence. In contrast, our results, which selected coefficients types, are less dependent on such modeling assumptions. Our results are more reliable. In addition, our results demonstrate the strong impact from NVCs, which have been overlooked in spatial statistics, rather than SVCs, suggests the dangerousness of using a spatial regression model without model selection. Our model selection approach will be useful to increase reliability of spatial regression analysis.

5.3. Application to Crime Prediction

Crime prediction for the near future is important for preventing crimes. Here, we apply our model (SNM_Sel) to predict the number of bicycle theft and shoplifting cases in minor municipal districts in the first quarter of 2019 by employing the model trained with the data from between 2017 and 2018. The accuracy is compared with the kernel density estimation (KDE), which is a popular crime prediction method in this area [44]. The ks package (https://cran.r-project.org/web/packages/ks/index.html) in R was used for the KDE model estimation and prediction. The plug-in selector of [45] is used to optimize the kernel bandwidth.

Figure 12 and Figure 13 compare the predicted and actual number of bicycle theft and shoplifting cases per area, respectively. For both cases, the KDE results are oversmoothed. The RMSE values are 5.77 and 6.93, respectively. Conversely, owing to the inclusion of spatial and non-spatial effects, our approach appropriately detects local hot spots. The resulting map pattern is quite similar to the observed numbers. The RMSEs are 3.53 for bicycle theft and 1.62 for shoplifting, which are considerably lower than those of KDE. This suggests that our approach is useful for crime prevention, such as the design of efficient patrol routes.

In short, we have demonstrated the usefulness of our approach, taking crime modeling as an example. While we considered data noise by introducing a noise term, we could not consider potential data bias attributable to under-recording/reporting, survey questions for crime surveys, whose minor change can dramatically change answers, and other reasons (e.g., see [46,47]). Consideration of bias in crime data will be an important next step for more reliable crime modeling. In addition, we considered the repetitiveness at a district level. However, based on studies on repeat victimization [48,49], such repetitiveness is more prominent in a finer spatial scale, such as individual properties, street segments, and 250-m grids. It is another important future task to consider more disaggregated spatial crime data rather than district-level data. Given that PredPol (https://www.predpol.com/) and many other crime prediction systems operate in a finer spatial scale, such as 250-m grids, extension of our crime model for spatially finer prediction will also be a remaining task toward social implementation.

6. Concluding Remarks

This study develops a model or coefficient type selection approach for spatial additive mixed models. The simulation experiments suggest that the present simple approach accurately selects the true model. Moreover, even though model selections usually increase computational time, our model selection dramatically reduces it. This property will be valuable for estimating a complex model that involves many candidate effects from large samples. The criminal analysis demonstrates that our approach provides intuitively reasonable results. Our approach highlights and quantifies numerous hidden effects behind geographic phenomena. These prominent features will be useful for a wide spectrum of analyses, such as crime analysis, ecological studies, and environmental studies.

However, there are many issues to be addressed. First, we need to incorporate spatio-temporal variations in regression coefficients or residuals. Spatially and temporally varying coefficients (STVC) have been studied by [50,51,52], among others. However, the discussion on selecting SVC, STVC, or other coefficients is still limited in spatial statistics. The spatio-temporal process, which comprises pure-spatial, pure-temporal, and spatio-temporal interaction processes, is much more complex than purely spatial variation. A well-manipulated selection of SVC, STVC, or other coefficient types will be very important in improving the accuracy and stability of spatio-temporal modeling. The consideration of dynamic temporal behavior is important therein [53,54]. Second, it is important to consider a larger number of coefficients, such as over 100. Sparse modeling approaches, such as LASSO [55] and smoothly cropped absolute deviation (SCAD) penalty [56], will be helpful in estimating SVC, NVC, or other varying coefficients while avoiding overfitting. Third, we need to extend our approach for non-Gaussian data modeling while maintaining the computational efficiency. For example, extension for Poisson or negative binomial regression will be an important future task to model crime count rather than crime density (e.g., see [1]), while regression with skewed distribution will be useful for modeling maximum temperature or other data relating extreme values (e.g., see [57]).

The developed model selection procedure was implemented in an R package spmoran (https://cran.r-project.org/web/packages/spmoran/index.html).

Author Contributions

Conceptualization, Mami Kajita and Seiji Kajita; Methodology, Daisuke Murakami; Formal analysis, Daisuke Murakami and Seiji Kajita; Writing—review and editing, Daisuke Murakami, Mami Kajita, and Seiji Kajita; Funding acquisition, Mami Kajita. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Institute of Information and Communications Technology (NICT), Japan, grant number 20005.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Restricted Log-Likelihood Function of The Spatial Additive Mixed Model

The fast REML maximizes the marginal likelihood

l o g l i k_{R} (Θ)

defined by integrating {

b, U}

from the full likelihood, which is formulated as

l o g l i k_{R} (Θ) = - \frac{1}{2} l o g | \begin{matrix} X^{'} X & X^{'} \tilde{E} (Θ) \\ \tilde{E}^{'} (Θ) X & {\tilde{E}}^{'} (Θ) \tilde{E} (Θ) + I \end{matrix} | - \frac{N - K}{2} (1 + l o g (\frac{2 π d (Θ)}{N - K}))

(A1)

where

\tilde{E} (Θ) = [x_{1} ° E_{1} V_{1} (θ_{1}), \dots, x_{P} ° E_{P} V_{P} (θ_{P})]

. In the second term of Equation (6),

d (Θ) = {\hat{ε}}^{'} \hat{ε} + \sum_{p = 1}^{P} {\hat{u}}^{'}_{P} {\hat{u}}_{p}

balances noise variance and random effects variance, where

\hat{ε} = y - X \hat{b} - \tilde{E} (Θ) \hat{U}

(A2)

[\begin{matrix} \hat{b} \\ \hat{U} \end{matrix}] = {[\begin{matrix} X^{'} X & X^{'} \tilde{E} (Θ) \\ \tilde{E}^{'} (Θ) X & {\tilde{E}}^{^{'}} (Θ) \tilde{E} (Θ) + I \end{matrix}]}^{- 1} [\begin{matrix} X^{'} y \\ \tilde{E}^{'} (Θ) y \end{matrix}]

(A3)

The additive mixed model is readily estimated by: (i) estimating

\hat{Θ}

by maximizing

l o g l i k_{R} (Θ)

, and (ii) estimating the fixed and random coefficients

[\hat{b}^{'}, \hat{U}^{'}]^{'}

by substituting

\hat{Θ}

into Equation (11). One major difficulty is the computational cost of estimating

\hat{Θ}

in step (i) because the cost increases exponentially with respect to the number of parameters in

\hat{Θ}

. The computational time can be disappointingly long, even for 10 variance parameters. Unfortunately, to flexibly capture the influence of many covariates, 10 or more variance parameters are typically required.

Appendix B. Details of Model Selection Approaches

This appendix explains the computational details of the simple model selection method. Then, the details of the MC method are explained.

In step (a) of the simple method (see Section 3.2.1), the data matrices

{y, X, E_{1}, \dots, E_{P}}

are replaced with the following inner products:

M_{0, 0} = X^{'} X

,

M_{0, p} = X^{'} (x_{p} ° E_{p})

,

M_{p, \tilde{p}} = (x_{p} ° E_{p})^{'} (x_{\tilde{p}} ° E_{p})

,

m_{0} = X^{'} y

,

m_{p} = (x_{p} ° E_{p})^{'} y

, and

m_{y, y} = y^{'} y

, where “

a ° B

” multiplies column vector

a

with each column of the B matrix element-wise.

The restricted log-likelihood maximized in step (b) (see Section 3.2.1) is rewritten by substituting these inner products into Equation (A1) [19]:

l o g l i k_{R} (Θ) = - \frac{1}{2} l n | P | - \frac{N - K}{2} (1 + l n (2 π \frac{{∥ \hat{ε} ∥}^{2} + \sum_{p = 1}^{P} {∥ \hat{u} ∥}_{p}^{2}}{N - P})),

(A4)

where

P = [\begin{matrix} \begin{matrix} M_{0, 0} & M_{0, 1} V_{1} (θ_{1}) \\ V_{1} (θ_{1}) M_{1, 0} & V_{1} (θ_{1}) M_{1, 1} V_{1} (θ_{1}) + I \end{matrix} & \begin{matrix} \dots & M_{0, P} V_{P} (θ_{P}) \\ \dots & V_{1} (θ_{1}) M_{1, P} V_{P} (θ_{P}) \end{matrix} \\ \begin{matrix} ⋮ & ⋮ \\ V_{P} (θ_{P}) M_{P, 0} & V_{P} (θ_{P}) M_{P, 1} V_{1} (θ_{1}) \end{matrix} & \begin{matrix} ⋱ & ⋮ \\ \dots & V_{P} (θ_{P}) M_{P, P} V_{P} (θ_{P}) + I \end{matrix} \end{matrix}],

(A5)

{∥ \hat{ε} ∥}^{2} = m_{y, y} - 2 [\hat{b}^{'}, {\hat{u}}^{'}_{1}, \dots {\hat{u}}^{'}_{P}] [\begin{matrix} \begin{matrix} m_{0} \\ V_{1} (θ_{1}) m_{1} \end{matrix} \\ \begin{matrix} ⋮ \\ V_{P} (θ_{P}) m_{P} \end{matrix} \end{matrix}] + [\hat{b}^{'}, {\hat{u}}^{'}_{1}, \dots {\hat{u}}^{'}_{P}] P_{0} [\begin{matrix} \begin{matrix} \hat{b} \\ {\hat{u}}_{1} \end{matrix} \\ \begin{matrix} ⋮ \\ {\hat{u}}_{P} \end{matrix} \end{matrix}],

(A6)

[\begin{matrix} \begin{matrix} \hat{b} \\ {\hat{u}}_{1} \end{matrix} \\ \begin{matrix} ⋮ \\ {\hat{u}}_{P} \end{matrix} \end{matrix}] = P^{- 1} [\begin{matrix} \begin{matrix} m_{0} \\ V_{1} (θ_{1}) m_{1} \end{matrix} \\ \begin{matrix} ⋮ \\ V_{P} (θ_{P}) m_{P} \end{matrix} \end{matrix}] .

(A7)

Equations (A4)–(A7) do not include any matrix whose size depends on N. Thus, in step (b), the likelihood is evaluated with a computational cost independent of the sample size.

Consider the P-th iteration of step (b) as an example. In step (b–1) of the P-th iteration,

l o g l i k_{R} (θ_{P (s)} | {\hat{Θ}}_{- P (s)})

is maximized with respect to

θ_{P (s)}

. Maximization involves iterative evaluation of

| P |

and

P^{- 1}

whose computational complexity is equal

O ({(\sum_{p = 1}^{P} L_{p})}^{3})

, which can be slow when considering many effects, as it is in our case. To reduce cost, Equation (A7), including

P^{- 1}

, is expanded as follows:

\begin{matrix} [\begin{matrix} \hat{b} \\ {\hat{u}}_{1} \\ ⋮ \\ {\hat{u}}_{p} \end{matrix}] = & [\begin{matrix} {\tilde{V}}_{- P}^{- 1} & 0 \\ 0 & V_{P} {(θ_{P (s)})}^{- 1} \end{matrix}] Q^{- 1} [\begin{matrix} m_{- P} \\ m_{P} \end{matrix}] \\ - [\begin{matrix} {\tilde{V}}_{- P}^{- 1} Q_{- P, P}^{*} \\ V_{P} {(θ_{P (s)})}^{- 1} Q_{P, P}^{*} \end{matrix}] {(V_{P} {(θ_{P (s)})}^{2} + Q_{P, P}^{*})}^{- 1} [Q_{P, - P}^{*} m_{- P} + Q_{P, P}^{*} m_{P}], \end{matrix}

(A8)

where

{\tilde{V}}_{- P} = [\begin{matrix} \begin{matrix} I & ⬚ \\ ⬚ & V_{1} \end{matrix} & \begin{matrix} ⬚ & ⬚ \\ ⬚ & ⬚ \end{matrix} \\ \begin{matrix} ⬚ & ⬚ \\ ⬚ & ⬚ \end{matrix} & \begin{matrix} ⋱ & ⬚ \\ ⬚ & V_{P - 1} \end{matrix} \end{matrix}]

, where

V_{p} = V_{p} ({\hat{θ}}_{p})

.

{\hat{Θ}}_{- P (s)} \in {{\hat{θ}}_{1}, \dots, {\hat{θ}}_{P - 1}, {\hat{θ}}_{P (n)}

}, which are fixed in this step is omitted for simplicity.

Q = [\begin{matrix} {\tilde{M}}_{- P, - P} + {\tilde{V}}_{- P}^{- 2} & {\tilde{M}}_{- P, P} \\ {\tilde{M}}_{P, - P} & M_{P, P} \end{matrix}]

, where

{\tilde{M}}_{- P, - P} = [\begin{matrix} \begin{matrix} M_{0, 0} & M_{0, 1} \\ M_{1, 0} & M_{1, 1} + V_{1}^{- 2} \end{matrix} & \begin{matrix} \dots & M_{0, P - 1} \\ \dots & M_{1, P - 1} \end{matrix} \\ \begin{matrix} ⋮ & ⋮ \\ M_{P - 1, 0} & M_{P - 1, 1} \end{matrix} & \begin{matrix} ⋱ & ⋮ \\ \dots & M_{P - 1, P - 1} + V_{P - 1}^{- 2} \end{matrix} \end{matrix}]

and

{\tilde{M}}_{- P, P} = {[\begin{matrix} \begin{matrix} M_{P, 0} & M_{P, 1} \end{matrix} & \begin{matrix} \dots & M_{P, P - 1} \end{matrix} \end{matrix}]}^{'}

, and

[\begin{matrix} Q_{- P, - P}^{*} & Q_{- P, P}^{*} \\ Q_{P, - P}^{*} & Q_{P, P}^{*} \end{matrix}] = Q^{- 1}

.

On the other hand, |P|, which is another tedious part, has the following expression:

| P | = {| {\tilde{V}}_{- P} |}^{2} {| V_{p} (θ_{P (s)}) |}^{2} | {\tilde{M}}_{- P, - P} + {\tilde{V}}_{- P}^{- 2} | | V_{p} {(θ_{P (s)})}^{- 2} + M_{P, P} - {\tilde{M}}_{P, - P} {({\tilde{M}}_{- P, - P} + {\tilde{V}}_{- P}^{- 2})}^{- 1} {\tilde{M}}_{- P, P} | .

(A9)

To maximize the likelihood numerically, Equations (A8) and (A9) must be evaluated repeatedly while varying

θ_{P (s)}

. Fortunately, many elements in Equation (A8) are unchanged even when

θ_{P (s)}

is changed. As a result, if the elements that are independent of

θ_{P (s)}

are evaluated a priori, the computational complexity for the iterative evaluation of Equation (A8) is only O(

L_{P}^{3}

), which is required for evaluating

{(V {(θ_{P (s)})}^{2} + Q_{P, P}^{*})}^{- 1}

. Likewise, the complexity of the iterative evaluation of Equation (A9) while varying

θ_{P (s)}

is only O(

L_{P}^{3}

). Thus, the model estimation step (b–1) scales extremely well for both sample size N and the number of effects P. The same holds for NVC estimation step (b–3) too.

In the model selection step (b–2), we need to compare the cost function (e.g., BIC) of the model with the P-th SVC, which is estimated in step (b–1), with the model without the P-th SVC. For this, we also need to evaluate the likelihood of the latter model using Equation (A4). The likelihood can be evaluated by replacing Equations (A8) and (A9) with Equations (A10) and (A11), respectively.

[\begin{matrix} \begin{matrix} \hat{b} \\ {\hat{u}}_{1} \end{matrix} \\ \begin{matrix} ⋮ \\ {\hat{u}}_{P - 1} \end{matrix} \end{matrix}] = {\tilde{V}}_{- P}^{- 1} {({\tilde{M}}_{- P, - P} + {\tilde{V}}_{- P}^{- 2})}^{- 1} m_{- P}

(A10)

| P | = {| {\tilde{V}}_{- P} |}^{2} | {\tilde{M}}_{- P, - P} + {\tilde{V}}_{- P}^{- 2} |

(A11)

All the elements in Equation (A11) are already evaluated in step (b–1) Equation (A6). Thus, Equation (A11) was evaluated without any additional computational cost. Although

{({\tilde{M}}_{- P, - P} + {\tilde{V}}_{- P}^{- 2})}^{- 1}

must be additionally calculated to evaluate Equation (A10), the computational complexity is

O ({(\sum_{p = 1}^{P - 1} L_{p})}^{3})

, which is still independent of sample size. In addition, iterative evaluation is not needed in this part because the computation cost in step (b–2) is trivial. The same holds for another model selection step (b–4).

In summary, both the estimation and model selection steps were performed computationally efficiently.

References

Osgood, D.W. Poisson-based regression analysis of aggregate crime rates. J. Quant. Criminol. 2000, 16, 21–43. [Google Scholar] [CrossRef]
Cahill, M.; Mulligan, G. Using geographically weighted regression to explore local crime patterns. Soc. Sci. Comput. Rev. 2007, 25, 174–193. [Google Scholar] [CrossRef]
Bernasco, W.; Block, R. Robberies in Chicago: A block-level analysis of the influence of crime generators, crime attractors, and offender anchor points. J. Res. Crime Delinq. 2011, 48, 33–57. [Google Scholar] [CrossRef]
Maguire, M.; McVie, S. Crime data and criminal statistics: A critical reflection. In The Oxford Handbook of Criminology; Maruna, S., McAra, L., Eds.; Oxford University Press: Oxford, UK, 2017; pp. 163–189. [Google Scholar]
LeSage, J.P.; Pace, R.K. Introduction to Spatial Econometrics; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
Cressie, N.; Wikle, C.K. Statistics for Spatio-Temporal Data; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Brunsdon, C.; Fotheringham, S.; Charlton, M. Geographically weighted regression. J. R. Stat. Soc. Ser. D (Stat.) 1998, 47, 431–443. [Google Scholar] [CrossRef]
Fotheringham, A.S.; Brunsdon, C.; Charlton, M. Geographically Weighted Regression: The Analysis of Spatially Varying Relationship; John Wiley & Sons: West Sussex, UK, 2002. [Google Scholar]
Lee, S.; Kang, D.; Kim, M. Determinants of crime incidence in Korea: A mixed GWR approach. In Proceedings of the World Conference of the Spatial Econometrics Association, Barcelona, Spain, 8–10 July 2009; pp. 8–10. [Google Scholar]
Arnio, A.N.; Baumer, E.P. Demography, foreclosure, and crime: Assessing spatial heterogeneity in contemporary models of neighborhood crime rates. Demogr. Res. 2012, 26, 449–486. [Google Scholar] [CrossRef] [Green Version]
Umlauf, N.; Adler, D.; Kneib, T.; Lang, S.; Zeileis, A. Structured additive regression models: An R interface to BayesX. J. Stat. Softw. 2015, 21, 63. [Google Scholar]
Nakaya, T.; Fotheringham, S.; Charlton, M.; Brunsdon, C. Semiparametric geographically weighted generalised linear modelling in GWR 4.0. In Proceedings of the 10th International Conference on GeoComputation, Sydney, Australia, 30 November–2 December 2009. [Google Scholar]
Wheeler, D.C. Simultaneous coefficient penalization and model selection in geographically weighted regression: The geographically weighted lasso. Environ. Plan. A 2009, 41, 722–742. [Google Scholar] [CrossRef] [Green Version]
Comber, A.; Brunsdon, C.; Charlton, M.; Dong, G.; Harris, R.; Lu, B.; Lu, Y.; Murakami, D.; Nakaya, T.; Wang, Y.; et al. The GWR route map: A guide to the informed application of Geographically Weighted Regression. arXiv 2020, arXiv:2004.06070. [Google Scholar]
Huang, J.; Horowitz, J.L.; Wei, F. Variable selection in nonparametric additive models. Ann. Stat. 2010, 38, 2282. [Google Scholar] [CrossRef] [Green Version]
Amato, U.; Antoniadis, A.; De Feis, I. Additive model selection. Stat. Methods Appl. 2016, 25, 519–654. [Google Scholar] [CrossRef]
Mei, C.L.; He, S.Y.; Fang, K.T. A note on the mixed geographically weighted regression model. J. Reg. Sci. 2004, 44, 143–157. [Google Scholar] [CrossRef]
Li, Z.; Fotheringham, A.S.; Li, W.; Oshan, T. Fast Geographically Weighted Regression (FastGWR): A scalable algorithm to investigate spatial process heterogeneity in millions of observations. Int. J. Geogr. Inf. Sci. 2019, 33, 155–175. [Google Scholar] [CrossRef]
Murakami, D.; Griffith, D.A. Spatially varying coefficient modeling for large datasets: Eliminating N from spatial regressions. Spat. Stat. 2019, 30, 39–64. [Google Scholar] [CrossRef] [Green Version]
Murakami, D.; Griffith, D.A. A memory-free spatial additive mixed modeling for big spatial data. Jpn. J. Stat. Data Sci. 2020, 3, 215–241. [Google Scholar] [CrossRef] [Green Version]
Murakami, D.; Yoshida, T.; Seya, H.; Griffith, D.A.; Yamagata, Y. A Moran coefficient-based mixed effects approach to investigate spatially varying relationships. Spat. Stat. 2017, 19, 68–89. [Google Scholar] [CrossRef] [Green Version]
Griffith, D.A. Spatial Autocorrelation and Spatial Filtering: Gaining Understanding through Theory and Scientific Visualization; Springer Science & Business Media: Berlin, Germany, 2003. [Google Scholar]
Tiefelsdorf, M.; Griffith, D.A. Semiparametric filtering of spatial autocorrelation: The eigenvector approach. Environ. Plan. A 2007, 39, 1193–1221. [Google Scholar] [CrossRef]
Murakami, D.; Griffith, D.A. Balancing spatial and non-spatial variation in varying coefficient modeling: A remedy for spurious correlation. arXiv 2020, arXiv:2005.09981. [Google Scholar]
Wheeler, D.; Tiefelsdorf, M. Multicollinearity and correlation among local regression coefficients in geographically weighted regression. J. Geogr. Syst. 2005, 7, 161–187. [Google Scholar] [CrossRef]
Bates, D.; Mächler, M.; Bolker, B.; Walker, S. Fitting linear mixed-effects models using lme4. arXiv 2014, arXiv:1406.5823. [Google Scholar]
Winter, B.; Wieling, M. How to analyze linguistic change using mixed models: Growth Curve Analysis and Generalized Additive Modeling. J. Lang. Evol. 2016, 1, 7–18. [Google Scholar] [CrossRef] [Green Version]
Baayen, H.; Vasishth, S.; Kliegl, R.; Bates, D. The cave of shadows: Addressing the human factor with generalized additive mixed models. J. Mem. Lang. 2017, 94, 206–234. [Google Scholar] [CrossRef] [Green Version]
Gurka, M.J. Selecting the best linear mixed model under REML. Am. Stat. 2006, 60, 19–26. [Google Scholar] [CrossRef]
Müller, S.; Scealy, J.L.; Welsh, A.H. Model selection in linear mixed models. Stat. Sci. 2013, 28, 135–167. [Google Scholar] [CrossRef] [Green Version]
Dimova, R.B.; Markatou, M.; Talal, A.H. Information methods for model selection in linear mixed effects models with application to HCV data. Comput. Stat. Data Anal. 2011, 55, 2677–2697. [Google Scholar] [CrossRef]
Sakamoto, W. Bias-reduced marginal Akaike information criteria based on a Monte Carlo method for linear mixed-effects models. Scand. J. Stat. 2019, 46, 87–115. [Google Scholar] [CrossRef]
Greven, S.; Kneib, T. On the behaviour of marginal and conditional AIC in linear mixed models. Biometrika 2010, 97, 773–789. [Google Scholar] [CrossRef]
Belitz, C.; Lang, S. Simultaneous selection of variables and smoothing parameters in structured additive regression models. Comput. Stat. Data Anal. 2008, 53, 61–81. [Google Scholar] [CrossRef]
Reiss, P.T.; Todd Ogden, R. Smoothing parameter selection for a class of semiparametric linear models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2009, 71, 505–523. [Google Scholar] [CrossRef]
Wood, S.N. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2011, 73, 3–36. [Google Scholar] [CrossRef] [Green Version]
Marra, G.; Wood, S.N. Practical variable selection for generalized additive models. Comput. Stat. Data Anal. 2011, 55, 2372–2387. [Google Scholar] [CrossRef]
Wood, S.N.; Li, Z.; Shaddick, G.; Augustin, N.H. Generalized additive models for gigadata: Modeling the UK black smoke network daily data. J. Am. Stat. Assoc. 2017, 112, 1199–1210. [Google Scholar] [CrossRef] [Green Version]
Felson, M. Crime and Everyday Life: Insights and Implications for Society (The Pine Forge Press Social Science Library); Pine Forge: Berks, PA, USA, 1994. [Google Scholar]
Farrell, G. Preventing repeat victimization. Crime Justice 1995, 19, 469–534. [Google Scholar] [CrossRef]
Johnson, S.D. Repeat burglary victimisation: A tale of two theories. J. Exp. Criminol. 2008, 4, 215–240. [Google Scholar] [CrossRef]
Caplan, J.M.; Kennedy, L.W.; Miller, J. Risk terrain modeling: Brokering criminological theory and GIS methods for crime forecasting. Justice Q. 2011, 28, 360–381. [Google Scholar] [CrossRef]
Ranson, M. Crime, weather, and climate change. J. Environ. Econ. Manag. 2014, 67, 274–302. [Google Scholar] [CrossRef]
Harada, Y.; Shimada, T. Examining the impact of the precision of address geocoding on estimated density of crime locations. Comput. Geosci. 2006, 32, 1096–1107. [Google Scholar] [CrossRef]
Wand, M.P.; Jones, M.C. Multivariate plug-in bandwidth selection. Comput. Stat. 1994, 9, 97–116. [Google Scholar]
Yu, O.; Zhang, L. The under-recording of crime by police in China: A case study. Police Int. J. 1999, 22, 252–264. [Google Scholar] [CrossRef]
Tabarrok, A.; Heaton, P.; Helland, E. The measure of vice and sin: A review of the uses, limitations, and implications of crime data. Handb. Econ. Crime 2010, 3, 53–81. [Google Scholar]
Farrell, G.; Phillips, C.; Pease, K. Like taking candy-why does repeat victimization occur. Br. J. Criminol. 1995, 35, 384–399. [Google Scholar] [CrossRef]
Farrell, G.; Pease, K. Repeat Victimization; Criminal Justice Press: New York, NY, USA, 2001. [Google Scholar]
Gelfand, A.E.; Kim, H.J.; Sirmans, C.F.; Banerjee, S. Spatial modeling with spatially varying coefficient processes. J. Am. Stat. Assoc. 2003, 98, 387–396. [Google Scholar] [CrossRef]
Huang, B.; Wu, B.; Barry, M. Geographically and temporally weighted regression for modeling spatio-temporal variation in house prices. Int. J. Geogr. Inf. Sci. 2010, 24, 383–401. [Google Scholar] [CrossRef]
Fotheringham, A.S.; Crespo, R.; Yao, J. Geographical and temporal weighted regression (GTWR). Geogr. Anal. 2015, 47, 431–452. [Google Scholar] [CrossRef] [Green Version]
Mohler, G. Modeling and estimation of multi-source clustering in crime and security data. Ann. Appl. Stat. 2013, 7, 1525–1539. [Google Scholar] [CrossRef] [Green Version]
Kajita, M.; Kajita, S. Crime prediction by data-driven Green’s function method. Int. J. Forecast. 2020, 36, 480–488. [Google Scholar] [CrossRef] [Green Version]
Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations; CRC Press: New York, NY, USA, 2015. [Google Scholar]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Cooley, D. Extreme value analysis and the study of climate change. Clim. Chang. 2009, 97, 77. [Google Scholar] [CrossRef]

Figure 1. Examples of the coefficients we will consider. S&NVC (spatially and non-spatially varying coefficient) is the sum of SVC (spatially varying coefficient) and NVC (non-spatially varying coefficient). As illustrated here, SVC and NVC are defined by non-linear functions in a geographical space and a feature space, respectively.

Figure 2. Coefficients obtained from our assumed generating processes. These patterns change in each iteration.

Figure 3. Root mean squared errors (RMSEs) of the coefficient estimates. N denotes sample size. P = 1 means a model with a constant coefficient, a SVC, and a NVC in the data generating process (Equations (7)–(10)) while P = 3 means three constant coefficients, three SVCs, and three NVCs in the process.

Figure 4. Bias of the standard error estimates.

Figure 5. RMSE of the standard error estimates.

Figure 6. Boxplot of the RMSEs for the coefficient estimates (LM: linear model; S&NVC: S&NVC model; Simple: simple S&NVC model selection; Mgcv: Mgcv without model selection; Mgcv_sel: Mgcv with double-penalty-based model selection).

Figure 7. Computation time. See Figure 4 for model names. We used a Mac Pro (3.5 GHz, 6-Core Intel Xeon E5 processor with 64 GB memory). R (version 4.0.0; https://cran.r-project.org/) was used for model estimation.

Figure 8. Crime density by minor municipal districts (number of incidences/km²) (first quarter of 2017). Lines denote railways except the subway. The black squares in the top panel are Shinjuku station (north-west) and Tokyo station (south-east), which are major railway stations.

Figure 9. Histograms of the logged crime density per districts per quarter.

Figure 10. Estimated S&NVC on Repeat.

Figure 11. Estimated NVCs. Solid lines represent coefficient estimates and the grey areas represent 95% confidence intervals. For Repeat, NVC is extracted from S&NVC.

Figure 12. Number of bicycle theft cases in the first quarter of 2019 (labeled by True) and the predicted results.

Figure 13. Number of shoplifting cases in the first quarter of 2019 (labeled by True) and the predicted results.

Table 1. Specifications for

{E_{p}, V_{p} (θ_{p}), θ_{p}}

.

Table 1. Specifications for

{E_{p}, V_{p} (θ_{p}), θ_{p}}

.

Model Type	$E_{p}$	$V_{P} (θ_{P})$	$θ_{P}$
Constant	0	0	N.A.
SVC	$E^{(s)}$	$\frac{τ_{p}^{(s)}}{σ} Λ^{α_{p}}$	${\frac{τ_{p}^{(s)}}{σ}, α_{p}}$
NVC	$E^{(n)}$	$\frac{τ_{p}^{(n)}}{σ} I_{p}$	$\frac{τ_{p}^{(n)}}{σ}$
S&NVC	$[E^{(s)}, E^{(n)}]$	$[\begin{matrix} \frac{τ_{p}^{(s)}}{σ} Λ^{α_{p}} & ⬚ \\ ⬚ & \frac{τ_{p}^{(n)}}{σ} I_{p} \end{matrix}]$	${\frac{τ_{p}^{(s)}}{σ}, α_{p}}$

Table 2. Summary of coefficient estimates.

	Bicycle Theft
Coefficients	Intercept	Repeat	RepOther	Popden	Retail	Fpopden	UnEmp	Univ
Minimum	−0.469	0.702	0.144	0.008	0.014	−0.015	0.028	−0.035
1st quantile	−0.387	0.765	0.160	0.017
Median	−0.313	0.787	0.188	0.021
3rd quantile	−0.283	0.805	0.208	0.025
Maximum	−0.208	0.872	0.279	0.031
	Shoplifting
Coefficients	Intercept	Repeat	RepOther	Dpopden	Retail	Fpopden	UnEmp	Univ
Minimum	−0.312	0.854	0.125	8.18×10⁻⁵	5.83×10⁻⁴	−0.027	0.135	0.035
1st quantile	−0.295	0.886
Median	−0.286	0.894
3rd quantile	−0.261	0.903
Maximum	−0.226	0.934

Table 3. Proportion of statistical significance levels in each of the coefficients.

	Bicycle theft
Significance	Intercept	Repeat	RepOther	Popden	Retail	Fpopden	UnEmp	Univ
10% level	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
5% level	0.000	0.000	0.000	0.000
1% level	1.000	1.000	1.000	1.000
	Shoplifting
Significance	Intercept	Repeat	RepOther	Dpopden	Retail	Fpopden	UnEmp	Univ
10% level	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
5% level	0.000	0.000	0.000	0.000
1% level	1.000	1.000	1.000	1.000

Table 4. Estimated group effects.

	Bicycle Theft
	Estimate	Standard Error	t-Value
January–March	−0.104	0.058	−1.812
April–June	0.075	0.058	1.292
July–September	0.028	0.057	0.493
October–December	0.001	NA	NA

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Murakami, D.; Kajita, M.; Kajita, S. Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis. ISPRS Int. J. Geo-Inf. 2020, 9, 577. https://doi.org/10.3390/ijgi9100577

AMA Style

Murakami D, Kajita M, Kajita S. Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis. ISPRS International Journal of Geo-Information. 2020; 9(10):577. https://doi.org/10.3390/ijgi9100577

Chicago/Turabian Style

Murakami, Daisuke, Mami Kajita, and Seiji Kajita. 2020. "Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis" ISPRS International Journal of Geo-Information 9, no. 10: 577. https://doi.org/10.3390/ijgi9100577

APA Style

Murakami, D., Kajita, M., & Kajita, S. (2020). Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis. ISPRS International Journal of Geo-Information, 9(10), 577. https://doi.org/10.3390/ijgi9100577

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis

Abstract

1. Introduction

2. Spatial Additive Mixed Model

2.1. Model

2.2. Estimation

3. Model Selection

3.1. Introduction

3.2. Model Selection Procedures

3.2.1. Simple Selection Method

3.2.2. Monte Carlo (MC) Selection Method

4. Numerical Experiments

4.1. Computational Details

4.2. Performance of Model Selection

4.3. Benchmark Comparison of Model Selection Methods

5. Application to Crime Modeling

5.1. Outline

5.2. Coefficient Estimation Results

5.3. Application to Crime Prediction

6. Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A. Restricted Log-Likelihood Function of The Spatial Additive Mixed Model

Appendix B. Details of Model Selection Approaches

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI