2.1. A Brief Review of Crime Prediction Models
A large number of different crime models have been developed over time. It is not possible to include all of them in a brief review, and more comprehensive reviews exist, e.g., [
9]. Here, we aim to highlight models that exemplify some of the important types of models that have been proposed.
Early approaches were primarily based on time series analysis and attempted to study how crime rates, as well as factors that could influence crime (e.g., unemployment rates, drug use, deterrence and legislative changes) evolved over time, in order to explain the level of crime, e.g., [
10,
11,
12,
13]. Such models have limited utility as predictive models because the causal structure is often weak or partially incorrect [
14] and because they make community-level predictions, which may not be as actionable as models that make space- and time-specific predictions.
The terms
retrospective and
prospective have been used to classify the crime models [
15,
16]. While such classification has no formal statistical meaning, the terminology is useful to distinguish the models based on the predictive rationale used.
Retrospective models use past crime data to predict future crime. These include hotspot-based approaches which assume that yesterday’s hotspots are also the hotspots for tomorrow. This assumption has empirical justification: research has shown that while hotspots may flare up and cool down over relatively short periods of times, they tend to occur in the same places over time [
17]. Hotspot models have typically been spatial models only, not explicitly accounting for temporal variations, e.g., [
10], so seasonal or cyclical patterns could be missed. Retrospective time series models have also been proposed, e.g., [
18,
19], and while the more complex of these methods are able to capture various patterns in crime over time, they also increasingly become less user friendly and have to be aggregated to a community level [
20], limiting their use for informing patrol patterns.
Prospective models use not just past data, but attempt to understand the root causes of crime and build a mathematical relationship between the causes and the levels of crime. Prospective models are based on criminological theories and model the likely prospect of crime based on the underlying causes. It is therefore expected that these models may be more meaningful and provide predictions that are more ‘enduring’ [
15]. Prospective models developed so far are based on either socio-economic factors (e.g., RTM [
15]) or the
near-repeat phenomenon (e.g.,
Promap [
16];
PredPol [
21]). The term
near-repeat refers to the widely observed phenomenon (especially in relation to crimes such as burglary) where a property or the neighboring properties or places are targeted again shortly after the first crime incident [
16].
Employing a near-repeat approach, Johnson et al. [
22] modeled the near-repeat phenomenon (i.e., for how far and for how long is there an increased risk of crime) and produced a predictive model named
Promap. Mohler et al. [
21] modeled the near-repeat phenomenon using self-exciting point processes which had earlier been used to predict earthquake aftershocks. This model is available in a software package
PredPol. While these two models consider the near repeat phenomenon, they do not consider longer-term historical data, taking into account the overarching spatial and temporal patterns. They also do not take into account the socio-demographic factors that can result in crime, and long-term changing dynamics in suburbs/communities.
In contrast,
Risk Terrain Modeling (RTM) [
15] combines a number of socio-demographic and environmental factors using a regression-based model to predict the likelihood of crime in each grid cell. However, this model does not consider historical crime data and thus may not accurately capture the overarching spatial and temporal patterns in crime. It also does not take near repeats into account and thus does not consider short-term risks at specific locations. A meta-analysis [
23] found that RTM is an effective forecasting method for a number of different crime types. However, research has also demonstrated that RTM can be less accurate than machine learning methods that better model the complexity of interactions between input variables, such as Random Forest [
24].
Ratcliffe et al. [
25] argue that a model that includes both short-term (near-repeat) as well as long-term (socio-demographic factors and past crime data) components has a superior ‘parsimony and accuracy’ compared to models that only include one of those. While this argument is logical, their assertion is based on comparing models using their BIC (Bayesian Information Criterion) values. While BIC is a standard statistical measure to compare models, it measures how well a given model ‘fits’ or ‘explains’ the data (e.g., [
26]) and does not directly measure the predictive accuracy (e.g., [
27]). We discuss this point further in
Section 2.2. Therefore, the assertion made by Ratcliffe et al. [
25] still remains to be verified.
In recent years, several attempts have been made to build predictive crime models using artificial neural networks-based machine learning algorithms [
6,
28,
29,
30]. These studies report encouraging results, indicating that neural-network-based models could play an important role in predicting crime in the future. Neural networks are often considered as ‘black box’ models and a common criticism of such models is that they cannot explain causal relationships. Thus, while a neural network model may be able to predict crime with good accuracy, it may not be able to highlight the underlying causal factors and could lack transparency in how it works.
Lee et al. [
1] argued that transparency in exactly how an algorithm works is just as important a criterion as predictive accuracy and operational efficiency. They point out that many of the available crime models are complex, proprietary and lack transparency. They propose a new Excel-based algorithm that is fully transparent and editable. It combines the principal of population heterogeneity in the space–crime context with the principal of state dependency (near-repeat victimization). The authors claim their algorithm outperforms existing crime models on operational efficiency, but not on accuracy. However, they do point to further improvements that could potentially lead to better accuracy.
While individual authors have argued the strengths of their respective methods, there have been few independent comparative evaluations. Perry [
31] and Uchida [
32] concluded that statistical techniques used in predictive crime analytics are largely untested and are yet to be evaluated rigorously and independently. Bennett Moses and Chan [
33] reviewed the assumptions made while using predictive crime models and issues regarding their evaluations and accountability. They concluded by emphasizing the need to develop better understanding, testing and governance of predictive crime models. Similarly, Meijer and Wessels [
34] concluded that the current thrust of predictive policing initiatives is based on convincing arguments and anecdotal evidence rather than on systematic empirical research, and call for independent tests to assess the benefits and the drawbacks of predictive policing models. Most recently, a systematic review of spatial crime models [
9] concluded that studies often lack a clear report of study experiments, feature engineering procedures, and use inconsistent terminology to address similar problems. The findings of a recent randomized experiment [
35] suggested that the use of predictive policing software can reduce certain types of crime but also highlighted the challenges of estimating and preventing crime in small areas. Collectively, these studies support the need for a robust, comprehensive and independent evaluation of predictive crime models.
2.2. Measures for Comparing Crime Models
Focusing crime prevention efforts effectively relies on identifying models that accurately forecast where crime is likely to occur. Note, however, that the reported accuracy of a model depends on the data to which it was applied, as it could be less accurate or more accurate on a different dataset. More importantly though, the accuracy also depends on how it is measured. Not all measures are created equal.
Criminology research has employed both the standard statistical measures as well as measures developed specifically for predictive crime models. There is, however, little consensus as to how to best measure and compare the performance of predictive models [
36]. Here, we provide a brief review of some of these measures. It is not an exhaustive review, but highlights measures that exemplify the different types of approaches that have been proposed.
Some of these measures assess the ability of predictive models to accurately predict crime (i.e., whether the predictions come true), while others assess their ability to yield operationally efficient patrolling patterns: minimizing patrol distance whilst maximizing potential prevention gain.
In predictive modeling, one typically uses two sets of data. A training dataset—the data used to fit the model, and a testing dataset—the data that will be used to test the model predictions. Model assessment is typically based on comparing predictions derived from the training dataset to observed crimes in the test dataset. In applications such as crime modeling, where a process evolves over time, the testing and the training datasets typically correspond to data from two distinct time periods. To account for variability in predictive accuracy over time, the reported value is often the average value of the measure over several test time periods. Using separate testing data ensures that the predictive accuracy is correctly measured.
One of the reasons why standard statistical model fitting measures such as AIC, BIC and R
cannot measure the predictive accuracy of a model very accurately [
27] is that they only use the training data. This is because their main objective is to measure how well a model explains a given set of data and not how well it can predict unseen or future data. Because we specifically focus on the measures of predictive accuracy or operational efficiency, we do not include measures of model fit, such as the BIC, in this review.
The average logarithmic score (ALS) was first proposed by Good [
37] and later advocated by Gneiting et al. [
38] for its technical mathematical properties. ALS computes the average joint probability of observing the testing data under a given model. In simple terms, if the ALS for model A is higher than the ALS for model B, then model A is more likely to produce the testing data than model B. Thus, ALS directly measures the predictive accuracy of a model.
where,
denotes the testing data,
denotes the model parameters and
denotes the probability of observing the testing observation
under the model. ALS was used [
39] to measure the predictive accuracy of a spatio-temporal point process model to predict ambulance demand in Toronto, Canada. Similar models have been used for crime (e.g., PredPol, [
21]). Thus, ALS is a natural candidate to measure accuracy for such crime models.
Another approach to assess the predictive accuracy of a model is by looking at the distribution of True Positives (TPs), False Positives (FPs), True Negatives (TNs) and False Negatives (FNs). Several of the accuracy measures proposed in the crime literature are indeed based on one or more of these quantities. This includes the ‘hit rate’ [
4,
6,
21,
22,
31,
36,
40,
41,
42,
43], which is the proportion of crimes that were correctly predicted by the model out of the total number of crimes committed in a given time period (TP/(TP + FN)), and is typically applied to hotspots identified by the model. Similarly, a measure termed ‘precision’—defined as the proportion of crimes that were correctly predicted by the model out of the total number of crimes predicted by the model (TP/(TP + FP))—has also been proposed [
6,
44]. Finally, a measure termed ‘predictive accuracy’ (PA)—which measures the proportion of crimes correctly classified out of the total number of crimes ((TP + TN)/(TP + FP + TN + FN))—has also been used [
45,
46,
47].
Note that while the terms used, namely, hit rate, precision and predictive accuracy, may appear to be novel, the measures themselves are well established in the statistical literature (e.g., [
48]). Thus, hit rate refers to what is also commonly known as the ‘sensitivity’ of the model, whereas precision refers to what is also commonly referred to as the ‘positive predictive value’, or the PPV. Finally, predictive accuracy refers to what is often referred to as ‘accuracy’ [
48].
A contingency table approach, which takes into account all four—TP, TN, FP and FN—quantities, has also been used [
19], and statistical tests of association on such contingency tables have also been applied [
15,
25,
42]. Rummens et al. [
6] used receiver operating characteristic (ROC) analysis, which is fairly common in other domains such as science or medicine (e.g., [
44,
48]). ROC analysis plots the hit rate (sensitivity) against the false positive rate 1-specificity (FP/(TN + FP)) at different thresholds.
A related measure is the
Predictive Accuracy Index (PAI) [
4], which is defined as
where
a denotes the area of the hotspot/s,
the total area under consideration,
the number of crimes observed in the hotspot/s and
the total number of crimes observed in the area under consideration. Thus,
denotes the
relative area, namely the proportion of area covered by the hotspot/s and
denotes the
hit rate, namely the proportion of crimes in that hotspot/s. Hit rate does not take into account the operational efficiency associated with patrolling the hotspot identified. For example, a large hotspot may have a high hit rate simply because it accounts for more crime, yet such a hotspot will have very little practical value in terms of preventing the crime because it may not be effectively patrolled. PAI overcomes this drawback by scaling the hit rate using the coverage area. If two hotspots have a similar hit rate, the one that has a smaller coverage area will have a higher PAI. Thus, PAI factors in both the predictive accuracy as well as the operational efficiency of the model. PAI has been widely used in the crime literature [
6,
36,
41,
49,
50,
51,
52,
53].
Several other attempts have been made to incorporate operational efficiency into a measure. Bowers et al. [
40] proposed measures such as the
Search Efficiency Rate (SER), which measures the number of crimes successfully predicted per km2, and
Area-to-Perimeter-Ratio (APR), which measures how compact the hotspot is and gives higher scores for more compact hotspots. Hotspots may be compact, but if they are evenly dispersed over a wide area, then they would still be operationally difficult to patrol compared to if the hotspots were clumped together, for example. The
Clumpiness Index (CI; [
1,
36,
43,
54]) attempts to solve this problem by measuring the dispersion of the hotspots. A model that renders hotspots that are clustered together will achieve a higher CI score compared with a model that predicts hotspots that are more dispersed. The
Nearest Neighbour Index (NNI; [
22,
55]) provides an alternative approach to measure dispersion based on the nearest neighbour clustering algorithm.
A model that predicts hotspots that change little over consecutive time periods may be operationally preferred over a model where the predicted hotspots vary more. Measures have been proposed to measure the variation of the hotspots over time. These include the
Dynamic Variability Index (DVI; [
36]) and the
Recapture Rate Index (RRI; [
41,
49,
50,
51,
52,
53,
56]). One advantage of the DVI is that it is straightforward to calculate and does not require specialized software. However, if the actual crime exhibits spatial variation over time, then one would expect a good predictive model to capture it, and hence the DVI would be higher for that model compared to (say) another model that did not capture this variation, and thus had a lower predictive accuracy. Thus, measures such as the DVI need to be considered in conjunction with the predictive accuracy of the model, and not independently. Finally, the
Complementarity [
36,
57] is a visual method to investigate how a number of different crime models complement each other by predicting different crime hotspots. Here, a Venn diagram is used to display the hotspots that are jointly predicted by all the models, as well as the hotspots that are uniquely predicted by each individual model.
Some measures may be arguably superior to others because they account for more aspects of accuracy. For example, PAI could be considered as superior to the hit rate because it also accounts for the corresponding hotspot area. However, other measures capture fewer aspects of accuracy: as mentioned above, hit rate equates to sensitivity; precision equates to PPV. In such cases, which measure is more appropriate depends on the particular application and the subjective opinions of the analysts. Rather than aiming to find just
one measure, we recommend using multiple measures to ensure that all aspects of accuracy are assessed. Kounadi et al. [
9] also argued in favor of including complimentary measures. In fact, as we illustrate later in this paper, some measures can be combined in the desired way using the expected utility function. The model with the highest
expected utility can be considered to be the best model.
The predictive accuracy of a given model will vary over time because the data considered in building the model and the actual number of crimes that happened during the prediction period will vary with time. Therefore, in practice, accuracy obtained over time will have to be somehow summarized. Considering the mean value is important but will not measure the variation in the accuracy observed over time. Therefore, it is also important to consider the standard deviation of the accuracy as well.
A final issue is to test whether the differences in accuracy (however measured) between models are statistically significant. Adepeju et al. [
36] employed the Wilcoxon signed-rank test (WSR) to compare the predictive performance of two different models over a series of time periods. WSR is a non-parametric hypothesis test that can be applied to crime models under the assumption that the difference in the predictive accuracy of the two methods is independent of the underlying crime rate. When comparing multiple models, a correction method such as Bonferroni’s has to be applied to ensure that the probability of false positives (in relation to whether the difference is significant or not) is maintained at the desired (usually 5%) level.