1. Introduction
With the continuous advancement of society and the steady improvement in living standards, public expectations regarding social security and public safety have grown substantially. Modern governance systems are therefore required to deliver more effective, timely, and comprehensive protection. In this context, enhancing the capacity of public safety governance under resource constraints has become a pressing challenge for governments and law enforcement agencies.
To address this challenge, advanced information technologies and data-driven analytical approaches have been increasingly integrated into public safety decision-making processes [
1,
2,
3]. By facilitating systematic analysis of large-scale heterogeneous data, these technologies improve the efficiency and precision of safety management, strengthen anticipatory capabilities, and alleviate operational and human resource burdens. Consequently, the development of reliable risk forecasting and decision-support systems has emerged as a critical research direction in public safety and intelligent governance.
Among the various dimensions of public safety, criminal activities constitute a primary factor affecting social stability and citizens’ perceived sense of security. Crime events exhibit complex spatial and temporal regularities and are influenced by multiple interacting factors, including temporal dynamics, geographic contexts, population distribution patterns, and socio-economic conditions. Therefore, accurately modeling and forecasting the spatio-temporal distribution of crime is essential for promoting evidence-based and proactive public safety governance.
Despite substantial investments in crime prevention and control, the existing safety governance strategies remain predominantly reactive, with interventions often implemented only after criminal incidents have taken place. Such response-driven approaches face inherent limitations in mitigating risks at an early stage and are frequently associated with high social and operational costs. These limitations highlight the necessity of shifting from passive response mechanisms toward proactive and preventive strategies, thereby further underscoring the importance of effective crime prediction methods.
In practice, constraints related to manpower, budget, and technical capacity make it impractical to maintain consistently intensive interventions across all locations and time periods [
4,
5,
6]. Under these conditions, accurately identifying potential high-risk areas and critical time windows becomes essential for improving decision-making efficiency and maximizing the effectiveness of limited resources. This practical requirement fundamentally underscores the importance of crime forecasting within modern public safety governance frameworks.
Moreover, crime occurrence cannot be attributed to a single causal factor. Theories in environmental criminology emphasize that criminal behavior emerges from interactions among individual behavior, environmental context, and routine social activities, resulting in complex and dynamic mechanisms [
7,
8]. These characteristics render crime prediction a challenging yet essential task. To effectively capture latent relationships and spatio-temporal dependency structures embedded in crime data, it is necessary to adopt data-based modeling approaches that can jointly represent multiple spatial and temporal factors, thereby providing reliable and anticipatory decision support for public safety governance.
However, unlike general spatio-temporal modeling and prediction tasks, which often assume relatively dense and balanced data distributions, crime forecasting in real-world settings is characterized by pronounced sparsity. For most locations and time periods, no criminal incidents occur, resulting in a large proportion of zero-valued observations. This inherent sparsity poses substantial challenges for conventional predictive models.
In addition to sparsity, different crime data categories exhibit significant heterogeneity and diverse tendencies in terms of their spatial distribution and temporal occurrence. These patterns are rarely uniform in statistical distributions. For instance, in regions with relatively stable public order and favorable socio-economic conditions, property-related crimes such as theft are more prevalent, while violent crimes occur less frequently. In contrast, areas facing greater security challenges often show a higher likelihood of violent incidents and personal injury. These category-specific disparities further complicate crime prediction as models must simultaneously account for uneven distributions across space, time, and crime types.
Figure 1 illustrates the spatial distribution of four crime categories in the public NYC dataset using region-wise histograms. The horizontal axis represents discretized spatial regions, while the vertical axis denotes the total number of crimes occurring within each region. For clarity, regions with zero crime occurrences are omitted from the histogram. As observed, crime incidents are highly unevenly distributed across space: the majority of regions exhibit very low crime frequencies, and a substantial proportion of regions experience no recorded crimes at all. In contrast, only a small number of regions account for a disproportionately large share of crime incidents, forming a pronounced long-tailed distribution.
This spatial imbalance is particularly evident for property-related crimes, such as larceny. As shown in
Figure 1b, crime occurrences are extremely concentrated in a few regions, with the most active region alone contributing nearly 95% of the total incidents within this category. Similar, although less extreme, patterns can also be observed for burglary, robbery, and assault, where a limited subset of regions consistently exhibit significantly higher crime frequencies than the rest.
Figure 2 presents the distribution of crime occurrences across space and time in the NYC dataset from two complementary perspectives. The left panel illustrates the crime counts for each region and time unit across four crime categories, where each point corresponds to the number of incidents observed in a specific spatial region at a given time step. As shown, the vast majority of region–time pairs are associated with very low crime counts, often close to one or zero, while only a small fraction exhibit substantially higher values. This distribution highlights the pronounced sparsity and skewness of crime data at fine-grained spatio-temporal resolutions.
The right panel depicts the total number of crimes aggregated over time for each spatial region using a symmetric logarithmic scale. Consistent with the observations at the region–time level, crime occurrences at the regional level also follow a strongly imbalanced long-tailed distribution. A limited number of regions accumulate disproportionately large crime counts over the entire observation period, forming persistent crime hotspots, whereas most regions remain relatively inactive. Notably, this pattern is observed across all four crime categories, with property-related crimes, such as larceny, and violent crimes, such as assault, exhibiting particularly pronounced regional concentration. Together, these results demonstrate that crime data are characterized by simultaneous spatio-temporal sparsity and spatial heterogeneity, posing significant challenges for conventional prediction models.
In addition, the spatial and temporal heterogeneity of crime data is commonly characterized by the tendency of geographically or temporally proximate units to exhibit similar crime patterns. Consequently, most existing machine learning and deep learning approaches focus on modeling local dependencies under the assumption that nearby regions and adjacent time intervals provide the most relevant contextual information. While such locality-based modeling strategies have achieved promising results [
9,
10,
11,
12,
13], they largely overlook the potential correlations among spatial regions or temporal periods that are distant from each other.
In real-world urban environments, geographically distant areas with similar functional roles (such as commercial districts) may display highly comparable crime patterns despite their physical separation. Similarly, temporally distant periods can share analogous crime cycles due to recurrent human activity rhythms and long-term periodic structures. Neglecting these long-range spatial and temporal similarities may limit the capacity of existing models to fully capture the underlying structure of crime dynamics.
To address these challenges, we propose GL-MoPA, a novel modeling framework with the following contributions:
We employ convolutional neural networks to capture local feature associations among adjacent spatial regions, temporal intervals, and crime categories while leveraging an attention mechanism to model long-range dependencies beyond local neighborhoods.
We aggregate fine-grained regions into a set of coarse-grained prototypes. Each prototype summarizes a group of spatial/temporal units with similar characteristics, allowing the attention mechanism to operate at the level of prototype interactions.
We introduce an occurrence intensity decomposition strategy that explicitly models crime occurrence and crime intensity in a unified framework, imposing constraints on the binary prediction task to improve the model accuracy and robustness on the sparse dataset.
We evaluate the proposed model on a real-world crime dataset collected from New York City. The experimental results show that our approach consistently outperforms state-of-the-art baseline methods across multiple crime categories.
3. Problem Formulation
In this section, we present the problem formulation, which formally defines the crime prediction task and establishes the notation used throughout the paper. The study area is partitioned into a regular grid of size , where H and W denote the number of rows and columns of spatial regions, respectively. Each grid cell corresponds to a fine-grained spatial region. The model input is a three-dimensional tensor , where denotes the total number of spatial regions obtained by flattening the spatial grid, T represents the number of time steps, and K denotes the number of crime categories. Each element corresponds to the observed number of crime incidents of category k occurring in region r at time step t.
The model aims to predict the crime hotspot map at the next time step
, which is represented as a two-dimensional tensor
. Each element
denotes the predicted crime intensity of category
k in region
r at time step
. The more detailed symbol definition is shown in the following
Table 1.
Based on the above notations, our objective is to learn a predictive function
that maps historical crime observations to future crime distributions. Formally, given the observed crime tensor
, we aim to estimate the crime intensity map at the next time step,
where
represents the predicted spatial distribution of crime intensities over all regions and crime categories at time
.
4. Method
The overall architecture of the proposed GL-MoPA framework is illustrated in
Figure 3. Given multi-category crime sequences as input, the model is designed to jointly capture local spatial–temporal patterns and global dependency structures while explicitly accounting for data sparsity through an occurrence-aware prediction mechanism.
As shown in
Figure 3, the framework consists of four main components: a spatial–temporal embedding layer, a local dependency modeling module, a global dependency modeling module based on prototype-aware attention, and a two-stage prediction head that integrates occurrence estimation with crime intensity forecasting. These components are organized in a global–local cooperative manner, enabling complementary feature learning at different spatial–temporal scales.
To further clarify the operational process of GL-MoPA and illustrate how the framework functions in a real-world forecasting scenario, we provide a step-by-step description of a typical inference procedure for a target spatial region. This example demonstrates how historical crime observations are progressively transformed through different modules of the architecture to generate the final prediction.
- 1.
The historical multi-category crime sequences over the past T days are collected as input for the target region.
- 2.
The spatial–temporal embedding layer transforms raw sequences into latent feature representations that encode temporal dynamics and spatial grid information.
- 3.
The local dependency modeling module aggregates information from neighboring regions within the 16 × 16 grid, capturing short-range spatial interactions.
- 4.
The global prototype-aware attention mechanism aligns region-level embeddings with learned global crime prototypes to model long-range dependencies.
- 5.
The two-stage prediction head first estimates crime occurrence probability and subsequently predicts crime intensity conditioned on the occurrence likelihood.
Through this sequential procedure, GL-MoPA integrates global trend modeling and fine-grained spatial refinement while explicitly addressing data sparsity.
We begin by introducing the spatial–temporal embedding layer, which serves as the foundation for subsequent local and global dependency modeling by transforming raw crime observations into unified latent representations.
4.1. Embedding Layer
Given the inherent sparsity and discreteness of crime count data, directly modeling raw crime values may limit the expressiveness of subsequent feature extraction modules. To address this issue, we introduce a lightweight embedding layer that lifts the original crime observations into a continuous latent space.
Specifically, the raw crime tensor
is first expanded along the channel dimension and then projected into a
d-dimensional feature space via a point-wise three-dimensional convolution with kernel size
. This operation performs an independent linear transformation at each region–time–crime location without altering the spatial or temporal resolution. Formally, for each spatio-temporal crime entry
, the embedding process is defined as
where
denotes the learnable embedding function implemented by the
convolution. As a result, the embedding layer produces an embedded crime tensor
.
It is worth noting that this embedding layer is intentionally designed to be structure-agnostic, serving solely as a feature lifting mechanism. By decoupling the embedding stage from subsequent spatial and temporal modeling, the proposed framework allows later modules to flexibly capture both local patterns and long-range dependencies in a unified manner.
4.2. Local Dependency Modeling
To capture fine-grained local patterns in crime dynamics, we design a local modeling module that explicitly decouples spatial and temporal dependencies while preserving interactions across crime categories. Instead of jointly modeling space and time in a single convolutional operation, the proposed module adopts a two-stage design that sequentially models spatial and temporal correlations.
4.2.1. Spatial Modeling
Given the embedded crime tensor
, we first focus on local spatial dependencies among neighboring regions. For each time step, the spatial module applies convolutional filters over adjacent grid cells while jointly considering multiple crime categories. Specifically, our spatial dependence modeling formula is as follows:
where
represents the spatial neighborhood of region
r,
is the spatial convolution kernel corresponding to the type
i crime,
represents the convolution operation in spatial neighborhood, and
is nonlinear activation. We concatenate all crime types to form the final local spatial crime dependency representation:
Equation (
3) yields a joint representation that integrates local spatial context with crime-type-aware features.
4.2.2. Temporal Modeling
While spatial modeling reveals where crimes are likely to occur, understanding when they happen requires explicit temporal modeling. To this end, we introduce a temporal convolutional module that captures short-term dependencies in the crime time series. Specifically, we use the following local modeling methods:
where
is in the time neighborhood of the current time step
t.
is the time convolution kernel of type
i crime. Similarly, our multi-branch time aggregation is expressed as
To jointly fuse and adaptively adjust the combined spatial–temporal features, we employ a learnable
convolution followed by group normalization, producing the output of the local modeling module:
The local feature representation integrates spatial continuity, short-term temporal dependency, and crime-type interactions through convolutional operations, thereby capturing fine-grained spatio-temporal patterns at the region level.
However, crime dynamics are not solely governed by local proximity. Regions that are geographically distant may still exhibit highly similar crime patterns due to shared functional roles (e.g., commercial or transportation zones), and temporally distant periods may follow comparable crime cycles. Such long-range and non-local dependencies cannot be sufficiently captured by purely convolutional modeling.
4.3. Global Dependency Modeling
In our model, beyond modeling region-level dependencies, we aim to equip the framework with a global perspective that enables it to capture potential correlations among geographically distant regions and temporally separated time steps. Such long-range dependencies are crucial for crime prediction as regions with similar functional roles and time periods following comparable crime cycles may exhibit highly correlated patterns despite being far apart in space or time.
However, conventional self-attention mechanisms typically compute attention scores at a fine-grained region level, which limits their ability to explicitly model interactions among large and irregular spatial structures. Inspired by clustering-based representation learning, we introduce a prototype-based attention mechanism that aggregates fine-grained regional representations into a set of coarse-grained spatial entities, referred to as zones. This design allows the model to focus on relationships among large-scale and irregular areas on the map, thereby facilitating more effective global dependency modeling.
4.3.1. Prototype Attention
We propose a prototype attention mechanism that abstracts the fine-grained spatio-temporal feature representations into a set of learnable prototypes. Each prototype corresponds to an irregular large-scale zone composed of multiple spatial regions and time steps, enabling the model to reason about interactions at the zone level rather than treating each region in isolation. This design encourages the model to move beyond local spatio-temporal points and instead capture latent similarities between geographically distant but functionally analogous regions or temporally separated yet behaviorally similar periods.
Existing prototype-based or clustering-driven approaches typically rely on explicit grouping strategies, such as k-means clustering [
36,
37,
38,
39], graph partitioning [
40,
41,
42], or metric-based assignment [
43,
44], where prototypes are either precomputed or updated through iterative optimization. Such methods often require carefully designed similarity measures and are sensitive to initialization.
In contrast, our approach does not perform explicit clustering. The prototypes are implicitly learned through convolutional aggregation along the spatio-temporal crime dimension, allowing the model to jointly learn both region representations and their assignments to latent zones in an end-to-end manner. This design avoids rigid partitioning and enables flexible data-driven abstraction of large-scale crime patterns.
The input after position encoding is set to be
, where
R represents the number of regions before prototype aggregation,
T is time steps, and
K is the type of crime. We expand the input into one-dimensional sequence
, where
. The query, key, and value matrices are then obtained via independent linear projections of
.
Directly applying self-attention on the flattened sequence may obscure higher-level structural regularities in large-scale urban environments. We abstract fine-grained region–time–crime representations into a compact set of latent prototypes. Each prototype represents an irregular large-scale zone composed of multiple regions and time steps, enabling the model to reason over zone-level dependencies while preserving region-level expressiveness.
Specifically, instead of computing attention directly between
and
, we first aggregate the key and value representations into a set of prototypes through a learnable aggregation operator. The resulting prototype representations serve as anchors for global attention, allowing each query to selectively attend to semantically meaningful zones. The formulation of the prototype aggregation and the subsequent attention computation is defined as follows:
In this manner, each query corresponding to a region–time–crime unit attends to a small number of semantically meaningful prototypes rather than all individual regions. Each element
reflects the degree to which the
i-th region-level representation attends to the
j-th prototype, enabling flexible many-to-many interactions between regions and zones.
The resulting representation encodes global contextual information by integrating zone-level crime patterns into each region-level representation. Specifically, captures long-range spatial similarity among distant regions, recurring temporal patterns across time steps, and cross-category interactions summarized by the learned prototypes while preserving the original region-level resolution.
4.3.2. Adaptive Global Feature Fusion
To integrate global representations with local context, we introduce a lightweight modulation mechanism. Specifically, local features are first transformed into adaptive weights via a sigmoid activation, reflecting the relative importance of region-level information. The global features are then modulated by these weights and combined with the original global representation through a residual connection, followed by a channel-wise mixing operation. This design allows global dependencies to be adaptively adjusted according to local crime characteristics while maintaining stable feature propagation.
The global features are adaptively adjusted under the guidance of local information, and a residual aggregation scheme is employed to ensure stable feature propagation. Subsequently,
convolution is applied to perform channel-wise mixing and re-calibration, yielding the final fused representation
.
4.4. Two-Stage Crime Prediction
Despite the rich spatio-temporal representations obtained through local modeling and global dependency aggregation, directly regressing crime counts remains challenging due to the inherent sparsity and zero inflation of crime data. In many regions and time intervals, crime events do not occur at all, while a small number of regions account for most incidents.
To explicitly account for this characteristic, we further introduce an occurrence-aware prediction mechanism, which decouples crime prediction into occurrence estimation and intensity modeling.
Specifically, for each region
and crime type
, the model first estimates the probability of crime occurrence
, which reflects whether a crime event is likely to happen. Conditioned on the occurrence, the model then predicts the potential crime intensity
.
In Equation (
16),
denotes the final predicted crime intensity map at the future time step
,
represents the predicted crime occurrence probabilities,
denotes the predicted conditional crime intensities given event occurrence, which are obtained by applying a lightweight prediction layer to the fused global representation
, ⊙ is the element-wise multiplication operator,
R is the number of spatial regions, and
K is the number of crime categories.
4.5. Model Optimization
To mitigate bias induced by the extreme sparsity and imbalance in crime counts, we adopt a normalized weighted mean squared error that upweights rare but significant events.
Let
and
denote the predicted and ground-truth crime intensities for the
k-th crime type in region
r of batch sample
b, respectively. We define a binary indicator
which identifies non-zero crime locations. Based on this indicator, a re-weighting factor is assigned to each prediction target as
where
controls the relative emphasis on non-zero crime observations. The sparsity-aware regression loss is then formulated as
which normalizes the weighted error by the total effective weight. We define the binary occurrence label as follows (from the strength label):
Here,
denotes the ground-truth crime count of category
k in region
r at time step
b. The indicator function
maps the continuous crime intensity into a binary occurrence label:
Thus,
explicitly represents whether a crime event occurs, enabling the model to decouple event occurrence modeling from crime intensity regression.
Here, and denote the total numbers of positive (crime-occurring) and negative (non-occurring) samples, respectively. Due to the sparsity of crime data, negative samples usually dominate the dataset.
To mitigate this imbalance, we introduce a dynamic positive weighting factor
, computed as the ratio between negative and positive samples:
where the small constant
ensures numerical stability.
This adaptive weighting mechanism increases the contribution of rare crime occurrences during optimization, preventing the model from being biased toward predicting non-occurrence everywhere. The occurrence loss is
In this equation, represents the predicted probability that a crime of type k occurs in region r at batch b. The loss function follows a weighted binary cross-entropy formulation, where:
This design encourages the model to accurately detect rare crime occurrences while maintaining robustness on abundant zero-crime regions.
The overall training objective of the proposed model is defined as the sum of three complementary loss terms, as shown in Equation (
25). Specifically, the regression losses on both the local prediction branch
and the global prediction branch
supervise the model to capture fine-grained regional patterns and long-range dependencies, respectively. Meanwhile, the occurrence loss
explicitly guides the model to distinguish between crime-occurring and non-occurring instances, alleviating the severe data sparsity and class imbalance inherent in crime datasets.
By jointly optimizing these objectives in an end-to-end manner, the proposed framework achieves balanced learning of local structure, global context, and event occurrence, leading to more accurate and robust crime prediction.
5. Experiments
In this section, we first introduce the datasets and experimental settings. We then compare the proposed method with state-of-the-art baselines and analyze the results in detail. In addition, case studies and ablation experiments are presented to evaluate the effectiveness of individual model components.
5.1. Dataset Description
We conduct experiments on a publicly available real-world crime dataset collected from New York City, the same preprocessed dataset as STSHN [
29], which is derived from the official NYPD crime data released via NYC Open Data, which includes reported crime incidents from January 2014 to December 2015. The dataset spans 731 consecutive days and covers four major crime categories: burglary, robbery, assault, and larceny. Following a chronological splitting strategy, the dataset is divided into training and testing sets with a ratio of 7:1.
In addition, a subset corresponding to one month of data is held out from the training period as a validation set for hyperparameter tuning and early stopping. All predictions are performed at a daily temporal resolution. From a spatial perspective, the geographic area of New York City is partitioned into a uniform grid with a resolution of 3 km × 3 km, resulting in 256
non-overlapping spatial regions. Each region aggregates crime incidents occurring within its boundaries. Detailed statistics of the dataset are summarized in
Table 2.
5.2. Implementation Details
All experiments were carried out on a workstation equipped with an NVIDIA TITAN RTX GPU (24 GB memory; NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Xeon Gold 5118 processor (Intel Corporation, Santa Clara, CA, USA). Model hyperparameters were tuned via a systematic grid search to ensure fair and stable evaluation. Specifically, the network depth was explored within , while the hidden feature dimension was selected from . The prototype number was searched over . The convolution kernel size was fixed to 3 based on empirical performance, and the batch size was set to 1 due to memory constraints. To stabilize training, gradient accumulation was adopted with an accumulation step of 16.
Through extensive validation, we observed that a network depth of 3, a hidden dimension of 256, and a reduced dimension of 2048 provided the most favorable balance between training stability and predictive accuracy. These settings were therefore used as the default configuration in all experiments.
5.3. Performance Metrics
To quantitatively assess the predictive performance of different methods, we employ Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) as the evaluation criteria. These two metrics are commonly adopted in continuous spatio-temporal forecasting tasks due to their interpretability and effectiveness.
MAE measures the average absolute difference between predicted values and ground truth, providing a direct indication of overall prediction accuracy without being affected by the scale of the data. In contrast, MAPE evaluates prediction errors in a relative manner by normalizing the absolute error with respect to the true value, which facilitates fair comparisons across regions and time periods with varying crime intensities.
As both metrics quantify prediction errors, lower MAE and MAPE values correspond to better crime prediction performance.
5.4. Baseline Methods
To comprehensively evaluate the effectiveness of the proposed approach, we compare our GL-MoPA model against a diverse set of representative baseline methods covering traditional statistical models, classical machine learning approaches, and recent deep learning-based spatio-temporal frameworks. All baseline results are obtained from their original implementations or reported performance to ensure a fair comparison.
ARIMA [
45]: The Autoregressive Integrated Moving Average (ARIMA) model is a classical statistical approach for modeling temporal dynamics in sequential data. It leverages autoregressive terms, differencing operations, and moving-average components to describe linear temporal dependencies in historical observations. In crime forecasting, ARIMA utilizes past crime records to extrapolate future trends, serving as a conventional temporal baseline for comparison with data-driven models.
SVM [
46]: Support Vector Machine (SVM) is a supervised learning method that has been employed for both classification and regression tasks in crime analysis. By constructing decision functions in high-dimensional feature spaces, SVM can capture complex patterns through kernel-based transformations. This capability enables it to model non-linear relationships within crime-related features, making it a commonly used baseline in spatio-temporal prediction studies.
ST-ResNet [
47]: A deep spatio-temporal residual network initially developed for city-scale flow forecasting. ST-ResNet decomposes temporal dynamics into multiple components, including short-term proximity, periodic regularity, and long-term trends, which are modeled through parallel residual convolutional branches. Spatial correlations are captured by convolutional operations, and the outputs of different temporal branches are fused together with external contextual factors to enhance prediction performance.
DCRNN [
48]: The diffusion convolutional recurrent neural network is a graph-based spatio-temporal model that combines diffusion convolution with recurrent architectures. Spatial dependencies are modeled by simulating information propagation along directed graph structures via random walk processes, while temporal dependencies are learned using recurrent units within an encoder–decoder framework. This design enables effective sequence-to-sequence forecasting on structured spatial domains.
DeepCrim [
49]: A deep learning framework specifically tailored for crime prediction that augments convolutional neural networks with attention mechanisms. DeepCrim applies attention to selectively emphasize informative spatial regions and critical temporal intervals, allowing the model to focus on salient spatio-temporal patterns embedded in historical crime data.
STDN [
50]: The Spatio-Temporal Dynamic Network explicitly models dynamic temporal influences by jointly considering local temporal dependencies and periodic patterns. STDN introduces a flow-gating mechanism to adaptively integrate temporal contexts while leveraging convolutional structures to capture spatial correlations, enabling fine-grained spatio-temporal forecasting in urban environments.
ST-MetaNet [
51]: A meta-learning-based spatio-temporal prediction model designed to address heterogeneity across urban regions. ST-MetaNet learns meta-knowledge from multiple regions and temporal contexts, allowing the forecasting model to dynamically adjust its parameters and improve generalization when transferring across diverse spatial areas.
STTrans [
52]: A transformer-based spatio-temporal forecasting framework that models long-range dependencies through hierarchical attention mechanisms. STTrans captures spatial interactions and temporal evolution at multiple resolutions, enabling the model to learn both global and local spatio-temporal representations for complex urban prediction tasks.
STGCN [
53]: The Spatio-Temporal Graph Convolutional Network models forecasting problems on graph-structured time series using a fully convolutional architecture. By interleaving graph convolution layers for spatial dependency modeling with temporal convolutions for sequence learning, STGCN effectively captures localized spatial interactions and temporal dynamics while maintaining high computational efficiency.
GWN [
54]: Graph WaveNet is a graph-based spatio-temporal forecasting framework that integrates adaptive graph learning with dilated temporal convolutions. It dynamically infers hidden spatial relationships among nodes while employing causal and dilated convolutions to capture long-range temporal dependencies, enabling flexible modeling without relying on a fixed predefined graph structure.
GMAN [
32]: The graph multi-attention network leverages attention mechanisms to model complex spatio-temporal dependencies in graph-structured data. GMAN adopts an encoder–decoder architecture composed of stacked attention blocks, allowing the model to dynamically capture both spatial interactions across nodes and temporal influences over historical sequences, thereby improving long-horizon prediction stability.
ACGRN [
55]: The adaptive graph convolutional recurrent network is designed to learn spatial dependencies directly from data by constructing node-adaptive graph representations. By integrating adaptive graph convolutions with recurrent neural units, ACGRN jointly models evolving spatial structures and temporal patterns, offering enhanced flexibility for multivariate time-series forecasting.
MTGNN [
27]: The Multivariate Time Series Graph Neural Network introduces a learnable graph structure to explicitly capture inter-variable dependencies in multivariate temporal data. MTGNN combines graph convolution operations with temporal modeling modules, enabling the framework to uncover latent relational structures among time series while effectively modeling their temporal evolution.
STSHN [
29]: The Spatial–Temporal Sequential Hypergraph Network formulates crime forecasting by representing regions and crime types within a unified hypergraph structure. Through dynamic hypergraph construction, STSHN models high-order spatial interactions among regions as well as relational dependencies across crime categories. Sequential hypergraph propagation combined with multi-channel routing enables the model to capture complex spatio-temporal dependencies in urban crime data.
DMSTGCN [
28]: The dynamic and multi-faceted spatio-temporal graph convolutional network is designed to handle time-evolving spatio-temporal patterns by dynamically adapting graph-based representations. By jointly modeling diverse temporal characteristics and adjusting spatial dependencies over time, DMSTGCN effectively captures the evolving relationships inherent in complex spatio-temporal sequences.
ST-HSL [
56]: The Spatial–Temporal Hypergraph Self-Supervised Learning framework mitigates supervision sparsity in crime prediction by leveraging hypergraph-based representation learning together with self-supervised objectives. ST-HSL integrates local and global spatio-temporal information while introducing auxiliary self-supervised tasks to enhance regional representation robustness under limited labeled data.
Following prior studies, ST-HSL is adopted as the primary hypergraph-based baseline, and all hyperparameters are set according to the configurations reported in the original work.
5.5. Experimental Results
As shown in
Table 3, GL-MoPA demonstrates consistent performance gains over the hypergraph-based baseline ST-HSL across all crime categories on the NYC dataset. Specifically, compared with ST-HSL, GL-MoPA reduces MAE and MAPE by 1.9% and 3.8% for burglary, 0.4% and 1.2% for larceny, 4.9% and 1.7% for robbery, and 9.3% and 9.6% for assault, respectively.
Overall, GL-MoPA achieves average relative improvements of 4.1% in MAE and 4.6% in MAPE compared with ST-HSL. These results suggest that the proposed model more effectively balances global and local spatio-temporal representations, leading to improved robustness across diverse crime patterns.
5.6. Robustness Analysis
We additionally investigate the robustness of the proposed GL-MoPA under different levels of data sparsity. To this end, regions are grouped according to their crime occurrence density, which is quantified as the proportion of non-zero entries in the regional crime time series
. Based on this criterion, regions exhibiting low activity (density
) are further partitioned into two intervals, namely
and
. The corresponding performance comparisons across different sparsity levels are presented in
Figure 4.
As illustrated in
Figure 4, GL-MoPA exhibits clear robustness advantages over ST-HSL under sparse-data conditions. In highly sparse regions with density in
, GL-MoPA reduces MAE, MAPE, and RMSE by 10.3%, 11.3%, and 8.7%, respectively, compared with ST-HSL.
The performance gap becomes more pronounced in moderately sparse regions , where GL-MoPA achieves relative reductions of 17.2% in MAE, 21.0% in MAPE, and 10.8% in RMSE. These results indicate that GL-MoPA is particularly effective at mitigating prediction degradation caused by data sparsity, consistently delivering lower errors across multiple evaluation metrics.
5.7. Ablation Study
To investigate the contribution of individual components in GL-MoPA, we perform a series of ablation studies by selectively removing or modifying key modules of the proposed architecture. This experimental setting allows us to isolate the effect of each component and assess its impact on overall prediction performance.
As illustrated in
Figure 5, the complete GL-MoPA model achieves strong overall performance across all the evaluation metrics. Among the examined components, the prototype attention module contributes most significantly to reducing overall error compared with the variant without prototype attention, leading to decreases of 23.4% in RMSE, 6.5% in MAE, and 6.1% in MAPE. This highlights its critical role in capturing global spatio-temporal dependencies and stabilizing prediction magnitude.
The local modeling module also provides notable performance gains compared with the variant without local modeling, yielding reductions of in RMSE, 6.9% in MAE, and 14.6% in MAPE. In particular, its substantial impact on MAPE indicates that local representations are especially effective in mitigating relative errors and fine-grained deviations across regions.
It is worth noting that the variant without the two-stage module achieves slightly lower overall RMSE, MAE, and MAPE. However, the primary purpose of introducing the two-stage mechanism is not to optimize global metrics but rather to enhance robustness under data sparsity. As demonstrated in
Figure 6, the complete GL-MoPA model consistently outperforms its counterpart without the occurrence-aware module in sparse regions, confirming the necessity of this design for reliable crime forecasting.
In addition, we explored the contribution of our different modules to each type of crime.
Figure 7 presents a fine-grained ablation analysis of GL-MoPA across different crime categories, reporting RMSE, MAE, and MAPE for burglary, larceny, robbery, and assault under various module removal settings. From the RMSE perspective, removing the prototype attention module leads to the most pronounced performance degradation, particularly for larceny and assault. This indicates that prototype-level global interaction modeling is crucial for stabilizing prediction magnitude and capturing long-range spatio-temporal dependencies across regions, whereas local representations mainly serve as complementary refinements rather than the primary source of magnitude correction.
A similar trend can be observed in the MAE results. The removal of prototype attention consistently yields higher absolute errors across all crime categories, confirming its dominant contribution to overall prediction accuracy. The local modeling module also plays an important role in reducing MAE, especially for robbery and assault, where fine-grained spatial cues are more informative. Notably, eliminating the two-stage module only slightly affects MAE, reinforcing the observation that this component does not primarily target overall accuracy metrics.
The impact of different modules becomes more distinctive when evaluated using MAPE. Removing the local modeling module leads to the largest relative error increase across all the crime types, highlighting its effectiveness in mitigating proportional and fine-grained deviations. This observation suggests that local modeling is particularly important for controlling relative prediction error, especially in regions with varying crime intensities. And the two-stage module is more effective in reducing MAPE.
Overall, the ablation results reveal a clear division of responsibilities among the different components: prototype attention primarily stabilizes global prediction magnitude, local modeling effectively reduces relative error, and the two-stage crime prediction module enhances robustness under sparse-data scenarios. Collectively, these results demonstrate that each module plays a complementary and indispensable role in the overall framework, validating the rationality of our model design.
6. Conclusions
In this paper, we propose GL-MoPA, a novel global–local spatio-temporal framework for city-scale crime prediction. By explicitly integrating prototype attention for global dependency modeling, local modeling for fine-grained spatial refinement, and a two-stage crime prediction mechanism for occurrence-aware learning, GL-MoPA is designed to jointly address the challenges of complex spatio-temporal correlations and severe data sparsity in real-world crime datasets.
Extensive experiments conducted on the NYC crime dataset demonstrate that GL-MoPA consistently outperforms a wide range of classical statistical models and state-of-the-art deep learning baselines across multiple crime categories. The quantitative results show notable improvements in RMSE, MAE, and MAPE, confirming the effectiveness of the proposed framework in capturing both global trends and local variations in crime dynamics.
Furthermore, robustness evaluations under different sparsity levels reveal that GL-MoPA maintains stable predictive performance in sparse regions, where conventional models often suffer from severe degradation. From a practical perspective, this robustness reduces the risk of overestimation or missed detection in low-incident areas, thereby improving the reliability of resource allocation and patrol planning decisions. Ablation studies provide additional insights into the functional roles of individual components, showing that prototype attention primarily stabilizes prediction magnitude, local modeling effectively reduces relative error, and the two-stage mechanism significantly enhances robustness under sparse-data scenarios. These findings collectively validate the rationality and necessity of each module in the proposed architecture.
Overall, the improved forecasting stability and reduced relative error achieved by GL-MoPA enhance decision support for urban safety management, enabling more informed deployment strategies and proactive crime prevention measures.
In future work, we plan to extend GL-MoPA to incorporate additional contextual factors, such as socio-economic indicators and mobility patterns, and to explore its applicability to other spatio-temporal forecasting tasks, including traffic flow and emergency event prediction.