Global–Local Modulated Prototype Attention Network for Spatio-Temporal Crime Prediction

Zhao, Yuchen; Zhou, Yanxia; Chen, Yanli; Wu, Hanzhou; Dong, Zhicheng

doi:10.3390/app16052572

Open AccessArticle

Global–Local Modulated Prototype Attention Network for Spatio-Temporal Crime Prediction

by

Yuchen Zhao

¹,

Yanxia Zhou

^2,*,

Yanli Chen

³,

Hanzhou Wu

^1,3,*

and

Zhicheng Dong

²

¹

School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China

²

School of Information Science and Technology, Xizang University, Lhasa 850000, China

³

School of Big Data and Computer Science, Guizhou Normal University, Guiyang 550025, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2572; https://doi.org/10.3390/app16052572

Submission received: 7 February 2026 / Revised: 1 March 2026 / Accepted: 5 March 2026 / Published: 7 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Accurate spatial–temporal crime prediction is a critical component of proactive public safety governance, yet it remains challenging due to complex dependency structures and severe data sparsity in real-world crime datasets. Most existing methods either focus on local spatial–temporal correlations or attempt to model global dependencies at fine-grained region levels, which limits their robustness under highly sparse and imbalanced crime distributions. In this paper, we propose GL-MoPA, a global–local modulated prototype attention framework for city-scale crime prediction. GL-MoPA integrates three key components. First, a local dependency modeling module is designed to capture fine-grained spatial and short-term temporal patterns. Second, a prototype-aware global attention mechanism aggregates region-level representations into semantically meaningful prototypes to efficiently model long-range dependencies. Third, a two-stage occurrence-aware prediction strategy decouples crime occurrence estimation from intensity regression to explicitly address data sparsity. We evaluate GL-MoPA on a real-world crime dataset from New York City covering four major crime categories. The experimental results show that GL-MoPA achieves state-of-the-art performance, consistently outperforming both classical statistical models and recent deep learning baselines. In particular, a robustness analysis shows substantial error reductions in sparse regions, while ablation studies reveal the complementary roles of individual model components. These results indicate that GL-MoPA provides an effective and robust solution for spatial–temporal crime forecasting under sparse-data scenarios.

Keywords:

spatial–temporal; crime prediction; prototype attention; global–local modeling; occurrence-aware learning; urban computing

1. Introduction

With the continuous advancement of society and the steady improvement in living standards, public expectations regarding social security and public safety have grown substantially. Modern governance systems are therefore required to deliver more effective, timely, and comprehensive protection. In this context, enhancing the capacity of public safety governance under resource constraints has become a pressing challenge for governments and law enforcement agencies.

To address this challenge, advanced information technologies and data-driven analytical approaches have been increasingly integrated into public safety decision-making processes [1,2,3]. By facilitating systematic analysis of large-scale heterogeneous data, these technologies improve the efficiency and precision of safety management, strengthen anticipatory capabilities, and alleviate operational and human resource burdens. Consequently, the development of reliable risk forecasting and decision-support systems has emerged as a critical research direction in public safety and intelligent governance.

Among the various dimensions of public safety, criminal activities constitute a primary factor affecting social stability and citizens’ perceived sense of security. Crime events exhibit complex spatial and temporal regularities and are influenced by multiple interacting factors, including temporal dynamics, geographic contexts, population distribution patterns, and socio-economic conditions. Therefore, accurately modeling and forecasting the spatio-temporal distribution of crime is essential for promoting evidence-based and proactive public safety governance.

Despite substantial investments in crime prevention and control, the existing safety governance strategies remain predominantly reactive, with interventions often implemented only after criminal incidents have taken place. Such response-driven approaches face inherent limitations in mitigating risks at an early stage and are frequently associated with high social and operational costs. These limitations highlight the necessity of shifting from passive response mechanisms toward proactive and preventive strategies, thereby further underscoring the importance of effective crime prediction methods.

In practice, constraints related to manpower, budget, and technical capacity make it impractical to maintain consistently intensive interventions across all locations and time periods [4,5,6]. Under these conditions, accurately identifying potential high-risk areas and critical time windows becomes essential for improving decision-making efficiency and maximizing the effectiveness of limited resources. This practical requirement fundamentally underscores the importance of crime forecasting within modern public safety governance frameworks.

Moreover, crime occurrence cannot be attributed to a single causal factor. Theories in environmental criminology emphasize that criminal behavior emerges from interactions among individual behavior, environmental context, and routine social activities, resulting in complex and dynamic mechanisms [7,8]. These characteristics render crime prediction a challenging yet essential task. To effectively capture latent relationships and spatio-temporal dependency structures embedded in crime data, it is necessary to adopt data-based modeling approaches that can jointly represent multiple spatial and temporal factors, thereby providing reliable and anticipatory decision support for public safety governance.

However, unlike general spatio-temporal modeling and prediction tasks, which often assume relatively dense and balanced data distributions, crime forecasting in real-world settings is characterized by pronounced sparsity. For most locations and time periods, no criminal incidents occur, resulting in a large proportion of zero-valued observations. This inherent sparsity poses substantial challenges for conventional predictive models.

In addition to sparsity, different crime data categories exhibit significant heterogeneity and diverse tendencies in terms of their spatial distribution and temporal occurrence. These patterns are rarely uniform in statistical distributions. For instance, in regions with relatively stable public order and favorable socio-economic conditions, property-related crimes such as theft are more prevalent, while violent crimes occur less frequently. In contrast, areas facing greater security challenges often show a higher likelihood of violent incidents and personal injury. These category-specific disparities further complicate crime prediction as models must simultaneously account for uneven distributions across space, time, and crime types.

Figure 1 illustrates the spatial distribution of four crime categories in the public NYC dataset using region-wise histograms. The horizontal axis represents discretized spatial regions, while the vertical axis denotes the total number of crimes occurring within each region. For clarity, regions with zero crime occurrences are omitted from the histogram. As observed, crime incidents are highly unevenly distributed across space: the majority of regions exhibit very low crime frequencies, and a substantial proportion of regions experience no recorded crimes at all. In contrast, only a small number of regions account for a disproportionately large share of crime incidents, forming a pronounced long-tailed distribution.

This spatial imbalance is particularly evident for property-related crimes, such as larceny. As shown in Figure 1b, crime occurrences are extremely concentrated in a few regions, with the most active region alone contributing nearly 95% of the total incidents within this category. Similar, although less extreme, patterns can also be observed for burglary, robbery, and assault, where a limited subset of regions consistently exhibit significantly higher crime frequencies than the rest.

Figure 2 presents the distribution of crime occurrences across space and time in the NYC dataset from two complementary perspectives. The left panel illustrates the crime counts for each region and time unit across four crime categories, where each point corresponds to the number of incidents observed in a specific spatial region at a given time step. As shown, the vast majority of region–time pairs are associated with very low crime counts, often close to one or zero, while only a small fraction exhibit substantially higher values. This distribution highlights the pronounced sparsity and skewness of crime data at fine-grained spatio-temporal resolutions.

The right panel depicts the total number of crimes aggregated over time for each spatial region using a symmetric logarithmic scale. Consistent with the observations at the region–time level, crime occurrences at the regional level also follow a strongly imbalanced long-tailed distribution. A limited number of regions accumulate disproportionately large crime counts over the entire observation period, forming persistent crime hotspots, whereas most regions remain relatively inactive. Notably, this pattern is observed across all four crime categories, with property-related crimes, such as larceny, and violent crimes, such as assault, exhibiting particularly pronounced regional concentration. Together, these results demonstrate that crime data are characterized by simultaneous spatio-temporal sparsity and spatial heterogeneity, posing significant challenges for conventional prediction models.

In addition, the spatial and temporal heterogeneity of crime data is commonly characterized by the tendency of geographically or temporally proximate units to exhibit similar crime patterns. Consequently, most existing machine learning and deep learning approaches focus on modeling local dependencies under the assumption that nearby regions and adjacent time intervals provide the most relevant contextual information. While such locality-based modeling strategies have achieved promising results [9,10,11,12,13], they largely overlook the potential correlations among spatial regions or temporal periods that are distant from each other.

In real-world urban environments, geographically distant areas with similar functional roles (such as commercial districts) may display highly comparable crime patterns despite their physical separation. Similarly, temporally distant periods can share analogous crime cycles due to recurrent human activity rhythms and long-term periodic structures. Neglecting these long-range spatial and temporal similarities may limit the capacity of existing models to fully capture the underlying structure of crime dynamics.

To address these challenges, we propose GL-MoPA, a novel modeling framework with the following contributions:

We employ convolutional neural networks to capture local feature associations among adjacent spatial regions, temporal intervals, and crime categories while leveraging an attention mechanism to model long-range dependencies beyond local neighborhoods.
We aggregate fine-grained regions into a set of coarse-grained prototypes. Each prototype summarizes a group of spatial/temporal units with similar characteristics, allowing the attention mechanism to operate at the level of prototype interactions.
We introduce an occurrence intensity decomposition strategy that explicitly models crime occurrence and crime intensity in a unified framework, imposing constraints on the binary prediction task to improve the model accuracy and robustness on the sparse dataset.
We evaluate the proposed model on a real-world crime dataset collected from New York City. The experimental results show that our approach consistently outperforms state-of-the-art baseline methods across multiple crime categories.

2. Related Works

2.1. Crime Prediction

The development of crime prediction represents an evolutionary process that integrates criminological theory, statistical analysis, and artificial intelligence technologies. Since the early twentieth century, scholars influenced by classical criminology and positivist thought have sought to summarize the underlying regularities of criminal behavior [14,15]. During this early stage, crime trend assessment relied heavily on qualitative judgments [8,16] and experiential knowledge provided by law enforcement practitioners or criminology experts. Although these efforts established important theoretical foundations, they lacked systematic data-driven evidence and rigorous statistical modeling frameworks [17]. As a result, experience-based approaches exhibited limited predictive capability and generalizability when confronted with the complexity and dynamics of real-world crime patterns.

In the 1980s, with the rapid advancement and widespread adoption of computers, researchers began to employ quantitative analytical tools, such as statistical data analysis [18,19] and regression models [20,21], to explore latent crime patterns from historical crime records. This period marked an important transition from predominantly experience-driven assessments toward more systematic data-oriented approaches, laying the groundwork for modern crime analysis and prediction.

Since the beginning of the 21st century, advances in artificial intelligence and machine learning have prompted researchers to place greater emphasis on the spatio-temporal correlations inherent in crime data. During this period, crime prediction gradually evolved from purely statistical analysis toward data-driven spatio-temporal modeling approaches [22,23,24], giving rise to early forms of spatio-temporal crime forecasting. These methods explicitly incorporated spatial proximity and temporal continuity, laying the foundation for subsequent deep learning-based crime prediction models.

After 2015, deep learning methods have been widely used in crime prediction. Wang et al., [25] used an improved residual network to predict the community-scale crime distribution in Los Angeles. Stec et al. [26] proposed a deep neural network that introduced additional datasets, such as weather and traffic. In recent years, graph neural networks have also been introduced into crime prediction to model complex spatial relationships and address the sparsity inherent in crime data. Wu et al. [27], proposed a graph neural network designed for multivariate time-series data. Han et al., [28] proposed a dynamic and multi-faceted spatio-temporal deep learning framework that jointly models multiple evolving spatial and temporal relations, demonstrating the effectiveness of dynamic dependency modeling in large-scale forecasting tasks. Xia et al., [29] proposed a spatial–temporal sequential hypergraph network that models dynamic multiplex relations among regions for crime prediction.

2.2. Attention Mechanism

Vaswani et al. [30] proposed an attention mechanism that provides an effectively infinite receptive field for sequence modeling. Liu et al. [31] introduced an attention mechanism into the field of computer vision by introducing windows, which reduce computational complexity. Motivated by strong capability in capturing long-range dependencies, attention mechanisms have been increasingly adopted in spatio-temporal prediction tasks. Several studies have employed attention to model complex temporal dynamics and spatial correlations in traffic flow, crowd flow, and crime prediction. Zheng et al. [32] proposed a graph multi-attention network to capture both spatial and temporal dependencies in traffic forecasting, demonstrating the effectiveness of attention-based spatio-temporal modeling. Zheng et al. [33] proposed STST: a Spatial–Temporal Specialized Transformer for skeleton-based action recognition, which decouples spatial and temporal modeling into dedicated attention branches to better capture the distinct dynamics of human motion, demonstrating the advantage of specialized architecture design in spatio-temporal representation learning.

However, directly applying global attention to large-scale spatio-temporal data often incurs high computational costs and may overlook hierarchical spatial structures. To address this issue, recent studies have explored combining local and global attention mechanisms to balance fine-grained modeling and long-range dependency capture. Li et al. [34] proposed a Pyramid Attention Network for semantic segmentation, which integrates attention mechanisms with multi-scale feature fusion to capture long-range contextual dependencies and fine-grained details simultaneously, demonstrating the effectiveness of pyramid-structured attention in improving pixel-level prediction accuracy. Yang et al. [35] proposed Hierarchical Attention Networks (HANs) for document classification, which model documents at two levels—words within sentences and sentences within a document—using attention mechanisms to selectively focus on informative words and sentences, thereby capturing the hierarchical structure of text and improving classification performance.

Motivated by hierarchical and multi-scale attention architectures, we propose an aggregation-based attention mechanism that groups fine-grained spatio-temporal regions into representative prototypes at the zone level before computing attention. By shifting from point-wise region-level interactions to zone-level modeling, the proposed GL-MoPA framework explicitly captures structural similarities across large spatial zones and long-range temporal dynamics that are often overlooked by conventional attention mechanisms operating at the finest granularity.

3. Problem Formulation

In this section, we present the problem formulation, which formally defines the crime prediction task and establishes the notation used throughout the paper. The study area is partitioned into a regular grid of size

H \times W

, where H and W denote the number of rows and columns of spatial regions, respectively. Each grid cell corresponds to a fine-grained spatial region. The model input is a three-dimensional tensor

X \in R^{R \times T \times K}

, where

R = W \times H

denotes the total number of spatial regions obtained by flattening the spatial grid, T represents the number of time steps, and K denotes the number of crime categories. Each element

X_{r, t, k}

corresponds to the observed number of crime incidents of category k occurring in region r at time step t.

The model aims to predict the crime hotspot map at the next time step

T + 1

, which is represented as a two-dimensional tensor

Y \in R^{R \times K}

. Each element

Y_{r, k}

denotes the predicted crime intensity of category k in region r at time step

T + 1

. The more detailed symbol definition is shown in the following Table 1.

Based on the above notations, our objective is to learn a predictive function

F (\cdot)

that maps historical crime observations to future crime distributions. Formally, given the observed crime tensor

X \in R^{R \times T \times K}

, we aim to estimate the crime intensity map at the next time step,

\hat{Y} = F (X),

where

\hat{Y} \in R^{R \times K}

represents the predicted spatial distribution of crime intensities over all regions and crime categories at time

T + 1

.

4. Method

The overall architecture of the proposed GL-MoPA framework is illustrated in Figure 3. Given multi-category crime sequences as input, the model is designed to jointly capture local spatial–temporal patterns and global dependency structures while explicitly accounting for data sparsity through an occurrence-aware prediction mechanism.

As shown in Figure 3, the framework consists of four main components: a spatial–temporal embedding layer, a local dependency modeling module, a global dependency modeling module based on prototype-aware attention, and a two-stage prediction head that integrates occurrence estimation with crime intensity forecasting. These components are organized in a global–local cooperative manner, enabling complementary feature learning at different spatial–temporal scales.

To further clarify the operational process of GL-MoPA and illustrate how the framework functions in a real-world forecasting scenario, we provide a step-by-step description of a typical inference procedure for a target spatial region. This example demonstrates how historical crime observations are progressively transformed through different modules of the architecture to generate the final prediction.

1.: The historical multi-category crime sequences over the past T days are collected as input for the target region.
2.: The spatial–temporal embedding layer transforms raw sequences into latent feature representations that encode temporal dynamics and spatial grid information.
3.: The local dependency modeling module aggregates information from neighboring regions within the 16 × 16 grid, capturing short-range spatial interactions.
4.: The global prototype-aware attention mechanism aligns region-level embeddings with learned global crime prototypes to model long-range dependencies.
5.: The two-stage prediction head first estimates crime occurrence probability and subsequently predicts crime intensity conditioned on the occurrence likelihood.

Through this sequential procedure, GL-MoPA integrates global trend modeling and fine-grained spatial refinement while explicitly addressing data sparsity.

We begin by introducing the spatial–temporal embedding layer, which serves as the foundation for subsequent local and global dependency modeling by transforming raw crime observations into unified latent representations.

4.1. Embedding Layer

Given the inherent sparsity and discreteness of crime count data, directly modeling raw crime values may limit the expressiveness of subsequent feature extraction modules. To address this issue, we introduce a lightweight embedding layer that lifts the original crime observations into a continuous latent space.

Specifically, the raw crime tensor

X \in R^{R \times T \times K}

is first expanded along the channel dimension and then projected into a d-dimensional feature space via a point-wise three-dimensional convolution with kernel size

1 \times 1 \times 1

. This operation performs an independent linear transformation at each region–time–crime location without altering the spatial or temporal resolution. Formally, for each spatio-temporal crime entry

(r, t, k)

, the embedding process is defined as

E_{r, t, k} = f_{emb} (X_{r, t, k}), E_{r, t, k} \in R^{d},

(1)

where

f_{emb} (\cdot)

denotes the learnable embedding function implemented by the

1 \times 1 \times 1

convolution. As a result, the embedding layer produces an embedded crime tensor

E \in R^{R \times T \times K \times d}

.

It is worth noting that this embedding layer is intentionally designed to be structure-agnostic, serving solely as a feature lifting mechanism. By decoupling the embedding stage from subsequent spatial and temporal modeling, the proposed framework allows later modules to flexibly capture both local patterns and long-range dependencies in a unified manner.

4.2. Local Dependency Modeling

To capture fine-grained local patterns in crime dynamics, we design a local modeling module that explicitly decouples spatial and temporal dependencies while preserving interactions across crime categories. Instead of jointly modeling space and time in a single convolutional operation, the proposed module adopts a two-stage design that sequentially models spatial and temporal correlations.

4.2.1. Spatial Modeling

Given the embedded crime tensor

E

, we first focus on local spatial dependencies among neighboring regions. For each time step, the spatial module applies convolutional filters over adjacent grid cells while jointly considering multiple crime categories. Specifically, our spatial dependence modeling formula is as follows:

E_{r, t, :}^{(spa)} = ϕ (E_{r, t, :} + \sum_{i = 1}^{K} W_{i}^{(spa)} *_{N (r)} E_{N (r), t, i}),

(2)

where

N (r)

represents the spatial neighborhood of region r,

W_{i}^{(spa)}

is the spatial convolution kernel corresponding to the type i crime,

*_{N (r)}

represents the convolution operation in spatial neighborhood, and

ϕ ()

is nonlinear activation. We concatenate all crime types to form the final local spatial crime dependency representation:

E_{r, t}^{(spa)} = ϕ (E_{r, t} + Concat (W_{1}^{(spa)} * E_{N (r), t}, \dots, W_{K}^{(spa)} * E_{N (r), t})) .

(3)

Equation (3) yields a joint representation that integrates local spatial context with crime-type-aware features.

4.2.2. Temporal Modeling

While spatial modeling reveals where crimes are likely to occur, understanding when they happen requires explicit temporal modeling. To this end, we introduce a temporal convolutional module that captures short-term dependencies in the crime time series. Specifically, we use the following local modeling methods:

E_{r, t, :}^{(tem)} = ϕ (E_{r, t, :}^{(spa)} + \sum_{i = 1}^{K} W_{i}^{(tem)} *_{N (t)} E_{r, N (t), i}^{(spa)}),

(4)

where

N (t)

is in the time neighborhood of the current time step t.

W_{i}^{(tem)}

is the time convolution kernel of type i crime. Similarly, our multi-branch time aggregation is expressed as

E_{r, t}^{(tem)} = ϕ (E_{r, t}^{(spa)} + Concat (W_{1}^{(tem)} * E_{r, N (t)}^{(spa)}, \dots, W_{K}^{(tem)} * E_{r, N (t)}^{(spa)})) .

(5)

To jointly fuse and adaptively adjust the combined spatial–temporal features, we employ a learnable

1 \times 1 \times 1

convolution followed by group normalization, producing the output of the local modeling module:

E_{local} = GN (W^{(1 \times 1 \times 1)} * E^{(tem)} + b) .

(6)

The local feature representation

E_{local}

integrates spatial continuity, short-term temporal dependency, and crime-type interactions through convolutional operations, thereby capturing fine-grained spatio-temporal patterns at the region level.

However, crime dynamics are not solely governed by local proximity. Regions that are geographically distant may still exhibit highly similar crime patterns due to shared functional roles (e.g., commercial or transportation zones), and temporally distant periods may follow comparable crime cycles. Such long-range and non-local dependencies cannot be sufficiently captured by purely convolutional modeling.

4.3. Global Dependency Modeling

In our model, beyond modeling region-level dependencies, we aim to equip the framework with a global perspective that enables it to capture potential correlations among geographically distant regions and temporally separated time steps. Such long-range dependencies are crucial for crime prediction as regions with similar functional roles and time periods following comparable crime cycles may exhibit highly correlated patterns despite being far apart in space or time.

However, conventional self-attention mechanisms typically compute attention scores at a fine-grained region level, which limits their ability to explicitly model interactions among large and irregular spatial structures. Inspired by clustering-based representation learning, we introduce a prototype-based attention mechanism that aggregates fine-grained regional representations into a set of coarse-grained spatial entities, referred to as zones. This design allows the model to focus on relationships among large-scale and irregular areas on the map, thereby facilitating more effective global dependency modeling.

4.3.1. Prototype Attention

We propose a prototype attention mechanism that abstracts the fine-grained spatio-temporal feature representations into a set of learnable prototypes. Each prototype corresponds to an irregular large-scale zone composed of multiple spatial regions and time steps, enabling the model to reason about interactions at the zone level rather than treating each region in isolation. This design encourages the model to move beyond local spatio-temporal points and instead capture latent similarities between geographically distant but functionally analogous regions or temporally separated yet behaviorally similar periods.

Existing prototype-based or clustering-driven approaches typically rely on explicit grouping strategies, such as k-means clustering [36,37,38,39], graph partitioning [40,41,42], or metric-based assignment [43,44], where prototypes are either precomputed or updated through iterative optimization. Such methods often require carefully designed similarity measures and are sensitive to initialization.

In contrast, our approach does not perform explicit clustering. The prototypes are implicitly learned through convolutional aggregation along the spatio-temporal crime dimension, allowing the model to jointly learn both region representations and their assignments to latent zones in an end-to-end manner. This design avoids rigid partitioning and enables flexible data-driven abstraction of large-scale crime patterns.

The input after position encoding is set to be

Z \in R^{R \times T \times K \times d}

, where R represents the number of regions before prototype aggregation, T is time steps, and K is the type of crime. We expand the input into one-dimensional sequence

Z^{♭} \in R^{N \times d}

, where

N = R \times T \times K

. The query, key, and value matrices are then obtained via independent linear projections of

Z^{♭}

.

Q = Z^{♭} W_{Q}, K = Z^{♭} W_{K}, V = Z^{♭} W_{V} .

(7)

Directly applying self-attention on the flattened sequence may obscure higher-level structural regularities in large-scale urban environments. We abstract fine-grained region–time–crime representations into a compact set of latent prototypes. Each prototype represents an irregular large-scale zone composed of multiple regions and time steps, enabling the model to reason over zone-level dependencies while preserving region-level expressiveness.

Specifically, instead of computing attention directly between

Q, K

and

V

, we first aggregate the key and value representations into a set of prototypes through a learnable aggregation operator. The resulting prototype representations serve as anchors for global attention, allowing each query to selectively attend to semantically meaningful zones. The formulation of the prototype aggregation and the subsequent attention computation is defined as follows:

\tilde{K} = A_{proto} (K), \tilde{V} = A_{proto} (V),

(8)

A_{proto} (\cdot) : R^{N \times d_{h}} \to R^{P \times d_{h}}, P ≪ N .

(9)

In this manner, each query corresponding to a region–time–crime unit attends to a small number of semantically meaningful prototypes rather than all individual regions. Each element

A_{i, j}

reflects the degree to which the i-th region-level representation attends to the j-th prototype, enabling flexible many-to-many interactions between regions and zones.

A = Softmax (\frac{Q {\tilde{K}}^{⊤}}{\sqrt{d_{h}}}), A \in R^{N \times P} .

(10)

Z_{global} = A \tilde{V}, Z_{global} \in R^{N \times d_{h}} .

(11)

The resulting representation

Z_{global}

encodes global contextual information by integrating zone-level crime patterns into each region-level representation. Specifically,

Z_{global}

captures long-range spatial similarity among distant regions, recurring temporal patterns across time steps, and cross-category interactions summarized by the learned prototypes while preserving the original region-level resolution.

4.3.2. Adaptive Global Feature Fusion

To integrate global representations with local context, we introduce a lightweight modulation mechanism. Specifically, local features are first transformed into adaptive weights via a sigmoid activation, reflecting the relative importance of region-level information. The global features are then modulated by these weights and combined with the original global representation through a residual connection, followed by a channel-wise mixing operation. This design allows global dependencies to be adaptively adjusted according to local crime characteristics while maintaining stable feature propagation.

{\tilde{E}}_{local} = E_{local} ⊙ σ (E_{local}) .

(12)

The global features are adaptively adjusted under the guidance of local information, and a residual aggregation scheme is employed to ensure stable feature propagation. Subsequently,

1 \times 1 \times 1

convolution is applied to perform channel-wise mixing and re-calibration, yielding the final fused representation

Z_{fused}

.

{\tilde{Z}}_{global} = Z_{global} + Z_{global} ⊙ {\tilde{E}}_{local},

(13)

Z_{fused} = W^{(1 \times 1 \times 1)} * {\tilde{Z}}_{global} + b .

(14)

4.4. Two-Stage Crime Prediction

Despite the rich spatio-temporal representations obtained through local modeling and global dependency aggregation, directly regressing crime counts remains challenging due to the inherent sparsity and zero inflation of crime data. In many regions and time intervals, crime events do not occur at all, while a small number of regions account for most incidents.

To explicitly account for this characteristic, we further introduce an occurrence-aware prediction mechanism, which decouples crime prediction into occurrence estimation and intensity modeling.

P_{occ} = σ (W_{occ} * Z_{fused} + b_{occ}), P_{occ} \in R^{R \times K} .

(15)

Specifically, for each region

R_{i}

and crime type

K_{i}

, the model first estimates the probability of crime occurrence

p_{occ, i, k}

, which reflects whether a crime event is likely to happen. Conditioned on the occurrence, the model then predicts the potential crime intensity

y_{global, i, k}

.

\hat{Y} = P_{occ} ⊙ Y_{global}, \hat{Y} \in R^{R \times K} .

(16)

In Equation (16),

\hat{Y}

denotes the final predicted crime intensity map at the future time step

T + 1

,

P_{occ} \in R^{R \times K}

represents the predicted crime occurrence probabilities,

Y_{global} \in R^{R \times K}

denotes the predicted conditional crime intensities given event occurrence, which are obtained by applying a lightweight prediction layer to the fused global representation

Z_{fused}

, ⊙ is the element-wise multiplication operator, R is the number of spatial regions, and K is the number of crime categories.

4.5. Model Optimization

To mitigate bias induced by the extreme sparsity and imbalance in crime counts, we adopt a normalized weighted mean squared error that upweights rare but significant events.

Let

{\hat{Y}}_{b, r, k}

and

Y_{b, r, k}

denote the predicted and ground-truth crime intensities for the k-th crime type in region r of batch sample b, respectively. We define a binary indicator

m_{b, r, k} = I (Y_{b, r, k} > 0),

(17)

which identifies non-zero crime locations. Based on this indicator, a re-weighting factor is assigned to each prediction target as

ω_{b, r, k} = 1 + (λ - 1) m_{b, r, k},

(18)

where

λ > 1

controls the relative emphasis on non-zero crime observations. The sparsity-aware regression loss is then formulated as

L_{reg} = \frac{\sum_{b, r, k} ω_{b, r, k} {({\hat{Y}}_{b, r, k} - Y_{b, r, k})}^{2}}{\sum_{b, r, k} ω_{b, r, k}},

(19)

which normalizes the weighted error by the total effective weight. We define the binary occurrence label as follows (from the strength label):

M_{b, r, k} = I (Y_{b, r, k} > 0) .

(20)

Here,

Y_{b, r, k}

denotes the ground-truth crime count of category k in region r at time step b. The indicator function

I (\cdot)

maps the continuous crime intensity into a binary occurrence label:

M_{b, r, k} = \{\begin{matrix} 1 & if at least one crime occurs (Y_{b, r, k} > 0), \\ 0 & otherwise . \end{matrix}

(21)

Thus,

M_{b, r, k} \in {0, 1}

explicitly represents whether a crime event occurs, enabling the model to decouple event occurrence modeling from crime intensity regression.

N^{+} = \sum_{b, r, k} M_{b, r, k}, N^{-} = \sum_{b, r, k} (1 - M_{b, r, k}) .

(22)

Here,

N^{+}

and

N^{-}

denote the total numbers of positive (crime-occurring) and negative (non-occurring) samples, respectively. Due to the sparsity of crime data, negative samples usually dominate the dataset.

To mitigate this imbalance, we introduce a dynamic positive weighting factor

α

, computed as the ratio between negative and positive samples:

α = \frac{N^{-}}{N^{+} + ϵ},

(23)

where the small constant

ϵ > 0

ensures numerical stability.

This adaptive weighting mechanism increases the contribution of rare crime occurrences during optimization, preventing the model from being biased toward predicting non-occurrence everywhere. The occurrence loss is

L_{occ} = \sum_{b, r, k} (- α M_{b, r, k} log (P_{b, r, k}^{occ} + ϵ) - (1 - M_{b, r, k}) log (1 - P_{b, r, k}^{occ} + ϵ)),

(24)

In this equation,

P_{b, r, k}^{occ} \in (0, 1)

represents the predicted probability that a crime of type k occurs in region r at batch b. The loss function follows a weighted binary cross-entropy formulation, where:

the positive term is amplified by the weight $α$ ;
the negative term remains unweighted.

This design encourages the model to accurately detect rare crime occurrences while maintaining robustness on abundant zero-crime regions.

The overall training objective of the proposed model is defined as the sum of three complementary loss terms, as shown in Equation (25). Specifically, the regression losses on both the local prediction branch

Y^{local}

and the global prediction branch

Y^{global}

supervise the model to capture fine-grained regional patterns and long-range dependencies, respectively. Meanwhile, the occurrence loss

L_{occ}

explicitly guides the model to distinguish between crime-occurring and non-occurring instances, alleviating the severe data sparsity and class imbalance inherent in crime datasets.

L = L_{reg} (Y_{local}, Y) + L_{reg} (Y_{global}, Y) + L_{occ} (P_{occ}, Y) .

(25)

By jointly optimizing these objectives in an end-to-end manner, the proposed framework achieves balanced learning of local structure, global context, and event occurrence, leading to more accurate and robust crime prediction.

5. Experiments

In this section, we first introduce the datasets and experimental settings. We then compare the proposed method with state-of-the-art baselines and analyze the results in detail. In addition, case studies and ablation experiments are presented to evaluate the effectiveness of individual model components.

5.1. Dataset Description

We conduct experiments on a publicly available real-world crime dataset collected from New York City, the same preprocessed dataset as STSHN [29], which is derived from the official NYPD crime data released via NYC Open Data, which includes reported crime incidents from January 2014 to December 2015. The dataset spans 731 consecutive days and covers four major crime categories: burglary, robbery, assault, and larceny. Following a chronological splitting strategy, the dataset is divided into training and testing sets with a ratio of 7:1.

In addition, a subset corresponding to one month of data is held out from the training period as a validation set for hyperparameter tuning and early stopping. All predictions are performed at a daily temporal resolution. From a spatial perspective, the geographic area of New York City is partitioned into a uniform grid with a resolution of 3 km × 3 km, resulting in 256

(16 \times 16)

non-overlapping spatial regions. Each region aggregates crime incidents occurring within its boundaries. Detailed statistics of the dataset are summarized in Table 2.

5.2. Implementation Details

All experiments were carried out on a workstation equipped with an NVIDIA TITAN RTX GPU (24 GB memory; NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Xeon Gold 5118 processor (Intel Corporation, Santa Clara, CA, USA). Model hyperparameters were tuned via a systematic grid search to ensure fair and stable evaluation. Specifically, the network depth was explored within

{1, 3, 5, 7}

, while the hidden feature dimension was selected from

{32, 64, 128, 256}

. The prototype number was searched over

{1024, 2048, 4096}

. The convolution kernel size was fixed to 3 based on empirical performance, and the batch size was set to 1 due to memory constraints. To stabilize training, gradient accumulation was adopted with an accumulation step of 16.

Through extensive validation, we observed that a network depth of 3, a hidden dimension of 256, and a reduced dimension of 2048 provided the most favorable balance between training stability and predictive accuracy. These settings were therefore used as the default configuration in all experiments.

5.3. Performance Metrics

To quantitatively assess the predictive performance of different methods, we employ Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) as the evaluation criteria. These two metrics are commonly adopted in continuous spatio-temporal forecasting tasks due to their interpretability and effectiveness.

MAE measures the average absolute difference between predicted values and ground truth, providing a direct indication of overall prediction accuracy without being affected by the scale of the data. In contrast, MAPE evaluates prediction errors in a relative manner by normalizing the absolute error with respect to the true value, which facilitates fair comparisons across regions and time periods with varying crime intensities.

As both metrics quantify prediction errors, lower MAE and MAPE values correspond to better crime prediction performance.

5.4. Baseline Methods

To comprehensively evaluate the effectiveness of the proposed approach, we compare our GL-MoPA model against a diverse set of representative baseline methods covering traditional statistical models, classical machine learning approaches, and recent deep learning-based spatio-temporal frameworks. All baseline results are obtained from their original implementations or reported performance to ensure a fair comparison.

ARIMA [45]: The Autoregressive Integrated Moving Average (ARIMA) model is a classical statistical approach for modeling temporal dynamics in sequential data. It leverages autoregressive terms, differencing operations, and moving-average components to describe linear temporal dependencies in historical observations. In crime forecasting, ARIMA utilizes past crime records to extrapolate future trends, serving as a conventional temporal baseline for comparison with data-driven models.

SVM [46]: Support Vector Machine (SVM) is a supervised learning method that has been employed for both classification and regression tasks in crime analysis. By constructing decision functions in high-dimensional feature spaces, SVM can capture complex patterns through kernel-based transformations. This capability enables it to model non-linear relationships within crime-related features, making it a commonly used baseline in spatio-temporal prediction studies.

ST-ResNet [47]: A deep spatio-temporal residual network initially developed for city-scale flow forecasting. ST-ResNet decomposes temporal dynamics into multiple components, including short-term proximity, periodic regularity, and long-term trends, which are modeled through parallel residual convolutional branches. Spatial correlations are captured by convolutional operations, and the outputs of different temporal branches are fused together with external contextual factors to enhance prediction performance.

DCRNN [48]: The diffusion convolutional recurrent neural network is a graph-based spatio-temporal model that combines diffusion convolution with recurrent architectures. Spatial dependencies are modeled by simulating information propagation along directed graph structures via random walk processes, while temporal dependencies are learned using recurrent units within an encoder–decoder framework. This design enables effective sequence-to-sequence forecasting on structured spatial domains.

DeepCrim [49]: A deep learning framework specifically tailored for crime prediction that augments convolutional neural networks with attention mechanisms. DeepCrim applies attention to selectively emphasize informative spatial regions and critical temporal intervals, allowing the model to focus on salient spatio-temporal patterns embedded in historical crime data.

STDN [50]: The Spatio-Temporal Dynamic Network explicitly models dynamic temporal influences by jointly considering local temporal dependencies and periodic patterns. STDN introduces a flow-gating mechanism to adaptively integrate temporal contexts while leveraging convolutional structures to capture spatial correlations, enabling fine-grained spatio-temporal forecasting in urban environments.

ST-MetaNet [51]: A meta-learning-based spatio-temporal prediction model designed to address heterogeneity across urban regions. ST-MetaNet learns meta-knowledge from multiple regions and temporal contexts, allowing the forecasting model to dynamically adjust its parameters and improve generalization when transferring across diverse spatial areas.

STTrans [52]: A transformer-based spatio-temporal forecasting framework that models long-range dependencies through hierarchical attention mechanisms. STTrans captures spatial interactions and temporal evolution at multiple resolutions, enabling the model to learn both global and local spatio-temporal representations for complex urban prediction tasks.

STGCN [53]: The Spatio-Temporal Graph Convolutional Network models forecasting problems on graph-structured time series using a fully convolutional architecture. By interleaving graph convolution layers for spatial dependency modeling with temporal convolutions for sequence learning, STGCN effectively captures localized spatial interactions and temporal dynamics while maintaining high computational efficiency.

GWN [54]: Graph WaveNet is a graph-based spatio-temporal forecasting framework that integrates adaptive graph learning with dilated temporal convolutions. It dynamically infers hidden spatial relationships among nodes while employing causal and dilated convolutions to capture long-range temporal dependencies, enabling flexible modeling without relying on a fixed predefined graph structure.

GMAN [32]: The graph multi-attention network leverages attention mechanisms to model complex spatio-temporal dependencies in graph-structured data. GMAN adopts an encoder–decoder architecture composed of stacked attention blocks, allowing the model to dynamically capture both spatial interactions across nodes and temporal influences over historical sequences, thereby improving long-horizon prediction stability.

ACGRN [55]: The adaptive graph convolutional recurrent network is designed to learn spatial dependencies directly from data by constructing node-adaptive graph representations. By integrating adaptive graph convolutions with recurrent neural units, ACGRN jointly models evolving spatial structures and temporal patterns, offering enhanced flexibility for multivariate time-series forecasting.

MTGNN [27]: The Multivariate Time Series Graph Neural Network introduces a learnable graph structure to explicitly capture inter-variable dependencies in multivariate temporal data. MTGNN combines graph convolution operations with temporal modeling modules, enabling the framework to uncover latent relational structures among time series while effectively modeling their temporal evolution.

STSHN [29]: The Spatial–Temporal Sequential Hypergraph Network formulates crime forecasting by representing regions and crime types within a unified hypergraph structure. Through dynamic hypergraph construction, STSHN models high-order spatial interactions among regions as well as relational dependencies across crime categories. Sequential hypergraph propagation combined with multi-channel routing enables the model to capture complex spatio-temporal dependencies in urban crime data.

DMSTGCN [28]: The dynamic and multi-faceted spatio-temporal graph convolutional network is designed to handle time-evolving spatio-temporal patterns by dynamically adapting graph-based representations. By jointly modeling diverse temporal characteristics and adjusting spatial dependencies over time, DMSTGCN effectively captures the evolving relationships inherent in complex spatio-temporal sequences.

ST-HSL [56]: The Spatial–Temporal Hypergraph Self-Supervised Learning framework mitigates supervision sparsity in crime prediction by leveraging hypergraph-based representation learning together with self-supervised objectives. ST-HSL integrates local and global spatio-temporal information while introducing auxiliary self-supervised tasks to enhance regional representation robustness under limited labeled data.

Following prior studies, ST-HSL is adopted as the primary hypergraph-based baseline, and all hyperparameters are set according to the configurations reported in the original work.

5.5. Experimental Results

As shown in Table 3, GL-MoPA demonstrates consistent performance gains over the hypergraph-based baseline ST-HSL across all crime categories on the NYC dataset. Specifically, compared with ST-HSL, GL-MoPA reduces MAE and MAPE by 1.9% and 3.8% for burglary, 0.4% and 1.2% for larceny, 4.9% and 1.7% for robbery, and 9.3% and 9.6% for assault, respectively.

Overall, GL-MoPA achieves average relative improvements of 4.1% in MAE and 4.6% in MAPE compared with ST-HSL. These results suggest that the proposed model more effectively balances global and local spatio-temporal representations, leading to improved robustness across diverse crime patterns.

5.6. Robustness Analysis

We additionally investigate the robustness of the proposed GL-MoPA under different levels of data sparsity. To this end, regions are grouped according to their crime occurrence density, which is quantified as the proportion of non-zero entries in the regional crime time series

X_{r}

. Based on this criterion, regions exhibiting low activity (density

\leq 0.5

) are further partitioned into two intervals, namely

(0.0, 0.25]

and

(0.25, 0.5]

. The corresponding performance comparisons across different sparsity levels are presented in Figure 4.

As illustrated in Figure 4, GL-MoPA exhibits clear robustness advantages over ST-HSL under sparse-data conditions. In highly sparse regions with density in

(0, 0.25]

, GL-MoPA reduces MAE, MAPE, and RMSE by 10.3%, 11.3%, and 8.7%, respectively, compared with ST-HSL.

The performance gap becomes more pronounced in moderately sparse regions

(0.25, 0.5]

, where GL-MoPA achieves relative reductions of 17.2% in MAE, 21.0% in MAPE, and 10.8% in RMSE. These results indicate that GL-MoPA is particularly effective at mitigating prediction degradation caused by data sparsity, consistently delivering lower errors across multiple evaluation metrics.

5.7. Ablation Study

To investigate the contribution of individual components in GL-MoPA, we perform a series of ablation studies by selectively removing or modifying key modules of the proposed architecture. This experimental setting allows us to isolate the effect of each component and assess its impact on overall prediction performance.

As illustrated in Figure 5, the complete GL-MoPA model achieves strong overall performance across all the evaluation metrics. Among the examined components, the prototype attention module contributes most significantly to reducing overall error compared with the variant without prototype attention, leading to decreases of 23.4% in RMSE, 6.5% in MAE, and 6.1% in MAPE. This highlights its critical role in capturing global spatio-temporal dependencies and stabilizing prediction magnitude.

The local modeling module also provides notable performance gains compared with the variant without local modeling, yielding reductions of

0.9 %

in RMSE, 6.9% in MAE, and 14.6% in MAPE. In particular, its substantial impact on MAPE indicates that local representations are especially effective in mitigating relative errors and fine-grained deviations across regions.

It is worth noting that the variant without the two-stage module achieves slightly lower overall RMSE, MAE, and MAPE. However, the primary purpose of introducing the two-stage mechanism is not to optimize global metrics but rather to enhance robustness under data sparsity. As demonstrated in Figure 6, the complete GL-MoPA model consistently outperforms its counterpart without the occurrence-aware module in sparse regions, confirming the necessity of this design for reliable crime forecasting.

In addition, we explored the contribution of our different modules to each type of crime. Figure 7 presents a fine-grained ablation analysis of GL-MoPA across different crime categories, reporting RMSE, MAE, and MAPE for burglary, larceny, robbery, and assault under various module removal settings. From the RMSE perspective, removing the prototype attention module leads to the most pronounced performance degradation, particularly for larceny and assault. This indicates that prototype-level global interaction modeling is crucial for stabilizing prediction magnitude and capturing long-range spatio-temporal dependencies across regions, whereas local representations mainly serve as complementary refinements rather than the primary source of magnitude correction.

A similar trend can be observed in the MAE results. The removal of prototype attention consistently yields higher absolute errors across all crime categories, confirming its dominant contribution to overall prediction accuracy. The local modeling module also plays an important role in reducing MAE, especially for robbery and assault, where fine-grained spatial cues are more informative. Notably, eliminating the two-stage module only slightly affects MAE, reinforcing the observation that this component does not primarily target overall accuracy metrics.

The impact of different modules becomes more distinctive when evaluated using MAPE. Removing the local modeling module leads to the largest relative error increase across all the crime types, highlighting its effectiveness in mitigating proportional and fine-grained deviations. This observation suggests that local modeling is particularly important for controlling relative prediction error, especially in regions with varying crime intensities. And the two-stage module is more effective in reducing MAPE.

Overall, the ablation results reveal a clear division of responsibilities among the different components: prototype attention primarily stabilizes global prediction magnitude, local modeling effectively reduces relative error, and the two-stage crime prediction module enhances robustness under sparse-data scenarios. Collectively, these results demonstrate that each module plays a complementary and indispensable role in the overall framework, validating the rationality of our model design.

6. Conclusions

In this paper, we propose GL-MoPA, a novel global–local spatio-temporal framework for city-scale crime prediction. By explicitly integrating prototype attention for global dependency modeling, local modeling for fine-grained spatial refinement, and a two-stage crime prediction mechanism for occurrence-aware learning, GL-MoPA is designed to jointly address the challenges of complex spatio-temporal correlations and severe data sparsity in real-world crime datasets.

Extensive experiments conducted on the NYC crime dataset demonstrate that GL-MoPA consistently outperforms a wide range of classical statistical models and state-of-the-art deep learning baselines across multiple crime categories. The quantitative results show notable improvements in RMSE, MAE, and MAPE, confirming the effectiveness of the proposed framework in capturing both global trends and local variations in crime dynamics.

Furthermore, robustness evaluations under different sparsity levels reveal that GL-MoPA maintains stable predictive performance in sparse regions, where conventional models often suffer from severe degradation. From a practical perspective, this robustness reduces the risk of overestimation or missed detection in low-incident areas, thereby improving the reliability of resource allocation and patrol planning decisions. Ablation studies provide additional insights into the functional roles of individual components, showing that prototype attention primarily stabilizes prediction magnitude, local modeling effectively reduces relative error, and the two-stage mechanism significantly enhances robustness under sparse-data scenarios. These findings collectively validate the rationality and necessity of each module in the proposed architecture.

Overall, the improved forecasting stability and reduced relative error achieved by GL-MoPA enhance decision support for urban safety management, enabling more informed deployment strategies and proactive crime prevention measures.

In future work, we plan to extend GL-MoPA to incorporate additional contextual factors, such as socio-economic indicators and mobility patterns, and to explore its applicability to other spatio-temporal forecasting tasks, including traffic flow and emergency event prediction.

Author Contributions

Conceptualization, Y.Z. (Yuchen Zhao) and H.W.; methodology, Y.Z. (Yuchen Zhao); software, Y.Z. (Yuchen Zhao); validation, Y.Z. (Yuchen Zhao) and Y.C.; formal analysis, Y.Z. (Yuchen Zhao); investigation, Y.Z. (Yuchen Zhao) and Y.Z. (Yanxia Zhou); data curation, Y.Z. (Yuchen Zhao) and Z.D.; writing—original draft preparation, Y.Z. (Yuchen Zhao); writing—review and editing, Y.Z. (Yanxia Zhou) and H.W.; visualization, Y.Z. (Yuchen Zhao); supervision, Y.Z. (Yanxia Zhou) and H.W.; project administration, Y.Z. (Yanxia Zhou); funding acquisition, Y.Z. (Yanxia Zhou) and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by 2024 Xizang Autonomous Region Central Guided Local Science and Technology Development Fund Project (grant number XZ202401YD0015), National Key Research and Development Program of China (grant number 2023YFC3304504), and Basic Research Program for Natural Science of Guizhou Province (grant number QIANKEHEJICHU-ZD[2025]-043).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The preprocessed crime dataset used in this study was obtained from the publicly available repository of ST-SHN at https://github.com/akaxlh/ST-SHN (accessed on 20 December 2025). The source code, trained model weights, and all derived results supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the School of Communication and Information Engineering at Shanghai University for providing research facilities and technical support. We also appreciate the valuable discussions and academic suggestions from colleagues at the School of Information Science and Technology, Xizang University, and the School of Big Data and Computer Science, Guizhou Normal University. Their support greatly contributed to the completion of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sarker, I.H. Smart City Data Science: Towards data-driven smart cities with open research issues. Internet Things 2022, 19, 100528. [Google Scholar] [CrossRef]
Kuo, P.F.; Lord, D. A promising example of smart policing: A cross-national study of the effectiveness of a data-driven approach to crime and traffic safety. Case Stud. Transp. Policy 2019, 7, 761–771. [Google Scholar] [CrossRef]
Elvas, L.B.; Marreiros, C.F.; Dinis, J.M.; Pereira, M.C.; Martins, A.L.; Ferreira, J.C. Data-driven approach for incident management in a smart city. Appl. Sci. 2020, 10, 8281. [Google Scholar] [CrossRef]
den Heyer, G. Policing cost reduction strategies: An international survey. J. Polic. Intell. Count. Terror. 2017, 12, 16–33. [Google Scholar] [CrossRef]
Votey, H.L., Jr.; Phillips, L. Police effectiveness and the production function for law enforcement. J. Leg. Stud. 1972, 1, 423–436. [Google Scholar] [CrossRef]
Carrington, R.; Puthucheary, N.; Rose, D.; Yaisawarng, S. Performance measurement in government service provision: The case of police services in New South Wales. J. Product. Anal. 1997, 8, 415–430. [Google Scholar] [CrossRef]
Brantingham, P.; Brantingham, P. Criminality of place: Crime generators and crime attractors. Eur. J. Crim. Policy Res. 1995, 3, 5–26. [Google Scholar] [CrossRef]
Felson, M. Crime and Everyday Life; Sage: New York, NY, USA, 2002. [Google Scholar]
Gerber, M.S. Predicting crime using Twitter and kernel density estimation. Decis. Support Syst. 2014, 61, 115–125. [Google Scholar] [CrossRef]
Liu, L.; Eck, J. Artificial Crime Analysis Systems: Using Computer Simulations and Geographic Information Systems: Using Computer Simulations and Geographic Information Systems; IGI Global: Hershey, PA, USA, 2008. [Google Scholar]
Park, C.; Lee, C.; Bahng, H.; Tae, Y.; Jin, S.; Kim, K.; Ko, S.; Choo, J. ST-GRAT: A novel spatio-temporal graph attention networks for accurately forecasting dynamically changing road speed. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1215–1224. [Google Scholar]
Grigsby, J.; Wang, Z.; Nguyen, N.; Qi, Y. Long-range transformers for dynamic spatiotemporal forecasting. arXiv 2021, arXiv:2109.12218. [Google Scholar]
Do, L.N.; Vu, H.L.; Vo, B.Q.; Liu, Z.; Phung, D. An effective spatial-temporal attention based neural network for traffic flow prediction. Transp. Res. Part C Emerg. Technol. 2019, 108, 12–28. [Google Scholar] [CrossRef]
Sutherland, E.H. Criminology; JB Lippincott: Philadelphia, PA, USA, 1924. [Google Scholar]
Brantingham, P.J.; Brantingham, P.L. Environmental Criminology; Sage Publications: New York, NY, USA, 1981. [Google Scholar]
Morgan, R.; Maguire, M.; Reiner, R. The Oxford Handbook of Criminology; Oxford University Press: Oxford, UK, 2012. [Google Scholar]
Perry, W.L. Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations; Rand Corporation: Santa Monica, CA, USA, 2013. [Google Scholar]
Cohen, L.E.; Felson, M. Social change and crime rate trends: A routine activity approach. Am. Sociol. Rev. 1979, 44, 588–608. [Google Scholar] [CrossRef]
Farrington, D.P. Predicting self-reported and official delinquency. Predict. Criminol. 1985, 8, 150–173. [Google Scholar]
Shumway, R.H. Time Series Analysis and Its Applications: With R Examples; Springer: New York, NY, USA, 2006. [Google Scholar]
Draper, N. Applied Regression Analysis; McGraw-Hill. Inc.: Columbus, OH, USA, 1998. [Google Scholar]
Chainey, S.; Tompson, L.; Uhlig, S. The utility of hotspot mapping for predicting spatial patterns of crime. Secur. J. 2008, 21, 4–28. [Google Scholar] [CrossRef]
Gorr, W.L.; Lee, Y. Early warning system for temporary crime hot spots. J. Quant. Criminol. 2015, 31, 25–47. [Google Scholar] [CrossRef]
Mohler, G.O.; Short, M.B.; Brantingham, P.J.; Schoenberg, F.P.; Tita, G.E. Self-exciting point process modeling of crime. J. Am. Stat. Assoc. 2011, 106, 100–108. [Google Scholar] [CrossRef]
Wang, B.; Yin, P.; Bertozzi, A.L.; Brantingham, P.J.; Osher, S.J.; Xin, J. Deep learning for real-time crime forecasting and its ternarization. Chin. Ann. Math. Ser. B 2019, 40, 949–966. [Google Scholar] [CrossRef]
Stec, A.; Klabjan, D. Forecasting crime with deep learning. arXiv 2018, arXiv:1806.01486. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 753–763. [Google Scholar]
Han, L.; Du, B.; Sun, L.; Fu, Y.; Lv, Y.; Xiong, H. Dynamic and multi-faceted spatio-temporal deep learning for traffic speed forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 547–555. [Google Scholar]
Xia, L.; Huang, C.; Xu, Y.; Dai, P.; Bo, L.; Zhang, X.; Chen, T. Spatial-temporal sequential hypergraph network for crime prediction with dynamic multiplex relation learning. arXiv 2022, arXiv:2201.02435. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zheng, C.; Fan, X.; Wang, C.; Qi, J. Gman: A graph multi-attention network for traffic prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 1234–1241. [Google Scholar]
Zhang, Y.; Wu, B.; Li, W.; Duan, L.; Gan, C. STST: Spatial-temporal specialized transformer for skeleton-based action recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 3229–3237. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar] [CrossRef]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics & Probability; Statistical Laboratory of the University of California: Davis, CA, USA, 1965; pp. 281–297. [Google Scholar]
Arthur, D.; Vassilvitskii, S. The advantages of careful seeding k-means++. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007. [Google Scholar]
Yang, J.; Parikh, D.; Batra, D. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5147–5156. [Google Scholar]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018, Proceedings, Part XIV; Springer: Cham, Switzerland, 2018; pp. 139–156. [Google Scholar]
Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar] [CrossRef]
Newman, M.E. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577–8582. [Google Scholar] [CrossRef]
Dhillon, I.S. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; Association for Computing Machinery: New York, NY, USA, 2001; pp. 269–274. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 4080–4090. [Google Scholar]
Hsu, W.N.; Zhang, Y.; Glass, J. Unsupervised learning of disentangled and interpretable representations from sequential data. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 1876–1887. [Google Scholar]
Pan, B.; Demiryurek, U.; Shahabi, C. Utilizing real-world transportation data for accurate traffic prediction. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, Brussels, Belgium, 10–13 December 2012; pp. 595–604. [Google Scholar]
Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2011, 2, 27. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, Y.; Qi, D. Deep spatio-temporal residual networks for citywide crowd flows prediction. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
Huang, C.; Zhang, J.; Zheng, Y.; Chawla, N.V. DeepCrime: Attentive hierarchical recurrent networks for crime prediction. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1423–1432. [Google Scholar]
Yao, H.; Tang, X.; Wei, H.; Zheng, G.; Li, Z. Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Palo Alto, CA, USA, 2019; pp. 5668–5675. [Google Scholar]
Pan, Z.; Liang, Y.; Wang, W.; Yu, Y.; Zheng, Y.; Zhang, J. Urban traffic prediction from spatio-temporal data using deep meta learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1720–1730. [Google Scholar]
Wu, X.; Huang, C.; Zhang, C.; Chawla, N.V. Hierarchically structured transformer networks for fine-grained spatial event forecasting. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 2320–2330. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar]
Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20), Vancouver, BC, Canada, 6–12 December 2020; pp. 17804–17815. [Google Scholar]
Li, Z.; Huang, C.; Xia, L.; Xu, Y.; Pei, J. Spatial-temporal hypergraph self-supervised learning for crime prediction. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2984–2996. [Google Scholar]

Figure 1. Frequency distribution histograms of four crime categories in the NYC dataset across spatial regions. The horizontal axis represents discretized spatial regions, while the vertical axis denotes the corresponding number of crime incidents. Regions with zero crime occurrences are omitted for clarity. Color intensity indicates the relative frequency of crimes, with warmer colors representing higher counts and cooler colors indicating lower counts.

Figure 2. Distribution of crime occurrences across space and time in the NYC dataset. The left panel shows the crime counts for each region and time unit across four crime categories using a logarithmic scale, highlighting the strong sparsity and skewness at fine-grained spatio-temporal resolutions. The right panel presents the total number of crimes aggregated over time for each spatial region using a symmetric logarithmic scale, illustrating the pronounced long-tailed and spatially imbalanced distribution across regions.

Figure 3. Architecture of the proposed GL-MoPA model for spatio-temporal crime prediction.

Figure 4. Performance comparison of GL-MoPA and the baseline ST-HSL under different levels of regional data sparsity. Results for highly sparse regions with density in the range

(0, 0.25]

are shown on the left, while those for moderately sparse regions

(0.25, 0.5]

are presented on the right. Lower values indicate better performance across all evaluation metrics.

Figure 4. Performance comparison of GL-MoPA and the baseline ST-HSL under different levels of regional data sparsity. Results for highly sparse regions with density in the range

(0, 0.25]

are shown on the left, while those for moderately sparse regions

(0.25, 0.5]

are presented on the right. Lower values indicate better performance across all evaluation metrics.

Figure 5. Ablation study of our GL-MoPA model on overall metrics.

Figure 6. Comparison of ablation experiments of two-stage crime prediction in sparse regions.

Figure 7. Comparison of ablation results of our GL-MoPA model for each type of crime.

Table 1. Notations used in the proposed framework.

$Notation$	Description
$R, T, K$	Number of spatial regions, time steps, and crime categories.
$r, t, k$	Indices of spatial regions, time steps, and crime categories.
$H, W$	Number of rows and columns of the gridded spatial map.
$m, n$	Row and column indices of the spatial grid.
$X \in R^{R \times T \times K}$	Raw crime count tensor, where each element records the observed number of crime incidents.
$X_{r, t, k}$	Observed number of crimes of category k occurring in region r at time step t.
$e_{r, t, k} \in R^{d}$	Latent embedding vector corresponding to region r, time step t, and crime category k.
$E \in R^{R \times T \times K \times d}$	Embedded crime tensor obtained after feature projection.
$Z \in R^{R \times T \times K \times d}$	Local spatio-temporal feature representation.
$W$	Learnable convolution kernel.
b	Bias term associated with the convolution kernel.
$p_{occ}$	Predicted crime occurrence probability at a given region and time step.

Table 2. Statistics of the NYC-Crimes dataset.

Data	NYC-Crimes
Time Span	January 2014–December 2015
Category	Burglary	Robbery	Assault	Larceny
Cases	31,799	33,453	40,429	85,899

Table 3. Evaluation results on NYC data for burglary, larceny, robbery and assault.

Models	Burglary		Larceny		Robbery		Assault
Models	MAE	MAPE	MAE	MAPE	MAE	MAPE	MAE	MAPE
SVM (2011)	1.1604	0.7653	1.4979	0.6417	1.1278	0.6733	1.1928	0.6964
ARIMA (2012)	0.8999	0.6363	1.3015	0.6268	0.9558	0.5969	0.9992	0.6198
ST-ResNet (2017)	0.8680	0.5603	1.1082	0.5478	0.8177	0.5209	0.9645	0.5749
DCRNN (2018)	0.8176	0.5324	1.0732	0.5492	0.9189	0.5532	0.9692	0.5955
STGCN (2018)	0.8366	0.5404	1.0692	0.5295	0.9035	0.5441	0.9375	0.5757
DeepCrime (2018)	0.8227	0.5508	1.0618	0.5351	0.9083	0.5380	0.9222	0.5777
GWN (2019)	0.7993	0.5235	1.0493	0.5405	0.8681	0.5351	0.8866	0.5646
STDN (2019)	0.8831	0.5768	1.1442	0.5889	0.9230	0.5649	0.9498	0.5661
ST-MetaNet (2019)	0.8285	0.5369	1.0697	0.5424	0.9152	0.5766	0.9320	0.5870
STtrans (2020)	0.8167	0.5592	1.0862	0.5473	0.8848	0.5312	0.9139	0.5655
GMAN (2020)	0.8652	0.5633	1.0834	0.5340	0.9234	0.5671	0.9338	0.5803
AGCRN (2020)	0.8260	0.5397	1.0950	0.5404	0.9013	0.5338	0.9063	0.5519
MTGNN (2020)	0.8429	0.5497	1.0375	0.5237	0.9026	0.5363	0.9256	0.5664
ST-SHN (2021)	0.8012	0.5198	1.0431	0.5291	0.8777	0.5362	0.9169	0.5682
DMSTGCN (2021)	0.8376	0.5485	1.0401	0.5464	0.8597	0.5403	0.9306	0.5601
ST-HSL (2022)	0.7329	0.4788	1.0316	0.5040	0.7912	0.4595	0.8484	0.5029
GL-MoPA (Ours)	0.7187	0.4608	1.0272	0.4980	0.7523	0.4518	0.7698	0.4549

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Y.; Zhou, Y.; Chen, Y.; Wu, H.; Dong, Z. Global–Local Modulated Prototype Attention Network for Spatio-Temporal Crime Prediction. Appl. Sci. 2026, 16, 2572. https://doi.org/10.3390/app16052572

AMA Style

Zhao Y, Zhou Y, Chen Y, Wu H, Dong Z. Global–Local Modulated Prototype Attention Network for Spatio-Temporal Crime Prediction. Applied Sciences. 2026; 16(5):2572. https://doi.org/10.3390/app16052572

Chicago/Turabian Style

Zhao, Yuchen, Yanxia Zhou, Yanli Chen, Hanzhou Wu, and Zhicheng Dong. 2026. "Global–Local Modulated Prototype Attention Network for Spatio-Temporal Crime Prediction" Applied Sciences 16, no. 5: 2572. https://doi.org/10.3390/app16052572

APA Style

Zhao, Y., Zhou, Y., Chen, Y., Wu, H., & Dong, Z. (2026). Global–Local Modulated Prototype Attention Network for Spatio-Temporal Crime Prediction. Applied Sciences, 16(5), 2572. https://doi.org/10.3390/app16052572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global–Local Modulated Prototype Attention Network for Spatio-Temporal Crime Prediction

Abstract

1. Introduction

2. Related Works

2.1. Crime Prediction

2.2. Attention Mechanism

3. Problem Formulation

4. Method

4.1. Embedding Layer

4.2. Local Dependency Modeling

4.2.1. Spatial Modeling

4.2.2. Temporal Modeling

4.3. Global Dependency Modeling

4.3.1. Prototype Attention

4.3.2. Adaptive Global Feature Fusion

4.4. Two-Stage Crime Prediction

4.5. Model Optimization

5. Experiments

5.1. Dataset Description

5.2. Implementation Details

5.3. Performance Metrics

5.4. Baseline Methods

5.5. Experimental Results

5.6. Robustness Analysis

5.7. Ablation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI