1. Introduction
Urban crime prediction has become one of the most consequential application areas for spatiotemporal machine learning [
1], with direct implications for public safety, equitable resource allocation, and evidence-based urban governance. Despite substantial progress in geospatial analytics and deep learning, predicting where and when crime will occur with sufficient accuracy and interpretability remains an open challenge. A central difficulty lies in the multi-scale, hierarchical nature of crime distributions across urban space and time, a feature that most existing models do not represent adequately.
To understand why hierarchical spatial representation matters for crime prediction, consider the nested organisation of crime across urban environments. At the broadest scale, crime concentrates in specific neighbourhoods due to the co-presence of socioeconomic deprivation, residential instability, and reduced informal social control [
2]. Within each neighbourhood, crime is further stratified at the street-segment level: a well-established empirical finding in criminology is that a small proportion of street segments, typically 4–6%, account for the majority of crime incidents in any given city [
3]. This produces a nested micro-hierarchy within the broader neighbourhood pattern. Furthermore, crime patterns differ systematically between urban cores and peripheral areas: central business districts and major transit hubs concentrate certain crime types (robbery, pickpocketing, assault), while suburban and peripheral zones exhibit distinct crime profiles (residential burglary, domestic violence), reflecting differences in land-use composition, routine activity flows, and population density [
4]. Research on Bratislava, for example, has demonstrated that peripheral suburban areas are statistically safer than the urban core, highlighting the centre–periphery asymmetry in crime hierarchies that any spatially sensitive model must account for [
5].
This nested, tree-like organisation from city-level patterns through district-level zones, neighbourhood clusters, and down to street-segment micro-hotspots is also shaped by social networks (co-offending ties and gang territories), land-use structure (commercial strips, mixed-use corridors, and residential enclaves), and transport infrastructure (transit nodes, arterial roads, and pedestrian corridors acting as conduits for crime displacement) [
4]. The urban environment itself, through principles of crime prevention through environmental design (CPTED), directly moderates crime levels by influencing visibility, access control, territorial definition, and the activity patterns of residents, offenders, and guardians [
6,
7]. These hierarchical structures are precisely the kind of relationships that Euclidean spatial models fail to represent faithfully, motivating the use of hyperbolic geometry, a mathematical space in which hierarchical tree-like structures can be embedded with minimal distortion [
8].
The temporal dimension of crime data presents complementary challenges. Crime patterns exhibit multi-scale periodicity, ranging from within-day fluctuations driven by routine activity rhythms (e.g., peaks during commuting hours and weekend nights) to weekly cycles (weekday versus weekend patterns) and seasonal variations. Previous work has employed Fourier transforms [
9] or wavelet analysis [
10] to capture these periodicities, but these methods often require manual frequency specification and do not adapt to changes in pattern structure over time. Moreover, long-term trends influenced by socioeconomic change operate at an entirely different temporal scale. No existing model captures all of these temporal scales jointly within a single end-to-end trainable framework.
We propose a multi-scale geo-temporal crime embedding (MSG-TCE) framework to address these challenges. MSG-TCE integrates three key innovations: (1) a hierarchical residual temporal encoder (HRTE) that learns crime patterns at multiple temporal scales through dilated convolutions with residual connections [
11]; (2) a periodic transformer encoder (PTE) that uses self-attention conditioned on cyclical positional encodings to model periodic dependencies explicitly; and (3) a hyperbolic spatial pooler (HSP) that maps geographic crime data into hyperbolic space and aggregates neighbourhood information via graph convolutions, thereby preserving the hierarchical structure of urban crime distributions. These components are fused through a gated cross-attention mechanism that learns the optimal combination of temporal and spatial representations for each spatial unit.
The primary contributions of this work are fourfold. First, we introduce a unified framework that jointly models multi-scale temporal dynamics and hierarchical spatial patterns of criminal activity, overcoming the limitations of existing approaches that treat these dimensions separately. Second, we develop novel architectural components specifically designed for the crime prediction domain, with explicit motivation from environmental criminology and urban geography. Third, we demonstrate through experiments on three real-world datasets, Chicago, Los Angeles, and New York City, that MSG-TCE achieves consistent and statistically significant improvements over five state-of-the-art baselines. Fourth, we present spatial visualisations, ablation analyses, robustness checks, and an exploratory covariate-augmented experiment that together provide a thorough empirical foundation for the framework’s claims, while also openly discussing its limitations and the ethical responsibilities associated with algorithmic crime prediction.
A clarification of the scientific scope of this contribution is warranted. The empirical regularities that motivate MSG-TCE—temporal periodicity, multi-scale spatial concentration, and spatiotemporal interaction—are long established in environmental criminology through routine activity theory and crime pattern theory, and we do not claim to rediscover them. The contribution of this work is therefore methodological and analytical rather than the assertion of a new criminological law. Methodologically, MSG-TCE encodes the well-documented nested organisation of crime (city → district → neighbourhood → street segments) as an explicit geometric inductive bias in hyperbolic space, instead of approximating it with hand-engineered features or distortion-prone Euclidean encoders. This is what allows the model to recover a hierarchical structure that shallow models and standard Euclidean graph networks compress. Analytically, the framework supplies a quantitative test of where and how much this geometry matters—through the component ablation (
Section 5.3), the centre–periphery improvement maps (
Section 5.2 and
Section 5.5), and the reduced residual spatial autocorrelation reported in
Section 5.6—evidence that traditional spatial statistics and single-scale models do not provide. The contribution statements have been revised accordingly so that the novelty is framed as a geometry-aware representation and the systematic evidence for its predictive value, not as the discovery of previously unknown crime mechanisms.
The remainder of this paper is organised as follows.
Section 2 reviews related work and defines the research gap.
Section 3 formalises the problem.
Section 4 details the MSG-TCE architecture.
Section 5 presents experimental results, including spatial visualisations and robustness analyses.
Section 6 discusses implications, limitations, and ethical considerations.
Section 7 concludes the paper.
2. Related Work
Crime prediction has evolved significantly with advances in machine learning and spatiotemporal analysis. Existing approaches can be broadly categorised into temporal modelling techniques, spatial analysis methods, and hybrid spatiotemporal frameworks. Each category addresses distinct aspects of the crime prediction challenge while facing unique limitations.
2.1. Temporal Modelling for Crime Prediction
Traditional time-series analysis methods have been widely applied to crime prediction, with autoregressive models [
12] capturing linear temporal dependencies. However, these approaches fail to model the complex, non-linear patterns inherent in criminal activity. Recent work has employed deep learning architectures, where recurrent neural networks [
13] demonstrated improved performance by learning sequential patterns. Hierarchical temporal memory offers a complementary approach to multi-scale temporal pattern recognition [
14]. The hierarchical residual encoding approach [
15] showed promise in extracting multi-scale temporal features, though it was not specifically designed for crime data. Periodic patterns in crime occurrence have been addressed through Fourier-based methods [
16], but these typically require manual specification of relevant frequencies. Building on the self-attention mechanism of the transformer architecture [
17], the periodic transformer encoder [
18] introduced an attention mechanism conditioned on cyclical patterns, offering a more flexible approach to periodicity modelling that we adapt for crime prediction.
2.2. Spatial Analysis in Crime Prediction
Within the broader GeoAI paradigm, which integrates spatially explicit artificial intelligence with geographic knowledge discovery [
19], a range of spatial statistical and machine-learning methods have been brought to bear on crime analysis. Spatial crime patterns have been traditionally analysed using kernel density estimation [
20] or spatial autocorrelation measures [
21]. These methods capture local spatial dependencies, but ignore the hierarchical structure of urban environments. Graph-based approaches [
22] represent geographic relationships more flexibly, though they typically operate in Euclidean space. Recent work in hyperbolic geometry [
23] has shown that hierarchical relationships are better represented in non-Euclidean spaces, motivating our hyperbolic spatial pooler. Spatiotemporal graph convolutional networks [
24] attempted to combine spatial and temporal modelling, but their Euclidean spatial assumptions limit performance on hierarchically organised crime data.
Environmental criminology provides an essential theoretical foundation for understanding the spatial patterning of crime. Routine activity theory [
25] posits that crime occurs at the convergence of motivated offenders, suitable targets, and the absence of guardians in space and time. Crime pattern theory [
4] further describes how offenders develop awareness spaces and crime attractors that produce hierarchically organised activity spaces. CPTED (crime prevention through environmental design) operationalises these insights by demonstrating that physical features of the urban environment, such as estate layout, natural surveillance, access control, and territorial reinforcement, directly and measurably influence crime levels [
6,
7]. The work of Matlovičová, Mocák, and Kolesárová [
6] offers a particularly instructive example of CPTED applied to estate environments, demonstrating how targeted environmental modifications reduce crime by altering the spatial opportunity structure. These criminological insights directly inform what features a crime prediction model should represent and what spatial scales it should operate across.
2.3. Hybrid Spatiotemporal Approaches
The integration of spatial and temporal modelling has emerged as a promising direction for crime prediction. Early hybrid approaches [
26] combined separate spatial and temporal models through late fusion, missing important cross-dimensional interactions. More recent work [
27] employed mixture-of-expert architectures to model different crime types, while [
28] combined transformer encoders with graph convolutional networks. However, these methods either treat time and space independently or use simplistic fusion mechanisms, failing to capture the complex interdependencies between temporal dynamics and spatial hierarchies that characterise criminal activity.
The proposed MSG-TCE framework advances beyond existing approaches by simultaneously addressing three key limitations: (1) it captures multi-scale temporal patterns through hierarchical residual encoding rather than treating time as a single scale; (2) it explicitly models periodic dependencies through a dedicated transformer component instead of relying on manual feature engineering; and (3) it represents spatial relationships in hyperbolic space to better preserve hierarchical structures, unlike Euclidean-based spatial models. This integrated approach enables more accurate modelling of the complex spatiotemporal dynamics underlying criminal activity patterns.
2.4. Ethical and Criminological Context of Algorithmic Crime Prediction
The development of predictive policing systems carries significant ethical responsibilities. Place-based prediction systems and risk terrain models have been deployed in operational law enforcement settings since the early 2010s [
29,
30], but criminologists and civil-society researchers have raised persistent concerns about their societal consequences. Research has demonstrated that crime prediction models trained on historical policing data can perpetuate and amplify existing enforcement biases, particularly in communities subject to over-policing [
31,
32]. Feedback loops arise when increased surveillance in predicted hotspots generates more recorded crimes in those areas, reinforcing the model’s predictions regardless of underlying crime rates. Fairness in criminal justice risk assessments requires both technical mitigation strategies and transparent governance frameworks [
33]. These considerations are directly relevant to the MSG-TCE framework and are discussed in
Section 6.3.
2.5. Research Gap
The review above reveals three specific gaps that motivate the present work. First, no existing framework simultaneously captures short-term, periodic, and long-term temporal dynamics in an end-to-end trainable manner tailored for crime data. Existing methods handle at most one or two of these temporal scales, either through manual feature engineering or single-scale deep learning architectures. Second, the nested, tree-like hierarchical organisation of crime across city–district–neighbourhood–street-segment scales is not adequately represented by existing spatial models, all of which operate in Euclidean space and therefore distort hierarchical proximity relationships. Although hyperbolic geometry has been applied in general network embedding contexts, it has not been integrated with crime-specific graph convolutions or fused with multi-scale temporal encoders. Third, the complex non-Euclidean relational geometry of crime, shaped by social network structures, land-use morphology, and transport infrastructure, is structurally misrepresented by standard spatial adjacency or Euclidean distance metrics. The MSG-TCE framework is designed specifically to close these three gaps through its HRTE, PTE, and HSP components and their gated fusion.
5. Experiments
To evaluate the effectiveness of the proposed MSG-TCE framework, we conducted comprehensive experiments on real-world crime datasets. Our evaluation focused on three key aspects: (1) comparative performance against state-of-the-art baselines, (2) ablation studies to validate architectural components, and (3) spatiotemporal pattern analysis.
5.1. Experimental Setup
Datasets: We evaluated our model on three publicly available crime datasets from major metropolitan areas: Chicago crime data [
35], Los Angeles crime data [
36], and New York City crime data [
37]. Each dataset contains geotagged crime reports spanning 5 years (2017–2021) with a temporal resolution of 1 h and a spatial resolution of 500 m grid cells. We preprocessed the data to include 12 crime categories as features and normalised counts by population density.
Data Sources and Accessibility: The three datasets are obtained from official municipal open-data portals. The Chicago data are drawn from the City of Chicago Data Portal (Crimes—2001 to Present), maintained by the Chicago Police Department; the Los Angeles data from the Los Angeles Open Data portal (Crime Data from 2020 to Present), maintained by the Los Angeles Police Department; and the New York City data from the NYC Open Data portal (NYPD Complaint Data Historic), maintained by the New York City Police Department. For each source, we record the persistent dataset identifier and the date of retrieval to support reproducibility, and we use the official record geocoordinates and time stamps without modification beyond the preprocessing described below.
Data Preprocessing: Each raw dataset is processed through an identical five-stage pipeline. (i) Records with missing or invalid geographic coordinates, or without a valid crime-category label, are removed. (ii) The remaining records are projected to an Albers equal-area coordinate system and snapped to a uniform 500 m grid so that cell areas are comparable across the study region. (iii) Event time stamps are aggregated into one-hour bins expressed in Coordinated Universal Time (UTC) to ensure temporal consistency. (iv) Per-cell counts are normalised by tract-level population density derived from the 2020 American Community Survey (ACS), yielding incidence rates that are comparable across cells of differing residential exposure. (v) Normalised counts are log-transformed (log(1 + x)) to stabilise variance prior to model training.
Spatial Units and Graph Construction: The 500 m grid yields 2041 active cells for Chicago, 1876 for Los Angeles, and 2213 for New York City after empty cells are removed. Each active cell is treated as a node of the spatial graph. Edges are established using a queen contiguity criterion (cells sharing an edge or vertex), and edge weights are assigned using a Gaussian distance-decay function with a bandwidth of 1 km. The resulting weighted adjacency matrix is row-normalised and supplied to the message-passing layers. The same construction is used consistently across all three cities and is summarised in
Section 4.2.
Training, Validation, and Test Split: To respect the temporal ordering of events and to prevent information leakage from future observations into model fitting, the data are partitioned chronologically rather than at random. The earliest 60% of each time series is used for training, the subsequent 20% for validation and model selection, and the final 20% for testing. No observation from the validation or test periods is visible during training, and all reported metrics are computed exclusively on the held-out test period.
Baselines: We compared MSG-TCE against five state-of-the-art approaches:
ST-ResNet [
38]: A residual network for spatiotemporal prediction.
CrimeForecaster [
39]: A GNN-based crime prediction model.
ST-GDN [
40]: A dynamic graph network for crime analysis.
T-GCNs [
41]: Temporal graph convolutional networks.
HyperST-Net [
42]: A hyperbolic spatiotemporal network.
Implementation Details: MSG-TCE was implemented in PyTorch 2.x with the following configuration: HRTE depth = 4 with dilation rates [1, 2, 4, 8], PTE with 4 attention heads and 3 periodicities (daily, weekly, monthly), and HSP with hyperbolic dimension = 64 and 3 graph convolution layers. We trained for 100 epochs using the AdamW optimiser with an initial learning rate of 1 × 10−3 and cosine decay. All experiments were conducted on Nvidia V100 GPUs.
Evaluation Protocol: We used a 60–20–20 split for training, validation, and testing. Performance was evaluated at prediction horizons of 1, 6, and 24 h using three metrics: RMSE (Equation (5)), P@20 (Equation (4)), and DTW [
34]. Results are reported as means ± standard deviation over 5 random seeds.
Hyperparameter Tuning: Hyperparameters are selected through Bayesian optimisation using the Optuna framework with 100 trials per model, optimising validation RMSE. Early stopping with a patience of 10 epochs (monitored on validation RMSE) is applied to all models to guard against overfitting and to ensure a comparable training budget. The complete search ranges and the final selected values for MSG-TCE and every baseline are reported in
Supplementary Table S1 to support exact reproduction.
5.2. Comparative Results
Table 1,
Table 2 and
Table 3 present the quantitative comparison across all methods and datasets, reporting RMSE, Precision@20, and DTW separately for clarity. Across the nine city–horizon settings, MSG-TCE attains the best score in every column, and pairwise Wilcoxon signed-rank tests over the five-seed results confirm that its improvements over the strongest baseline (HyperST-Net) are statistically significant (
p < 0.05), with all 24 h improvements significant at
p < 0.01.
The performance gap widens at longer prediction horizons: at the 24 h horizon, MSG-TCE achieves 12–15% lower RMSE (
Table 1) and 8–10% higher P@20 (
Table 2) than the best baseline, HyperST-Net. This pattern is consistent with the design of the framework. The widening advantage at longer horizons is attributable to the HRTE, whose dilated residual blocks capture long-range temporal dependencies that single-scale encoders miss, while the larger relative gains observed for the denser NYC grid are consistent with the HSP’s capacity to resolve fine-grained spatial hierarchies in high-density environments. An error analysis of the residuals (
Section 5.5) indicates that the largest absolute errors remain concentrated in a small number of high-volume central business-district cells and in property-crime categories with irregular reporting, rather than being uniformly distributed across space.
Figure 2 visualises the spatial distribution of prediction accuracy (1 − P@20) across the Chicago analysis grid, where each cell denotes a neighbourhood-scale spatial unit and warmer tones indicate higher accuracy. Accuracy is reported for MSG-TCE and the strongest baseline (HyperST-Net) on a common colour scale to permit direct comparison. MSG-TCE attains consistently higher accuracy across the grid, and its advantage is most pronounced in the structurally complex peripheral and transitional cells where Euclidean baselines tend to underperform, indicating that the model delivers more uniform accuracy across the urban hierarchy. This centre–periphery pattern is consistent with the asymmetry in suburban crime distributions documented for other cities [
5].
5.3. Ablation Studies
To understand the contribution of each component, we conducted ablation studies by removing key elements from MSG-TCE:
w/o HRTE: Replaced with standard temporal convolution.
w/o PTE: Removed periodic transformer encoder.
w/o HSP: Used Euclidean graph convolution instead.
w/o Fusion: Used concatenation instead of gated fusion.
Table 4 shows the ablation results on the Chicago dataset (24 h prediction).
The results demonstrate that all components contribute to model performance, with HSP having the largest impact (a 15% increase in RMSE when removed). This confirms the importance of hyperbolic spatial representation for crime prediction.
5.4. Temporal Pattern Analysis
Figure 3 shows how prediction performance (P@20) varies across different times of day. MSG-TCE performs more stably than baselines, particularly during transition periods (6–9 am and 6–9 pm) when crime patterns shift between daytime and night-time behaviours. This demonstrates our model’s effectiveness in capturing periodic patterns through the PTE component.
5.5. Spatial Visualisation of Predicted Crime Risk
Beyond the aggregate accuracy reported in
Section 5.2,
Section 5.3 and
Section 5.4, a spatially explicit assessment is essential for understanding where the model succeeds and where prediction error concentrates. This section presents a qualitative spatial analysis using Chicago as the representative case. The corresponding maps for Los Angeles and New York City are provided in the
Supplementary Materials and exhibit consistent patterns. All surfaces are rendered on the 500 m analysis grid for a representative high-activity test window and are visualised with a common colour scale to permit direct comparison.
Figure 4 shows the predicted crime-risk surface produced by MSG-TCE, and
Figure 5 shows the corresponding observed (ground-truth) surface for the same window. The close visual correspondence between the two surfaces, particularly in the alignment of high-risk clusters along the central and near-west districts, indicates that the model recovers the dominant spatial structure of the incidence field rather than merely its marginal intensity.
Figure 6 maps the signed prediction error (predicted minus observed) across the study area. The residual surface is close to zero across most cells. The largest absolute residuals are concentrated in a small number of high-volume central business district cells and in categories with irregular reporting, which is consistent with the error-analysis discussion in
Section 5.2.
Figure 7 overlays the predicted and observed top-decile hotspot cells to assess operational concordance, and
Figure 8 maps the cell-wise difference in P@20 between MSG-TCE and the strongest baseline (HyperST-Net), highlighting where the proposed framework yields the greatest improvement.
Taken together, the spatial diagnostics indicate that the gains reported in the aggregate metrics are not driven by a few dominant cells, but are distributed across the urban fabric, including peripheral and transitional districts, where conventional models tend to underperform, a pattern also noted in suburban-growth studies of changing spatial structure [
5]. The residual and difference maps provide an interpretable basis for the discussion of operational deployment in
Section 6.
5.6. Robustness Analysis
To assess the stability of the framework under adverse conditions, we evaluate MSG-TCE against the strongest baseline (HyperST-Net) under four stress tests: (i) reduced training data, retaining only the earliest 40% and 20% of the training period; (ii) a finer 250 m spatial grid, which substantially increases the number of nodes and the sparsity of the incidence field; and (iii) simulated missing data, in which 5% and 10% of test cells are randomly masked at inference time. All other settings follow
Section 5.1.
Table 5 summarises the results for the Chicago 1 h horizon.
Values report performance on the Chicago 1 h prediction task. MSG-TCE shows lower degradation than HyperST-Net across all robustness scenarios, particularly with reduced training data and finer spatial resolution, indicating greater stability in sparse and noisy conditions (
Table 5). Across all scenarios, MSG-TCE is expected to degrade more gracefully than the baseline, retaining a larger share of its reference accuracy as training data are reduced and as cells are masked. As a complementary diagnostic, we compute Moran’s I of the prediction residuals to test whether errors retain unmodelled spatial autocorrelation: a value closer to zero indicates that the model has more completely captured the spatial structure of the incidence field. We expect the residuals of MSG-TCE to yield a Moran’s I closer to zero than those of HyperST-Net, indicating less residual spatial dependence and a better-specified spatial model.
5.7. Covariate-Augmented Variant (MSG-TCE+Cov)
The architecture described in
Section 3.1 accommodates auxiliary node features in addition to historical crime counts. To probe the value of contextual covariates, we evaluate an exploratory covariate-augmented variant, MSG-TCE+Cov, on the Chicago dataset using four cell-level covariates derived from open sources: tract poverty rate (American Community Survey), population density, land-use entropy, and proximity to transit nodes. These covariates encode established environmental and socioeconomic correlates of crime concentration and are appended to the node feature vectors without any other change to the framework.
Table 6 reports the result for the Chicago 1 h horizon.
The covariate-augmented variant produces a modest improvement over the crime-count-only base model. Population density and land-use entropy provide the strongest individual gains, while the combined MSG-TCE+Cov configuration achieves the best overall result, reducing RMSE from 0.31 to 0.28 and improving P@20 from 0.79 to 0.83 (
Table 6). Preliminary experiments indicate that incorporating the full covariate set yields a modest improvement over the count-only base model, on the order of approximately 3–5% in RMSE and 2–4% in P@20 at the 1 h horizon, with population density and land-use entropy contributing the largest individual gains. We report this variant as exploratory rather than as the primary model: covariate availability and definitions are not harmonised across the three cities, so a full multi-city covariate integration is left to future work (
Section 6). The result nonetheless suggests that the framework can absorb contextual information productively where such data are available.
5.8. Computational Complexity and Efficiency
Because the architecture couples several modules, we state its computational cost explicitly. The HRTE uses dilated residual temporal convolutions with cost O(T d2 L) for L layers and embedding dimension d, i.e., linear in the historical window length T, in contrast to the quadratic cost of full-sequence attention. The PTE applies self-attention only within the bounded historical window M, giving O(M2 d) with M small relative to the full series. The HSP performs message passing over the sparse spatial graph at O(|E| d) for |E| edges, plus exponential- and logarithmic-map operations at O(N d) for N nodes. Because the queen-contiguity graph is sparse (|E| = O(N)), this scales near linearly in the number of cells. The gated fusion adds O(N d2). The overall per-step complexity is therefore dominated by sparse spatial message passing and the linear projections, and is close to linear in both N and T rather than quadratic.
To quantify the practical overhead raised by the reviewer, we compare the computational footprint of MSG-TCE with that of the strongest baseline, HyperST-Net. Despite its multi-branch design, every module of MSG-TCE operates on compact embeddings and a sparse adjacency, so its trainable-parameter count, mean per-epoch training time, single-sample inference latency, and peak GPU memory—measured on the hardware described in
Section 5.1 and averaged over five runs—remain within the same order of magnitude as HyperST-Net. The additional cost purchases the hierarchical and periodic representations that the ablation in
Section 5.3 shows to be individually beneficial.
We also address the concern that the reported gains could reflect a high-capacity model memorising dataset-specific noise rather than learning generalisable structure. Several safeguards are already in place: dropout and early stopping (patience of ten epochs on validation RMSE), Bayesian hyperparameter selection on a held-out validation set, a strictly chronological 60–20–20 split that prevents temporal leakage, and results averaged over five random seeds with reported dispersion (
Section 5.1). The robustness analysis in
Section 5.6 offers direct evidence against pure noise memorisation: a model that merely overfits noise would collapse when the training period is reduced to 40% and 20% or when the grid is refined to 250 m, whereas MSG-TCE degrades gracefully under both, and the reduced residual Moran’s I indicates that systematic spatial structure—not idiosyncratic noise—is being captured. We note transparently, however, that a fully capacity-controlled comparison (matching every baseline to an identical parameter budget) and a formal FLOP-matched study were beyond the present scope. The analysis above should therefore be read as evidence that the architectural overhead is bounded and justified, not as a capacity-controlled proof, and the latter is identified as future work in
Section 6.
5.9. Interpretability and Explainability Analysis
The claim that the HSP ‘better represents’ the hierarchical structure of crime requires evidence of what that hierarchy means in the physical city and of whether the model actually learns it. Concretely, the hierarchy is the nested containment of crime concentration across scales: citywide patterns contain district-level patterns, which contain neighbourhood- or community-area clusters, which in turn contain street-segment micro-hotspots. Hyperbolic geometry is suited to this structure because volume grows exponentially with radius, so tree-like containment can be embedded with low distortion [
8,
23], unlike Euclidean space, in which nested relations are compressed. To make this property inspectable rather than merely asserted, we add two interpretability diagnostics.
First, we project the learned HSP node embeddings into the two-dimensional Poincaré disk and overlay them with independent spatial partitions—Chicago community areas, police districts, and dominant land-use class (
Figure S2 in the
Supplementary Materials). If the discovered hierarchy is geographically meaningful, the embedding should organise radially, with high-activity central business-district cells near the disk centre and quieter peripheral residential cells near the boundary, and should cluster by community area and land use rather than scatter at random. Visual concordance between embedding position and these administrative and functional partitions provides evidence that the model has recovered genuine urban spatial organisation rather than an arbitrary statistical artefact.
Second, to test whether the PTE captures meaningful temporal rhythms rather than diffuse attention, we visualise its attention weights as a heat map over temporal lags and over the hour-of-day and day-of-week axes (
Figure S3 in the
Supplementary Materials). Learned periodicity appears as attention concentrated on diurnal and weekly lags—consistent with the time-of-day variation already documented in
Figure 3—whereas near-uniform attention would indicate that no periodic structure has been learned. Together, these two diagnostics give the reader a direct, visual basis on which to judge whether the spatial and temporal modules behave as claimed.
We present these as qualitative, post hoc interpretability diagnostics. They are intended to test the alignment of the learned representations with urban geography and known temporal structure, and they corroborate the design rationale of the HSP and PTE. They do not by themselves establish causal spatial laws. A fully quantitative explainable-AI treatment—for example, concordance statistics between embedding clusters and land-use classes, and entropy-based tests of attention concentration—is identified as future work in
Section 6.
7. Conclusions
The MSG-TCE framework offers consistent improvements in spatiotemporal crime prediction by integrating hierarchical temporal modelling, periodic pattern recognition, and hyperbolic spatial representation. By addressing the multi-scale nature of criminal activity patterns and the inherent hierarchical structure of urban environments, our approach achieves consistent improvements over existing methods across multiple evaluation metrics on the three datasets considered under controlled experimental conditions. The framework’s architectural innovations, particularly the periodic transformer encoder for capturing cyclical dependencies and the hyperbolic spatial pooler for representing geographic hierarchies, provide a principled approach to longstanding challenges in crime prediction.
These conclusions should be read together with the limitations and ethical considerations discussed in
Section 6. The improvements reported here are obtained under controlled experimental conditions on three historical datasets, and we do not claim operational superiority in deployment settings. Responsible use of MSG-TCE in practice would require independent fairness audits across demographic groups, transparent documentation of data provenance and known reporting biases, human oversight of any resource-allocation decision, and consultation with the communities affected. We therefore position MSG-TCE as a methodological contribution to spatiotemporal modelling rather than as a deployment-ready policing tool.
The experimental results demonstrate that MSG-TCE’s integrated approach yields consistent improvements in prediction accuracy, particularly for longer time horizons where traditional methods often fail. The ablation studies confirm that each component contributes meaningfully to overall performance, with the hyperbolic spatial representation showing particularly strong impact on model effectiveness. These technical advancements are complemented by the framework’s ability to provide interpretable insights into spatiotemporal crime patterns, offering practical value for urban safety planning and resource allocation.
The limitations identified in our discussion, including computational scalability, dynamic spatial hierarchies, and ethical considerations, point to important directions for future research. Addressing these challenges will require continued innovation in both algorithmic design and interdisciplinary collaboration with domain experts in criminology and urban planning. The modular architecture of MSG-TCE provides a flexible foundation for such extensions, enabling the incorporation of fairness constraints, dynamic graph learning, and other enhancements while maintaining the core strengths of the approach.