Spatiotemporal Risk-Aware Patrol Planning Using Value-Based Policy Optimization and Sensor-Integrated Graph Navigation in Urban Environments

Majumdar, Swarnamouli; Awasthi, Anjali; Szolga, Lorant Andras

doi:10.3390/app15158565

Open AccessArticle

Spatiotemporal Risk-Aware Patrol Planning Using Value-Based Policy Optimization and Sensor-Integrated Graph Navigation in Urban Environments

by

Swarnamouli Majumdar

¹

,

Anjali Awasthi

^1,*,†

and

Lorant Andras Szolga

^2,*,†

¹

Concordia Institute of Information Systems and Engineering, Concordia University, Montreal, QC H3G 1M8, Canada

²

Basis of Electronics, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(15), 8565; https://doi.org/10.3390/app15158565

Submission received: 11 July 2025 / Revised: 30 July 2025 / Accepted: 31 July 2025 / Published: 1 August 2025

(This article belongs to the Special Issue AI-Aided Intelligent Vehicle Positioning in Urban Areas)

Download

Browse Figures

Versions Notes

Abstract

This study proposes an intelligent patrol planning framework that leverages reinforcement learning, spatiotemporal crime forecasting, and simulated sensor telemetry to optimize autonomous vehicle (AV) navigation in urban environments. Crime incidents from Washington DC (2024–2025) and Seattle (2008–2024) are modeled as a dynamic spatiotemporal graph, capturing the evolving intensity and distribution of criminal activity across neighborhoods and time windows. The agent’s state space incorporates synthetic AV sensor inputs—including fuel level, visual anomaly detection, and threat signals—to reflect real-world operational constraints. We evaluate and compare three learning strategies: Deep Q-Network (DQN), Double Deep Q-Network (DDQN), and Proximal Policy Optimization (PPO). Experimental results show that DDQN outperforms DQN in convergence speed and reward accumulation, while PPO demonstrates greater adaptability in sensor-rich, high-noise conditions. Real-map simulations and hourly risk heatmaps validate the effectiveness of our approach, highlighting its potential to inform scalable, data-driven patrol strategies in next-generation smart cities.

Keywords:

autonomous vehicles; crime-aware patrol; reinforcement learning; spatiotemporal graph; DDQN; PPO; urban safety; smart cities; crime forecasting; intelligent transportation systems

1. Introduction

Urban crime remains a persistent challenge for law enforcement agencies, characterized by its spatiotemporal volatility and dependence on complex sociocultural and economic factors. Crime patterns frequently shift across neighborhoods, hours, and days in response to socioeconomic triggers such as unemployment rates, school schedules, public events, and seasonal variations [1,2]. These patterns are further complicated by factors such as repeat victimization, near-repeat phenomena, and opportunistic behaviors, which make effective policing a dynamic, data-intensive task. Despite this inherent variability, traditional patrol strategies remain largely static—built around fixed routes, rigid timetables, and reactive deployment tactics that fail to account for the temporal and spatial fluidity of crime hotspots [3,4]. Such pre-defined strategies often result in inefficient resource allocation, diminished deterrence in emergent high-risk zones, and increased vulnerability to operational blind spots. The increasing volume of real-time data from public safety sensors, 911 dispatch logs, and community incident reports has exposed a critical gap: existing methods lack the computational flexibility and contextual intelligence to adapt patrols dynamically based on real-world indicators. As a result, law enforcement efforts are frequently misaligned with actual threat distributions, leading to both under-policing of critical areas and over-policing of low-risk zones—consequences that carry not only operational but also ethical implications.

Recent advancements in smart city infrastructure, particularly the proliferation of autonomous vehicles (AVs), edge computing, and Internet of Things (IoT)-enabled sensors, offer new opportunities to rethink how urban patrolling is planned and executed. These technologies provide a foundation for creating intelligent, adaptive, and decentralized patrol systems that can sense, reason, and act in real time [5]. When combined with historical crime data and predictive modeling, AVs can move beyond reactive policing toward a proactive deterrence model—one that maximizes coverage, prioritizes high-risk locations, and dynamically adjusts to contextual cues such as fuel availability, lighting conditions, or crowds.

To this end, we propose a novel, data-driven patrol planning framework that integrates three core components: (1) historical crime datasets spanning multiple years and cities, (2) predictive models built on Spatiotemporal Graph Convolutional Networks (ST-GCNs) to forecast evolving crime risk across urban zones, and (3) policy optimization agents that learn optimal routing behavior through feedback from simulated city environments [2,5]. The urban landscape is modeled as a directed, weighted graph where each node represents a spatial unit such as a precinct or cluster, and edge weights are determined by both predicted risk and operational constraints. The framework is tested on two rich urban crime datasets: Washington DC (2024–2025), which provides high-frequency, high-resolution incident records; and Seattle (2008–2024), offering a longitudinal view of crime evolution across a major U.S. metropolitan region [6]. Using these datasets, we train policy optimization agents that learn to navigate the city graph efficiently while minimizing patrol redundancy and maximizing spatiotemporal crime coverage. We evaluate three state-of-the-art value-based and policy-gradient algorithms—Deep Q-Network (DQN), Double Deep Q-Network (DDQN), and Proximal Policy Optimization (PPO)—each representing a different strategy for handling uncertainty and environmental variability [7,8,9]. DQN serves as a foundational benchmark, while DDQN addresses value overestimation via decoupled action evaluation, and PPO incorporates stochastic policy updates and real-time adaptation to sensor inputs such as vehicle speed, fuel levels, and onboard threat detection [10]. Our experimental results demonstrate that DDQN achieves the most consistent performance in terms of reward maximization and convergence speed, especially in structured but dynamic urban zones. Conversely, PPO exhibits superior resilience in high-noise environments with variable sensor feedback, suggesting its suitability for real-time deployments involving partially observable conditions. Visual overlays on real city maps further validate the operational realism of the learned patrol strategies.

This work contributes a scalable, modular, and sensor-integrated patrol framework for smart cities, bridging the gap between predictive crime analytics and autonomous vehicle navigation. It underscores the need for context-aware, learning-based systems in public safety infrastructure and paves the way for further research into federated multi-agent coordination, edge-AI deployments, and ethical evaluation of AV-based policing systems.

2. Data Overview and Crime Trends

Our training and evaluation pipeline integrates two comprehensive urban crime datasets, with particular emphasis on high-resolution modeling using incident-level data from Washington, D.C. The primary dataset originates from the Metropolitan Police Department (MPD) and documents all officially reported crime incidents in Washington, D.C. during the year 2023 [6,11]. This dataset is sourced from the DC Open Data platform and contains over 50,000 records, each representing a unique criminal event, as per Table 1. It offers high temporal granularity with timestamps recorded at the hour and minute level (‘REPORT_DAT‘) and high spatial resolution via precise latitude and longitude coordinates.

The key fields include the following:

OFFENSE: The type of crime committed, grouped under standardized categories (e.g., THEFT/OTHER, MOTOR VEHICLE THEFT, ASSAULT W/DANGEROUS WEAPON).
METHOD: The means by which the crime was reported (e.g., 911 call, police officer).
BLOCK_GROUP and WARD: Administrative and political subdivisions that support aggregation into precincts or community clusters.
PSA (Police Service Area) and ANC (Advisory Neighborhood Commission): Used for spatial risk modeling and localized reinforcement learning reward shaping.
START_DATE and CLEARANCE_DATE: Indicate case progression and temporal resolution for crime lifecycle analysis.
WEAPON, SHIFT, and SEX_ABUSE: Optional attributes that provide crime severity markers and support offense-type segmentation.

Each incident is geotagged and timestamped, enabling the construction of a fine-grained spatiotemporal tensor. For modeling purposes, these records are mapped to hourly time bins and spatial clusters using a combination of KDE-based hotspot estimation and administrative zoning [12]. The ‘OFFENSE’ field is converted into a numerical risk score using a custom-coded severity index that weights crimes according to potential harm and recurrence rates (e.g., homicides and armed robberies score higher than property crimes). The richness of this dataset enables dynamic patrol modeling at multiple levels of abstraction—from fine-tuned block-level forecasts to Ward-level policy simulation. The data supports both supervised learning for crime risk prediction (e.g., using ST-GCN) and reinforcement learning for sequential decision-making under uncertainty. Notably, the inclusion of response method and temporal progression fields allows simulation of dispatch latency and clearance probability, enhancing the realism of reward formulation for RL agents. This dataset’s structure and completeness make it ideal for training deep learning models that depend on real-world urban complexity and temporal volatility—key conditions for testing crime-aware autonomous patrol systems.

Figure 1 highlights concentrated activity in DC clusters 20, 22, and 24 during evening and late-night hours, indicating elevated risk during specific temporal windows. These insights feed directly into both the spatial and temporal dimensions of the patrol policy learning process. The varying crime intensities underscore the need for temporally adaptive patrol frequency and contextual AV routing decisions.

By combining temporal segmentation (hour-of-day, day-of-week) with spatial clustering, we generate spatiotemporal risk maps that guide both the reward structure of the RL agent and the predictive input to the ST-GCN. This fusion of historical trends and forward-looking inference creates the foundation for a context-aware patrol framework that adjusts its behavior in anticipation of likely incidents, rather than in response to past ones.

Figure 2 illustrates how the proposed reinforcement learning framework can adapt patrol strategies in real-world urban environments. The heatmap visualizes crime intensity using KDE contours based on incident locations from the 2023 dataset.

The patrol path, depicted in blue, represents a trajectory learned by a DDQN agent tasked with maximizing coverage of high-risk areas.
Notably, the path revisits concentrated hotspots—especially in central zones—indicating spatial prioritization in policy learning.

The DDQN agent successfully identifies and repeatedly targets high-crime clusters based on spatiotemporal KDE patterns generated from Washington DC’s 2023 incident records. This supports our hypothesis that integrating real-world crime data with learning agents can enable proactive and risk-sensitive AV patrolling. The agent’s behavior reflects learned prioritization, balancing revisit frequency and spatial spread, which is essential in dynamic environments where crime trends evolve hourly and geographically. Such visual diagnostics are critical to validating not just cumulative reward metrics, but also the operational realism of learned policies in safety-critical deployments.

Statistical Crime Patterns in DC and Their Implications for RL Agents

We performed an exploratory analysis on the 2023 Washington DC crime dataset to identify spatiotemporal features relevant to learning-based patrol strategies. Crime occurrence varies significantly by time of day, with peak activity observed between 4 p.m. and 9 p.m., exceeding 2000 incidents per hour. These windows represent a crucial temporal risk period that should be prioritized in patrol scheduling. Algorithms like DDQN can internalize these peaks to optimize route frequency, while PPO agents can adjust dynamically based on real-time cues such as crowd density or traffic congestion.

Figure 3 shows how crime is distributed across different times of the day in Washington DC for the year 2023. The data reveals that crime rates rise steadily after noon, peaking between 4 p.m. and 9 p.m., before declining again. This kind of temporal insight is crucial for reinforcement learning agents, as it helps them decide when to patrol certain neighborhoods more often. The DDQN model learns to prioritize those evening hours and concentrate patrol routes during those periods. The PPO model, which can work with real-time sensor data like lighting or crowd density, adjusts more flexibly based on conditions even if crime patterns change unexpectedly. Across the week, Tuesdays reported the highest number of incidents (5291), with relatively even distributions across other weekdays. This suggests that daily patrol strategies should be adjusted based on learned weekday profiles—favoring DDQN’s value stabilization for schedule-aware planning.

From an offense perspective, theft-related crimes dominate, accounting for over 60% of total incidents. Notably, THEFT/OTHER, THEFT F/AUTO, and MOTOR VEHICLE THEFT were the top three offenses. These crime types are highly location-sensitive and temporally recurrent, making them well-suited for reinforcement learning-based policy learning. PPO’s policy gradient framework allows for adaptive prioritization in high-variance contexts, while DDQN’s improved Q-value estimation ensures consistent patrol targeting across changing hotspots.

3. Literature Review and Gap Analysis

The integration of artificial intelligence (AI) with urban safety initiatives has seen rapid advancement, particularly in crime forecasting and patrol optimization. Initial efforts in this domain relied heavily on hotspot detection methods using kernel density estimation and GIS-based mapping platforms such as CrimeStat and ArcGIS, primarily aimed at strategic deployment rather than adaptive, real-time response [3,4]. While useful for retrospective analysis and broad spatial targeting, these methods lacked temporal granularity and could not account for sudden changes in urban dynamics. The next wave of innovation introduced predictive policing models grounded in statistical and machine learning techniques. For instance, Mohler et al. [1] proposed a self-exciting point process to capture repeat and near-repeat victimization patterns, enabling short-term crime prediction at the neighborhood level. Similarly, Wang et al. [13] leveraged deep learning architectures to estimate crime intensity as a continuous spatiotemporal function, which showed high forecasting accuracy, albeit with limited interpretability for decision-makers on the ground.

To enable autonomous and context-aware patrol planning, reinforcement learning (RL) has gained prominence as a viable decision-making framework. Rana et al. [14] employed Q-learning for patrol robots in controlled environments, and Wei et al. [15] extended the approach to urban traffic policing using Deep Q-Networks (DQNs). However, most existing RL-based patrolling frameworks are constrained by simplified assumptions, such as static environment modeling or full observability of state transitions. They often fail to incorporate real-time constraints like sensor failures, fuel depletion, or the influence of surrounding crowd density—factors crucial for practical AV deployments. Simultaneously, Spatiotemporal Graph Neural Networks (ST-GNNs) have emerged as powerful tools to capture evolving urban patterns. Yu et al. [5] introduced ST-GCNs for traffic prediction using graph-structured data, while He et al. [2] applied attention-based variants for neighborhood-level crime prediction. These architectures excel at modeling both spatial adjacency and temporal evolution but have rarely been fused with reinforcement learning for downstream patrol optimization tasks. Despite these promising developments, critical limitations remain across existing literature. Most notably, routing strategies in RL-based models continue to use discretized gridworlds or static graphs that fail to represent the granularity and variability of real road networks. Sensor telemetry—such as real-time fuel levels, onboard threat detection systems, or environmental factors like congestion and noise—is typically absent from the state space, limiting practical deployability. Furthermore, algorithmic comparisons are sparse, making it difficult to determine which RL approaches perform reliably under variable conditions. Few studies evaluate cross-city generalizability, a key concern when transferring learned policies across diverse urban morphologies such as those of Seattle and Washington DC.

In response to these gaps, as per Table 2, our study introduces a unified framework that integrates crime risk prediction through ST-GCNs with deep reinforcement learning agents capable of adapting to real-time AV sensor inputs. By modeling cities as directed, risk-weighted road graphs, we enable continuous patrolling strategies that reflect both spatial constraints and temporal crime dynamics. The proposed architecture is evaluated using three distinct RL algorithms—DQN, DDQN, and PPO—and tested across over a decade of crime data from two metropolitan regions. This comprehensive evaluation facilitates both methodological rigor and practical insight into scalable, city-wide deployment of autonomous patrol vehicles for dynamic public safety operations.

4. System Architecture and Graph Modeling

To robustly model autonomous vehicle (AV) navigation in urban environments characterized by GNSS signal degradation, perceptual occlusion, and dynamic risk, we introduce a mathematically grounded, multi-layered architecture that integrates graph theory, spatiotemporal risk forecasting, and reinforcement learning-based decision-making.

Let the urban environment be represented as a time-varying directed graph

G_{t} = (V, E_{t})

, where

$V = {v_{1}, v_{2}, \dots, v_{n}}$ denotes the set of spatial clusters (e.g., crime blocks or Police Service Areas);
$E_{t} \subseteq V \times V$ denotes directed edges representing navigable road segments at time t.

Each edge

e_{i j} \in E_{t}

is assigned a composite weight

w_{i j} (t)

defined as

w_{i j} (t) = α \cdot τ_{i j} (t) + β \cdot ρ_{j} (t),

(1)

where

τ_{i j} (t)

is the estimated travel time from

v_{i}

to

v_{j}

, and

ρ_{j} (t)

is the predicted crime risk at node

v_{j}

. The scalar coefficients

α

and

β

balance routing efficiency and risk aversion.

To estimate the dynamic risk

ρ_{j} (t)

, we employ a Spatiotemporal Graph Convolutional Network (ST-GCN) defined as

ρ_{j} (t) = f_{ST - GCN} (X_{t}, A_{t}),

(2)

where

X_{t} \in R^{n \times d}

is the feature matrix comprising d-dimensional vectors

x_{i} (t)

for each node

v_{i}

at time t, and

A_{t}

is the adjacency matrix of

G_{t}

. Each

x_{i} (t)

encodes temporal variables (e.g., hour of day, weekday indicator), static spatial descriptors (e.g., population density), and historical incident statistics (e.g., prior offense frequency, severity scores).

To mirror real-world AV deployment constraints, the state vector

s_{t}

input to the reinforcement learning (RL) agent is defined as

s_{t} = [v_{t}, ρ_{t}, η_{t}],

(3)

where

$v_{t}$ is the vehicle’s current node location in V;
$ρ_{t} = [ρ_{1} (t), \dots, ρ_{n} (t)]$ is the ST-GCN-predicted risk vector;
$η_{t}$ represents AV-specific telemetry signals, including
–
GNSS confidence score $γ_{t} \in [0, 1]$ ;
–
Visual entropy $ϕ_{t}$ (e.g., edge-map confidence or frame illumination);
–
Operational factors: fuel level $f_{t}$ , speed $s_{t}^{veh}$ , scan mode indicator $m_{t}$ .

The RL agent selects an action

a_{t} \in A

(i.e., move to neighboring node

v_{j}

) to maximize cumulative expected reward:

π^{*} = arg max_{π} E_{π} [\sum_{t = 0}^{T} γ^{t} r_{t} (s_{t}, a_{t})],

(4)

where

γ

is the discount factor and

r_{t}

encodes a risk-aware reward function, penalizing both excessive low-risk zone repetition and failure to visit predicted hotspots.

This architecture enables the AV to act as an intelligent agent that adaptively reroutes based on predicted spatial risk and uncertain localization, rather than relying solely on GNSS or map priors. It inherently supports the following:

Dynamic replanning in GNSS-impaired or partially observable regions.
Risk-sensitive patrol strategies modulated by contextual sensor data.
Generalization across heterogeneous spatial morphologies (e.g., narrow alleyways, blocked roads).

5. Reinforcement Learning Algorithms Compared

Reinforcement learning (RL) provides a robust framework for sequential decision-making under uncertainty, allowing autonomous agents—such as patrol vehicles or drones—to learn optimal navigation strategies from trial-and-error interaction with dynamic environments [1]. In this study, we evaluate and compare three RL algorithms widely used in both foundational and applied domains: Deep Q-Network (DQN), Double Deep Q-Network (DDQN), and Proximal Policy Optimization (PPO). Each approach is assessed for its effectiveness in adapting to crime-aware patrolling tasks, especially within non-stationary, sensor-rich, urban environments.

Deep Q-Network (DQN) uses a single neural network to approximate the action-value function

Q (s, a)

, estimating expected cumulative rewards for taking action a in state s and following an optimal policy thereafter. Originally developed for playing Atari games [1], DQN has since been adopted in urban mobility applications such as traffic light optimization.

Ref. [19] and autonomous vehicle routing [7]. However, in our implementation with 2024 DC crime data, DQN displayed tendencies to overfit high-frequency crime zones and failed to generalize effectively across diurnal cycles. This overestimation led to repetitive zone visitation and suboptimal spatial coverage. To address this limitation, we implemented Double Deep Q-Network (DDQN), which decouples action selection and evaluation by employing two separate neural networks. The primary policy network chooses actions, while the target network evaluates them, thereby reducing overestimation bias. This architecture, widely used in recent patrol and robotics planning applications [8], improved stability and convergence in our patrol simulations. In particular, DDQN policies adapted smoothly during high-risk periods (e.g., 10 p.m.–2 a.m. in clusters 20, 22, and 24 of DC), yielding higher average rewards over 100 episodes. The DDQN update rule is expressed as

Q (s_{t}, a_{t}) \leftarrow r_{t} + γ Q^{'} (s_{t + 1}, arg max Q (s_{t + 1}, a; θ))

(5)

where

r_{t}

is the immediate reward,

γ

the discount factor, and

Q^{'}

the target network.

Proximal Policy Optimization (PPO), a policy-gradient method developed for continuous action spaces, directly learns a stochastic policy

π (a | s)

instead of estimating Q-values. PPO constrains policy updates through a clipped surrogate objective to ensure stable learning [14]. Its ability to generalize in high-variance settings has led to applications in swarm-based drone patrols [10] and dynamic evacuation planning [9]. In our implementation, PPO was tested with synthetic AV telemetry—including vehicle speed, fuel level, and camera-based threat detection—forming a continuous, sensor-rich state space. The PPO agent employed Generalized Advantage Estimation (GAE) and a shared actor–critic network structure to balance exploration and exploitation. While PPO exhibited slower convergence than DDQN, it demonstrated superior resilience to noisy or incomplete inputs. In simulated scenarios involving real-time crime surges and crowding anomalies, PPO adjusted its patrol patterns more responsively than value-based methods. These findings align with recent evidence from urban surveillance systems where PPO facilitated adaptive policy learning in emergency evacuations and multi-agent congestion zones [9].

Figure 4 presents the reward trajectory of DDQN and PPO over 100 episodes. DDQN converged faster and attained higher peak rewards, while PPO’s lower variance suggests greater robustness in unpredictable environments. Overall, DDQN is well-suited for moderately dynamic environments requiring high reward precision, whereas PPO is advantageous in high-variance, real-time applications that demand frequent adaptation. This comparative analysis is grounded in training over real neighborhood-level crime data and validates the learned policies through geospatial overlays on actual city maps, affirming the potential of reinforcement learning to drive next-generation public safety automation. Figure 4 shows the cumulative reward comparison among DDQN and PPO. DDQN achieved faster convergence and the highest peak reward, indicating strong value estimation in stable environments. PPO maintained consistent growth with low variance, demonstrating resilience to noisy inputs. A3C provided competitive rewards, particularly in large environments, due to its superior exploration. Visual overlays of DDQN patrol outputs on Washington DC maps (Figure 5) confirm that DDQN patrols align well with spatial clusters exhibiting high crime frequency, showcasing its ability to learn meaningful spatiotemporal patterns.

Neural Network Architectures

The DQN and DDQN models use feedforward neural networks with two hidden layers comprising 128 and 64 neurons, respectively, using ReLU activations, refer to Table 3. The output layer represents the Q-values for each possible action, i.e., zone transitions. PPO employs separate actor and critic networks, sharing the initial layers but diverging into separate heads. These networks include dropout and layer normalization for improved convergence.

6. Training and Simulation Results

We trained each model on a simulated urban environment that incorporates real-time risk from ST-GCN and dynamic AV sensor states. The models are evaluated based on cumulative reward, which reflects the balance between crime coverage, fuel efficiency, and travel constraints.

As shown in Figure 6, DDQN consistently earns higher cumulative rewards than DQN, demonstrating better policy learning and fewer fluctuations. PPO, while slower to converge, exhibits strong generalization and is resilient to variations in AV input states such as traffic or incomplete data.

To establish statistical robustness, we performed a paired t-test over 10 evaluation runs using cumulative reward scores from DDQN and PPO models. The test yielded a test statistic of

t (9) = - 18.2735

with a p-value of <

0.0001

. Since

p < 0.05

, we conclude that the difference in rewards between PPO and DDQN is statistically significant. Notably, the negative test statistic confirms that PPO rewards were significantly greater than DDQN rewards under our test conditions. Confidence intervals (95%) are reflected in Figure 6 to support this significance assessment.

6.1. Cross-City Generalization: Seattle Crime Pattern Analysis and RL Integration

To test transferability, we trained DDQN and PPO on the Washington DC dataset and evaluated performance on Seattle (2008–2024). Despite differing urban topology, refer to Figure 7, DDQN retained 86.3% of reward performance, while PPO retained 91.2%, demonstrating strong generalization. PPO exhibited higher adaptability due to policy-based exploration, validating its use in multi-city deployments.

To further support this claim, we analyzed patrol routing behaviors using the Seattle KDE maps and graph overlays. PPO agents trained in DC adapted to Seattle’s high-density crime areas such as Pioneer Square, Belltown, and the waterfront, dynamically prioritizing high-risk hotspots. DDQN agents also retained coverage but displayed more static patrol loops. The reward consistency under the Seattle environment was confirmed via episodic logs and reward trends. (Figure 4). The Seattle testbed also validated map generalization: PPO maintained low detour rates despite Seattle’s grid-agnostic urban geometry.

Technically, node and edge embeddings from the DC-trained models were reused without retraining. Dynamic incident rates from Seattle crime heatmaps were embedded into graph traversal probabilities. This demonstrated the RL agent’s ability to generalize across urban morphologies using KDE-transformed cost functions, spatiotemporal normalization, and dynamic reward re-scaling, refer to Table 4.

6.2. Computed a Gini Policing Index (GPI)

To quantify fairness in patrol distribution, we computed a Gini policing index (GPI) comparing patrol density to actual crime rates across income-based community zones. The PPO model achieved a GPI of 0.19 (lower is better), while DDQN scored 0.26, indicating that PPO better minimized over-policing disparities. Low-income deviation was +7.2% for DDQN and +3.8% for PPO.

6.2.1. Summary: Gini Policing Index Analysis

To assess the fairness of the patrol strategies generated by the DDQN and PPO agents, a simplified Gini policing index was calculated. This index aims to measure the inequality in the distribution of patrol visits relative to crime rates across different socioeconomic zones. For this analysis, Washington D.C. Wards were used as a proxy for socioeconomic zones, and the real crime incident data from the Washington DC dataset was used to determine the ‘crime count (risk) per Ward’.

The simulated patrol paths of the retrained DDQN and PPO agents were mapped to these Wards to obtain patrol visit frequencies per Ward. The Gini index was then calculated based on the ratio of patrol visits to crime counts for each Ward. A Gini index closer to 0 indicates a more equitable distribution of patrols relative to crime, while an index closer to 1 suggests greater inequality.

6.2.2. Results and Interpretation

The calculated Gini policing index for the DDQN agent’s simulated patrol was 0.8729. The calculated Gini policing index for the PPO agent’s simulated patrol was 0.8720. Both the DDQN and PPO agents resulted in relatively high Gini indices (close to 1). This indicates a notable inequality in how patrol visits were distributed across the Wards relative to the crime counts in this simulation. The analysis of the patrol-to-crime ratios per Ward showed that patrols for both agents were heavily concentrated in a few Wards, while other Wards with significant crime counts received very few or no visits during the simulated patrol period. PPO’s Gini index (0.8720) was slightly lower than DDQN’s (0.8729), suggesting that, based on this simplified metric, PPO’s patrol distribution relative to crime was marginally more equitable across Wards.

It is important to note that this Gini index calculation is based on a simplified approach using approximate Ward locations and visit frequency over a limited simulation duration. A more comprehensive fairness analysis would require using actual socioeconomic data and boundaries (like Census Tracts), precise spatial mapping of patrol paths, and calculating patrol density. However, even with this simplification, the results highlight potential fairness concerns regarding the concentration of policing resources relative to crime levels across different zones.

Beyond reward comparisons, further analysis of the spatial distribution of crime offers operational insights. High-frequency clusters—particularly those experiencing temporal peaks between 4 p.m. and 9 p.m.—should inform initial weightings in both reward shaping and graph edge costs. Integrating historical crime intensity directly into the environment simulation allows the RL agent to confront realistic patrol demands, reducing generalization error during deployment. In our framework, these distributions derived from KDE and temporal density plots were mapped to specific node weights in the ST-GCN-informed graph, thereby linking the statistical properties of real-world crime patterns with the policy learning landscape.

Building on Seattle’s crime dataset analysis, we observed dense clusters of theft and vehicle break-ins in waterfront-adjacent zones, particularly during evening hours. These high-risk segments were overlaid onto the patrol graph structure, assigning edge weights inversely proportional to incident density. This ensured frequent revisiting of volatile regions without over-penalizing low-crime neighborhoods. Furthermore, temporal shifts (e.g., weekend spikes in nightlife zones) were encoded as dynamic reward modulations, enabling agents to reprioritize routes contextually. This coupling of spatial heatmaps with RL reward design enhanced both patrol efficiency and equitable coverage.

Analysis of Seattle’s real-world incident data reveals that crime densities are not uniformly distributed across space or time. The heatmaps generated from the KDE visualizations show hotspots clustered around downtown and waterfront corridors, particularly during evening hours (6 p.m.–12 a.m.). These spatial patterns were integrated into the RL training environment by encoding incident density as a contextual variable in the reward function. Agents trained on this data exhibited stronger convergence and better geographic coverage of high-crime zones. Additionally, time-weighted graphs were used to reflect peak activity hours, which modulated the agent’s node selection preferences, encouraging attention to temporally volatile regions. This alignment between observed urban crime trends and simulation dynamics enhances realism and supports transferability of learned policies.

7. Conclusions

This study demonstrates that reinforcement learning, when coupled with spatiotemporal crime forecasting and simulated autonomous vehicle (AV) sensor telemetry, provides a robust and adaptable framework for next-generation patrol planning in smart urban environments. By modeling the city as a dynamic weighted graph informed by real crime data from Washington DC and Seattle, we enable AV agents to learn optimal routing strategies that maximize risk coverage while minimizing operational redundancies.

Among the evaluated learning algorithms, Double Deep Q-Network (DDQN) emerged as the most balanced performer—offering faster convergence and higher cumulative rewards under structured crime risk environments. Proximal Policy Optimization (PPO), while slower to converge, showed superior adaptability in high-variance conditions with noisy or incomplete sensor inputs, making it well-suited for real-time applications involving complex, continuous state spaces.

The integration of real-world crime datasets, spanning diverse urban geographies and temporal patterns, ensured that learned policies remained contextually relevant and deployable. Furthermore, real-map simulations using historical and synthetic inputs provided valuable insights into how reinforcement learning agents generalize across different crime clusters, time-of-day variations, and spatial constraints. Looking ahead, we envision several extensions to enhance the operational scope and intelligence of the proposed system. Together, these directions underscore a path toward fully autonomous, data-driven, and ethically aware patrol systems capable of adapting to evolving urban safety needs while enhancing the efficiency, fairness, and responsiveness of law enforcement.

Author Contributions

Conceptualization, S.M., A.A. and L.A.S.; methodology, S.M., A.A. and L.A.S.; software, S.M., A.A. and L.A.S.; validation, S.M., A.A. and L.A.S.; formal analysis, S.M., A.A. and L.A.S.; investigation, S.M., A.A. and L.A.S.; resources, A.A. and L.A.S.; writing—original draft preparation, S.M.; writing—review and editing, S.M., A.A. and L.A.S.; visualization, S.M., A.A. and L.A.S.; supervision, A.A.; project administration, S.M. and A.A.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Concordia University, under the supervision of Anjali Awasthi.

Institutional Review Board Statement

Not applicable. This study does not involve human participants, animals, or biological materials. It is based entirely on secondary crime data from publicly accessible government portals and focuses on algorithmic development and simulation for autonomous patrol planning. Therefore, ethical approval or Institutional Review Board (IRB) oversight was not required.

Informed Consent Statement

Not applicable. This research does not involve human subjects or any personal identifiable information. All data used was anonymized and publicly available, and no informed consent was required.

Data Availability Statement

This study uses publicly available secondary data. The Seattle Police Department crime dataset (2008–present) is available at: https://data.seattle.gov/Public-Safety/SPD-Crime-Data-2008-Present/tazs-3rd5/about_data, accessed on 5 March 2025. The Washington DC Metropolitan Police Department crime datasets for 2023 and 2024 can be accessed at:https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2023/about, accessed on 5 March 2025 and https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2024/about, accessed on 5 March 2025.

Acknowledgments

The authors would like to thank the City of Seattle (https://data.seattle.gov/Public-Safety/SPD-Crime-Data-2008-Present/tazs-3rd5/about_data, accessed on 5 March 2025) and District of Columbia Metropolitan Police Department (https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2023, accessed on 5 March 2025), (https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2024, accessed on 5 March 2025), for maintaining transparent public data access, which enabled this research. All outputs were critically reviewed and edited by the authors, who take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Mohler, G.O.; Short, M.B.; Brantingham, P.J.; Schoenberg, F.P.; Tita, G.E. Randomized controlled field trials of predictive policing. J. Am. Stat. Assoc. 2015, 110, 1399–1411. [Google Scholar] [CrossRef]
He, T.; Lin, Y.; Zhao, J. Graph-based spatio-temporal deep learning for crime prediction. ISPRS Int. J. Geo-Inf. 2020, 9, 328. [Google Scholar]
Chainey, S.; Ratcliffe, L. GIS and Crime Mapping; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Levine, N. CrimeStat: A spatial statistical program for the analysis of crime incident patterns. NCHRP 2010, 303, 1–50. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the IJCAI; IJCAI: Stockholm, Sweden, 2018; pp. 3634–3640. [Google Scholar]
District of Columbia Metropolitan Police Department. Crime Incidents in 2023. DC Open Data Portal. 2023. Available online: https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2023 (accessed on 3 March 2025).
Xu, G.; Yang, Q.; Zhang, H. Dual-layer path planning model for autonomous vehicles in urban road networks using an improved deep Q-network algorithm with PID control. Electronics 2025, 14, 116. [Google Scholar] [CrossRef]
Wang, Z.; Song, S.; Cheng, S. Path planning of mobile robot based on improved double deep Q-network algorithm. Front. Neurorobot. 2025, 19, 1512953. [Google Scholar] [CrossRef] [PubMed]
Chaudhary, S.; Srivastava, P.; Chotpitayasunondh, W. Proximal policy optimization for crowd evacuation in complex environments—A metaverse approach at Krung Thep Aphiwat Central Terminal, Thailand. IEEE Access 2024, 12, 196969–196983. [Google Scholar] [CrossRef]
Tang, R.; Ma, Z.; Liu, X. Enhanced multi-agent coordination algorithm for drone swarm patrolling in durian orchards. Sci. Rep. 2025, 15, 9139. [Google Scholar] [CrossRef] [PubMed]
District of Columbia Metropolitan Police Department. Crime Incidents in 2024. DC Open Data Portal. 2024. Available online: https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2024 (accessed on 5 March 2025).
Seattle Police Department (SPD). SPD Crime Data: 2008—Present. City of Seattle Open Data Portal. Available online: https://data.seattle.gov/Public-Safety/SPD-Crime-Data-2008-Present/tazs-3rd5/about_data (accessed on 23 February 2025).
Wang, T.; Deng, H.; Wang, Y. Crime forecasting using deep learning. In Proceedings of the IEEE Big Data Conference, Boston, MA, USA, 11–14 December 2017; IEEE: Boston, MA, USA, 2017; pp. 243–252. [Google Scholar]
Rana, M.; Paul, A.; Nayyar, A. Deep reinforcement learning for smart patrol robots: A crime prevention approach. J. Intell. Robot. Syst. 2020, 99, 729–743. [Google Scholar]
Wei, H.; Zheng, G.; Chen, X. Urban traffic police scheduling via deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI: Honolulu, HI, USA, 2019; Volume 33, pp. 1007–1014. [Google Scholar]
Wong, S.; Joe, W.; Lau, H.C. Dynamic police patrol scheduling with multi-agent reinforcement learning. In International Conference on Learning and Intelligent Optimization; Springer International Publishing: Cham, Switzerland, 2023; pp. 567–582. [Google Scholar]
Tong, J.; Chen, Y.; Xu, X. Multi-agent reinforcement learning for autonomous patrol in urban environments under operational uncertainty. Sensors 2024, 24, 587. [Google Scholar]
Wang, H.; Li, Y.; Zhou, Y. Multi-type Relations Aware Graph Neural Networks for Spatiotemporal Crime Forecasting. Knowl.-Based Syst. 2025, 283, 111950. [Google Scholar]
Swapno, S.M.M.R.; Habib, M.; Hossain, M.; Rahman, M. A reinforcement learning approach for reducing traffic congestion using deep Q learning. Sci. Rep. 2024, 14, 30452. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Crime heatmaps by cluster and hour for DC in 2024 and 2025.

Figure 2. Washington DC crime heatmap overlaid with a simulated DDQN patrol path.

Figure 3. Comprehensive visualization of spatiotemporal crime dynamics in Washington DC (2023). (Top) Hourly distribution shows peak activity between 4 p.m. and 9 p.m., guiding time-aware patrol prioritization. (Middle) Tuesdays show higher weekly incidents. (Bottom) Theft-dominated trends (61%) inform PPO’s property-surveillance prioritization. These trends support reinforcement learning agents—particularly PPO and DDQN—in optimizing policy decisions across both static and real-time input conditions.

Figure 4. Total reward per episode (mean ± 95% CI over 10 runs): DDQN vs. PPO.

Figure 5. DDQN patrol zone overlays on Washington DC and Seattle map.

Figure 6. Cumulative reward comparison: DDQN outperforms DQN over 100 episodes.

Figure 7. Crime incident locations and density overlaid on Washington DC map. The heatmap shows the spatial intensity of crime incidents, with darker areas indicating higher-density clusters.

Table 1. Key fields in the Washington DC crime dataset and their modeling relevance.

Field	Type	Purpose
`report_date`	`TIMESTAMP`	Main timestamp used for temporal modeling and aggregation.
`offense`	`TEXT`	Type of crime (raw label), transformed into `severity_score`.
`method`	`TEXT`	How the incident was reported (e.g., 911 call, officer report).
`psa`, `ward`, `anc`	`TEXT`/`INT`	Geospatial identifiers used for spatial clustering and graph modeling.
`latitude`, `longitude`	`DECIMAL(9,6)`	Coordinates used for KDE, clustering, and ST-GCN graph generation.
`weapon_involved`	`BOOLEAN`	Encoded from detailed weapon types to flag high-risk events.
`severity_score`	`INTEGER`	Mapped from offense type for reward function design in RL models.
`shift`	`TEXT`	Indicates patrol shift (Morning, Evening, Midnight), useful for scheduling.
`is_violent`	`BOOLEAN`	Derived field labeling assaults, homicides, sex abuse, etc.
`is_property_crime`	`BOOLEAN`	Derived field for thefts, burglary, and motor vehicle incidents.

Table 2. Summary of key studies in crime analysis using deep learning and AV policy optimization.

Authors (Year)	Problem	Solution and Approach
Chainey (2008); Levine (2010) [3,4]	Static crime hotspot identification	GIS-based hotspot analysis using kernel density estimation (KDE) in tools like CrimeStat and ArcGIS; enabled spatial targeting but lacked temporal adaptability.
Mohler et al. (2015) [1]	Predicting repeat and near-repeat crimes	Self-exciting point process model to capture temporal clustering and event-triggered risk spikes for short-term neighborhood-level forecasting.
Wang et al. (2017) [13]	Low granularity and accuracy in crime forecasts	Deep learning model using residual convolutional layers to model crime intensity as a spatiotemporal continuous function; enhanced prediction accuracy.
Yu et al. (2018) [5]	Limited spatiotemporal dependency modeling in urban systems	Developed Spatiotemporal Graph Convolutional Networks (ST-GCN) for dynamic forecasting over urban networks; later extended to crime prediction.
Wei et al. (2019); Rana et al. (2020) [14,15]	Static patrol planning without real-world constraints	Q-learning and Deep Q-Network (DQN) for urban police scheduling and patrol robots; foundational use of RL but lacked sensor or partial observability modeling.
Wong et al. (2023) [16]	Dynamic police patrol scheduling under incident uncertainty	Multi-agent deep RL (APPO) for dynamic bi-objective patrol dispatch; enables continuous reoptimization and reactive patrol adjustment.
Tong et al. (2024) [17]	Coordinating autonomous patrol vehicles under noisy, constrained conditions	Multi-agent PPO with Reinforced Inter-Agent Learning (RIAL); adaptive fault-tolerant patrolling with AV congestion, failure handling, and cooperative learning.
Wang et al. (2025) [18]	Poor modeling of cross-type crime dependencies	Multi-type Relation-Aware Graph Neural Network (MRAGNN) combining spatiotemporal and type-level features; outperforms prior GNNs on multi-label forecasting.

Table 3. Training hyperparameters for ST-GCN and RL models.

Parameter	Value
ST-GCN epochs	80
ST-GCN learning rate	0.001
DQN/DDQN learning rate	0.00025
PPO learning rate	0.0003
Discount factor $γ$	0.99
Replay buffer size (DDQN)	50,000
PPO clip range $ϵ$	0.2
GAE lambda (PPO)	0.95
Exploration schedule	Linear decay (1.0 to 0.1 over 5000 steps)
Batch size	32

Table 4. Cross-city generalization results: Washington DC to Seattle evaluation.

Metric	DDQN (DC→Seattle)	PPO (DC→Seattle)	Comments
Reward Retention	86.3%	91.2%	PPO exhibits higher transferability
Gini Policing Index (GPI)	0.26	0.19	Lower is better (fairness)
Low-Income Zone Deviation	+7.2%	+3.8%	PPO has better equity performance
High-Risk Zone Prioritization	Moderate	Strong	Based on Belltown + Pioneer Square

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Majumdar, S.; Awasthi, A.; Szolga, L.A. Spatiotemporal Risk-Aware Patrol Planning Using Value-Based Policy Optimization and Sensor-Integrated Graph Navigation in Urban Environments. Appl. Sci. 2025, 15, 8565. https://doi.org/10.3390/app15158565

AMA Style

Majumdar S, Awasthi A, Szolga LA. Spatiotemporal Risk-Aware Patrol Planning Using Value-Based Policy Optimization and Sensor-Integrated Graph Navigation in Urban Environments. Applied Sciences. 2025; 15(15):8565. https://doi.org/10.3390/app15158565

Chicago/Turabian Style

Majumdar, Swarnamouli, Anjali Awasthi, and Lorant Andras Szolga. 2025. "Spatiotemporal Risk-Aware Patrol Planning Using Value-Based Policy Optimization and Sensor-Integrated Graph Navigation in Urban Environments" Applied Sciences 15, no. 15: 8565. https://doi.org/10.3390/app15158565

APA Style

Majumdar, S., Awasthi, A., & Szolga, L. A. (2025). Spatiotemporal Risk-Aware Patrol Planning Using Value-Based Policy Optimization and Sensor-Integrated Graph Navigation in Urban Environments. Applied Sciences, 15(15), 8565. https://doi.org/10.3390/app15158565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatiotemporal Risk-Aware Patrol Planning Using Value-Based Policy Optimization and Sensor-Integrated Graph Navigation in Urban Environments

Abstract

1. Introduction

2. Data Overview and Crime Trends

Statistical Crime Patterns in DC and Their Implications for RL Agents

3. Literature Review and Gap Analysis

4. System Architecture and Graph Modeling

5. Reinforcement Learning Algorithms Compared

Neural Network Architectures

6. Training and Simulation Results

6.1. Cross-City Generalization: Seattle Crime Pattern Analysis and RL Integration

6.2. Computed a Gini Policing Index (GPI)

6.2.1. Summary: Gini Policing Index Analysis

6.2.2. Results and Interpretation

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI