A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City

Semsar, Samin; Prabhu, Kiran Laxmikant; Waters, Gabriella; Foulds, James

doi:10.3390/a19050398

Open AccessArticle

A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City

¹

Department of Information Systems, University of Maryland, Baltimore County (UMBC), 1000 Hilltop Cir., Baltimore, MD 21250, USA

²

Center for Responsible AI, Virginia State University (VSU), 21101 Barnes St, Petersburg, VA 23806, USA

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2026, 19(5), 398; https://doi.org/10.3390/a19050398

Submission received: 13 February 2026 / Revised: 23 April 2026 / Accepted: 24 April 2026 / Published: 16 May 2026

(This article belongs to the Special Issue Algorithms for Smart Cities (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

There are ongoing discussions about predictive policing systems being unfair, for example, by exhibiting racial bias. Law enforcement in some cities, such as Los Angeles, California, and Baltimore, Maryland, have initiated the integration of these systems into their decision-making processes, and some of these systems were advertised as being unbiased. However, later studies discovered that these methods could also be unfair due to feedback loops and being trained on historically biased recorded data. Comparative studies on predictive policing systems are few and insufficiently comprehensive. Crucially, the relative fairness of predictive policing methods with regard to traditional hot spot-based policing has not been established. Moreover, the relationship between fairness and accuracy is complex and requires further study. Furthermore, the case of Baltimore City, Maryland, USA, has not yet been systematically analyzed despite its relevance as an early adopter of predictive policing technologies with a fraught history of social justice concerns around policing. An improved understanding of these questions could better inform policy decisions around predictive policing technologies both in Baltimore and beyond. Therefore, in this work we perform a comprehensive comparative simulation study on the fairness and accuracy of predictive policing technologies in Baltimore. Our results suggest that the situation around bias in predictive policing is more complex than previously assumed. While we find that predictive policing exhibits bias due to feedback loops, as previously reported, we also find traditional hot spot-based policing to have similar issues. Although predictive policing is found to be more fair and accurate than hot spot policing in the short term, it also amplifies bias more quickly, suggesting the potential for worse long-run behavior. In Baltimore, the bias in these systems tended toward over-policing White neighborhoods in some cases, unlike in previous studies. However, when the analysis was restricted to some specific crime types, this tendency differed. Overall, this work demonstrates a methodology for city-specific evaluation and compares behavioral tendencies of predictive policing systems, showing how such simulations can reveal inequities and long-term tendencies. We recommend that authorities and community stakeholders use simulation methodologies to assist in collaboratively navigating the complexities around fairness in predictive policing.

Keywords:

predictive policing; hot spots policing; AI fairness; PredPol; kernel density estimation; artificial intelligence; agent-based modeling; simulation studies

1. Introduction

Advancements in AI have led to the development of predictive policing systems with the potential for higher accuracy in predicting crime [1]. Predictive policing has been shown to provide improvements over simpler prior approaches [2], referred to as hot spots or hotspot policing [3,4], a strategy that uses data and simple statistical models to find areas with high crime rates and allocates officers accordingly. This strategy is based on the assumption that crime distribution is affected by environmental factors such as a neighborhood not being patrolled enough or presenting profitable burglary targets, as well as by successful crime attempts on the part of the criminals, that is, there is a higher chance of crime happening where it was been previously observed [5]. These simple models have evolved into more sophisticated predictive policing models that can take into account more factors, including changes in crime distribution over time, allowing them to predict future crime distributions [6]. These predictive models promise higher accuracy in identifying future crime locations [4,7,8].

If police resources are allocated based on these more accurate predictions, one might also expect that the resulting crime distribution would better align with the underlying crime patterns, and consequently produce more proportionate allocations across neighborhoods. However, theoretical and simulation studies have suggested that feedback loops between police allocations and recorded crimes may amplify racial and regional disparities over time [9,10,11]. These competing perspectives suggest that predictive policing may promise higher accuracy while simultaneously posing risks for the development of long-term bias. Understanding how these dynamics unfold remains an important question. Most existing studies have examined these dynamics in relatively simplified settings, often focusing on a single model, limited fairness metrics (either racial or regional), or synthetic data [10,11,12]. The relative fairness and accuracy of predictive policing methods compared to traditional hot spot policing approaches remain insufficiently understood in realistic urban contexts. In particular, prior work has not examined how different policing algorithms compare when evaluated simultaneously across multiple fairness metrics and over extended operational periods. Existing studies typically evaluate a single algorithm or focus on a small number of performance indicators, and none have had the goal of demonstrating a comparative, localized, and multi-metric simulation example. To address these gaps, we propose a comparative simulation framework for evaluating policing systems that jointly analyzes fairness and accuracy across multiple metrics over time using real-world data. Our comparative study provides bases upon which these models’ long-term behavioral tendencies can be evaluated against one another using real crime data. Our results interpretation provides insight about the complications of using real crime data and how the observations can be explained.

Along with comparing the performance of predictive policing and hot spots policing, we investigate the long-term potential of using these methods in Baltimore City, Maryland, USA. Baltimore provides an interesting and important context for several reasons:

Its history of reports of violence [13,14,15].
The evident geographical racial divide around the city [16,17].
The animosity that arose among police and citizens around 2015 after the case of Freddie Carlos Gray Jr. [18,19].
Its use of predictive policing models law in enforcement, starting around 2018 [20,21].

Neighborhoods in Baltimore remain largely segregated. This is not merely an after-effect of historical segregation policies and resulting “White flight”, which kept the Black majority in inner Baltimore while driving White residents toward outer suburbs [22]. Rather, according to the “New Jim Crow” theory [23], mass incarceration in Black neighborhoods functions as a modern way of maintaining segregation by ensuring that people of color reside in the low-income part of the city [23]. This segregated status of neighborhoods reinforces racial and economic disparities among the neighborhoods, which in turn makes it increasingly vital to address the fairness of predictive policing algorithms.

Further complicating fairness evaluations is the difference in the effect of police presence in Black neighborhoods vs. White neighborhoods. In her book “The New Jim Crow,” Alexander, points out that policing in Black neighborhoods is militarized to the extent that Black youth started referring to police presence in their community as “The Occupation”. Alexander mentions that the aggressive tactics used in Black neighborhoods are not even newsworthy, while the same tactics would be “political suicide” in White urban neighborhoods. According to Alexander, the reason for this is the lower political influence of the Black community [23].

These policing tactics affect the psychological and social atmosphere in these neighborhoods. In White communities, police presence is often perceived to bring safety and order. In Black neighborhoods, however, it causes fear, anxiety, and instability—conditions that can escalate tension and end up in aggression by both police forces and residents. Therefore, the same presence can have varied effects in different communities [23]. If policing decisions made using the results of predictive models do not consider these dynamics, they might reinforce harm rather than reduce crime. Pre-evaluation of these systems has uncovered tendencies towards concentration of policing, which could help to promote better-informed police decision-making.

The death of Freddie Carlos Gray Jr., a 25 year-old Black man, was caused by injuries sustained while in police custody and led to riots and city-wide protests, heightening the tension between the police and the Black community [15]. The protests were rooted in the over-policing of Black neighborhoods by heavy surveillance and wrongful arrests, as well as under-protection with delayed and insufficient police response to calls for service [24,25]. In the wake of these protests, there was a spike in crime rates, especially in homicides and non-fatal shootings, in Baltimore [26,27]. Two reasons have been reported as probable causes of this spike:

1.: Police pullback: The police department underwent a month-long retreat in response to the protests [26]. There was a drop in the number of arrests [27].
2.: Opportunists and emboldened criminals: A public survey by the PEW Research Center showed that about 60% believed the main cause of the crime rise was that some people took advantage of the situation [28]. During the unrest and subsequent police pullback, opportunists looted approximately 30 pharmacies and drug clinics, stealing substantial amounts of narcotics, which some believed caused a rise in crimes as well [29,30,31].

The grassroots movements following the death of Freddie Gray and the mass arrest of protesters [32] resulted in the establishment of several community-led organizations advocating for police reform and racial justice, namely, the Baltimore Action Legal Team (BALT) [33], the No Boundaries Coalition Community [34], and Leaders of a Beautiful Struggle (LBS) [35]. These local organizations, with the help of activists, lawyers, and community members, are trying for reform by providing legal assistance, public education, and policy advocacy for Black residents affected by biased policing practices. Following the goals of this grassroots movements, our study extends their vision by addressing the technological aspect of policing reform.

The segregated nature of Baltimore’s neighborhoods along with its history of aggressive law enforcement tactics may have taken root in Baltimore’s crime records, on which predictive policing algorithms are trained. Baltimore’s law enforcement has been using predictive policing tools without prior independent evaluations and analysis or estimating potential long-term performance tendencies in Baltimore’s specific context, which is a problem that needs to be addressed.

With this study, our contributions are:

1.: A simulation study of the fairness of predictive policing in the context of Baltimore/
2.: A comparative analysis of hot spot policing and predictive policing.
3.: Introducing a logical basis on which predictive policing models can be compared.
4.: Visualization of results with suitable graphics, promoting the comprehensibility of our comparisons and the interpretability of related findings.
5.: Last and most importantly, demonstrating the necessity of localized evaluation of policing systems prior to real-world implementation and providing a method with which to achieve this.

In our 300 day-simulation period, predictive policing developed less bias on average in most scenarios compared to hot spot policing while maintaining higher accuracy; however, this could change if the simulation were run for longer because of predictive policing’s higher rate of bias development. Contrary to the prevailing wisdom, the feedback loop in Baltimore trends toward assigning a higher average share of officers to White neighborhoods, as determined for Kernel Density Estimation (KDE) and PredPol when trained on all crime records from 2018-2019.

2. Related Work

Since the emergence of predictive policing systems, research in this field has focused on four main areas: accuracy improvement [2,8,36], fairness enhancements [9,10,12], comparative studies [37,38], and contextualized implementation studies [9,11,12,39,40], with some works contributing to more than one aspect.

2.1. Accuracy Improvement

Advancements in predictive policing have grown with improved data collection methods and more sophisticated techniques. Earlier predictive systems in this field relied on hot spot mapping methods such as Kernel Density Estimation (KDE) [41,42] to predict crime densities [43]. Temporal-aware crime mapping claimed accuracy improvement over non-aware static mapping by taking into account the time distance between prior crimes and the prediction target time [4]. By accounting for the time and space dependency of crime distribution, the proposal of spatiotemporal KDE for improved hot spots policing set the foundation for more enhanced predictive models [8]. The Epidemic-Type Aftershock Sequence (ETAS) crime forecasting model, which makes use of the Expectation Maximization (EM) algorithm, showed improved accuracy in assigning crime risk scores to different geographical areas [2,36]. This ETAS-based framework was adopted and operationalized by the PredPol predictive policing system to generate hot spot forecasts [44].

More recent years have seen the use of textual data for crime analysis and prediction using data from social media [45,46]. Vivek and Prathap [45] used crime-related tweet counts and compared the predictions of crime counts by LSTM, ARIMA, and SARIMA. They showed that ARIMA performed best at predicting crime counts while LSTM performed worst, probably due to limited training data. They also showed that crime tweet counts fairly matched real-world crime counts and could be used as a proxy for real-world crime data in certain crime categories. Similarly, S. Tam et al. [46] combined twitter sentiment analysis with historical crime data and proposed the ConvBiLSTM model, which integrates CNN and BiLSTM networks. They fused processed textual and non-textual data, which they fed into a Bidirectional LSTM. The resulting model predicted whether a tweet could be indicative of a crime with about 97% accuracy.

Reinforcement learning has contributed to the enhancement of police allocation logics [39,40,47,48,49,50]. Barbosa et al. proposed a model-free RL approach for optimizing patrol agent positioning. The model learns from digital “pheromones” left over from previous events. After repositioning an agent, the model is rewarded if an agent is present where a crime is happening and penalized if a crime happens where there is no patrol [40]. Joe et al. [47] used deep reinforcement learning and temporal difference learning for patrol and incident response. Their model learns dispatch and rescheduling jointly with two main objectives: maximizing successful incident response and minimizing patrol schedule disruption. Another work on patrolling improvement used IDQN and deep reinforcement learning to learn dynamic patrol routes by maximizing patrol coverage and prioritizing high-risk areas. The city is represented by a graph, with locations as nodes and roads as edges. Each road has two attributes: crime risk and distance. Patrolling a high-risk road is rewarded, while revisiting a road diminishes the reward [48,49]. Multi-Agent Reinforcement Learning (MARL) has also been used to optimize patrolling and dispatching by a joint policy. The model is rewarded if the decision results in faster response and better coverage, and is penalized if a decision results in delays or gaps in patrols [39].

The aforementioned studies did not address fairness; other studies that have evaluated fairness are discussed in the following subsection.

2.2. Fairness Studies

In 2016, the unfairness tendencies of predictive policing models were brought to light when a simulation study by Lum and Isaac [9] on drug-related crime in Oakland, California demonstrated how police would frequent minority and low-income neighborhoods when these models were applied. This motivated more studies focusing on fairness analysis of these systems.

Brantingham ran a study explaining how bias enters into data when an officer downgrades or upgrades a crime based on personal bias [51]. In this study, simulations were run after downgrading crime counts by removing a percentage of crimes from the actual dataset and after upgrading counts by adding an additional percentage of crimes, using setting of 2, 5, 10, 15, and 20 percent. They compared model parameters and demonstrated that for downgrading, the bias needs to be substantial in order to have a noticeable impact on the model parameters. By “noticeable impact” they meant that the estimated parameter values were more than one standard deviation from what they would have been if there was no downgrading. When upgrading crime counts, the impact became noticeable at a lower percentage (10 percent) than when downgrading [51]. In 2018, Ensign et al. mathematically analyzed how this bias could happen using a Polya-urn model in four different scenarios involving allocating a single officer to one of two neighborhoods:

1.: High-crime-rate neighborhood vs. low-crime-rate, training only on discovered crimes.
2.: The same, except this time training only on reported crimes.
3.: Two neighborhoods with high crime rates, training only on discovered crimes.
4.: The same neighborhoods, now trained only on reported crimes [10].

The authors proposed several remedies that could improve that specific model [10]. The same year, a study by Brantingham and Mohler et al. investigated evidence of bias in a randomized controlled trial of the ETAS model, which their results suggested was nonexistent [52]. However, their bias measurement was different from what is measured in the fairness of policing distribution when using predictive policing algorithms in simulations [5,11,52]. They did not look at police concentration after long-term use of the predictive policing model or at the effect of the feedback loop. Their bias detection focused on detecting significant racial differences in arrests between a control (day 20 hot spots chosen by human analysts) and treatment days (day 20 hot spots chosen by the predictive policing algorithm) [52]. In another paper, Mohler introduced a new version of the ETAS model with a penalized likelihood method that incorporated demographic parity, demonstrating give-and-take between accuracy and fairness [12].

Akpinar et al. contributed to the discussion of fairness by demonstrating that even when using victim-reported data instead of arrest data (i.e., discovered crime), the models would show bias because crimes are under-reported in some areas compared to others [11]. They pointed out that the unfair rankings were data-driven and not model-specific. They used a synthetic crime dataset generated by the Self-Exciting Point Process model (SEPP) over Bogota, Colombia. They thinned the generated dataset by the victimization rate of each district in Bogota to form a synthetic true crime dataset, then thinned it further using the victim report rate of each district to create the reported crime dataset. A simulation was run once using the Moving Average (MAVG) model and once using the SEPP model. In both cases, it was shown that fewer hot spots were predicted in low-reporting districts regardless of the actual crime rate. While this study demonstrates the inherent risk of data-driven bias even across different model types, it did not simulate police deployment or detection feedback, nor did it compare models in terms of predictive performance. Our work builds on this insight by using real-world crime data, modeling officer deployment and detection, and examining how retraining based on observed and reported crime influences the fairness and accuracy of hot spot-based policing and predictive policing over time [11].

In 2022, Chapman et al. argued that the cause of bias lies in the theoretical assumptions behind model development [5]. They asserted that criminological theories are categorized into two beliefs:

1.: Neo-classical: Crime is the product of rational choice, and police presence is preventive. Because a successful burglary can motivate another one nearby, there is a relation among crimes within space and time.
2.: Positivist: Crime is based in genetics, social setting, and biology, rather than by choice; thus, its distribution is mostly random.

To show that bias arises from the model, Chapman et al. created three sets of synthetic crime data: one with a uniform and random distribution, the second mostly uniform with two hot spot regions added, and the third consisting of real-world data. They attempted to show that simulating PredPol over each of these datasets caused clearer hotspots to form after 48 days of simulation. Police were distributed among the ten top-ranked regions.

We note that even if the model had ranked them the same, only ten regions would be allocated police. From then on, hot spots would appear. Whether this demonstrates bias was caused by the model or bias due to limited policing resources is unclear. The authors also tried to remove the effects of feedback in the simulation by using a random crime dataset. They observed that no hot spots were formed, countering their assumption that the model was the cause of bias. In this experiment, the authors changed the data fed into the model, not the model itself. They concluded that bias is model-driven, unlike Akpinar [5,11]. In 2023, Mashiat et al. presented an extended abstract proposing the use of a counterfactual causal model to assure fairness. They compared their PropFair model with Ensign’s PolyaUrn model based on police allocation between two neighborhoods with constant crime rates of 0.3 and 0.7, claiming that their model better matched the fairness line by allocating about 0.45 of police to beat 1 and the rest to beat 2 [37]. There have also been fairness studies on some investigative tools [38] and criminal recidivism prediction systems [53,54,55]. Here, we focus on fairness studies concerning crime prediction and patrolling-related tools.

2.3. Comparative Analysis

There have been numerous works comparing the accuracy of two models with noticeable differences in their underlying logics [4,8,40,46,47,48,50]. Among these, several proposed new models for incident response or route-patrolling scheduling [40,47,48,50]. Others have proposed crime prediction models for helping with policing decision-making [2,4,8,46]. To show the enhancement in accuracy provided by their proposed framework, these studies compare a proposed model to other models or consider specific features by comparison to the same base model without optimization.

Other studies have focused on comparing an unbiased version of an algorithm to its original version in terms of fairness [10,12,39]. Ensign theoretically compared PredPol with a modified version of it, with the comparison focusing on allocating an officer between two regions [10]. Mohler [12] compared a Hawkes process-based model to a version of it with a penalized likelihood, while Repaskey studied fairness in the officer-response time of a Multi-Agent Reinforcement Learning (MARL) model and an optimized model by rewarding even patrol coverage [39].

There are also some existing works comparing both the fairness and accuracy of two different algorithms with different levels of complexity. In an extended abstract, Mashiat et al. presented a theoretical fairness and accuracy comparison between PropFair, their proposed counterfactual causal model, and Ensign’s PolyaUrn model, using the same two-region police allocation scenario as in Ensign’s 2018 study [37]. Griffard et al. also studied fairness and accuracy, finding a pattern between newly police-reported crime and previous cases by looking at the rate at which patterns were found in minority neighborhoods versus others. They compared Patternizr, an investigative tool deployed by the NYPD, with baseline models such as gradient-boosting decision trees [38].

2.4. Contextualized Studies

There have been a number of studies analyzing the performance and impact of predictive policing systems in certain regions, some through simulation and others through real-world experiments. The simulation studies have investigated model behavior in cities such as Oakland, Indianapolis, Atlanta, Denver, and Bogota [9,11,12,39,40]. Real-world performance analyses have been conducted in Los Angeles, Kent (UK), and Memphis [7,52,56]. Baltimore has not been systematically studied before, although the city is currently making more use of AI tools in their law enforcement than previously [57,58]. We believe it is imperative to be aware of the unfairness tendencies of such models specific to the context prior to real-world application, since any city policies can further impact the demographic distribution and be a cause of economic, social, and racial divide, as seen in the past [22]. Therefore, one of the aims of this study is to analyze model performance in Baltimore, an important deployed real-world context which is also representative of areas with a history of racial segregation.

As these studies show, fairness distortions in predictive policing systems can emerge from both modeling assumptions and biased data inputs. Yet, few have examined how these biases evolve over time in real-world deployment settings, another gap we aim to address through simulation in the Baltimore context. For more information on the body of work in this field, see previous literature reviews by Raji et al. [59] and Mandalapu et al. [60].

3. Materials and Methods

In this section, we describe the datasets used and some basic terminology that facilitates our explanation of the simulated scenarios, followed by the details of those simulations.

3.1. Data Description

We use the crime and neighborhood dataset available on the Open Baltimore website [61,62,63] (data source: https://data.baltimorecity.gov/datasets/, accessed on: 11 March 2023, 23 August 2022, 15 March 2023, respectively). As the Open Baltimore website periodically updates these datasets and its directory structure may change, we also maintain archived copies to ensure reproducibility (downloaded version: https://github.com/saminsemsar/Data_Analysis_Portfolio/tree/main/PredictivePolicing/Data, accessed on 1 April 2026). The crime dataset is a spatiotemporal dataset which includes information such as the date and time of the crime, latitude, longitude, neighborhood, and crime description. For the study, crime locations outside residential areas were excluded. After performing data preprocessing, a total of 207,447 crime records that occurred in Baltimore City in the years 2019–2020 were utilized. Crime records were sorted chronologically by their timestamp. Note that these crime records are only a subset of the incidents that occurred in that period, as they are the only ones documented and known to law enforcement, and this is likely to impact the behavior and potential bias in predictive policing algorithms [9].

3.2. Definitions

In this paper, several key terms will be used, which we have defined below.

Black neighborhoods: We define a neighborhood as Black if the number of Black residents in that neighborhood exceeds the number of residents of any other racial group. For example, if a neighborhood has 21 Black residents, 20 White residents, and 10 Latino residents, it is flagged as a Black neighborhood. This majority-based definition allows us to consistently categorize neighborhoods by their predominant racial composition for fairness analysis.
White neighborhoods: Neighborhoods with majority White residents.
Neither Black nor White neighborhoods: Neighborhoods where the majority of the residents are neither Black nor White. These areas are populated primarily by groups including the Latinx community, Alaskans, and other racial or ethnic minorities.
Noisy-OR: We use a Noisy-OR function to model the probability of an effect Y given binary causes $X_{1}, \dots, X_{n}$ . Here, $X_{i}$ indicates whether cause i is present or not. Each cause $X_{i}$ is associated with a failure probability $q_{i}$ . The Noisy-OR formulation assumes conditional independence and is defined as follows:

$P (Y = 1 ∣ X_{1}, X_{2}, \dots, X_{n}) = 1 - \prod_{i = 1}^{n} q_{i}^{X_{i}} .$

(1)

We incorporate the Noisy-OR probabilistic model to introduce uncertainty into crime detection. Noisy-OR models the independent contribution of multiple officers in detecting a crime by increasing the likelihood of detection as the number of officers within the detection radius increase. Then, an event (e.g., crime detection) is determined as having occurred if any of several coin flips come up as heads, i.e., if a random probability is higher than the Noisy-OR calculated probability, that crime is marked as “detected”.
Detected crimes: The algorithm determines whether a crime is detected by applying the Noisy-OR model to the probability of detection (p), taking into account the number of police officers within the neighborhood k and the probability of an officer detecting a crime, which is a hyperparameter $p = 0.5$ . This detection probability parameter (p) is selected as a neutral baseline in the absence of reliable empirical estimates of an officer’s detection rate. Choosing $p = 0.5$ allows us to model detection uncertainty while avoiding strong assumptions that could bias results toward specific outcomes.

$P (Crime Detected) = 1 - {(1 - p)}^{k}$

(2)

A crime is labeled as “detected” based on a coin flip with the probability in Equation (2).
Reported crimes: We create a report dataset by flipping a coin with 0.4 probability to decide whether or not each crime would be reported Equation (4). The value of 0.4 was calculated using a weighted average based on the number of each type of crime in our dataset and the probability of reporting that crime based on the 2019 report of the Bureau of Justice Statistics (BJS) [64]. We weight each crime category’s count by its national reporting probability from the BJS. The estimate is computed as follows:

$Average Report Probability of Crime = \frac{\sum_{i \in CT} N_{i} \cdot r_{i}}{\sum_{i \in CT} N_{i}}, in our case ≃ 0.4 .$

(3)

$Crime is Reported if (0.4 > rand)$

(4)

where $N_{i}$ is the number of crimes of type i that occurred after 2019 in Baltimore and $r_{i}$ is the national reporting rate of crime type i (see Figure 1).
To clarify the concepts of reported and detected crimes, consider a simple example. Suppose that ten crimes actually happen in a neighborhood, but only four are reported to the police due to under-reporting. A predictive system trained on historical data will only observe these four reported incidents, not the full set of ten true crimes. In our simulation, “detected” crimes refers to the subset of true crimes that become visible to the system through policing activity, which may partially overlap with reported crimes.
KDE: Kernel Density Estimation (KDE) is a statistical technique used to estimate the underlying Probability Density Function (PDF) of a set of data points. It involves creating a smooth and continuous function by placing a kernel (a predefined shape, such as a Gaussian) on each data point and then summing them. The resulting estimated density function provides insights into the distribution and intensity of the data across the entire range. We implemented Kernel Density Estimation using the KernelDensity class from the scikit-learn Python library (version: 1.1.3) [65], which provides a computationally efficient implementation of the density estimation method originally introduced by Rosenblatt [42] and Parzen [41].
Short-term KDE: KDE which receives short-term crime history. In our simulations, we set this short term to equal a month’s worth of crime history.
Long-term KDE: KDE which receives a longer history of crime compared to short-term KDE. In our simulations, we set it equal to one year’s worth of crime history to make it comparable to PredPol in terms of the data they receive.
PredPol: PredPol is a self-exciting point process model for crime prediction introduced by Mohler et al. [2,7]. The model is based on the Epidemic-Type Aftershock Sequence (ETAS) framework originally developed in seismology, in which each past event increases the short-term likelihood of nearby future events. In the context of crime forecasting, this formulation captures near-repeat victimization by modeling crime intensity as a combination of a background rate and a self-exciting component. We implemented the simplified formulation explained in Mohler’s 2015 work [7].

3.3. Methods

A simulation-based framework was developed to investigate how policing algorithms perform over time with respect to fairness and accuracy. Real crime data and neighborhood data from Baltimore were applied in the simulation, including neighborhood identifiers, timestamps, and racial demographic indicators. The goal was to observe and quantify the temporal dynamics of bias and accuracy in policing models. For this purpose, three variables were calculated in each scenario setting during the course of the simulation for each day and for each neighborhood:

1.: Number of real crimes in the Baltimore city dataset.
2.: Number of police officers assigned.
3.: Number of crimes detected.

The simulation was run 20 times in each scenario setting; therefore, the number of police assigned and number of crimes detected are both averaged over 20 runs.

The simulations involved running PredPol, short-term KDE, and long-term KDE over the eight distinct settings summarized in Table 1. Our implementation approximates hot spot policing using a spatial kernel density estimation model (short-term and long-term KDE) applied to crimes observed within a rolling historical window. While KDE is a spatial smoothing technique, the simulation introduces temporal dynamics by updating the input crime data sequentially over time. This approach reflects how hot spot policing is commonly implemented operationally through analysts who generate hotspot maps based on recent crime observations. Let t denote the current simulation day and

τ_{i}

the occurrence time of crime i. We define the set of crimes available for prediction at day t as

H_{t}^{(W)} = {x_{i} = (l a t_{i}, l o n_{i}) : t - W \leq τ_{i} < t},

(5)

where

(l a t_{i}, l o n_{i})

represents the spatial coordinates of crime i in terms of latitude and longitude and W is the historical window length. In our simulations,

W =

one month for short-term KDE and

W =

one year for long-term KDE.

Using the historical events, the KDE model estimates a spatial crime intensity surface

{\hat{f}}_{t}^{KDE} (x) = \frac{1}{| H_{t}^{(W)} |} \sum_{x_{i} \in H_{t}^{(W)}} K_{h} (x - x_{i}),

(6)

where

x = (lat, lon)

and

K_{h} ()

is a Gaussian kernel with bandwidth h.

For a Gaussian kernel in two dimensions,

K_{h} (x - x_{i}) = \frac{1}{2 π h^{2}} exp (- \frac{∥ x - x_{i} ∥^{2}}{2 h^{2}}) .

(7)

Unlike hot spot-based policing systems, the PredPol model that represents predictive policing in our study models how crimes influence the likelihood of future crimes nearby in space and time. PredPol is modeled as a self-exciting point process in which past crimes increase the short-term likelihood of future crimes. Let n index neighborhoods and let

H_{t}

denote the set of crimes observed prior to day t. The predicted crime intensity for neighborhood n at day t is written as

λ_{n}^{PredPol} (t) = μ_{n} + \sum_{i \in H_{t}} 1 {n_{i} = n} θ ω e^{- ω (t - τ_{i})},

where

μ_{n}

is the background crime rate for neighborhood n,

θ

controls the strength of self-excitation,

ω

controls temporal decay,

n_{i}

is the neighborhood in which crime i occurred, and

τ_{i}

is the occurrence time of crime i. The neighborhood-level intensities are then normalized into a probability distribution over neighborhoods

p_{n}^{PredPol} (t) = \frac{λ_{n}^{PredPol} (t)}{\sum_{k} λ_{k}^{PredPol} (t)},

and police officers are allocated by sampling from this distribution.

PredPol is applied to a dataset of all known crimes to predict future crime distributions. This model gives higher weights to more recent crimes; on the other hand, hot spot-based policing typically receives only recent crimes and only calculates the crime density of different areas. This method does not perform any time-wise calculation. Thus, short-term KDE more accurately represents the practice of hot spot policing than long-term KDE. However, we include long-term KDE for a fair comparison with PredPol, as it uses the same amount of data while short-term KDE only uses the most recent data. The algorithms were simulated over various settings defined by the following three variables:

1.: Number of police officers: Distribution of 40 or 400 police officers.
2.: Probability of reporting a crime: Using only detected crimes (probability of report = 0) or using both detected and reported crimes (probability of report = 0.4). The value $r = 0.4$ was calculated as the weighted average of reporting probabilities across crime categories, where the weights correspond to the observed frequency of each crime type in the dataset (Section 3.2). It is used as a simplified uniform reporting rate across neighborhoods and crime types to represent under-reporting of crimes. This assumption allows us to examine how under-reporting affects policing algorithms. We note that in reality, reporting probabilities may vary across neighborhoods and crime types; we discuss this impact in Section 6.
3.: Crime type: Using all crime types in the dataset (’TOTAL’) or only aggravated assault records (’AGG. ASSAULT’).

The simulation horizon was set to 300 days. This duration was chosen to be long enough for feedback effects between crime detections and police deployments to emerge and influence algorithm behavior. At the same time, extremely long simulations can become difficult to interpret because the underlying environment is unlikely to remain stationary over extended periods. For instance, the demographic map, crime patterns, and reporting behavior may change over time. In such settings, very long simulated horizons may show recurring reversals in the bias direction, or even show no bias at all if examining average police concentration over time. Therefore, the 300 day horizon represents a compromise that allows for the study of long-term feedback tendencies while maintaining a reasonably stable environment for interpreting the results.

PredPol and long-term KDE used all available detected and reported crime data, while short-term KDE used detected and reported crimes within one month before the prediction date. The crime events were taken from the Baltimore dataset, while the simulation modeled officer allocation using alternative policing algorithms, crime detection, and prediction updates based on the observed events.

During the simulation, police locations were determined for each day over a period of 300 days, starting from 1 January 2019. Note that the police assignment for the first date was determined by considering all real crimes that occurred before that date, assuming that all the crimes were detected. However, for the subsequent days we appended that crime history to a subset of the real crimes from the previous days that were specifically labeled as detected by the simulation, plus the reported crimes of the previous days. Noisy-OR was applied to the number of police officers within the neighborhood in which each crime happened in order to generate the detected crimes dataset for the current day. This dataset was then appended along with the reported crimes dataset and used to determine police locations for the following day. Predictions were generated at each simulation day, after which police officers were accordingly allocated to the model predictions, crimes were detected according to the detection technique (Section 3.2), and detected crimes were added to the crime dataset before generating the next day’s predictions. The pseudo-codes in Algorithms 1 and 2 clarify our strategy.

Algorithm 1: Creating a Reported Crimes Dataset

Input: Crime dataset crimes, Report probability report_probability

Output: None

reported_crimes = an empty crime dataset;

save CSV file of reported_crimes;

Algorithm 2: Police Allocation Simulation

Input:: Crime dataset crimes; prediction algorithm Alg; start date start_date; end date end_date; detection probability detectProb; reported-crimes dataset reported_crimes
Output:: Result dataset res containing daily neighborhood-level police allocations and detected crimes

Initialize crimes_h ← all crimes before start_date;

Initialize detected_crimes ← crimes_h;

Initialize res with columns: date, neighborhood, crime_num, police_num, detected_crime_num;

Fill res with simulation dates, neighborhoods, and number of crimes from crimes;

return res;

3.4. Fairness and Accuracy Metrics

The results gathered from running KDE and PredPol in different scenarios were analyzed using different visualizations and statistical assessments.

To improve legibility and reduce redundancy, we use the notation

\bar{x}

and

A v g (x)

to denote the arithmetic mean (i.e., the sum of values divided by the number of elements). This notation is used interchangeably throughout the formulas depending on which better improves clarity and visual flow.

To quantify fairness and accuracy in our data analysis, we focus on five metrics to analyze three major concepts: the racial fairness gap, the neighborhood-level fairness gap, and coverage accuracy. For both racial group fairness and individual neighborhood fairness, equality of treatment is a combination of having equal resources in general and equal resources assigned to similar individuals. For this purpose, both racial and neighborhood-level fairness consider the equality of police share or police officer number as general resource equity and the equality of police share to crime share ratio as a measure for equity in treatment of individuals having similar crime rates, i.e., treatment proportional to crime rate.

The most fair model or simulation setting in terms of any fairness gap metric is defined as the one with the minimum average value for that metric during simulation, while the most accurate is defined as the one with the maximum average accuracy value during the simulation.

1.

Racial Fairness Gap

Racial fairness in this study is a group fairness metric. In this work, the racial disparity or fairness gap is defined as an absolute difference in averages. The use of a difference or absolute difference is a standard approach in quantifying group disparity (cf. Equation (5) in [66] and Definition 6.3 in [67]).

Inequality of Average Police Share between White and Black Neighborhoods
This measure determines the disparity of average police share between the two races, meaning the groups’ general equality of resource allocation:

${RacialFairnessGap}_{PoliceShare} = |\bar{P_{Black}} - \bar{P_{White}}|$

(8)

where $\bar{P_{Race}}$ is the average police share in neighborhoods of the given race. Based on the explanation above, the most fair model or simulation setting would be determined by

${MostFair}_{PoliceShare} = arg min (A v g ({RacialFairnessGap}_{PoliceShare})) .$

(9)
Inequality of Average Police-to-Crime Ratio (PCR) between White and Black Neighborhoods
The racial fairness gap is defined as the absolute difference of the average police-to-crime ratio between groups, analyzing proportional treatment or similar treatment of individuals with similar crime rates (see Equations (10)–(14), where $N = | Neighborhoods |$ ).

$P o l i c e S h a r e S m o o t h e d_{i} = \frac{P o l i c e N u m_{i} + ε}{T o t a l P o l i c e N u m + N ε}, i \in N e i g h b o r h o o d s$

(10)

$C r i m e S h a r e S m o o t h e d_{i} = \frac{C r i m e s N u m_{i} + ε}{T o t a l C r i m e s N u m + N ε}, i \in N e i g h b o r h o o d s$

(11)

${PCR}_{Race} = A v g (\frac{P o l i c e S h a r e S m o o t h e d_{i}}{C r i m e S h a r e S m o o t h e d_{i}}), i \in N e i g h b o r h o o d s_{R a c e}$

(12)

${RacialFairnessGap}_{PCR} = |{PCR}_{Black} - {PCR}_{White}|$

(13)

${MostFair}_{PCR} = arg min (A v g (|{RacialFairnessGap}_{PCR}|))$

(14)

2.

Neighborhood-Level Fairness Gap

To study inequality of treatment among individual neighborhoods, we use the Gini coefficient. This measure of inequality has been used in health [68], education [69], and especially the economic-related literature [70]. The Gini coefficient measures the distance from the equality line; the higher the coefficient, the more unequal the values. The Gini coefficient is calculated using the trapezoidal method:

G i n i (X) = \frac{1}{n} (n + 1 - 2 \cdot \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{i} x_{(j)}}{\sum_{i = 1}^{n} x_{(i)}}), where x_{(i)} are sorted values .

(15)

Inequality of Police Distribution
This metric calculates the Gini coefficient of police numbers in each neighborhood during the simulation in order to determine the overall inequality of the resource distribution.

${Gini}_{Police} = Gini ({P_{i}}_{i = 1}^{N})$

(16)

${MostFair}_{Gini, Police} = arg min (A v g ({Gini}_{Police}))$

(17)
Inequality of Police-to-Crime Ratio (PCR) Across Neighborhoods
We measure the Gini coefficient of each neighborhood’s police-to-crime ratio, which is the inequality of the police distribution of similar individuals with similar crime ratios and the inequality of officer shares proportional to crime shares.

${PCR}_{i} = \frac{P_{i} + ε}{C_{i} + ε}, {Gini}_{PCR} = Gini ({{PCR}_{i}}_{i = 1}^{N})$

(18)

${MostFair}_{Gini, PCR} = arg min (A v g ({Gini}_{PCR}))$

(19)

3.

Coverage Accuracy — Proportion of Crimes Detected

Coverage accuracy measures the effectiveness of the police distribution relative to the actual crime distribution. It is defined as the proportion of detected crimes.

CoverageAccuracy = \frac{Total Detected Crimes}{Total Crimes}

(20)

MostAccurate = arg max (A v g (CoverageAccuracy))

(21)

3.5. Experimental Environment

All experiments were implemented in Python (version 3.9.13) within an Anaconda environment, using libraries including scikit-learn (version 1.1.3), NumPy (version 1.22.3), pandas (version 1.5.1), GeoPandas (version 0.12.1), Shapely (version 1.8.5. post 1), inequalipy (version 1.0.5), Matplotlib (version 3.6.2), Plotly (version 6.5.2), Folium (version 0.13.0), and seaborn (version 0.12.2). The primary simulation results presented in the main text were obtained using standard computing resources. To support robustness analysis over extended parameter ranges, additional large-scale experiments reported in the appendix were conducted on the UMBC High Performance Computing Facility (HPCF) cluster (chip). Code and processed data are publicly available to facilitate reproducibility.

4. Results

This section describes the results of our experiments. The results dataset and code for the analysis are available online (accessed on 1 April 2026) (https://github.com/saminsemsar/Data_Analysis_Portfolio/tree/main/PredictivePolicing). To summarize, across the scenarios we observe three main and mostly consistent patterns. First, PredPol achieves higher average coverage accuracy and fairness than KDE-based methods in most of the scenarios. Second, PredPol often shows faster deterioration in neighborhood-level fairness. Third, crime type affects police concentration patterns. In what follows, we provide the details of these findings. The results are organized to address a series of guiding questions aimed at examining key aspects of our research objectives:

Which algorithm is fairer or more accurate in our specific location and time period?
Do different systems react differently to feedback loops?
Is the bias model-driven or data-driven?
How do the police concentration and distribution change over the course of the simulation in different scenarios?

Before presenting the findings, we begin with a map-based visualization to provide an intuitive overview of how police were distributed on the first and last days of simulations in an example scenario (40 police officers, 300 days) (Figure 2). The red circles are crime hot spots in Black neighborhoods, the black circles are those in White neighborhoods, and the blue markers are police locations. It can be observed that police are allocated more evenly among neighborhoods on the first day of allocation compared to the last day (about a year later), as shown in Figure 2, illustrating the feedback loop phenomenon [9].

We present the detailed findings organized by themes corresponding to the guiding questions listed above. Some of these findings have been further analyzed for robustness to different report probabilities; this analysis can be found in Appendix A.

1.

Comparative Fairness and Accuracy Across Models

This subsection compares short-term KDE, long-term KDE, and PredPol in terms of overall fairness and accuracy within the Baltimore study region and time period. The findings summarize average and aggregate performance across scenarios, including whether the improvements in fairness affected accuracy.

Finding 1.1: PredPol was the Most Accurate at the Beginning
In 50% of scenarios, PredPol had the highest coverage accuracy compared to the other two on the simulation’s first day. However, on the last day of the simulation, long-term KDE became the most accurate in 75% of the scenarios, cf. Figure 3.
Finding 1.2: PredPol was generally the most accurate algorithm.
PredPol was the most accurate algorithm in 87.5% of scenarios and the second-most accurate in the rest. Conversely, short-term KDE was the least accurate in 75% of scenarios (Figure 4). Further analysis of this result’s robustness against different report probabilities (Appendix A) proved that although the extent of difference in accuracy of these models was influenced by $p R e p o r t$ , their relative ranking remained mostly the same (Figure A1).
Finding 1.3: PredPol was the most racially fair, while long-term KDE was the least.
PredPol was the most racially fair, in terms of equality of police-share to crime-share ratio, in 75% of scenarios, and long-term KDE was the least racially fair in 87.5% of scenarios (Figure 5). In terms of average police share, PredPol was the most racially fair model in 62.5% of scenarios. Long-term KDE, on the other hand, was the least racially fair model in 75% of scenarios (Figure 6 and Figure 7). Further analysis of this result’s robustness against different report probabilities (Appendix A) proved that although the extent of the difference in racial fairness gap between these models was influenced by $p R e p o r t$ (Figure A1, right column), their relative ranking remained mostly the same (Figure A1, left column), with PredPol having a higher probability of achieving the best performance across $p R e p o r t s$ .
Finding 1.4: PredPol was the most fair in regard to neighborhood-level fairness, followed by long-term KDE, while short-term KDE was the least fair at the neighborhood level.
Based on both equality of average police share and equality of average police-to-crime ratio, PredPol (in 75% of scenarios) and long-term KDE (in the other 25%) were the most fair models at the neighborhood level. According to both metrics, short-term KDE was the least fair at the neighborhood level in 100% of scenarios (Figure 8 and Figure 9). Further analysis of this result’s robustness against different report probabilities (Appendix A) proved that although the extent of difference between these models in terms of the neighborhood-level fairness gap was influenced by $p R e p o r t$ (Figure A1, right column), their relative ranking remained mostly the same, with PredPol maintaining the highest probability of achieving the best performance across $p R e p o r t s$ (with the exception of $p R e p o r t = 0$ , where long-term KDE and PredPol swapped places); see Figure A1, left column. In general, long-term KDE and PredPol performed more alike in terms of the extent of neighborhood-level fairness gap, but with a higher decrease in gap for PredPol compared to long-term KDE as the report probability increased (Figure A1, right column).
To reinforce the scenario percentage pattern reporting, we also computed paired differences between PredPol and each KDE-based method across the eight main scenarios. The average differences are provided in Table 2.
Finding 1.5: Higher accuracy is accompanied with lower neighborhood-level fairness-gap.
In Figure 10, it can be observed that higher neighborhood-level fairness gaps (lower fairness) were accompanied by lower accuracy. The accuracy–fairness correlation is clearer when we group the data points by the number of officers allocated, which was the variable creating the most distinct clusters (Appendix B). Further correlation analysis can be found in Appendix B.
Finding 1.6: Racial fairness is not correlated with accuracy.
Unlike neighborhood-level fairness, we did not observe any correlations between racial fairness and accuracy (Figure 11, Appendix B). All levels of racial fairness were observed over all levels of accuracy. Further correlation analysis can be found in Appendix B.

2.

Differential Responses to Feedback Loops

This subsection examines how the different algorithms responded to feedback loops over time. These findings focus on temporal trends, including the direction and speed of change in fairness and accuracy, rather than on average outcomes. Overall, we found that long-term KDE had the slowest pace of change (slope of trend-line) for accuracy and fairness metrics over the days of simulation, with PredPol and short-term KDE competing for the least stable. In some cases where this trend moved towards focusing officers on a race different from the one at the start of simulation, the trend became positive, showing a temporary improvement in bias. This happened more for KDE-based models. A more detailed report is presented below.

Finding 2.1

Long-Term KDE had the slowest pace of change in most scenarios, especially for neighborhood-level fairness.

Long-term KDE had the smallest slope of the trend line in:

50% of scenarios for racial fairness in terms of average police share (Figure 12).
37.5% of scenarios for racial fairness in terms of police-to-crime proportionality (Figure 13).
75% of scenarios for neighborhood-level fairness in terms of police-share equality (Figure 14 and Figure 15).
75% of scenarios for neighborhood-level fairness in terms of police–crime proportionality (Figure 16).

Finding 2.2

Short-term KDE had the fastest trend of racial bias based on average police share in most of the scenarios.

Short-term KDE had the fastest trend of racial bias in 62.5% of scenarios for equality of average police share (Figure 12). Taking a closer look at individual scenarios, although short-term KDE ranked second in pace of neighborhood-level bias in most of the scenarios, its change in neighborhood-level fairness over the days showed a more volatile pattern compared to the other two models (Figure 15; to see the other scenarios, refer to the code [71]).

Finding 2.3

PredPol had the fastest trend of neighborhood-level bias, and showed a trend in racial bias metrics for most of the scenarios.

We have already established that PredPol was generally the fairest model in terms of both racial and neighborhood-level fairness; however, looking at the pace of change for each metric over different scenarios, PredPol shows a fast neighborhood-level fairness gap trend (Figure 14 and Figure 16) and racial fairness trend in terms of police-to-crime ratio in most of the scenarios (Figure 13). This suggests that it is indeed vulnerable to bias from feedback loops, as previously reported by [9].

Finding 2.4

Bias amplification was observed in a higher percentage of scenarios for PredPol compared to the other two models.

Although PredPol was fairer in most of the scenarios, amplification of bias was seen in its trends in a higher percentage of scenarios than for the two KDE approaches. PredPol’s neighborhood-level fairness dropped in 100% of scenarios, while the drop for short-term KDE and long-term KDE occurred in 75% and 37.5%, respectively (Figure 17 and Figure 18).

The drop in racial fairness for PredPol happened in 75% of scenarios for police–crime proportionality and 50% of scenarios for average police share, while these numbers were 50% and 25% for short-term KDE and 75% and 25% for long-term KDE, respectively (Figure 19 and Figure 20). In Figure 19 and Figure 20, the hatched or striped bars are those in which the race receiving focus at the start of the simulation contrasted with the race the trend was toward; this demonstrates how all the fairness gaps with negative slopes were those where the race focus at the beginning of the simulation contradicted the trend. For instance, refer to the models’ behaviors in the example scenario in Figure 7.

Finding 2.5

Short-term KDE experienced an accuracy drop in a higher percentage of scenarios compared to the other two models.

Although we observed previously that neighborhood-level fairness and accuracy seemed to be correlated to some extent, here we saw that short-term KDE, not the model with constant and highest speed of neighborhood-level fairness drop, experienced accuracy drops in 75% of the scenarios. The corresponding drop percentages were 62.5% for PredPol and 37.5% for short-term KDE (Figure 21).

Finding 2.6

Distribution uniformity did not guarantee the defined racial fairness.

While the percentage of scenarios with a drop in neighborhood-level fairness ranged from 75 to 100 for PredPol and short-term KDE, the percentage of scenarios with a drop in racial fairness ranged from 25 to 75 (Figure 17, Figure 18, Figure 19 and Figure 20). On the other hand, for long-term KDE racial fairness based on police–crime proportionality fell in 75% of scenarios, while both neighborhood-level fairness metrics worsened in 37.5% of scenarios. Long-term KDE’s racial fairness based on average police share only fell in 25% of scenarios. This indicates that the models with the least police distribution uniformity might stay racially fair in areas with certain demographic maps and crime records.

Finding 2.7

Bias typically worsened over time except when the trend and current state contradicted.

We observed a consistent pattern across scenarios: in cases where racial fairness appeared to improve unexpectedly, the demographic group receiving the greatest police attention at the start of the simulation was different from the group receiving the most attention at the end. In the case of Baltimore and using all crime records, this focus at the end was surprisingly on White neighborhoods. This shift largely accounted for the observed improvement (see Figure 19 and Figure 20). For these contradictory cases where the initial state and the direction of change did not align with bias amplification, the slope of the bias trend must have been sufficiently steep relative to the initial disparity for the trajectory to have crossed the parity point and begun moving toward amplification. In such situations, longer simulation horizons are required for this transition to appear.

These apparent improvements hint at the possibility of seasonal cycles of bias reduction and subsequent amplification when simulations are run for extended durations. Changes in the data distribution for real crimes may arise from multiple underlying factors. Two such factors are illustrated below:

i.: Seasonality: Any variable with cyclic effect, such as weather, could cause seasonal changes in the distribution.
ii.: Police policy change: Changes in police distribution or enforcement practices could reduce crime in a given location or suppress specific crime types affected by that policy, which would then induce a shift in crime distribution.

Therefore, when data on real crimes are used and when both reported and detected crimes contribute to the updates, the distribution of reported crimes may begin to favor a neighborhood that was not previously predicted as high-risk. As officers gradually shift their attention from earlier high crime rate areas towards this newly emergent hot spot, a temporary reduction in measured bias can occur. This is typically followed by renewed amplification once police resources become concentrated in the new focal area, until the distribution of reported crimes eventually shifts again.

3.

The Effects of Data Variation on Bias

This subsection examines how different crime data affect the fairness and accuracy outcomes across the algorithms. We compared scenarios using aggravated assault records with those using all crime records in order to analyze the change in fairness and accuracy.

Finding 3.1: Expanding from aggravated assault to all crimes increased inequality of the police-to-crime ratio and reduced accuracy for long-term KDE and PredPol.
When changing the crime type from aggravated assault to all crimes, the neighborhood-level gap based on the police-to-crime ratio and accuracy dropped in most scenarios (see Figure 22 and Figure 23).
Finding 3.2: Different models could have different bias outcomes when applied to different data, and vice versa.
When looking at the slope of the trendline to determine which race is receiving a higher average police share (Figure 24) or higher average police-to-crime ratio (Figure 25) as the simulation days pass, we observe that when feeding aggravated assault crime records to PredPol, the focus is more on Black neighborhoods, whereas the focus flips to White neighborhoods when using all crime records. This is observed in most scenarios and for both metrics. This result is noteworthy because the debate around bias in predictive policing typically makes the implicit assumption that algorithms are solely biased against Black neighborhoods, following the work of Lum and Isaac [9].

4.

Police Concentration Areas in Baltimore

This subsection focuses on how police resources are distributed across neighborhoods over the course of the simulation. To be more precise, we looked at the number of scenarios in which each of the top-10 most frequently over-policed neighborhoods were ranked as top-3.

Finding 4.1: Assigning officers based on aggravated assault led to over-policing more Black neighborhoods than White neighborhoods
Over-policing is defined as having a higher police share compared to crime share. Looking at the top-10 most frequently over-policed neighborhoods by each algorithm when using aggravated assault records, we observed that other than Downtown, which is the most frequently over-policed neighborhood in the scenarios, all other top-10 frequently over-policed neighborhoods were Black, such as Blair-Edison and Sandtown-Winchester. (Figure 26).
Finding 4.2: Assigning officers based on all crime records by PredPol over-policed mostly White neighborhoods such as Mount Vernon and Canton, unlike the other two models.
Looking at the top-10 most frequently over-policed neighborhoods by each algorithm when all crime records were used, we saw that other than Sandtown-Winchester and Cherry Hill, all other top-10 frequently over-policed neighborhoods by PredPol were White, such as Mount Vernon, Canton, and Brooklyn. (Figure 27). Among short-term KDE’s over-policed neighborhoods, only two of six (Paterson Park and Inner Harbor) were White; among long-term KDE’s over-policed neighborhoods, one out of four (Brooklyn) was White.

5. Discussion

We performed simulations in a range of scenarios comparing different policing systems using a Baltimore crime dataset. Based on our simulations and analyses, we draw the following overall conclusions:

Fairness and Accuracy Comparison
By comparing simulation results of representative predictive policing and hot spot policing algorithms, we discovered that predictive policing via PredPol was generally more accurate and had higher racial and neighborhood-level fairness than hot spot policing via short-term and long-term KDE across most scenarios.
Long-term KDE, where we widened KDE’s data input to use the same data received by PredPol, improved accuracy and neighborhood-level fairness over that of short-term KDE but worsened racial fairness. The improvement in accuracy did not overtake PredPol’s level of accuracy but almost matched its neighborhood-level fairness.
The theory of hot spot policing claims that distribution of recent crimes can approximate their near-future distribution. Recent crimes are often defined as those within either a month or a year of the prediction date [72]. In our experiment, both long-term KDE and PredPol received the crime history from a year before the starting date of the simulation. The simulation’s starting prediction date was 1 January 2019, and the crime history included all crimes after 1 January 2018. If the crime history was more than a year ago, we expect that the accuracy at the starting date would drop, but that overall accuracy might improve because of our finding that a longer crime history slows down the effect of neighborhood-level bias for KDE, which in turn slows the drop in accuracy.
Our results show that higher accuracy does not guarantee a racially fairer model. Both fair and unfair outcomes were observed across the fairness spectrum for high-accuracy scenarios (Figure 11). However, neighborhood-level fairness seemed to correlate with accuracy (Figure 10). These results extend the work of Mohler et al. [12] on accuracy–fairness tradeoffs by providing evidence from extended simulations on real crime data. Unlike Mohler’s observation that fairness comes at a cost in accuracy, some of our scenarios were both fairer and more accurate than others. This allows us to infer that the give-or-take between accuracy and fairness depends on both the fairness metric and the demographic geography of the location the system is customized for (Figure 10 and Figure 11).
Temporal Feedback Loop Effects
We saw a trend of bias amplification in most of the scenarios for all three systems, except when there was a change in the pattern of data during the simulation causing the police focus to change from certain neighborhoods to others. In scenarios where bias did not become worse, the race that received the highest focus at the start of the simulation contradicted the race at the end, as shown in Figure 19 and Figure 20. Note that the negative slopes of the fairness gaps indicate those scenarios with improvements in regard to that metric. For these scenarios, bias amplification might have happened later on if the simulation had been continued for a longer period, unless reported crime distribution changed drastically again. Therefore, when using real crime data we saw that bias amplification was not constant, instead depending on changes in the distribution of newly discovered and reported crimes. For neighborhood-level fairness, we saw a continual increase in bias for most of the scenarios, especially for PredPol and short-term KDE (Figure 17 and Figure 18). For racial fairness, a lower number of scenarios experienced bias amplification compared to neighborhood-level fairness, especially for the metric related to average police share (Figure 19). However, it should be noted that all scenarios with bias improvement also showed a contradiction between which race the initial police focus was on and which race the focus was trending towards.
It is important to consider that we defined bias amplification as a falling trend in a system’s equality of treatment. In all of scenarios where we saw improvement in equality, we also observed that the race receiving a higher policing share at the beginning of the simulation was different from the race that the trend was towards; therefore, a longer simulation duration is needed in order to make sure there is not a point of racial equality in the future after which a bias amplification towards the trending race occurs. We did not extend the simulation further, as longer durations introduce a shift in the underlying crime distribution that interacts with the feedback effect, making it difficult to attribute the changes in fairness to a single mechanism. Unlike Ensign’s assumptive theoretical study that showed constant bias amplification when assigning an officer between two neighborhoods [10], in a more realistic situation such amplification is not constant, and could temporarily subside before intensifying again.
Although PredPol was generally fairer and more accurate on average in most of the scenarios, its speed of bias amplification was higher for most of the metrics in most of the scenarios, while the speed of bias amplification for long-term KDE was substantially slower compared to the other two in most scenarios. This provides further evidence that different models respond differently to feedback loops.
Crime Type Effect
In the context of Baltimore, using different data from all crime records flipped the police concentration from one demographic to another for PredPol, and in some scenarios for short-term and long-term KDE as well. This result shows that the direction of racial bias can be affected not only by the predictive policing system but also by the data it was applied to.
We also observed that the average accuracy and average neighborhood-level equality based on the police-to-crime ratio dropped for long-term KDE and PredPol. This suggests that systems with more focus on the long-term effects of present crime on predictions of future crime might become less accurate when daily records are sparse. Alternatively, it could be that data records had a higher change in distribution over a short period of time. The sparseness of crime data and its effect on different systems could be studied further in future work.
Data-Driven vs. Model-Driven Bias
Our analyses found that both data and model matter: the same data fed into different models (KDE vs. PredPol) produced different fairness outcomes as previously shown by Chapman et al. [5]. The same model fed different data (aggravated assault vs. total crime) can affect the speed of bias development, or even flip the direction of bias.
Baltimore-Specific Insights
Black neighborhoods received more policing on average in most simulations, although this varied with crime type and algorithm. In most scenarios, the trend was toward assigning a higher average share of officers in general, and also toward a higher average share of officers per share of crime to White neighborhoods when all crime records were used. This finding counters the widely held assumption that predictive policing is biased specifically against Black neighborhoods, as found by Lum and Isaac in data from Oakland, California [9]. These patterns are not solely explainable by neighborhood count (Baltimore has more Black than White neighborhoods), suggesting that model behavior plays a larger role than geography alone. Furthermore, when examining the top-5 highest crime neighborhoods in different period lengths before the start of the simulation (cf. Figure 28), we observed the percentage of crimes occurring in White neighborhoods to be substantially higher for total crimes as compared to aggravated assault. This indicates that the crime distribution differs by crime type, and could affect the direction of bias.
However, when comparing the police allocation of different models with crime percentage concentrations (cf. Figure 29), we observe that the models produce different police distributions even on the first day of the simulation. These differences reflect inherent differences in model behavior. Since longer term include the influence of feedback loops, these initial discrepancies are likely to compound over time, leading to the observed long-run differences in speed or extent of bias.
Thus, both data and model behavior appear to affect not only the magnitude of bias in Baltimore, but also its direction and how the direction evolves over time.
Downtown, Sandtown-Winchester, and Blair-Edison were repeatedly among the top over-policed neighborhoods when aggravated assault records were used. This consistency across models suggests that some neighborhoods are structurally favored or targeted regardless of which predictive algorithm is used.
However, when looking at over-policed neighborhoods when using all crime records, we see that the models are not as consistent. This is another demonstration of how data and algorithms can both affect the results. The only neighborhood the algorithms commonly over-policed when using total crime records was Sandtown-Winchester, a Black neighborhood.
These Baltimore predictive policing simulation results are consistent with previous studies on other cities, including both real experimental studies [7] and simulation studies [9], in that police concentration tendencies occurred in the models due to feedback loops.
Based on the duration of the simulation, these systems might change rank in terms of each bias metric or even average accuracy. A short simulation duration might indicate PredPol as causing the most uniform police distribution (based on police Gini coefficient), followed by long-term KDE and then short-term KDE, while a longer simulation duration might cause a rank swap between long-term KDE and PredPol. These variables could be studied in future system-comparative works.
For Baltimore City, our recommendations are as follows:
–
Although predictive policing remained fairer and more accurate than hot spots-based policing, it had a higher speed of bias amplification than hot spots policing in most scenarios. Hence, we advise the city to be aware of the long-term tendencies of any predictive policing system they might use.
–
Evaluation studies like the one performed here should be performed prior to real-world implementation. These studies can highlight bias issues arising from the combination of a particular algorithm and data, leading to decision-making that is better informed.
–
When distributing resources using predictive policing models based on aggravated assault records, authorities should be mindful of assigning too many officers to the Downtown, Blair-Edison, and Sandtown-Winchester neighborhoods. We advise careful interpretation of results when applying all crime records to the predictive policing model, since Sandtown-Winchester, Brooklyn, Mount Vernon, and Canton could incorrectly appear to have higher crime rates due to feedback loop phenomena. Similarly, the city could prepare a list of neighborhoods that might be under-rated, under-policed, and in need of more attention.
–
Our results highlight the importance of actively monitoring long-term fairness trends when deploying predictive policing systems. In particular, it is important to distinguish between estimated or predicted trends in model behavior obtained through pre-deployment evaluation and observed trends that emerge from real-world implementation. In practice, this could involve periodic reassessment of both predicted and observed system behavior over time. At the end of each evaluation period, stakeholders could conduct audits of police allocation distributions and fairness indicators (e.g., racial or neighborhood-level fairness) to detect emerging bias. These audits should compare observed trends with previously estimated tendencies and current outcomes with those from previous periods. Such comparisons could help to identify deviations between expected and actual system behavior as well as shifts in bias over time. Decisions made based on estimated trends may also influence future outcomes, which should be documented and studied.
–
Engaging domain experts and community stakeholders in defining additional fairness metrics and reviewing these trends could bring to light new aspects involving fairness and community welfare and provide critical context for interpreting system behaviors.
Together, these steps can help to ensure that predictive policing systems remain aligned with fairness objectives over extended deployment periods.
Policy Implications
Based on our examination of various aspects of social fairness, policymakers should look beyond the promised ability to reduce and prevent crime when authorizing deployment or continuation of any smart policing system. More broadly, achieving a society where every individual has equal opportunity to succeed, and in turn to help their community prosper, requires coordinated efforts among stakeholders along with a continuous evaluation process to ensure that these systems operate as intended.
Our results demonstrate that predictive policing systems exhibit dynamic behavior over time, affecting how these system deployments need to be regulated. Policies regarding their evaluation should be ongoing, meaning that continuous pre- and post-deployment evaluations through periodic audits need to be performed rather than a one-time pre-deployment evaluation. By enforcing appropriate laws and policies, it is necessary to establish a cyclic evaluation framework that includes the following steps:
1.
Defining/updating fairness and accuracy metrics.
2.
Defining/updating the system dynamics and behavioral responses, e.g., how police presence changes crime distribution, how different environmental factors change the probability of crime being reported, etc.
3.
Defining/updating policies and regulations.
4.
Pre-deployment evaluation of the policing system’s behavior.
5.
Monitoring real-world outcomes.
6.
Conducting causal and experimental analysis of outcome data to better understand interactions among the metrics and the system’s responses.
7.
Interpretation of the results by community stakeholders and domain experts.
This approach shifts the focus from selecting the best policing system to establishing a robust evaluation and governance process that directs the development of the policing system, simulation framework, and evaluation framework itself. Such a process can adapt to long-term system dynamics while ensuring that the system meets fairness objectives over time.
Furthermore, a dynamic governance process and continuous re-evaluation that engages community stakeholders to promote transparency and accountability will result in higher levels of community cooperation, thereby building trust between citizens and law enforcement agencies. When stakeholders are aware that system performance is regularly assessed and corrective actions are being taken, confidence in the system will improve and tolerance for inadequacies and enforcement errors will increase.

6. Limitations

Limitations of this work include:

Every individual scenario parameter (e.g., crime-type, number of officers, probability of report, etc.) and hyperparameter (such as base probability of crime detection by an officer within detection radius (the crime’s neighborhood), could be explored over a wider range of values to asses robustness and their influence on the results.
We assumed that the crimes in the database represented all crimes that actually happened, which is untrue. These records could be the result of an already biased system, such as by having less crime reported or discovered in certain areas.
We used real crime data, which we filtered by using a fixed report probability for all neighborhoods. The approach would be more realistic if different crimes were filtered based on their report probability instead of using an average report probability for all crimes. Using varying report probabilities across neighborhoods and/or crime types could change the distribution of observed crimes (detected + reported) and potentially influence the fairness and accuracy outcomes of the algorithms. Exploring neighborhood-specific and/or crime type-specific reporting rates would be a valuable expansion for future work.
We did not account for how the presence of police in a neighborhood could affect the crime distribution. The only change in crime distribution over the course of our simulations consisted of what was hidden in the original crime data, without any calculated manual changes. Currently, we expect that by running the simulation for longer periods, these systems would potentially experience a flip in their bias direction tendency when the distribution of the filtered reported crimes eventually changes by suddenly having several days of high crime rates in neighborhoods with different demographic majorities.
In this work, we considered a detection formula that implicitly assumes crime detection to be determined exclusively by the number of deployed officers. Nevertheless, in real-world deployments detection rates may also be affected by the model-estimated level of crime risk itself. Officers entering a predicted high-crime area typically behave with elevated vigilance, which can inform how they evaluate behavior and distribute attention [73]. As a result, detection probability may be determined not only by officer count but also by the broader sociotechnical context induced by the predictive mechanism.

7. Conclusions

In this study, we develop an agent-based modeling simulation to investigate whether the use of hot spot policing (short-term and long-term KDE) and predictive policing (PredPol) lead to feedback loops and hence to racial bias in Baltimore. We define accuracy and fairness metrics to evaluate and compare the systems’ behavior in this locality. In doing so, we extend the body of existing literature by providing a framework consisting of fairness and accuracy metrics to compare and evaluate different systems’ long-term localized tendencies. This approach provides insights into the resource distribution tendencies of each model, recounting neighborhoods that might be over- or under-rated as crime hotspots by looking at which neighborhoods are over-policed or under-policed during a 300-day simulation period. Our results using 2019–2020 data from Baltimore City show that predictive policing and hot spot-based policing are indeed subject to both feedback loops and to racial and neighborhood-level bias. Although the speed of bias amplification was higher for predictive policing, it was still the fairer and more accurate system on average compared to hot spot-based policing systems in our 300-day simulations. However, our results suggest that the relatively high vulnerability of predictive policing to feedback loops could result in poorer fairness and accuracy than hot spot policing over the long term.

Our findings in these 300-day simulation scenarios of Baltimore show that, contrary to the generally assumed bias trends, the police concentration trends when using all crime records were mostly towards White neighborhoods. Downtown Baltimore is a mostly White-dominant area, and appeared as the neighborhood receiving the highest policing shares (with a rising trend) in most of the simulation scenarios. This shows that fairness issues in predictive policing algorithms have impacts beyond the Black community, which could motivate a broader set of stakeholders to work together to find solutions that benefit the entire community. It is also crucial to continuously monitor the fairness dynamics in real-world deployments. In particular, comparing predicted model behavior with observed outcomes over time, along with periodic auditing of fairness metrics, would help practitioners to identify emerging biases and adjust their decision-making strategies accordingly. Overall, our results emphasize the need for continuous localized fairness assessments and cautious use of predictive policing tools in sensitive environments.

Author Contributions

Conceptualization, J.F., K.L.P. and S.S.; data curation, S.S.; formal analysis, S.S.; funding acquisition, J.F.; investigation, S.S. and K.L.P.; methodology, J.F. and S.S.; project administration, J.F.; resources, S.S.; software, S.S. and K.L.P.; supervision, J.F. and G.W.; validation, S.S. and G.W.; visualization, S.S.; writing—original draft preparation, S.S., K.L.P. and J.F.; writing—review and editing, S.S., J.F., K.L.P. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the National Science Foundation under Grant Nos. IS1927486; IIS2046381. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Institutional Review Board Statement

No new data were collected for this study, and all data used are publicly available. Therefore, this study did not require an IRB.

Data Availability Statement

The datasets used in this study were originally retrieved from the Baltimore City Open Data Portal. As they are no longer available at the original links, archived copies along with the code are provided at: https://github.com/saminsemsar/Data_Analysis_Portfolio/tree/main/PredictivePolicing (accessed on: 1 May 2026).

Acknowledgments

We extend our gratitude to the undergraduate and graduate students who previously contributed to this project, laying the groundwork for the research presented here. Although their specific contributions did not appear in this paper, their efforts were valuable in preparing for this study. We would like to thank the graduate student researchers Sambhaw Sharma, Ashwathy Samivel Sureshkumar, Harish Ramamoorthy, Akarshika Singhal, Vamshi Krishna Yenmangandla, Pranvat Singh, and Dharmil Shah, whose coding contributions aided the preliminary exploratory analysis, as well the undergraduate student researchers Aminat Alabi and Shaniah Reece for the literature review and critical analysis of the local context.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

KDE	Kernel Density Estimation
Avg.	Average
Abs.	Absolute

Appendix A. Sensitivity Analysis to Reporting Probabilities

To assess the robustness of our findings to the report probability, we conducted additional experiments with two other report probabilities (0.2 and 0.6), each of which was run 20 times. Using

p R e p o r t s

from the range 0.0–0.6, we performed statistical tests to check the sensitivity of accuracy and fairness outcomes to report probabilities; Table A1 includes the mean relative difference for each pair of algorithms based on a metric, the low and high values from the 95% confidence interval, and the F value and the P value from the ANOVA test. Looking at these numbers, it is apparent that even with other report probabilities, PredPol was still the most accurate and fair, as reported before. However, the algorithm with the highest speed of bias development was short-term KDE for two of the metrics and PredPol for the other two. The ANOVA test scores show that the extent of difference between each algorithm pair based on all the metrics was significantly affected by the report probability. Note that a high f-value indicates a significant difference among the grouped values of

p c t_d i f f

based on

p R e p o r t

, while a low p-value shows the statistical significance of the difference. To calculate the f and p-value, we used the one way f-statistics implemented in the scipy.stats library.

Table A1. Pairwise relative difference of models, low and high confidence interval, and P and F values of the ANOVA test across metrics.

Metric	Algorithm_A	Algorithm_B	Mean_pct_diff	CI_low_pct	CI_high_pct	ANOVA_F_pReport	ANOVA_p_pReport
Accuracy	KDE_LongTerm	KDE_ShortTerm	23	20.4	25.5	85.9	1.11 × $10^{- 40}$
Accuracy	PredPol	KDE_LongTerm	4.64	4.17	5.1	26.2	3.47 × $10^{- 15}$
Accuracy	PredPol	KDE_ShortTerm	27.5	25.1	29.9	71.6	2.29 × $10^{- 35}$
NeighborhoodFairnessGap_pGini	KDE_LongTerm	KDE_ShortTerm	−11.2	−12.2	−10.2	23.3	1.22 × $10^{- 13}$
NeighborhoodFairnessGap_pGini	PredPol	KDE_LongTerm	−0.326	−0.5	−0.151	84.9	2.58 × $10^{- 40}$
NeighborhoodFairnessGap_pGini	PredPol	KDE_ShortTerm	−11.5	−12.5	−10.5	11.5	3.52 × $10^{- 7}$
NeighborhoodFairnessGap_pcrGini	KDE_LongTerm	KDE_ShortTerm	−8.75	−9.69	−7.81	10.4	1.42 × $10^{- 6}$
NeighborhoodFairnessGap_pcrGini	PredPol	KDE_LongTerm	−0.22	−0.327	−0.113	46.3	8 × $10^{- 25}$
NeighborhoodFairnessGap_pcrGini	PredPol	KDE_ShortTerm	−8.97	−9.92	−8.01	6.14	0.000455
NeighborhoodFairness_pGini_absTrend	KDE_ShortTerm	KDE_LongTerm	126	121	131	49.2	3.95 × $10^{- 26}$
NeighborhoodFairness_pGini_absTrend	KDE_ShortTerm	PredPol	8.12	2.46	13.8	59.5	1.68 × $10^{- 30}$
NeighborhoodFairness_pGini_absTrend	PredPol	KDE_LongTerm	122	116	128	7.46	7.67 × $10^{- 5}$
NeighborhoodFairness_pcrGini_absTrend	KDE_ShortTerm	KDE_LongTerm	79.1	72.2	85.9	52.6	1.29 × $10^{- 27}$
NeighborhoodFairness_pcrGini_absTrend	PredPol	KDE_LongTerm	89.6	84.5	94.7	18.5	4.55 × $10^{- 11}$
NeighborhoodFairness_pcrGini_absTrend	PredPol	KDE_ShortTerm	8.76	3.55	14	136	1.52 × $10^{- 56}$
RacialFairnessGap_PCR	KDE_ShortTerm	KDE_LongTerm	−4.4	−5.41	−3.38	28.1	3.94 × $10^{- 16}$
RacialFairnessGap_PCR	PredPol	KDE_LongTerm	−9.65	−10.5	−8.82	5.38	0.00127
RacialFairnessGap_PCR	PredPol	KDE_ShortTerm	−5.26	−6.04	−4.48	20.7	2.93 × $10^{- 12}$
RacialFairnessGap_PCR_absTrend	KDE_LongTerm	KDE_ShortTerm	23.7	13.2	34.3	9.21	7.4 × $10^{- 6}$
RacialFairnessGap_PCR_absTrend	PredPol	KDE_LongTerm	3.74	−5.78	13.3	6.37	0.000335
RacialFairnessGap_PCR_absTrend	PredPol	KDE_ShortTerm	30.1	20.6	39.7	5.17	0.00169
RacialFairnessGap_avgPolShare	KDE_ShortTerm	KDE_LongTerm	−6.57	−9.03	−4.11	10	2.54 × $10^{- 6}$
RacialFairnessGap_avgPolShare	PredPol	KDE_LongTerm	−11.6	−13	−10.2	1.09	0.353
RacialFairnessGap_avgPolShare	PredPol	KDE_ShortTerm	−4.95	−7	−2.91	14	1.31 × $10^{- 8}$
RacialFairnessGap_avgPolShare_absTrend	KDE_ShortTerm	KDE_LongTerm	74.6	63.4	85.8	23.8	6.24 × $10^{- 14}$
RacialFairnessGap_avgPolShare_absTrend	KDE_ShortTerm	PredPol	59.6	50.2	69	5.38	0.00126
RacialFairnessGap_avgPolShare_absTrend	PredPol	KDE_LongTerm	26.5	14.6	38.4	14.8	4.89 × $10^{- 9}$

The

C h i^{2}

analysis used the chi-square contingency function implemented in Scipy.stats library. The results shown in Table A2 contain consistently low p-values and moderate-to-high normalized

C h i^{2}

values, indicating that the distribution of the better-performing algorithms varies significantly across

p R e p o r t s

for all metrics. This means that the report probability influences how frequently each algorithm achieves the best performance.

Table A2. Chi-square test results for dependence between reporting probability and winning algorithm.

Metric	Chi2/DoF	p-Value
Accuracy	12.4	5.35 × $10^{- 14}$
RacialFairnessGap_avgPolShare	4.11	0.000395
RacialFairnessGap_PCR	8.18	7.1 × $10^{- 9}$
NeighborhoodFairnessGap_pGini	35.1	1.15 × $10^{- 22}$
NeighborhoodFairnessGap_pcrGini	27.4	6.21 × $10^{- 33}$
RacialFairnessGap_avgPolShare_absTrend	7.71	2.63 × $10^{- 8}$
RacialFairnessGap_PCR_absTrend	5.91	3.55 × $10^{- 6}$
NeighborhoodFairness_pGini_absTrend	8.18	7.19 × $10^{- 9}$
NeighborhoodFairness_pcrGini_absTrend	9.71	1.01 × $10^{- 10}$

For the fairness and accuracy metrics in Figure A1, the left column shows that although the probability of achieving the best performance differs across

p R e p o r t s

, the relative ranking of the algorithms remains largely stable. The right column also illustrates the magnitude of difference in the average value and 95% confidence interval of each metric for each model across

p R e p o r t s

, confirming near-fully consistent higher performance for PredPol. This shows a level of robustness in the reported advantage of PredPol over the other two models against the report probability for average fairness and accuracy during the 300-day simulation.

Figure A1. Sensitivity of averaged-value metrics to reporting probability. The left column shows dominance curves, indicating the probability of each algorithm attaining the best performance at each reporting probability. The right column shows the average value of the corresponding metrics across the report probabilities, indicating how the metric values themselves vary with the reporting probability.

The same plots for the metrics related to the trends in Figure A2 can only claim report probability robustness in lower speed of bias development for long-term KDE and near-consistent higher speeds for PredPol in the speed of neighborhood-level bias development. We cannot claim

p R e p o r t

robustness of the results for the speed of racial bias development.

Figure A2. Sensitivity of temporal trend metrics to reporting probability. The left column shows dominance curves, indicating the probability of each algorithm attaining the best value at each reporting probability. The right column shows the value of the corresponding metrics across the report probabilities for the absolute trend measures.

Appendix B. Correlation Analysis of Fairness and Accuracy

We analyzed the fairness–accuracy relation in all fairness–accuracy spaces. To do this, the silhouette score for each scenario parameter was calculated to determine the rankings of the parameters that yielded the most distinct clusters of points in each of the four fairness–accuracy spaces (Figure A3).

We then computed correlations within progressively refined groupings, starting from the highest silhouette score and incrementally adding parameters while the subgroups contained sufficient data points (at least three) for correlation estimation (Table A3).

The results show the same negative correlation between neighborhood-level fairness gap and accuracy in the numbers (Table A3) as was visible in Figure 10. This is the case even after multiple levels of groupings, although with reduced magnitude. In contrast, little to no consistent correlation is found between racial fairness gap and accuracy across groupings (see Table A3, Figure 11).

Figure A3. Silhouette scores of fairness metrics for clustering in accuracy–fairness space.

Table A3. Fairness–accuracy correlation analysis under progressively refined groupings. For each fairness metric, grouping variables are added sequentially in order of their silhouette score, with the correlation between accuracy and that fairness metric calculated in the resulting subgroups. The table reports the number of groups (Num Gs), average correlation (Avg. Corr), and its interpretation (Trend) plus the correlation variability (Std. Corr) and its interpretation (consistency).

Metric	Grouping Variables	Num Gs	Avg Corr	Std. Corr	Trend	Consistency
avg_police_gini		1	−0.976	0	strong negative	high
avg_police_gini	number_of_police	2	−0.864	0.075	strong negative	high
avg_police_gini	number_of_police + crime_type	4	−0.929	0.067	strong negative	high
avg_police_gini	number_of_police + crime_type + Algorithm	12	−0.61	0.533	moderate negative	low
avg_policeCrime_ratio_gini		1	−0.957	0	strong negative	high
avg_policeCrime_ratio_gini	number_of_police	2	−0.842	0.118	strong negative	high
avg_policeCrime_ratio_gini	number_of_police + crime_type	4	−0.606	0.603	moderate negative	low
avg_policeCrime_ratio_gini	number_of_police + crime_type + Algorithm	12	−0.429	0.67	moderate negative	low
avg_racial_fairnessGap_PCR		1	−0.253	0	weak negative	high
avg_racial_fairnessGap_PCR	crime_type	2	−0.314	0.523	weak negative	low
avg_racial_fairnessGap_PCR	crime_type + number_of_police	4	−0.072	0.214	very weak negative	medium
avg_racial_fairnessGap_PCR	crime_type + number_of_police + Algorithm	12	0.09	0.431	very weak positive	low
avg_racial_fairnessGap_avgPolShare		1	−0.131	0	very weak negative	high
avg_racial_fairnessGap_avgPolShare	crime_type	2	−0.27	0.522	weak negative	low
avg_racial_fairnessGap_avgPolShare	crime_type + number_of_police	4	−0.094	0.518	very weak negative	low
avg_racial_fairnessGap_avgPolShare	crime_type + number_of_police + Algorithm	12	−0.065	0.566	very weak negative	low

References

Zubair, T.; Fatima, S.K.; Ahmed, N.; Khan, A. Crime Hotspot Prediction Using Deep Graph Convolutional Networks. arXiv 2025, arXiv:2506.13116. [Google Scholar]
Mohler, G.O.; Short, M.B.; Brantingham, P.J.; Schoenberg, F.P.; Tita, G.E. Self-exciting point process modeling of crime. J. Am. Stat. Assoc. 2011, 106, 100–108. [Google Scholar] [CrossRef]
Braga, A.A.; Turchan, B.; Papachristos, A.V.; Hureau, D.M. Hot spots policing of small geographic areas effects on crime. Campbell Syst. Rev. 2019, 15, e1046. [Google Scholar] [CrossRef]
Bowers, K.J.; Johnson, S.D.; Pease, K. Prospective hot-spotting: The future of crime mapping? Br. J. Criminol. 2004, 44, 641–658. [Google Scholar] [CrossRef]
Chapman, A.; Grylls, P.; Ugwudike, P.; Gammack, D.; Ayling, J. A Data-driven analysis of the interplay between Criminological theory and predictive policing algorithms. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2022; pp. 36–45. [Google Scholar]
Perry, W.L.; McInnis, B.; Price, C.C.; Smith, S.C.; Hollywood, J.S. Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations; Rand Corporation: Santa Monica, CA, USA, 2013. [Google Scholar]
Mohler, G.O.; Short, M.B.; Malinowski, S.; Johnson, M.; Tita, G.E.; Bertozzi, A.L.; Brantingham, P.J. Randomized controlled field trials of predictive policing. J. Am. Stat. Assoc. 2015, 110, 1399–1411. [Google Scholar] [CrossRef]
Hu, Y.; Wang, F.; Guin, C.; Zhu, H. A spatio-temporal kernel density estimation framework for predictive crime hotspot mapping and evaluation. Appl. Geogr. 2018, 99, 89–97. [Google Scholar] [CrossRef]
Lum, K.; Isaac, W. To predict and serve? Significance 2016, 13, 14–19. [Google Scholar] [CrossRef]
Ensign, D.; Friedler, S.A.; Neville, S.; Scheidegger, C.; Venkatasubramanian, S. Runaway feedback loops in predictive policing. In Proceedings of the Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 160–171. [Google Scholar]
Akpinar, N.J.; De-Arteaga, M.; Chouldechova, A. The effect of differential victim crime reporting on predictive policing systems. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2021; pp. 838–849. [Google Scholar]
Mohler, G.; Raje, R.; Carter, J.; Valasik, M.; Brantingham, J. A penalized likelihood method for balancing accuracy and fairness in predictive policing. In 2018 IEEE International Conference on Systems, man, and Cybernetics (SMC); IEEE: New York, NY, USA, 2018; pp. 2454–2459. [Google Scholar]
American Civil Liberties Union. Plaintiffs Win Justice in Illegal Arrests Lawsuit Settlement with Baltimore City Police. 2010. Available online: https://www.aclu.org/press-releases/plaintiffs-win-justice-illegal-arrests-lawsuit-settlement-baltimore-city-police (accessed on 8 September 2025).
American Civil Liberties Union. ACLU Condemns Baltimore Police Department for Failing to Comply with Settlement Agreement in Illegal Arrests Case. 2012. Available online: https://www.aclu.org/press-releases/aclu-condemns-baltimore-police-department-failing-comply-settlement-agreement-illegal (accessed on 8 September 2025).
United States Department of Justice. Investigation of the Baltimore City Police Department. 2016. Available online: https://www.justice.gov/archives/opa/file/883366/dl?inline (accessed on 8 September 2025).
Brown, L.T. The Black Butterfly: The Harmful Politics of Race and Space in America; JHU Press: Baltimore, MD, USA, 2021. [Google Scholar]
Pietila, A. Not in My Neighborhood: How Bigotry Shaped a Great American City; Bloomsbury Publishing USA: New York, NY, USA, 2010. [Google Scholar]
Magazine, T. State of Emergency Is Declared in Baltimore as Riots Erupt. 27 April 2015. Available online: https://time.com/3837454/baltimore-looting-clashes-freddie-gray-police-protesters (accessed on 1 April 2026).
Makarechi, K. The Clock Didn’t Start with the Riots: Baltimore and Freddie Gray. 2015. Available online: https://www.vanityfair.com/news/2015/04/baltimore-riots-freddie-gray (accessed on 8 September 2025).
Prudente, T. Baltimore Mayor to Bring in Crime Fighting Strategist with High-Tech Policing Model. 2018. Available online: https://web.archive.org/web/20180201101655/http://www.baltimoresun.com/news/maryland/crime/bs-md-ci-sean-malinowski-20180123-story.html (accessed on 1 April 2026).
Zumer, B. Baltimore Police Department to Launch Predictive Policing Strategy. 2018. Available online: https://foxbaltimore.com/news/local/baltimore-police-to-launch-predictive-policing-strategy (accessed on 9 January 2025).
Hanlon, B.; Vicino, T.J. The fate of inner suburbs: Evidence from metropolitan Baltimore. Urban Geogr. 2007, 28, 249–275. [Google Scholar] [CrossRef]
Alexander, M. The New Jim Crow: Mass Incarceration in the Age of Colorblindness, revised edition ed.; The New Press: New York, NY, USA, 2012. [Google Scholar]
No Boundaries Coalition. Over-Policed, Yet Underserved: The People’s Findings Regarding Police Encounters and Accountability in Central West Baltimore. Technical Report, No Boundaries Coalition, 2016. Available online: https://www.noboundariescoalition.com/wp-content/uploads/2016/03/No-Boundaries-Layout-Web-1.pdf (accessed on 1 April 2026).
Densley, J.A. Over-policed and under-protected: Police violence as a symptom and cause of urban violence in America’s Black communities. In Public Health, Mental Health, and Mass Atrocity Prevention; Routledge: Abingdon, UK, 2021; pp. 71–88. [Google Scholar]
CBS News. Violent Crime Rate Spikes in Baltimore After Freddie Gray’s Death in Police Custody. 2015. Available online: https://www.cbsnews.com/news/violent-crime-rate-spikes-baltimore-freddie-gray-death-police-custody-2015/ (accessed on 12 November 2025).
Koerth-Baker, M.; Bronner, L. Charts: Baltimore Crime Before and After Freddie Gray’s Funeral. FiveThirtyEight. 2015. Available online: https://www.fivethirtyeight.com/features/charts-baltimore-crime-before-and-after-freddie-grays-funeral/ (accessed on 12 November 2025).
Pew Research Center. Multiple Causes Seen for Baltimore Unrest. 2015. Available online: https://www.pewresearch.org/politics/2015/05/04/multiple-causes-seen-for-baltimore-unrest/ (accessed on 1 April 2026).
CBC News. Baltimore Riots Prompt State of Emergency After Freddie Gray’s Funeral. 2015. Available online: https://www.cbc.ca/news/world/baltimore-riots-prompt-state-of-emergency-after-freddie-gray-funeral-1.3051048 (accessed on 12 November 2025).
Los Angeles Times. CVS Pharmacy Emerges as Symbolic Flashpoint of Baltimore Riot. 2015. Available online: https://www.latimes.com/nation/la-na-cvs-pharmacy-baltimore-riots-20150428-story.html (accessed on 12 November 2025).
CNN. DEA: Prescription Drugs Stolen in Baltimore Flooding the Streets. 2015. Available online: https://www.cnn.com/2015/06/25/politics/baltimore-drug-market-freddie-gray (accessed on 12 November 2025).
The Guardian. Baltimore Timeline: The Year Since Freddie Gray’s Arrest. The Guardian, 27 April 2016. Available online: https://www.theguardian.com/us-news/2016/apr/27/baltimore-freddie-gray-arrest-protest-timeline (accessed on 10 November 2025).
Baltimore Action Legal Team. About Us. 2024. Available online: https://www.baltimoreactionlegal.org/aboutus (accessed on 10 November 2025).
No Boundaries Coalition. About Us. 2024. Available online: https://www.noboundariescoalition.com/about-us/ (accessed on 10 November 2025).
Leaders of a Beautiful Struggle. About. 2024. Available online: https://lbsbaltimore.com/about/ (accessed on 10 November 2025).
Mohler, G. Marked point process hotspot maps for homicide and gun crime prediction in Chicago. Int. J. Forecast. 2014, 30, 491–497. [Google Scholar] [CrossRef]
Mashiat, T.; Gitiaux, X.; Rangwala, H.; Das, S. Counterfactually fair dynamic assignment: A case study on policing. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems; International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS): Richland, SC, USA, 2023; pp. 2526–2528. [Google Scholar]
Griffard, M. A Bias-Free Predictive Policing Tool: An Evaluation of the NYPD’s Patternizr. Fordham Urban Law. J. 2019, 47, 43. [Google Scholar]
Repasky, M.; Wang, H.; Xie, Y. Multi-Agent Reinforcement Learning for Joint Police Patrol and Dispatch. arXiv 2024, arXiv:2409.02246. [Google Scholar] [CrossRef]
Barbosa, S.E.; Petty, M.D. Exploiting spatio-temporal patterns using partial-state reinforcement learning in a synthetically augmented environment. Prog. Artif. Intell. 2015, 3, 55–71. [Google Scholar] [CrossRef]
Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
Rosenblatt, M. Remarks on Some Nonparametric Estimates of a Density Function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
Chainey, S.; Tompson, L.; Uhlig, S. The utility of hotspot mapping for predicting spatial patterns of crime. Secur. J. 2008, 21, 4–28. [Google Scholar] [CrossRef]
Mohler, G.O.; Short, M.B.; Malinowski, S.; Johnson, M.; Tita, G. Systems and Methods for Predictive Policing. US Patent US8949164B1, 3 February 2015. Available online: https://patents.google.com/patent/US8949164B1/en (accessed on 12 November 2025).
Vivek, M.; Prathap, B.R. Spatio-temporal crime analysis and forecasting on twitter data using machine learning algorithms. SN Comput. Sci. 2023, 4, 383. [Google Scholar] [CrossRef]
Tam, S.; Tanriöver, Ö.Ö. Multimodal deep learning crime prediction using tweets. IEEE Access 2023, 11, 93204–93214. [Google Scholar] [CrossRef]
Joe, W.; Lau, H.C.; Pan, J. Reinforcement learning approach to solve dynamic bi-objective police patrol dispatching and rescheduling problem. In Proceedings of the International Conference on Automated Planning and Scheduling; AAAI Press: Palo Alto, CA, USA, 2022; Volume 32, pp. 453–461. [Google Scholar]
Chen, H.; Wu, Y.; Wang, W.; Zheng, Z.; Ma, J.; Zhou, B. A risk-aware multi-objective patrolling route optimization method using reinforcement learning. In Proceedings of the 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS); IEEE: New York, NY, USA, 2023; pp. 1637–1644. [Google Scholar]
Chen, H.; Wu, Y.; Wang, W.; Zheng, Z.; Ma, J.; Zhou, B. Optimizing Patrolling Route with a Risk-Aware Reinforcement Learning Model. Preprint. SSRN 4752931, 2023. Available online: https://ssrn.com/abstract=4752931 (accessed on 1 April 2026).
Joe, W.; Lau, H.C. Learning to Send Reinforcements: Coordinating Multi-Agent Dynamic Police Patrol Dispatching and Rescheduling via Reinforcement Learning. 2023. Available online: https://dl.acm.org/doi/10.24963/ijcai.2023/18 (accessed on 11 January 2025).
Brantingham, P.J. The logic of data bias and its impact on place-based predictive policing. Ohio State J. Crim. Law 2017, 15, 473. [Google Scholar]
Brantingham, P.J.; Valasik, M.; Mohler, G.O. Does predictive policing lead to biased arrests? Results from a randomized controlled trial. Stat. Public Policy 2018, 5, 1–6. [Google Scholar] [CrossRef]
Lagioia, F.; Rovatti, R.; Sartor, G. Algorithmic fairness through group parities? The case of COMPAS-SAPMOC. AI Soc. 2023, 38, 459–478. [Google Scholar] [CrossRef]
Wang, H.; Grgic-Hlaca, N.; Lahoti, P.; Gummadi, K.P.; Weller, A. An empirical study on learning fairness metrics for compas data with human supervision. arXiv 2019, arXiv:1910.10255. [Google Scholar] [CrossRef]
Dressel, J.; Farid, H. The accuracy, fairness, and limits of predicting recidivism. Sci. Adv. 2018, 4, eaao5580. [Google Scholar] [CrossRef]
Helms, J.M.; Madden, A. Assessment of Data-Driven Deployment by the Memphis Police Department. Fall 2020 Report. Technical Report, Public Safety Institute, University of Memphis, Memphis, TN, USA, Fall 2020. Available online: https://memphiscrime.org/wp-content/uploads/2020/02/PSI-MPD-Data-Driven-Assessment.pdf (accessed on 20 February 2025).
Baltimore Police Department. New Technology Initiatives. 2024. Available online: https://www.baltimorepolice.org/resources-and-reports/new-technology-initiatives (accessed on 6 February 2026).
Maryland Crime Research and Innovation Center. MCRIC Partners with Baltimore City on Data-Driven Policing Research. Available online: https://bsos.umd.edu/academics-research/maryland-crime-research-and-innovation-center-mcric-mcric-partners-baltimore (accessed on 6 February 2026).
Raji, I.; Sholademi, D.B. Predictive Policing: The Role of AI in Crime Prevention. Int. J. Comput. Appl. Technol. Res. 2024, 13, 66–78. [Google Scholar]
Mandalapu, V.; Elluri, L.; Vyas, P.; Roy, N. Crime prediction using machine learning and deep learning: A systematic review and future directions. IEEE Access 2023, 11, 60153–60170. [Google Scholar] [CrossRef]
Baltimore Police Department. Part 1 Crime Data. Available online: https://data.baltimorecity.gov/datasets/baltimore::part-1-crime-data (accessed on 11 March 2023).
City of Baltimore. Neighborhood Demographic and Spatial Data. Available online: https://data.baltimorecity.gov/datasets/neighborhood-1 (accessed on 23 August 2022).
City of Baltimore. Neighborhood Boundary KML File. Available online: https://data.baltimorecity.gov/datasets/baltimore::neighborhood-1 (accessed on 15 March 2023).
Morgan, R.E.; Truman, J.L. Criminal Victimization, 2019. Technical Report NCJ 255113, Bureau of Justice Statistics, U.S. Department of Justice, 2020. Available online: https://bjs.ojp.gov/content/pub/pdf/cv19.pdf (accessed on 13 June 2025).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference; ACM: New York, NY, USA, 2012; pp. 214–226. [Google Scholar]
Friedler, S.A.; Scheidegger, C.; Venkatasubramanian, S.; Choudhary, S.; Hamilton, E.P.; Roth, D. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2019; pp. 329–338. [Google Scholar]
Abeles, J.; Conway, D.J. The Gini coefficient as a useful measure of malaria inequality among populations. Malar. J. 2020, 19, 444. [Google Scholar] [CrossRef]
Thomas, V.; Wang, Y.; Fan, X. Measuring Education Inequality: Gini Coefficients of Education for 140 Countries, 1960–2000. J. Educ. Plan. Adm. 2003, 17, 5–33. Available online: https://www.niepa.ac.in/download/Publications/JEPA_(15%20years)/JEPA%202003_Vol-17%20(1-4)/JEPA_JAN-2003-VOL17_1%20Final.pdf#page=5 (accessed on 1 April 2026).
De Maio, F.G. Income inequality measures. J. Epidemiol. Community Health 2007, 61, 849–852. [Google Scholar] [CrossRef] [PubMed]
Semsar, S. Predictive Policing Project Code and Data. 2025. Available online: https://github.com/saminsemsar/Data_Analysis_Portfolio/tree/main/PredictivePolicing (accessed on 1 April 2026).
Halford, E.; Giannoulis, M.; Condon, C.; Keningale, P. Do hotspot policing interventions against optimal foragers cause crime displacement? Int. J. Law Crime Justice 2024, 77, 100654. [Google Scholar] [CrossRef]
Ferguson, A.G. Predictive policing and reasonable suspicion. Emory Law J. 2012, 62, 259. [Google Scholar] [CrossRef]

Figure 1. Crime reporting probabilities based on data from Bureau of Justice Statistics, 2019 report [64].

Figure 2. Simulation maps comparing three algorithms (columns). The top row shows the police distribution on day 1, while the bottom row shows the distribution on day 300. All were simulated using a report probability of

0.4

and 40 officers.

Figure 2. Simulation maps comparing three algorithms (columns). The top row shows the police distribution on day 1, while the bottom row shows the distribution on day 300. All were simulated using a report probability of

0.4

and 40 officers.

Figure 3. First day vs. last day coverage accuracy of each algorithm over different scenarios.

Figure 4. Average accuracy by scenario and algorithm.

Figure 5. Average racial fairness gap based on police-to-crime ratios by scenario and algorithm.

Figure 6. Average racial fairness gap based on average police share by scenario and algorithm.

Figure 7. Model behavior based on average police share assigned to Black and White neighborhoods over the days of the simulation, in a scenario example (distributing 400 officers and using all crime records with crime report probability = 0.4).

Figure 8. Average neighborhood-level fairness gap based on Gini coefficient of police number, by scenario and algorithm.

Figure 9. Average neighborhood-level fairness gap based on Gini coefficient of police-to-crime ratios, by scenario and algorithm.

Figure 10. Tradeoffs between accuracy and neighborhood-level fairness across algorithms; marker size indicates number of police, while marker shape indicates crime type.

Figure 11. Tradeoffs between accuracy and racial fairness across scenarios; marker size indicates number of police, marker shape indicates crime type.

Figure 12. Racial fairness gap (average police share), absolute slope by algorithm.

Figure 13. Racial fairness gap (average police-to-crime ratio), absolute slope by algorithm.

Figure 14. Neighborhood-level fairness gap (average police share), absolute slope by algorithm.

Figure 15. Model behavior based on neighborhood-level fairness gap in terms of Gini coefficient of police numbers over the days of simulation, from a scenario example (distributing 400 officers and using all crime records with crime report probability = 0.4).

Figure 16. Neighborhood-level fairness gap (police-to-crime ratio), absolute slope by algorithm.

Figure 17. Trend-line slope of neighborhood-level fairness gap based on Gini coefficient of police–crime proportionality over the days of simulation across scenarios and algorithms.

Figure 18. Trend-line slope of neighborhood-level fairness gap based on Gini coefficient of police share over the days of simulation across scenarios and algorithms.

Figure 19. Trend-line slope of racial fairness gap based on average police share over the days of simulation across scenarios and algorithms.

Figure 20. Trend-line slope of racial fairness gap based on police–crime proportionality over the days of simulation across scenarios and algorithms.

Figure 21. Trend-line slope of accuracy over the days of simulation across scenarios and algorithms.

Figure 22. Comparison of average neighborhood-level fairness gaps (based on police-to-crime ratios) of similar scenarios with different crime data.

Figure 23. Comparison of the average accuracy of similar scenarios with different crime data.

Figure 24. Comparison of the trendline slope of Black vs. White neighborhoods for the average police share across scenarios with different algorithms and crime data.

Figure 25. Comparison of the trendline slope of Black vs. White neighborhoods for the police-to-crime ratio across scenarios with different algorithms and crime data.

Figure 26. Top-10 neighborhoods across scenarios that appeared in the list of top-3 over-policed neighborhoods in each individual scenario when applying aggravated assault records with each algorithm.

Figure 27. Top-10 neighborhoods across scenarios that appeared in the list of top-3 over-policed neighborhoods in each individual scenario when applying total crime records with each algorithm.

Figure 28. Crime distribution sensitivity across temporal lookback windows. Each heatmap shows the percentage of crimes occurring in White neighborhoods among the top-k highest-crime neighborhoods, where the top-k set is recomputed separately for each lookback window (1, 2, 5, 10, 30, and 365 days prior to the start of the simulation). Results are shown for aggravated assault (left) and total crime (right).

Figure 29. Comparison of crime concentration and initial police allocation across top-k neighborhoods. The left column shows the percentage of crime in top-k neighborhoods during the 30 day pre-simulation period, while the right column shows the percentage of police assigned to these neighborhoods on day 1 of the simulation for each algorithm. Although all models use the same crime baseline, their initial allocations differ, demonstrating that model structures, independent of feedback loops, can introduce disparities in police distributions.

Table 1. The eight simulation scenario settings described in the study.

Scenario	Crime Input	Number of Officers	Report Setting
S1	AGG. ASSAULT	40	Detected only ( $r = 0$ )
S2	AGG. ASSAULT	40	Detected + reported ( $r = 0.4$ )
S3	AGG. ASSAULT	400	Detected only ( $r = 0$ )
S4	AGG. ASSAULT	400	Detected + reported ( $r = 0.4$ )
S5	TOTAL	40	Detected only ( $r = 0$ )
S6	TOTAL	40	Detected + reported ( $r = 0.4$ )
S7	TOTAL	400	Detected only ( $r = 0$ )
S8	TOTAL	400	Detected + reported ( $r = 0.4$ )

Table 2. Paired scenario-level differences between PredPol and KDE-based methods across the eight main scenarios. Positive values indicate higher accuracy for PredPol; negative values indicate lower fairness gaps (better fairness) for PredPol.

Metric	Avg. Dif. Long-Term KDE	Avg. Dif. Short-Term KDE
Average accuracy	$+ 0.0127$	$+ 0.1306$
Racial fairness gap (PCR)	$- 3.3727$	$- 1.8443$
Racial fairness gap (avg. police share)	$- 0.0003$	$- 0.0002$
Neighborhood fairness gap (PCR Gini)	$- 0.039$	$- 0.1908$
Neighborhood fairness gap (police Gini)	$- 0.0329$	$- 0.2171$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Semsar, S.; Prabhu, K.L.; Waters, G.; Foulds, J. A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City. Algorithms 2026, 19, 398. https://doi.org/10.3390/a19050398

AMA Style

Semsar S, Prabhu KL, Waters G, Foulds J. A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City. Algorithms. 2026; 19(5):398. https://doi.org/10.3390/a19050398

Chicago/Turabian Style

Semsar, Samin, Kiran Laxmikant Prabhu, Gabriella Waters, and James Foulds. 2026. "A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City" Algorithms 19, no. 5: 398. https://doi.org/10.3390/a19050398

APA Style

Semsar, S., Prabhu, K. L., Waters, G., & Foulds, J. (2026). A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City. Algorithms, 19(5), 398. https://doi.org/10.3390/a19050398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City

Abstract

1. Introduction

2. Related Work

2.1. Accuracy Improvement

2.2. Fairness Studies

2.3. Comparative Analysis

2.4. Contextualized Studies

3. Materials and Methods

3.1. Data Description

3.2. Definitions

3.3. Methods

3.4. Fairness and Accuracy Metrics

3.5. Experimental Environment

4. Results

5. Discussion

6. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Sensitivity Analysis to Reporting Probabilities

Appendix B. Correlation Analysis of Fairness and Accuracy

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI