A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City
Abstract
1. Introduction
- 1.
- 2.
- Opportunists and emboldened criminals: A public survey by the PEW Research Center showed that about 60% believed the main cause of the crime rise was that some people took advantage of the situation [28]. During the unrest and subsequent police pullback, opportunists looted approximately 30 pharmacies and drug clinics, stealing substantial amounts of narcotics, which some believed caused a rise in crimes as well [29,30,31].
- 1.
- A simulation study of the fairness of predictive policing in the context of Baltimore/
- 2.
- A comparative analysis of hot spot policing and predictive policing.
- 3.
- Introducing a logical basis on which predictive policing models can be compared.
- 4.
- Visualization of results with suitable graphics, promoting the comprehensibility of our comparisons and the interpretability of related findings.
- 5.
- Last and most importantly, demonstrating the necessity of localized evaluation of policing systems prior to real-world implementation and providing a method with which to achieve this.
2. Related Work
2.1. Accuracy Improvement
2.2. Fairness Studies
- 1.
- High-crime-rate neighborhood vs. low-crime-rate, training only on discovered crimes.
- 2.
- The same, except this time training only on reported crimes.
- 3.
- Two neighborhoods with high crime rates, training only on discovered crimes.
- 4.
- The same neighborhoods, now trained only on reported crimes [10].
- 1.
- Neo-classical: Crime is the product of rational choice, and police presence is preventive. Because a successful burglary can motivate another one nearby, there is a relation among crimes within space and time.
- 2.
- Positivist: Crime is based in genetics, social setting, and biology, rather than by choice; thus, its distribution is mostly random.
2.3. Comparative Analysis
2.4. Contextualized Studies
3. Materials and Methods
3.1. Data Description
3.2. Definitions
- Black neighborhoods: We define a neighborhood as Black if the number of Black residents in that neighborhood exceeds the number of residents of any other racial group. For example, if a neighborhood has 21 Black residents, 20 White residents, and 10 Latino residents, it is flagged as a Black neighborhood. This majority-based definition allows us to consistently categorize neighborhoods by their predominant racial composition for fairness analysis.
- White neighborhoods: Neighborhoods with majority White residents.
- Neither Black nor White neighborhoods: Neighborhoods where the majority of the residents are neither Black nor White. These areas are populated primarily by groups including the Latinx community, Alaskans, and other racial or ethnic minorities.
- Noisy-OR: We use a Noisy-OR function to model the probability of an effect Y given binary causes . Here, indicates whether cause i is present or not. Each cause is associated with a failure probability . The Noisy-OR formulation assumes conditional independence and is defined as follows:We incorporate the Noisy-OR probabilistic model to introduce uncertainty into crime detection. Noisy-OR models the independent contribution of multiple officers in detecting a crime by increasing the likelihood of detection as the number of officers within the detection radius increase. Then, an event (e.g., crime detection) is determined as having occurred if any of several coin flips come up as heads, i.e., if a random probability is higher than the Noisy-OR calculated probability, that crime is marked as “detected”.
- Detected crimes: The algorithm determines whether a crime is detected by applying the Noisy-OR model to the probability of detection (p), taking into account the number of police officers within the neighborhood k and the probability of an officer detecting a crime, which is a hyperparameter . This detection probability parameter (p) is selected as a neutral baseline in the absence of reliable empirical estimates of an officer’s detection rate. Choosing allows us to model detection uncertainty while avoiding strong assumptions that could bias results toward specific outcomes.A crime is labeled as “detected” based on a coin flip with the probability in Equation (2).
- Reported crimes: We create a report dataset by flipping a coin with 0.4 probability to decide whether or not each crime would be reported Equation (4). The value of 0.4 was calculated using a weighted average based on the number of each type of crime in our dataset and the probability of reporting that crime based on the 2019 report of the Bureau of Justice Statistics (BJS) [64]. We weight each crime category’s count by its national reporting probability from the BJS. The estimate is computed as follows:where is the number of crimes of type i that occurred after 2019 in Baltimore and is the national reporting rate of crime type i (see Figure 1).To clarify the concepts of reported and detected crimes, consider a simple example. Suppose that ten crimes actually happen in a neighborhood, but only four are reported to the police due to under-reporting. A predictive system trained on historical data will only observe these four reported incidents, not the full set of ten true crimes. In our simulation, “detected” crimes refers to the subset of true crimes that become visible to the system through policing activity, which may partially overlap with reported crimes.
- KDE: Kernel Density Estimation (KDE) is a statistical technique used to estimate the underlying Probability Density Function (PDF) of a set of data points. It involves creating a smooth and continuous function by placing a kernel (a predefined shape, such as a Gaussian) on each data point and then summing them. The resulting estimated density function provides insights into the distribution and intensity of the data across the entire range. We implemented Kernel Density Estimation using the KernelDensity class from the scikit-learn Python library (version: 1.1.3) [65], which provides a computationally efficient implementation of the density estimation method originally introduced by Rosenblatt [42] and Parzen [41].
- Short-term KDE: KDE which receives short-term crime history. In our simulations, we set this short term to equal a month’s worth of crime history.
- Long-term KDE: KDE which receives a longer history of crime compared to short-term KDE. In our simulations, we set it equal to one year’s worth of crime history to make it comparable to PredPol in terms of the data they receive.
- PredPol: PredPol is a self-exciting point process model for crime prediction introduced by Mohler et al. [2,7]. The model is based on the Epidemic-Type Aftershock Sequence (ETAS) framework originally developed in seismology, in which each past event increases the short-term likelihood of nearby future events. In the context of crime forecasting, this formulation captures near-repeat victimization by modeling crime intensity as a combination of a background rate and a self-exciting component. We implemented the simplified formulation explained in Mohler’s 2015 work [7].
3.3. Methods
- 1.
- Number of real crimes in the Baltimore city dataset.
- 2.
- Number of police officers assigned.
- 3.
- Number of crimes detected.
- 1.
- Number of police officers: Distribution of 40 or 400 police officers.
- 2.
- Probability of reporting a crime: Using only detected crimes (probability of report = 0) or using both detected and reported crimes (probability of report = 0.4). The value was calculated as the weighted average of reporting probabilities across crime categories, where the weights correspond to the observed frequency of each crime type in the dataset (Section 3.2). It is used as a simplified uniform reporting rate across neighborhoods and crime types to represent under-reporting of crimes. This assumption allows us to examine how under-reporting affects policing algorithms. We note that in reality, reporting probabilities may vary across neighborhoods and crime types; we discuss this impact in Section 6.
- 3.
- Crime type: Using all crime types in the dataset (’TOTAL’) or only aggravated assault records (’AGG. ASSAULT’).
| Algorithm 1: Creating a Reported Crimes Dataset |
| Input: Crime dataset crimes, Report probability report_probability |
| Output: None |
| reported_crimes = an empty crime dataset; |
![]() |
| save CSV file of reported_crimes; |
| Algorithm 2: Police Allocation Simulation |
|
| Initialize crimes_h ← all crimes before start_date; |
| Initialize detected_crimes ← crimes_h; |
| Initialize res with columns: date, neighborhood, crime_num, police_num, detected_crime_num; |
| Fill res with simulation dates, neighborhoods, and number of crimes from crimes; |
![]() |
| return res; |
3.4. Fairness and Accuracy Metrics
- 1.
- Racial Fairness GapRacial fairness in this study is a group fairness metric. In this work, the racial disparity or fairness gap is defined as an absolute difference in averages. The use of a difference or absolute difference is a standard approach in quantifying group disparity (cf. Equation (5) in [66] and Definition 6.3 in [67]).
- Inequality of Average Police Share between White and Black NeighborhoodsThis measure determines the disparity of average police share between the two races, meaning the groups’ general equality of resource allocation:where is the average police share in neighborhoods of the given race. Based on the explanation above, the most fair model or simulation setting would be determined by
- Inequality of Average Police-to-Crime Ratio (PCR) between White and Black NeighborhoodsThe racial fairness gap is defined as the absolute difference of the average police-to-crime ratio between groups, analyzing proportional treatment or similar treatment of individuals with similar crime rates (see Equations (10)–(14), where ).
- 2.
- Neighborhood-Level Fairness GapTo study inequality of treatment among individual neighborhoods, we use the Gini coefficient. This measure of inequality has been used in health [68], education [69], and especially the economic-related literature [70]. The Gini coefficient measures the distance from the equality line; the higher the coefficient, the more unequal the values. The Gini coefficient is calculated using the trapezoidal method:
- Inequality of Police DistributionThis metric calculates the Gini coefficient of police numbers in each neighborhood during the simulation in order to determine the overall inequality of the resource distribution.
- Inequality of Police-to-Crime Ratio (PCR) Across NeighborhoodsWe measure the Gini coefficient of each neighborhood’s police-to-crime ratio, which is the inequality of the police distribution of similar individuals with similar crime ratios and the inequality of officer shares proportional to crime shares.
- 3.
- Coverage Accuracy — Proportion of Crimes DetectedCoverage accuracy measures the effectiveness of the police distribution relative to the actual crime distribution. It is defined as the proportion of detected crimes.
3.5. Experimental Environment
4. Results
- Which algorithm is fairer or more accurate in our specific location and time period?
- Do different systems react differently to feedback loops?
- Is the bias model-driven or data-driven?
- How do the police concentration and distribution change over the course of the simulation in different scenarios?
- 1.
- Comparative Fairness and Accuracy Across ModelsThis subsection compares short-term KDE, long-term KDE, and PredPol in terms of overall fairness and accuracy within the Baltimore study region and time period. The findings summarize average and aggregate performance across scenarios, including whether the improvements in fairness affected accuracy.
- Finding 1.1
- PredPol was the Most Accurate at the BeginningIn 50% of scenarios, PredPol had the highest coverage accuracy compared to the other two on the simulation’s first day. However, on the last day of the simulation, long-term KDE became the most accurate in 75% of the scenarios, cf. Figure 3.
- Finding 1.2
- PredPol was generally the most accurate algorithm.PredPol was the most accurate algorithm in 87.5% of scenarios and the second-most accurate in the rest. Conversely, short-term KDE was the least accurate in 75% of scenarios (Figure 4). Further analysis of this result’s robustness against different report probabilities (Appendix A) proved that although the extent of difference in accuracy of these models was influenced by , their relative ranking remained mostly the same (Figure A1).
- Finding 1.3
- PredPol was the most racially fair, while long-term KDE was the least.PredPol was the most racially fair, in terms of equality of police-share to crime-share ratio, in 75% of scenarios, and long-term KDE was the least racially fair in 87.5% of scenarios (Figure 5). In terms of average police share, PredPol was the most racially fair model in 62.5% of scenarios. Long-term KDE, on the other hand, was the least racially fair model in 75% of scenarios (Figure 6 and Figure 7). Further analysis of this result’s robustness against different report probabilities (Appendix A) proved that although the extent of the difference in racial fairness gap between these models was influenced by (Figure A1, right column), their relative ranking remained mostly the same (Figure A1, left column), with PredPol having a higher probability of achieving the best performance across .
- Finding 1.4
- PredPol was the most fair in regard to neighborhood-level fairness, followed by long-term KDE, while short-term KDE was the least fair at the neighborhood level.Based on both equality of average police share and equality of average police-to-crime ratio, PredPol (in 75% of scenarios) and long-term KDE (in the other 25%) were the most fair models at the neighborhood level. According to both metrics, short-term KDE was the least fair at the neighborhood level in 100% of scenarios (Figure 8 and Figure 9). Further analysis of this result’s robustness against different report probabilities (Appendix A) proved that although the extent of difference between these models in terms of the neighborhood-level fairness gap was influenced by (Figure A1, right column), their relative ranking remained mostly the same, with PredPol maintaining the highest probability of achieving the best performance across (with the exception of , where long-term KDE and PredPol swapped places); see Figure A1, left column. In general, long-term KDE and PredPol performed more alike in terms of the extent of neighborhood-level fairness gap, but with a higher decrease in gap for PredPol compared to long-term KDE as the report probability increased (Figure A1, right column).To reinforce the scenario percentage pattern reporting, we also computed paired differences between PredPol and each KDE-based method across the eight main scenarios. The average differences are provided in Table 2.
- Finding 1.5
- Higher accuracy is accompanied with lower neighborhood-level fairness-gap.In Figure 10, it can be observed that higher neighborhood-level fairness gaps (lower fairness) were accompanied by lower accuracy. The accuracy–fairness correlation is clearer when we group the data points by the number of officers allocated, which was the variable creating the most distinct clusters (Appendix B). Further correlation analysis can be found in Appendix B.
- Finding 1.6
- Racial fairness is not correlated with accuracy.Unlike neighborhood-level fairness, we did not observe any correlations between racial fairness and accuracy (Figure 11, Appendix B). All levels of racial fairness were observed over all levels of accuracy. Further correlation analysis can be found in Appendix B.
- 2.
- Differential Responses to Feedback LoopsThis subsection examines how the different algorithms responded to feedback loops over time. These findings focus on temporal trends, including the direction and speed of change in fairness and accuracy, rather than on average outcomes. Overall, we found that long-term KDE had the slowest pace of change (slope of trend-line) for accuracy and fairness metrics over the days of simulation, with PredPol and short-term KDE competing for the least stable. In some cases where this trend moved towards focusing officers on a race different from the one at the start of simulation, the trend became positive, showing a temporary improvement in bias. This happened more for KDE-based models. A more detailed report is presented below.
- Finding 2.1
- Long-Term KDE had the slowest pace of change in most scenarios, especially for neighborhood-level fairness.Long-term KDE had the smallest slope of the trend line in:
- Finding 2.2
- Short-term KDE had the fastest trend of racial bias based on average police share in most of the scenarios.Short-term KDE had the fastest trend of racial bias in 62.5% of scenarios for equality of average police share (Figure 12). Taking a closer look at individual scenarios, although short-term KDE ranked second in pace of neighborhood-level bias in most of the scenarios, its change in neighborhood-level fairness over the days showed a more volatile pattern compared to the other two models (Figure 15; to see the other scenarios, refer to the code [71]).
- Finding 2.3
- PredPol had the fastest trend of neighborhood-level bias, and showed a trend in racial bias metrics for most of the scenarios.We have already established that PredPol was generally the fairest model in terms of both racial and neighborhood-level fairness; however, looking at the pace of change for each metric over different scenarios, PredPol shows a fast neighborhood-level fairness gap trend (Figure 14 and Figure 16) and racial fairness trend in terms of police-to-crime ratio in most of the scenarios (Figure 13). This suggests that it is indeed vulnerable to bias from feedback loops, as previously reported by [9].
- Finding 2.4
- Bias amplification was observed in a higher percentage of scenarios for PredPol compared to the other two models.Although PredPol was fairer in most of the scenarios, amplification of bias was seen in its trends in a higher percentage of scenarios than for the two KDE approaches. PredPol’s neighborhood-level fairness dropped in 100% of scenarios, while the drop for short-term KDE and long-term KDE occurred in 75% and 37.5%, respectively (Figure 17 and Figure 18).The drop in racial fairness for PredPol happened in 75% of scenarios for police–crime proportionality and 50% of scenarios for average police share, while these numbers were 50% and 25% for short-term KDE and 75% and 25% for long-term KDE, respectively (Figure 19 and Figure 20). In Figure 19 and Figure 20, the hatched or striped bars are those in which the race receiving focus at the start of the simulation contrasted with the race the trend was toward; this demonstrates how all the fairness gaps with negative slopes were those where the race focus at the beginning of the simulation contradicted the trend. For instance, refer to the models’ behaviors in the example scenario in Figure 7.
- Finding 2.5
- Short-term KDE experienced an accuracy drop in a higher percentage of scenarios compared to the other two models.Although we observed previously that neighborhood-level fairness and accuracy seemed to be correlated to some extent, here we saw that short-term KDE, not the model with constant and highest speed of neighborhood-level fairness drop, experienced accuracy drops in 75% of the scenarios. The corresponding drop percentages were 62.5% for PredPol and 37.5% for short-term KDE (Figure 21).
- Finding 2.6
- Distribution uniformity did not guarantee the defined racial fairness.While the percentage of scenarios with a drop in neighborhood-level fairness ranged from 75 to 100 for PredPol and short-term KDE, the percentage of scenarios with a drop in racial fairness ranged from 25 to 75 (Figure 17, Figure 18, Figure 19 and Figure 20). On the other hand, for long-term KDE racial fairness based on police–crime proportionality fell in 75% of scenarios, while both neighborhood-level fairness metrics worsened in 37.5% of scenarios. Long-term KDE’s racial fairness based on average police share only fell in 25% of scenarios. This indicates that the models with the least police distribution uniformity might stay racially fair in areas with certain demographic maps and crime records.
- Finding 2.7
- Bias typically worsened over time except when the trend and current state contradicted.We observed a consistent pattern across scenarios: in cases where racial fairness appeared to improve unexpectedly, the demographic group receiving the greatest police attention at the start of the simulation was different from the group receiving the most attention at the end. In the case of Baltimore and using all crime records, this focus at the end was surprisingly on White neighborhoods. This shift largely accounted for the observed improvement (see Figure 19 and Figure 20). For these contradictory cases where the initial state and the direction of change did not align with bias amplification, the slope of the bias trend must have been sufficiently steep relative to the initial disparity for the trajectory to have crossed the parity point and begun moving toward amplification. In such situations, longer simulation horizons are required for this transition to appear.These apparent improvements hint at the possibility of seasonal cycles of bias reduction and subsequent amplification when simulations are run for extended durations. Changes in the data distribution for real crimes may arise from multiple underlying factors. Two such factors are illustrated below:
- i.
- Seasonality: Any variable with cyclic effect, such as weather, could cause seasonal changes in the distribution.
- ii.
- Police policy change: Changes in police distribution or enforcement practices could reduce crime in a given location or suppress specific crime types affected by that policy, which would then induce a shift in crime distribution.
Therefore, when data on real crimes are used and when both reported and detected crimes contribute to the updates, the distribution of reported crimes may begin to favor a neighborhood that was not previously predicted as high-risk. As officers gradually shift their attention from earlier high crime rate areas towards this newly emergent hot spot, a temporary reduction in measured bias can occur. This is typically followed by renewed amplification once police resources become concentrated in the new focal area, until the distribution of reported crimes eventually shifts again.
- 3.
- The Effects of Data Variation on BiasThis subsection examines how different crime data affect the fairness and accuracy outcomes across the algorithms. We compared scenarios using aggravated assault records with those using all crime records in order to analyze the change in fairness and accuracy.
- Finding 3.1
- Expanding from aggravated assault to all crimes increased inequality of the police-to-crime ratio and reduced accuracy for long-term KDE and PredPol.
- Finding 3.2
- Different models could have different bias outcomes when applied to different data, and vice versa.When looking at the slope of the trendline to determine which race is receiving a higher average police share (Figure 24) or higher average police-to-crime ratio (Figure 25) as the simulation days pass, we observe that when feeding aggravated assault crime records to PredPol, the focus is more on Black neighborhoods, whereas the focus flips to White neighborhoods when using all crime records. This is observed in most scenarios and for both metrics. This result is noteworthy because the debate around bias in predictive policing typically makes the implicit assumption that algorithms are solely biased against Black neighborhoods, following the work of Lum and Isaac [9].
- 4.
- Police Concentration Areas in BaltimoreThis subsection focuses on how police resources are distributed across neighborhoods over the course of the simulation. To be more precise, we looked at the number of scenarios in which each of the top-10 most frequently over-policed neighborhoods were ranked as top-3.
- Finding 4.1
- Assigning officers based on aggravated assault led to over-policing more Black neighborhoods than White neighborhoodsOver-policing is defined as having a higher police share compared to crime share. Looking at the top-10 most frequently over-policed neighborhoods by each algorithm when using aggravated assault records, we observed that other than Downtown, which is the most frequently over-policed neighborhood in the scenarios, all other top-10 frequently over-policed neighborhoods were Black, such as Blair-Edison and Sandtown-Winchester. (Figure 26).
- Finding 4.2
- Assigning officers based on all crime records by PredPol over-policed mostly White neighborhoods such as Mount Vernon and Canton, unlike the other two models.Looking at the top-10 most frequently over-policed neighborhoods by each algorithm when all crime records were used, we saw that other than Sandtown-Winchester and Cherry Hill, all other top-10 frequently over-policed neighborhoods by PredPol were White, such as Mount Vernon, Canton, and Brooklyn. (Figure 27). Among short-term KDE’s over-policed neighborhoods, only two of six (Paterson Park and Inner Harbor) were White; among long-term KDE’s over-policed neighborhoods, one out of four (Brooklyn) was White.
5. Discussion
- Fairness and Accuracy ComparisonBy comparing simulation results of representative predictive policing and hot spot policing algorithms, we discovered that predictive policing via PredPol was generally more accurate and had higher racial and neighborhood-level fairness than hot spot policing via short-term and long-term KDE across most scenarios.Long-term KDE, where we widened KDE’s data input to use the same data received by PredPol, improved accuracy and neighborhood-level fairness over that of short-term KDE but worsened racial fairness. The improvement in accuracy did not overtake PredPol’s level of accuracy but almost matched its neighborhood-level fairness.The theory of hot spot policing claims that distribution of recent crimes can approximate their near-future distribution. Recent crimes are often defined as those within either a month or a year of the prediction date [72]. In our experiment, both long-term KDE and PredPol received the crime history from a year before the starting date of the simulation. The simulation’s starting prediction date was 1 January 2019, and the crime history included all crimes after 1 January 2018. If the crime history was more than a year ago, we expect that the accuracy at the starting date would drop, but that overall accuracy might improve because of our finding that a longer crime history slows down the effect of neighborhood-level bias for KDE, which in turn slows the drop in accuracy.Our results show that higher accuracy does not guarantee a racially fairer model. Both fair and unfair outcomes were observed across the fairness spectrum for high-accuracy scenarios (Figure 11). However, neighborhood-level fairness seemed to correlate with accuracy (Figure 10). These results extend the work of Mohler et al. [12] on accuracy–fairness tradeoffs by providing evidence from extended simulations on real crime data. Unlike Mohler’s observation that fairness comes at a cost in accuracy, some of our scenarios were both fairer and more accurate than others. This allows us to infer that the give-or-take between accuracy and fairness depends on both the fairness metric and the demographic geography of the location the system is customized for (Figure 10 and Figure 11).
- Temporal Feedback Loop EffectsWe saw a trend of bias amplification in most of the scenarios for all three systems, except when there was a change in the pattern of data during the simulation causing the police focus to change from certain neighborhoods to others. In scenarios where bias did not become worse, the race that received the highest focus at the start of the simulation contradicted the race at the end, as shown in Figure 19 and Figure 20. Note that the negative slopes of the fairness gaps indicate those scenarios with improvements in regard to that metric. For these scenarios, bias amplification might have happened later on if the simulation had been continued for a longer period, unless reported crime distribution changed drastically again. Therefore, when using real crime data we saw that bias amplification was not constant, instead depending on changes in the distribution of newly discovered and reported crimes. For neighborhood-level fairness, we saw a continual increase in bias for most of the scenarios, especially for PredPol and short-term KDE (Figure 17 and Figure 18). For racial fairness, a lower number of scenarios experienced bias amplification compared to neighborhood-level fairness, especially for the metric related to average police share (Figure 19). However, it should be noted that all scenarios with bias improvement also showed a contradiction between which race the initial police focus was on and which race the focus was trending towards.It is important to consider that we defined bias amplification as a falling trend in a system’s equality of treatment. In all of scenarios where we saw improvement in equality, we also observed that the race receiving a higher policing share at the beginning of the simulation was different from the race that the trend was towards; therefore, a longer simulation duration is needed in order to make sure there is not a point of racial equality in the future after which a bias amplification towards the trending race occurs. We did not extend the simulation further, as longer durations introduce a shift in the underlying crime distribution that interacts with the feedback effect, making it difficult to attribute the changes in fairness to a single mechanism. Unlike Ensign’s assumptive theoretical study that showed constant bias amplification when assigning an officer between two neighborhoods [10], in a more realistic situation such amplification is not constant, and could temporarily subside before intensifying again.Although PredPol was generally fairer and more accurate on average in most of the scenarios, its speed of bias amplification was higher for most of the metrics in most of the scenarios, while the speed of bias amplification for long-term KDE was substantially slower compared to the other two in most scenarios. This provides further evidence that different models respond differently to feedback loops.
- Crime Type EffectIn the context of Baltimore, using different data from all crime records flipped the police concentration from one demographic to another for PredPol, and in some scenarios for short-term and long-term KDE as well. This result shows that the direction of racial bias can be affected not only by the predictive policing system but also by the data it was applied to.We also observed that the average accuracy and average neighborhood-level equality based on the police-to-crime ratio dropped for long-term KDE and PredPol. This suggests that systems with more focus on the long-term effects of present crime on predictions of future crime might become less accurate when daily records are sparse. Alternatively, it could be that data records had a higher change in distribution over a short period of time. The sparseness of crime data and its effect on different systems could be studied further in future work.
- Data-Driven vs. Model-Driven BiasOur analyses found that both data and model matter: the same data fed into different models (KDE vs. PredPol) produced different fairness outcomes as previously shown by Chapman et al. [5]. The same model fed different data (aggravated assault vs. total crime) can affect the speed of bias development, or even flip the direction of bias.
- Baltimore-Specific InsightsBlack neighborhoods received more policing on average in most simulations, although this varied with crime type and algorithm. In most scenarios, the trend was toward assigning a higher average share of officers in general, and also toward a higher average share of officers per share of crime to White neighborhoods when all crime records were used. This finding counters the widely held assumption that predictive policing is biased specifically against Black neighborhoods, as found by Lum and Isaac in data from Oakland, California [9]. These patterns are not solely explainable by neighborhood count (Baltimore has more Black than White neighborhoods), suggesting that model behavior plays a larger role than geography alone. Furthermore, when examining the top-5 highest crime neighborhoods in different period lengths before the start of the simulation (cf. Figure 28), we observed the percentage of crimes occurring in White neighborhoods to be substantially higher for total crimes as compared to aggravated assault. This indicates that the crime distribution differs by crime type, and could affect the direction of bias.However, when comparing the police allocation of different models with crime percentage concentrations (cf. Figure 29), we observe that the models produce different police distributions even on the first day of the simulation. These differences reflect inherent differences in model behavior. Since longer term include the influence of feedback loops, these initial discrepancies are likely to compound over time, leading to the observed long-run differences in speed or extent of bias.Thus, both data and model behavior appear to affect not only the magnitude of bias in Baltimore, but also its direction and how the direction evolves over time.Downtown, Sandtown-Winchester, and Blair-Edison were repeatedly among the top over-policed neighborhoods when aggravated assault records were used. This consistency across models suggests that some neighborhoods are structurally favored or targeted regardless of which predictive algorithm is used.However, when looking at over-policed neighborhoods when using all crime records, we see that the models are not as consistent. This is another demonstration of how data and algorithms can both affect the results. The only neighborhood the algorithms commonly over-policed when using total crime records was Sandtown-Winchester, a Black neighborhood.These Baltimore predictive policing simulation results are consistent with previous studies on other cities, including both real experimental studies [7] and simulation studies [9], in that police concentration tendencies occurred in the models due to feedback loops.Based on the duration of the simulation, these systems might change rank in terms of each bias metric or even average accuracy. A short simulation duration might indicate PredPol as causing the most uniform police distribution (based on police Gini coefficient), followed by long-term KDE and then short-term KDE, while a longer simulation duration might cause a rank swap between long-term KDE and PredPol. These variables could be studied in future system-comparative works.For Baltimore City, our recommendations are as follows:
- –
- Although predictive policing remained fairer and more accurate than hot spots-based policing, it had a higher speed of bias amplification than hot spots policing in most scenarios. Hence, we advise the city to be aware of the long-term tendencies of any predictive policing system they might use.
- –
- Evaluation studies like the one performed here should be performed prior to real-world implementation. These studies can highlight bias issues arising from the combination of a particular algorithm and data, leading to decision-making that is better informed.
- –
- When distributing resources using predictive policing models based on aggravated assault records, authorities should be mindful of assigning too many officers to the Downtown, Blair-Edison, and Sandtown-Winchester neighborhoods. We advise careful interpretation of results when applying all crime records to the predictive policing model, since Sandtown-Winchester, Brooklyn, Mount Vernon, and Canton could incorrectly appear to have higher crime rates due to feedback loop phenomena. Similarly, the city could prepare a list of neighborhoods that might be under-rated, under-policed, and in need of more attention.
- –
- Our results highlight the importance of actively monitoring long-term fairness trends when deploying predictive policing systems. In particular, it is important to distinguish between estimated or predicted trends in model behavior obtained through pre-deployment evaluation and observed trends that emerge from real-world implementation. In practice, this could involve periodic reassessment of both predicted and observed system behavior over time. At the end of each evaluation period, stakeholders could conduct audits of police allocation distributions and fairness indicators (e.g., racial or neighborhood-level fairness) to detect emerging bias. These audits should compare observed trends with previously estimated tendencies and current outcomes with those from previous periods. Such comparisons could help to identify deviations between expected and actual system behavior as well as shifts in bias over time. Decisions made based on estimated trends may also influence future outcomes, which should be documented and studied.
- –
- Engaging domain experts and community stakeholders in defining additional fairness metrics and reviewing these trends could bring to light new aspects involving fairness and community welfare and provide critical context for interpreting system behaviors.
Together, these steps can help to ensure that predictive policing systems remain aligned with fairness objectives over extended deployment periods. - Policy ImplicationsBased on our examination of various aspects of social fairness, policymakers should look beyond the promised ability to reduce and prevent crime when authorizing deployment or continuation of any smart policing system. More broadly, achieving a society where every individual has equal opportunity to succeed, and in turn to help their community prosper, requires coordinated efforts among stakeholders along with a continuous evaluation process to ensure that these systems operate as intended.Our results demonstrate that predictive policing systems exhibit dynamic behavior over time, affecting how these system deployments need to be regulated. Policies regarding their evaluation should be ongoing, meaning that continuous pre- and post-deployment evaluations through periodic audits need to be performed rather than a one-time pre-deployment evaluation. By enforcing appropriate laws and policies, it is necessary to establish a cyclic evaluation framework that includes the following steps:
- 1.
- Defining/updating fairness and accuracy metrics.
- 2.
- Defining/updating the system dynamics and behavioral responses, e.g., how police presence changes crime distribution, how different environmental factors change the probability of crime being reported, etc.
- 3.
- Defining/updating policies and regulations.
- 4.
- Pre-deployment evaluation of the policing system’s behavior.
- 5.
- Monitoring real-world outcomes.
- 6.
- Conducting causal and experimental analysis of outcome data to better understand interactions among the metrics and the system’s responses.
- 7.
- Interpretation of the results by community stakeholders and domain experts.
This approach shifts the focus from selecting the best policing system to establishing a robust evaluation and governance process that directs the development of the policing system, simulation framework, and evaluation framework itself. Such a process can adapt to long-term system dynamics while ensuring that the system meets fairness objectives over time.Furthermore, a dynamic governance process and continuous re-evaluation that engages community stakeholders to promote transparency and accountability will result in higher levels of community cooperation, thereby building trust between citizens and law enforcement agencies. When stakeholders are aware that system performance is regularly assessed and corrective actions are being taken, confidence in the system will improve and tolerance for inadequacies and enforcement errors will increase.
6. Limitations
- Every individual scenario parameter (e.g., crime-type, number of officers, probability of report, etc.) and hyperparameter (such as base probability of crime detection by an officer within detection radius (the crime’s neighborhood), could be explored over a wider range of values to asses robustness and their influence on the results.
- We assumed that the crimes in the database represented all crimes that actually happened, which is untrue. These records could be the result of an already biased system, such as by having less crime reported or discovered in certain areas.
- We used real crime data, which we filtered by using a fixed report probability for all neighborhoods. The approach would be more realistic if different crimes were filtered based on their report probability instead of using an average report probability for all crimes. Using varying report probabilities across neighborhoods and/or crime types could change the distribution of observed crimes (detected + reported) and potentially influence the fairness and accuracy outcomes of the algorithms. Exploring neighborhood-specific and/or crime type-specific reporting rates would be a valuable expansion for future work.
- We did not account for how the presence of police in a neighborhood could affect the crime distribution. The only change in crime distribution over the course of our simulations consisted of what was hidden in the original crime data, without any calculated manual changes. Currently, we expect that by running the simulation for longer periods, these systems would potentially experience a flip in their bias direction tendency when the distribution of the filtered reported crimes eventually changes by suddenly having several days of high crime rates in neighborhoods with different demographic majorities.
- In this work, we considered a detection formula that implicitly assumes crime detection to be determined exclusively by the number of deployed officers. Nevertheless, in real-world deployments detection rates may also be affected by the model-estimated level of crime risk itself. Officers entering a predicted high-crime area typically behave with elevated vigilance, which can inform how they evaluate behavior and distribute attention [73]. As a result, detection probability may be determined not only by officer count but also by the broader sociotechnical context induced by the predictive mechanism.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| KDE | Kernel Density Estimation |
| Avg. | Average |
| Abs. | Absolute |
Appendix A. Sensitivity Analysis to Reporting Probabilities
| Metric | Algorithm_A | Algorithm_B | Mean_pct_diff | CI_low_pct | CI_high_pct | ANOVA_F_pReport | ANOVA_p_pReport |
|---|---|---|---|---|---|---|---|
| Accuracy | KDE_LongTerm | KDE_ShortTerm | 23 | 20.4 | 25.5 | 85.9 | 1.11 × |
| Accuracy | PredPol | KDE_LongTerm | 4.64 | 4.17 | 5.1 | 26.2 | 3.47 × |
| Accuracy | PredPol | KDE_ShortTerm | 27.5 | 25.1 | 29.9 | 71.6 | 2.29 × |
| NeighborhoodFairnessGap_pGini | KDE_LongTerm | KDE_ShortTerm | −11.2 | −12.2 | −10.2 | 23.3 | 1.22 × |
| NeighborhoodFairnessGap_pGini | PredPol | KDE_LongTerm | −0.326 | −0.5 | −0.151 | 84.9 | 2.58 × |
| NeighborhoodFairnessGap_pGini | PredPol | KDE_ShortTerm | −11.5 | −12.5 | −10.5 | 11.5 | 3.52 × |
| NeighborhoodFairnessGap_pcrGini | KDE_LongTerm | KDE_ShortTerm | −8.75 | −9.69 | −7.81 | 10.4 | 1.42 × |
| NeighborhoodFairnessGap_pcrGini | PredPol | KDE_LongTerm | −0.22 | −0.327 | −0.113 | 46.3 | 8 × |
| NeighborhoodFairnessGap_pcrGini | PredPol | KDE_ShortTerm | −8.97 | −9.92 | −8.01 | 6.14 | 0.000455 |
| NeighborhoodFairness_pGini_absTrend | KDE_ShortTerm | KDE_LongTerm | 126 | 121 | 131 | 49.2 | 3.95 × |
| NeighborhoodFairness_pGini_absTrend | KDE_ShortTerm | PredPol | 8.12 | 2.46 | 13.8 | 59.5 | 1.68 × |
| NeighborhoodFairness_pGini_absTrend | PredPol | KDE_LongTerm | 122 | 116 | 128 | 7.46 | 7.67 × |
| NeighborhoodFairness_pcrGini_absTrend | KDE_ShortTerm | KDE_LongTerm | 79.1 | 72.2 | 85.9 | 52.6 | 1.29 × |
| NeighborhoodFairness_pcrGini_absTrend | PredPol | KDE_LongTerm | 89.6 | 84.5 | 94.7 | 18.5 | 4.55 × |
| NeighborhoodFairness_pcrGini_absTrend | PredPol | KDE_ShortTerm | 8.76 | 3.55 | 14 | 136 | 1.52 × |
| RacialFairnessGap_PCR | KDE_ShortTerm | KDE_LongTerm | −4.4 | −5.41 | −3.38 | 28.1 | 3.94 × |
| RacialFairnessGap_PCR | PredPol | KDE_LongTerm | −9.65 | −10.5 | −8.82 | 5.38 | 0.00127 |
| RacialFairnessGap_PCR | PredPol | KDE_ShortTerm | −5.26 | −6.04 | −4.48 | 20.7 | 2.93 × |
| RacialFairnessGap_PCR_absTrend | KDE_LongTerm | KDE_ShortTerm | 23.7 | 13.2 | 34.3 | 9.21 | 7.4 × |
| RacialFairnessGap_PCR_absTrend | PredPol | KDE_LongTerm | 3.74 | −5.78 | 13.3 | 6.37 | 0.000335 |
| RacialFairnessGap_PCR_absTrend | PredPol | KDE_ShortTerm | 30.1 | 20.6 | 39.7 | 5.17 | 0.00169 |
| RacialFairnessGap_avgPolShare | KDE_ShortTerm | KDE_LongTerm | −6.57 | −9.03 | −4.11 | 10 | 2.54 × |
| RacialFairnessGap_avgPolShare | PredPol | KDE_LongTerm | −11.6 | −13 | −10.2 | 1.09 | 0.353 |
| RacialFairnessGap_avgPolShare | PredPol | KDE_ShortTerm | −4.95 | −7 | −2.91 | 14 | 1.31 × |
| RacialFairnessGap_avgPolShare_absTrend | KDE_ShortTerm | KDE_LongTerm | 74.6 | 63.4 | 85.8 | 23.8 | 6.24 × |
| RacialFairnessGap_avgPolShare_absTrend | KDE_ShortTerm | PredPol | 59.6 | 50.2 | 69 | 5.38 | 0.00126 |
| RacialFairnessGap_avgPolShare_absTrend | PredPol | KDE_LongTerm | 26.5 | 14.6 | 38.4 | 14.8 | 4.89 × |
| Metric | Chi2/DoF | p-Value |
|---|---|---|
| Accuracy | 12.4 | 5.35 × |
| RacialFairnessGap_avgPolShare | 4.11 | 0.000395 |
| RacialFairnessGap_PCR | 8.18 | 7.1 × |
| NeighborhoodFairnessGap_pGini | 35.1 | 1.15 × |
| NeighborhoodFairnessGap_pcrGini | 27.4 | 6.21 × |
| RacialFairnessGap_avgPolShare_absTrend | 7.71 | 2.63 × |
| RacialFairnessGap_PCR_absTrend | 5.91 | 3.55 × |
| NeighborhoodFairness_pGini_absTrend | 8.18 | 7.19 × |
| NeighborhoodFairness_pcrGini_absTrend | 9.71 | 1.01 × |




Appendix B. Correlation Analysis of Fairness and Accuracy

| Metric | Grouping Variables | Num Gs | Avg Corr | Std. Corr | Trend | Consistency |
|---|---|---|---|---|---|---|
| avg_police_gini | 1 | −0.976 | 0 | strong negative | high | |
| avg_police_gini | number_of_police | 2 | −0.864 | 0.075 | strong negative | high |
| avg_police_gini | number_of_police + crime_type | 4 | −0.929 | 0.067 | strong negative | high |
| avg_police_gini | number_of_police + crime_type + Algorithm | 12 | −0.61 | 0.533 | moderate negative | low |
| avg_policeCrime_ratio_gini | 1 | −0.957 | 0 | strong negative | high | |
| avg_policeCrime_ratio_gini | number_of_police | 2 | −0.842 | 0.118 | strong negative | high |
| avg_policeCrime_ratio_gini | number_of_police + crime_type | 4 | −0.606 | 0.603 | moderate negative | low |
| avg_policeCrime_ratio_gini | number_of_police + crime_type + Algorithm | 12 | −0.429 | 0.67 | moderate negative | low |
| avg_racial_fairnessGap_PCR | 1 | −0.253 | 0 | weak negative | high | |
| avg_racial_fairnessGap_PCR | crime_type | 2 | −0.314 | 0.523 | weak negative | low |
| avg_racial_fairnessGap_PCR | crime_type + number_of_police | 4 | −0.072 | 0.214 | very weak negative | medium |
| avg_racial_fairnessGap_PCR | crime_type + number_of_police + Algorithm | 12 | 0.09 | 0.431 | very weak positive | low |
| avg_racial_fairnessGap_avgPolShare | 1 | −0.131 | 0 | very weak negative | high | |
| avg_racial_fairnessGap_avgPolShare | crime_type | 2 | −0.27 | 0.522 | weak negative | low |
| avg_racial_fairnessGap_avgPolShare | crime_type + number_of_police | 4 | −0.094 | 0.518 | very weak negative | low |
| avg_racial_fairnessGap_avgPolShare | crime_type + number_of_police + Algorithm | 12 | −0.065 | 0.566 | very weak negative | low |
References
- Zubair, T.; Fatima, S.K.; Ahmed, N.; Khan, A. Crime Hotspot Prediction Using Deep Graph Convolutional Networks. arXiv 2025, arXiv:2506.13116. [Google Scholar]
- Mohler, G.O.; Short, M.B.; Brantingham, P.J.; Schoenberg, F.P.; Tita, G.E. Self-exciting point process modeling of crime. J. Am. Stat. Assoc. 2011, 106, 100–108. [Google Scholar] [CrossRef]
- Braga, A.A.; Turchan, B.; Papachristos, A.V.; Hureau, D.M. Hot spots policing of small geographic areas effects on crime. Campbell Syst. Rev. 2019, 15, e1046. [Google Scholar] [CrossRef]
- Bowers, K.J.; Johnson, S.D.; Pease, K. Prospective hot-spotting: The future of crime mapping? Br. J. Criminol. 2004, 44, 641–658. [Google Scholar] [CrossRef]
- Chapman, A.; Grylls, P.; Ugwudike, P.; Gammack, D.; Ayling, J. A Data-driven analysis of the interplay between Criminological theory and predictive policing algorithms. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2022; pp. 36–45. [Google Scholar]
- Perry, W.L.; McInnis, B.; Price, C.C.; Smith, S.C.; Hollywood, J.S. Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations; Rand Corporation: Santa Monica, CA, USA, 2013. [Google Scholar]
- Mohler, G.O.; Short, M.B.; Malinowski, S.; Johnson, M.; Tita, G.E.; Bertozzi, A.L.; Brantingham, P.J. Randomized controlled field trials of predictive policing. J. Am. Stat. Assoc. 2015, 110, 1399–1411. [Google Scholar] [CrossRef]
- Hu, Y.; Wang, F.; Guin, C.; Zhu, H. A spatio-temporal kernel density estimation framework for predictive crime hotspot mapping and evaluation. Appl. Geogr. 2018, 99, 89–97. [Google Scholar] [CrossRef]
- Lum, K.; Isaac, W. To predict and serve? Significance 2016, 13, 14–19. [Google Scholar] [CrossRef]
- Ensign, D.; Friedler, S.A.; Neville, S.; Scheidegger, C.; Venkatasubramanian, S. Runaway feedback loops in predictive policing. In Proceedings of the Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 160–171. [Google Scholar]
- Akpinar, N.J.; De-Arteaga, M.; Chouldechova, A. The effect of differential victim crime reporting on predictive policing systems. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2021; pp. 838–849. [Google Scholar]
- Mohler, G.; Raje, R.; Carter, J.; Valasik, M.; Brantingham, J. A penalized likelihood method for balancing accuracy and fairness in predictive policing. In 2018 IEEE International Conference on Systems, man, and Cybernetics (SMC); IEEE: New York, NY, USA, 2018; pp. 2454–2459. [Google Scholar]
- American Civil Liberties Union. Plaintiffs Win Justice in Illegal Arrests Lawsuit Settlement with Baltimore City Police. 2010. Available online: https://www.aclu.org/press-releases/plaintiffs-win-justice-illegal-arrests-lawsuit-settlement-baltimore-city-police (accessed on 8 September 2025).
- American Civil Liberties Union. ACLU Condemns Baltimore Police Department for Failing to Comply with Settlement Agreement in Illegal Arrests Case. 2012. Available online: https://www.aclu.org/press-releases/aclu-condemns-baltimore-police-department-failing-comply-settlement-agreement-illegal (accessed on 8 September 2025).
- United States Department of Justice. Investigation of the Baltimore City Police Department. 2016. Available online: https://www.justice.gov/archives/opa/file/883366/dl?inline (accessed on 8 September 2025).
- Brown, L.T. The Black Butterfly: The Harmful Politics of Race and Space in America; JHU Press: Baltimore, MD, USA, 2021. [Google Scholar]
- Pietila, A. Not in My Neighborhood: How Bigotry Shaped a Great American City; Bloomsbury Publishing USA: New York, NY, USA, 2010. [Google Scholar]
- Magazine, T. State of Emergency Is Declared in Baltimore as Riots Erupt. 27 April 2015. Available online: https://time.com/3837454/baltimore-looting-clashes-freddie-gray-police-protesters (accessed on 1 April 2026).
- Makarechi, K. The Clock Didn’t Start with the Riots: Baltimore and Freddie Gray. 2015. Available online: https://www.vanityfair.com/news/2015/04/baltimore-riots-freddie-gray (accessed on 8 September 2025).
- Prudente, T. Baltimore Mayor to Bring in Crime Fighting Strategist with High-Tech Policing Model. 2018. Available online: https://web.archive.org/web/20180201101655/http://www.baltimoresun.com/news/maryland/crime/bs-md-ci-sean-malinowski-20180123-story.html (accessed on 1 April 2026).
- Zumer, B. Baltimore Police Department to Launch Predictive Policing Strategy. 2018. Available online: https://foxbaltimore.com/news/local/baltimore-police-to-launch-predictive-policing-strategy (accessed on 9 January 2025).
- Hanlon, B.; Vicino, T.J. The fate of inner suburbs: Evidence from metropolitan Baltimore. Urban Geogr. 2007, 28, 249–275. [Google Scholar] [CrossRef]
- Alexander, M. The New Jim Crow: Mass Incarceration in the Age of Colorblindness, revised edition ed.; The New Press: New York, NY, USA, 2012. [Google Scholar]
- No Boundaries Coalition. Over-Policed, Yet Underserved: The People’s Findings Regarding Police Encounters and Accountability in Central West Baltimore. Technical Report, No Boundaries Coalition, 2016. Available online: https://www.noboundariescoalition.com/wp-content/uploads/2016/03/No-Boundaries-Layout-Web-1.pdf (accessed on 1 April 2026).
- Densley, J.A. Over-policed and under-protected: Police violence as a symptom and cause of urban violence in America’s Black communities. In Public Health, Mental Health, and Mass Atrocity Prevention; Routledge: Abingdon, UK, 2021; pp. 71–88. [Google Scholar]
- CBS News. Violent Crime Rate Spikes in Baltimore After Freddie Gray’s Death in Police Custody. 2015. Available online: https://www.cbsnews.com/news/violent-crime-rate-spikes-baltimore-freddie-gray-death-police-custody-2015/ (accessed on 12 November 2025).
- Koerth-Baker, M.; Bronner, L. Charts: Baltimore Crime Before and After Freddie Gray’s Funeral. FiveThirtyEight. 2015. Available online: https://www.fivethirtyeight.com/features/charts-baltimore-crime-before-and-after-freddie-grays-funeral/ (accessed on 12 November 2025).
- Pew Research Center. Multiple Causes Seen for Baltimore Unrest. 2015. Available online: https://www.pewresearch.org/politics/2015/05/04/multiple-causes-seen-for-baltimore-unrest/ (accessed on 1 April 2026).
- CBC News. Baltimore Riots Prompt State of Emergency After Freddie Gray’s Funeral. 2015. Available online: https://www.cbc.ca/news/world/baltimore-riots-prompt-state-of-emergency-after-freddie-gray-funeral-1.3051048 (accessed on 12 November 2025).
- Los Angeles Times. CVS Pharmacy Emerges as Symbolic Flashpoint of Baltimore Riot. 2015. Available online: https://www.latimes.com/nation/la-na-cvs-pharmacy-baltimore-riots-20150428-story.html (accessed on 12 November 2025).
- CNN. DEA: Prescription Drugs Stolen in Baltimore Flooding the Streets. 2015. Available online: https://www.cnn.com/2015/06/25/politics/baltimore-drug-market-freddie-gray (accessed on 12 November 2025).
- The Guardian. Baltimore Timeline: The Year Since Freddie Gray’s Arrest. The Guardian, 27 April 2016. Available online: https://www.theguardian.com/us-news/2016/apr/27/baltimore-freddie-gray-arrest-protest-timeline (accessed on 10 November 2025).
- Baltimore Action Legal Team. About Us. 2024. Available online: https://www.baltimoreactionlegal.org/aboutus (accessed on 10 November 2025).
- No Boundaries Coalition. About Us. 2024. Available online: https://www.noboundariescoalition.com/about-us/ (accessed on 10 November 2025).
- Leaders of a Beautiful Struggle. About. 2024. Available online: https://lbsbaltimore.com/about/ (accessed on 10 November 2025).
- Mohler, G. Marked point process hotspot maps for homicide and gun crime prediction in Chicago. Int. J. Forecast. 2014, 30, 491–497. [Google Scholar] [CrossRef]
- Mashiat, T.; Gitiaux, X.; Rangwala, H.; Das, S. Counterfactually fair dynamic assignment: A case study on policing. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems; International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS): Richland, SC, USA, 2023; pp. 2526–2528. [Google Scholar]
- Griffard, M. A Bias-Free Predictive Policing Tool: An Evaluation of the NYPD’s Patternizr. Fordham Urban Law. J. 2019, 47, 43. [Google Scholar]
- Repasky, M.; Wang, H.; Xie, Y. Multi-Agent Reinforcement Learning for Joint Police Patrol and Dispatch. arXiv 2024, arXiv:2409.02246. [Google Scholar] [CrossRef]
- Barbosa, S.E.; Petty, M.D. Exploiting spatio-temporal patterns using partial-state reinforcement learning in a synthetically augmented environment. Prog. Artif. Intell. 2015, 3, 55–71. [Google Scholar] [CrossRef]
- Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
- Rosenblatt, M. Remarks on Some Nonparametric Estimates of a Density Function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
- Chainey, S.; Tompson, L.; Uhlig, S. The utility of hotspot mapping for predicting spatial patterns of crime. Secur. J. 2008, 21, 4–28. [Google Scholar] [CrossRef]
- Mohler, G.O.; Short, M.B.; Malinowski, S.; Johnson, M.; Tita, G. Systems and Methods for Predictive Policing. US Patent US8949164B1, 3 February 2015. Available online: https://patents.google.com/patent/US8949164B1/en (accessed on 12 November 2025).
- Vivek, M.; Prathap, B.R. Spatio-temporal crime analysis and forecasting on twitter data using machine learning algorithms. SN Comput. Sci. 2023, 4, 383. [Google Scholar] [CrossRef]
- Tam, S.; Tanriöver, Ö.Ö. Multimodal deep learning crime prediction using tweets. IEEE Access 2023, 11, 93204–93214. [Google Scholar] [CrossRef]
- Joe, W.; Lau, H.C.; Pan, J. Reinforcement learning approach to solve dynamic bi-objective police patrol dispatching and rescheduling problem. In Proceedings of the International Conference on Automated Planning and Scheduling; AAAI Press: Palo Alto, CA, USA, 2022; Volume 32, pp. 453–461. [Google Scholar]
- Chen, H.; Wu, Y.; Wang, W.; Zheng, Z.; Ma, J.; Zhou, B. A risk-aware multi-objective patrolling route optimization method using reinforcement learning. In Proceedings of the 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS); IEEE: New York, NY, USA, 2023; pp. 1637–1644. [Google Scholar]
- Chen, H.; Wu, Y.; Wang, W.; Zheng, Z.; Ma, J.; Zhou, B. Optimizing Patrolling Route with a Risk-Aware Reinforcement Learning Model. Preprint. SSRN 4752931, 2023. Available online: https://ssrn.com/abstract=4752931 (accessed on 1 April 2026).
- Joe, W.; Lau, H.C. Learning to Send Reinforcements: Coordinating Multi-Agent Dynamic Police Patrol Dispatching and Rescheduling via Reinforcement Learning. 2023. Available online: https://dl.acm.org/doi/10.24963/ijcai.2023/18 (accessed on 11 January 2025).
- Brantingham, P.J. The logic of data bias and its impact on place-based predictive policing. Ohio State J. Crim. Law 2017, 15, 473. [Google Scholar]
- Brantingham, P.J.; Valasik, M.; Mohler, G.O. Does predictive policing lead to biased arrests? Results from a randomized controlled trial. Stat. Public Policy 2018, 5, 1–6. [Google Scholar] [CrossRef]
- Lagioia, F.; Rovatti, R.; Sartor, G. Algorithmic fairness through group parities? The case of COMPAS-SAPMOC. AI Soc. 2023, 38, 459–478. [Google Scholar] [CrossRef]
- Wang, H.; Grgic-Hlaca, N.; Lahoti, P.; Gummadi, K.P.; Weller, A. An empirical study on learning fairness metrics for compas data with human supervision. arXiv 2019, arXiv:1910.10255. [Google Scholar] [CrossRef]
- Dressel, J.; Farid, H. The accuracy, fairness, and limits of predicting recidivism. Sci. Adv. 2018, 4, eaao5580. [Google Scholar] [CrossRef]
- Helms, J.M.; Madden, A. Assessment of Data-Driven Deployment by the Memphis Police Department. Fall 2020 Report. Technical Report, Public Safety Institute, University of Memphis, Memphis, TN, USA, Fall 2020. Available online: https://memphiscrime.org/wp-content/uploads/2020/02/PSI-MPD-Data-Driven-Assessment.pdf (accessed on 20 February 2025).
- Baltimore Police Department. New Technology Initiatives. 2024. Available online: https://www.baltimorepolice.org/resources-and-reports/new-technology-initiatives (accessed on 6 February 2026).
- Maryland Crime Research and Innovation Center. MCRIC Partners with Baltimore City on Data-Driven Policing Research. Available online: https://bsos.umd.edu/academics-research/maryland-crime-research-and-innovation-center-mcric-mcric-partners-baltimore (accessed on 6 February 2026).
- Raji, I.; Sholademi, D.B. Predictive Policing: The Role of AI in Crime Prevention. Int. J. Comput. Appl. Technol. Res. 2024, 13, 66–78. [Google Scholar]
- Mandalapu, V.; Elluri, L.; Vyas, P.; Roy, N. Crime prediction using machine learning and deep learning: A systematic review and future directions. IEEE Access 2023, 11, 60153–60170. [Google Scholar] [CrossRef]
- Baltimore Police Department. Part 1 Crime Data. Available online: https://data.baltimorecity.gov/datasets/baltimore::part-1-crime-data (accessed on 11 March 2023).
- City of Baltimore. Neighborhood Demographic and Spatial Data. Available online: https://data.baltimorecity.gov/datasets/neighborhood-1 (accessed on 23 August 2022).
- City of Baltimore. Neighborhood Boundary KML File. Available online: https://data.baltimorecity.gov/datasets/baltimore::neighborhood-1 (accessed on 15 March 2023).
- Morgan, R.E.; Truman, J.L. Criminal Victimization, 2019. Technical Report NCJ 255113, Bureau of Justice Statistics, U.S. Department of Justice, 2020. Available online: https://bjs.ojp.gov/content/pub/pdf/cv19.pdf (accessed on 13 June 2025).
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference; ACM: New York, NY, USA, 2012; pp. 214–226. [Google Scholar]
- Friedler, S.A.; Scheidegger, C.; Venkatasubramanian, S.; Choudhary, S.; Hamilton, E.P.; Roth, D. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2019; pp. 329–338. [Google Scholar]
- Abeles, J.; Conway, D.J. The Gini coefficient as a useful measure of malaria inequality among populations. Malar. J. 2020, 19, 444. [Google Scholar] [CrossRef]
- Thomas, V.; Wang, Y.; Fan, X. Measuring Education Inequality: Gini Coefficients of Education for 140 Countries, 1960–2000. J. Educ. Plan. Adm. 2003, 17, 5–33. Available online: https://www.niepa.ac.in/download/Publications/JEPA_(15%20years)/JEPA%202003_Vol-17%20(1-4)/JEPA_JAN-2003-VOL17_1%20Final.pdf#page=5 (accessed on 1 April 2026).
- De Maio, F.G. Income inequality measures. J. Epidemiol. Community Health 2007, 61, 849–852. [Google Scholar] [CrossRef] [PubMed]
- Semsar, S. Predictive Policing Project Code and Data. 2025. Available online: https://github.com/saminsemsar/Data_Analysis_Portfolio/tree/main/PredictivePolicing (accessed on 1 April 2026).
- Halford, E.; Giannoulis, M.; Condon, C.; Keningale, P. Do hotspot policing interventions against optimal foragers cause crime displacement? Int. J. Law Crime Justice 2024, 77, 100654. [Google Scholar] [CrossRef]
- Ferguson, A.G. Predictive policing and reasonable suspicion. Emory Law J. 2012, 62, 259. [Google Scholar] [CrossRef]






























| Scenario | Crime Input | Number of Officers | Report Setting |
|---|---|---|---|
| S1 | AGG. ASSAULT | 40 | Detected only () |
| S2 | AGG. ASSAULT | 40 | Detected + reported () |
| S3 | AGG. ASSAULT | 400 | Detected only () |
| S4 | AGG. ASSAULT | 400 | Detected + reported () |
| S5 | TOTAL | 40 | Detected only () |
| S6 | TOTAL | 40 | Detected + reported () |
| S7 | TOTAL | 400 | Detected only () |
| S8 | TOTAL | 400 | Detected + reported () |
| Metric | Avg. Dif. Long-Term KDE | Avg. Dif. Short-Term KDE |
|---|---|---|
| Average accuracy | ||
| Racial fairness gap (PCR) | ||
| Racial fairness gap (avg. police share) | ||
| Neighborhood fairness gap (PCR Gini) | ||
| Neighborhood fairness gap (police Gini) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Semsar, S.; Prabhu, K.L.; Waters, G.; Foulds, J. A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City. Algorithms 2026, 19, 398. https://doi.org/10.3390/a19050398
Semsar S, Prabhu KL, Waters G, Foulds J. A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City. Algorithms. 2026; 19(5):398. https://doi.org/10.3390/a19050398
Chicago/Turabian StyleSemsar, Samin, Kiran Laxmikant Prabhu, Gabriella Waters, and James Foulds. 2026. "A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City" Algorithms 19, no. 5: 398. https://doi.org/10.3390/a19050398
APA StyleSemsar, S., Prabhu, K. L., Waters, G., & Foulds, J. (2026). A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City. Algorithms, 19(5), 398. https://doi.org/10.3390/a19050398



