A Spatial and Cluster-Based Framework for Identifying Railroad Trespassing Hotspots

Mohammed, Habeeb; Liu, Rongfang; Jiang, Steven

doi:10.3390/systems14040396

Open AccessArticle

A Spatial and Cluster-Based Framework for Identifying Railroad Trespassing Hotspots

by

Habeeb Mohammed

¹

,

Rongfang Liu

^2,3 and

Steven Jiang

^1,*

¹

Department of Industrial & Systems Engineering, North Carolina A&T State University, Greensboro, NC 27411, USA

²

Transportation Institute, North Carolina A&T State University, Greensboro, NC 27411, USA

³

Department of Marketing and Supply Chain Management, North Carolina A&T State University, Greensboro, NC 27411, USA

^*

Author to whom correspondence should be addressed.

Systems 2026, 14(4), 396; https://doi.org/10.3390/systems14040396

Submission received: 14 February 2026 / Revised: 29 March 2026 / Accepted: 31 March 2026 / Published: 3 April 2026

(This article belongs to the Special Issue Multimodal and Intermodal Transportation Systems in the AI Era)

Download

Browse Figures

Versions Notes

Abstract

Rail trespassing remains a persistent safety challenge at the system level in the United States, with a 24% increase in incidents within the last decade (2016–2025). Identifying hotspots proactively is difficult due to limited incident data and strong spatial dependencies within the built environment. This study thus creates a ZIP-code–level geospatial analytics framework to identify current and emerging trespassing hotspots across North Carolina by combining land-use composition, rail exposure metrics, and historical Federal Railroad Administration (FRA) trespassing records. Geospatial layers were integrated within a GIS workflow to derive attributes such as rail miles, grade crossings, population density, and land-use types. Exploratory spatial analysis showed significant clustering of trespassing incidents, with Global Moran’s I indicating positive spatial autocorrelation across multiple neighborhood sizes. Permutation z-scores confirmed non-random hotspot formation along major rail corridors. A k-means clustering method also identified four structural risk environments, and a Composite Risk Index (CRI) was developed from weighted, standardized exposure and land-use variables to quantify latent risk, independent of raw casualty counts. Results indicate that clusters characterized by higher rail infrastructure exposure and mixed land-use environments exhibit the highest CRI values and elevated hotspot probabilities. In contrast, clusters with limited rail infrastructure, including predominantly commercial and rural ZIP codes, show substantially lower risk levels. The findings highlight that trespassing risk is more strongly associated with structural exposure conditions than with isolated historical incident counts. The resulting risk surfaces and hotspots provide an interpretable and scalable framework for statewide safety planning, early hotspot detection, and targeted interventions by transportation agencies.

Keywords:

rail trespassing; spatial autocorrelation; hotspot analysis; Composite Risk Index; systems modeling; spatial data; transportation safety

1. Introduction

Railroad trespassing persists as a significant and perilous issue, causing numerous fatalities, injuries, and operational interruptions globally. In the United States, trespassing on railroads results in hundreds of deaths annually, representing a considerable proportion of rail-related casualties [1]. Despite increased awareness and ongoing safety initiatives, the prevalence of railroad trespassing continues to be a serious public safety concern. While behavioral and demographic aspects are commonly investigated in relation to trespassing, the physical and spatial environment where these incidents occur has received less scrutiny. A literature review reveals that the built and natural environment—including elements such as pedestrian crossings, fencing, land use, and proximity to rail tracks—significantly influences the likelihood of trespassing [2,3]. However, comprehensive investigations of these spatial correlations, particularly through geographic information systems, remain limited. Most existing studies rely on aggregated data or anecdotal evidence, providing little insight into the micro-level characteristics that elevate the probability of trespassing occurrences. There is a critical need for location-specific, data-driven analyses that incorporate spatial and environmental attributes to enhance the effectiveness of prevention strategies. Spatial correlation refers to the degree to which observations located near one another exhibit similar values, reflecting the spatial dependence described by Tobler’s First Law of Geography [4]. It is commonly quantified using global or local measures such as Moran’s I [5], Geary’s C [6], or Local Indicators of Spatial Association (LISA) statistics [7].

This study addresses this need by investigating the relationship between the physical characteristics of the railroad environment and the spatial distribution of trespassing incidents. Utilizing GIS data and advanced spatial modeling methods, this research identifies high-risk locations based on established environmental features and constructs a framework for identifying trespassing hotspots. By employing spatial analysis and unsupervised learning methodologies, this study aims to go beyond retrospective analysis and support proactive safety planning. The primary objectives of this study are: (1) to analyze the spatial distribution of railroad trespassing incidents within North Carolina, USA. (2) to identify the physical and environmental factors most closely associated with these incidents, and (3) to develop a model to identify potential hotspots to aid rail authorities in targeting preventive measures. Ultimately, the findings are expected to facilitate more effective, geographically targeted interventions, decrease trespassing-related injuries and fatalities, and contribute to the expanding body of knowledge on spatial risk modeling in transportation safety. The next section discusses previous studies that have used these techniques in different contexts. Subsequent sections provide details of the methodology employed in this study as well as the results of the analysis and their implications for transportation safety policy.

2. Literature Review

Railroad trespassing poses a persistent and complex challenge, endangering individuals, disrupting rail services, and jeopardizing public safety. Understanding its root causes is crucial for developing effective prevention strategies [8]. Research highlights a complex interplay of factors: railroad infrastructure design, surrounding land use, and human activity patterns as key drivers of trespassing incidents [9]. Silla & Luoma [10] found that residents near tracks often lack sufficient legal crossings, necessitating trespassing, especially in districts where homes are separated from city centers by railroad lines. Furthermore, commercial areas near stations often experience higher rates of crime, including trespassing, although with different patterns than residential areas, varying by day and time [11].

The extent of rail infrastructure, typically measured in rail miles or track length, is a foundational exposure variable in trespassing risk analysis. Studies have shown that by simply expanding the physical interface between rail operations and the surrounding environment, the opportunity for unauthorized access is increased [12,13]. This aligns with the exposure-opportunity framework in injury prevention, where risk is a function of both the frequency of potential encounters and the inherent hazards present [14]. A study by Searcy et al. [14] further found that location-specific characteristics, including rail mileage, explained nearly half (48.9%) of the variation in daily pedestrian trespassing events in a 10-site subset analyzed in the study. Similarly, Kang et al. [15] used a mixed-effects negative binomial model at the county level across the United States and found that rail track length significantly influenced the frequency of trespassing crashes, alongside demographic factors such as population density and age structure.

Pedestrian crossings such as grade crossings, underpasses, overpasses, and unauthorized (informal) paths represent critical points of interface between rail infrastructure and public movement [14]. Findings from the Federal Railroad Administration (FRA) indicate that a substantial proportion of pedestrian trespasser fatalities occur within 1000 feet of a grade crossing, underscoring the importance of crossing density and design in risk mitigation [16]. According to a report by the Florida Department of Transportation Freight and Multi-modal Operations Office [17], the presence, type, and accessibility of crossings shape pedestrian behavior, influencing whether individuals use legal routes or resort to trespassing as a shortcut. Searcy et al. [14] further showed that proximity to authorized crossings was inversely related to trespassing frequency: sites with distant or inaccessible crossings experienced higher rates of illegal crossing, while those with nearby, well-designed crossings saw reduced trespassing events. Crossings-per-mile and crossings-per-area are commonly used as exposure metrics in predictive models and cluster analysis, and their inclusion in risk models is essential for capturing the accessibility landscape, understanding behavioral motivations, and designing targeted interventions [14].

Another factor considered is population density, which reflects the concentration of people living, working, or moving near rail infrastructure, thereby modulating the frequency of potential rail–pedestrian interactions. Consequently, high-density areas are hypothesized to experience greater trespassing risk due to increased pedestrian flows, land-use pressures, and the likelihood of informal access points [14]. A substantial body of research confirms that population density is a significant predictor of trespassing incidents. Kang et al. [15] found that at the county level in the U.S., higher population density was associated with increased rail trespass crash frequency, even after controlling for rail miles and train traffic. Similarly, in the Czech Republic, urbanized areas with high residential and industrial development reported trespassing frequencies as high as 10 cases per hour. Grabušić and Barić [18] quantified the effect, noting that a population density of 100 people per 1.5 km² led to an increase in trespassing accidents from 4.8% to 8.18%. Surveys indicate that the majority of trespassers are local residents, and that incidents are more likely to occur close to home, especially in urban environments [13].

Closely related to population density is the zoning and land-use composition of an area. They define the functional character of areas adjacent to rail infrastructure, shaping both the motivations for and patterns of trespassing. Research consistently demonstrates that land use and zoning are key determinants of trespassing behavior. Skládaná et al. [8] found that the pattern of functional area types—especially the combination of housing, shopping, industrial, and public services—was a crucial factor in the motivation for railroad trespassing in the Czech Republic. Regression models also showed that the density of pedestrian attractors—such as schools, social services, and restaurants—within one mile of observation sites was significantly associated with increased trespassing events [14]. For instance, in Florida, trespassing hotspots were frequently located where residential neighborhoods bordered recreational facilities without reasonable legal pedestrian routes, prompting shortcut behavior [17]. Land-use variables are also critical in risk-based prioritization and intervention design. Agencies such as the Long Island Rail Road (LIRR) and New Jersey Transit (NJT) incorporate land-use context into hazard analyses and fencing policies, recognizing that proximity to schools, parks, and commercial establishments increases the need for targeted mitigation [12].

In terms of analysis, Geographic Information Systems (GIS) provide a comprehensive framework for managing and exploring spatially referenced data, enabling researchers to examine geographic patterns and relationships across multiple scales [19]. GIS has been extensively utilized in transportation safety for various purposes, including risk mapping and hotspot prediction. Li et al. [20], for instance, use GIS to display locations of intra-city motor vehicle crashes as well as the hotspots. Bilim [21] similarly uses the same tool for spatial autocorrelation and kernel density estimation to obtain the distribution and identification of the most critical locations for pedestrian road crashes. GIS techniques such as Kernel Density Estimation (KDE), Getis-Ord Gi*, and Moran’s I have proven useful in identifying accident-prone areas and hotspots by aiding in clustering accidents and identifying black spots, which are crucial for planning and safety interventions [22,23]. By combining physical data such as infrastructure layouts, topography, and land use with human-related data, including population density, behavioral patterns, and socioeconomic indicators, GIS facilitates a comprehensive understanding of the factors contributing to risks in a given area. This integrative capacity makes it possible to analyze how environmental and human variables interact to create or exacerbate hazardous conditions. Moreover, GIS enhances decision-making processes by offering detailed, location-specific insights into potential hazards, areas of vulnerability, and high-risk zones [24].

Generalized Linear Models (GLMs), particularly those employing the negative binomial distribution, are commonly used to analyze count data such as crash frequencies. These models are especially suited for handling overdispersion, a frequent characteristic of crash data. To further enhance the predictive power and account for spatial characteristics inherent in safety data, researchers have increasingly integrated spatial regression techniques and a combination of generalized maximum likelihood estimation (GMLE) and generalized extreme value mixture model (GEVMM), which has been shown to be robust in complex probabilistic modeling problems [25]. Notably, Geographically Weighted Poisson Regression (GWPR) and Geographically Weighted Negative Binomial Regression (GWNBR) have been applied to capture spatial dependency and heterogeneity in crash occurrences [26]. These approaches allow model parameters to vary across geographic space, thus providing localized insights that traditional global models may overlook.

Although railroad trespassing is increasingly recognized as a significant safety concern, current research predominantly emphasizes behavioral, demographic, or temporal dimensions, with inadequate consideration of physical and spatial determinants. Existing spatial analyses often provide descriptive insights but fail to integrate data pertaining to physical infrastructure, such as access routes, barriers, land utilization, or adjacency to urban areas. This deficiency curtails a holistic understanding of how the built environment influences trespassing incidents, thereby impeding the formulation of targeted preventative measures. Furthermore, the application of unsupervised learning within a GIS to identify potential hotspots based on environmental attributes remains limited. This study addresses these limitations by synthesizing granular spatial data with analytical techniques to identify physical factors correlated with trespassing and to produce actionable hotspot maps. By merging physical infrastructure characteristics with geospatial behavioral patterns, this research promotes a more holistic and forward-thinking strategy for railroad safety management.

3. Methodology

To achieve our research objectives, we develop a spatially explicit, ZIP–code–level model that identifies current and emerging railroad trespassing hotspots across North Carolina. The methodological framework integrates geospatial data processing, exploratory spatial analysis, and cluster-based risk characterization for hotspot identification.

3.1. Study Area Description

Situated in the southeastern United States, North Carolina features a complex rail network that extends across diverse urban, suburban, and rural terrains. The state’s railroad infrastructure is crucial for both freight and passenger transport, incorporating over 3600 miles of active lines managed by Class I railroads, regional carriers, and short-line operators [27]. Key rail corridors link major urban centers, including Charlotte, Raleigh, Greensboro, and Wilmington, while smaller branch lines serve industrial, agricultural, and rural sectors. This network, characterized by both high-traffic urban sections and isolated rural tracks, offers a varied geographical context for studying railroad trespassing patterns. This study includes all accessible rail segments within North Carolina, utilizing geospatial data from the North Carolina Department of Transportation (NCDOT), Federal Railroad Administration (FRA) trespassing incident records, and supplementary spatial layers like land use and demographic overlays. The statewide scope allows for a comprehensive analysis that considers regional differences in rail environments, providing broader insights into trespassing risk factors beyond specific hotspots.

3.2. Data Sources and Preprocessing

3.2.1. Trespassing Incident Data

This study integrates multiple publicly available datasets from the FRA to construct a spatial dataset for railroad trespassing risk analysis across North Carolina. Two primary data sources were used: the FRA Injury and Illness Summary database (Form 55A) [1] and the FRA Crossing Inventory Database (Form 71) [28]. The casualty dataset contains detailed records of railroad-related injuries and fatalities reported nationwide. It includes detailed information for each recorded event, such as geographic coordinates, location descriptions, and the exact date and time of occurrence, among other variables. Geo-coding of incidents did not start until 2011; therefore, our analysis uses data from 2011 to 2024. After filtering for incidents by trespassers within North Carolina, the dataset used in this study contains 511 observations and 64 attributes. These records provide the basis for identifying historical patterns of railroad trespassing incidents.

3.2.2. Railroad Crossings

The second dataset, the FRA Crossing Inventory (Form 71), provides detailed information on railroad crossings across the United States. The North Carolina subset used in this study contains 12,819 crossings and 257 variables describing crossing characteristics, including crossing control devices, roadway characteristics, train traffic levels, and surrounding land-use conditions. This dataset aids in quantifying rail–pedestrian interface exposure at the ZIP-code level. The raw dataset contains point-based records of public and private crossings, including attributes related to location, crossing type, and operational status. Initial preprocessing involved filtering the dataset to retain only active and public crossings relevant to the study period. All crossing locations were converted to the projected coordinate reference system consistent with the rest of the spatial datasets (NAD83 State Plane, ftUS) to ensure spatial alignment and accurate aggregation.

To generate a ZIP-level measure of crossing exposure, the cleaned crossing points were spatially joined to ZIP code polygons. For each ZIP code, the total number of crossings was computed by counting the number of crossing points falling within its boundary. This aggregation step produced a single, interpretable variable representing crossing density at the ZIP level. The resulting crossing count variable was subsequently merged with other ZIP-level attributes, including population density, rail mileage, and land-use composition, forming a unified analytical dataset. This prepared crossing metric captures the frequency of potential interaction points between rail operations and roadway or pedestrian traffic and serves as a key explanatory variable in subsequent clustering and hotspot characterization.

The majority of railroad crossings in the study area are highway crossings, accounting for nearly 99% of all recorded locations. Pedestrian-related crossings—including pathway and station pedestrian crossings—represent a small fraction of the inventory. This distribution reflects the dominant role of roadway–rail interfaces in the regional rail network and underscores the importance of highway crossings as primary points of interaction between rail operations and public activity. Although pedestrian crossings are relatively rare, their presence may still warrant targeted consideration due to the elevated vulnerability associated with pedestrian exposure. Also, highway crossings serve as an alternative legal route for pedestrians in the absence of pedestrian crossings, according to the FRA safety policy [29].

3.2.3. Rail Mileage Calculation

Data of the rail network across North Carolina were obtained from the North Carolina Department of Transportation (NCDOT) online portal [30]. The network data were obtained as polyline features representing active railroad segments and processed in QGIS^® Desktop version 3.42 [31] to quantify rail exposure at the ZIP-code level. All rail geometries were first projected to NAD83 State Plane, ftUS to ensure accurate length calculations. The rail layer was then spatially intersected with ZIP code polygon boundaries, resulting in segmented rail line features corresponding to individual ZIP codes. For each intersected rail segment, geometric length was calculated in feet and subsequently converted to miles. Rail miles were then aggregated by ZIP code by summing the lengths of all rail segments within each ZIP boundary. This procedure produced a continuous ZIP-level rail mileage variable representing the total extent of railroad infrastructure present within each ZIP code. The resulting rail mileage measure was merged with other ZIP-level land-use and safety variables and used as a key exposure indicator in subsequent analysis.

3.2.4. Land Use Composition

Land-use composition within the state was obtained from NEXTGIS^® [32] and derived using polygon-based land-use data processed in QGIS^® to quantify the spatial distribution of land-use categories within each ZIP code. Land-use polygons were first intersected with ZIP code boundaries, producing subdivided land-use features representing the portion of each land-use category contained within individual ZIP codes. Following the intersection, the area of each land-use polygon segment was calculated in square feet. For each ZIP code, land-use areas were aggregated by category and divided by the total ZIP code area to compute proportional land-use coverage. This process yielded percentage measures of residential, commercial, industrial, agricultural, and other land-use types for each ZIP code. The resulting land-use composition variables capture contextual environmental characteristics and were subsequently integrated with rail infrastructure, demographic, and safety data for use in clustering and hotspot characterization. Proportional land-use metrics were employed to ensure comparability across ZIP codes of varying size and to capture the relative dominance of land-use types relevant to contextual exposure and risk. This approach is consistent with prior transportation and land-use research [33,34], which has shown that relative land-use composition better explains travel behavior, exposure patterns, and safety outcomes than absolute measures.

Table 1 presents an illustrative sample of the final ZIP-level dataset used for the hotspot characterization. The dataset integrates demographic characteristics, rail infrastructure exposure, safety outcomes, and proportional land-use composition for each ZIP code. The sample highlights substantial heterogeneity across ZIPs, ranging from dense urban areas with extensive rail infrastructure and multiple incidents to predominantly agricultural ZIPs with minimal rail exposure and no recorded incidents. This example dataset is provided for demonstration and methodological illustration purposes.

3.3. Spatial Autocorrelation Using Moran’s I

Moran’s I is a widely recognized and utilized method for quantifying spatial autocorrelation, serving as a vital tool in various disciplines, including transportation safety, to ascertain the degree to which values of a variable, such as trespassing incident spots, are clustered together in space [35]. Thus, this procedure is used to answer the first research question: How are railroad trespassing incidents spatially distributed across North Carolina, and what geographic patterns or clusters emerge within the state?. This statistic assesses whether observed spatial patterns are clustered, dispersed, or random [36,37]. Moran’s I operates on the principle of comparing the similarity of values at different locations with the spatial relationships between those locations [38]. A positive Moran’s I value suggests that similar values tend to cluster together, indicating positive spatial autocorrelation, where high values are located near other high values, and low values are located near other low values [39]. Conversely, a negative Moran’s I value suggests that dissimilar values tend to cluster together, indicating negative spatial autocorrelation, where high values are located near low values and vice versa. A Moran’s I value close to zero indicates a random spatial pattern, suggesting that the values are distributed independently of their locations. The formula for Moran’s I is expressed as:

I = \frac{N}{\sum_{i} \sum_{j} w_{i j}} \cdot \frac{\sum_{i} \sum_{j} w_{i j} (x_{i} - \bar{x}) (x_{j} - \bar{x})}{\sum_{i} {(x_{i} - \bar{x})}^{2}}

(1)

where N is the number of observations,

w_{i j}

represents the spatial weight between trespassing locations i and j,

\bar{x}

is the mean number of incidents across all locations, while

x_{i}

represents the number of incidents at location i. The spatial weights matrix defines the spatial relationships between observations and can be based on various criteria, such as adjacency, distance, or other measures of spatial proximity [40]. Moran’s I, in essence, measures the extent to which the presence of a phenomenon in one location influences its presence in neighboring locations [41]. It has been demonstrated that results are dependent on both the number of events in the data and the degree of spatial clustering, so a single ‘appropriate’ scale is not identified [42].

Moran’s I can also be calculated locally, as Local Indicators of Spatial Association (LISA), to analyze the spatial autocorrelation in smaller areas [43]. A positive local Moran’s I value for a specific location indicates that the location has neighboring locations with similar values; negative values, on the other hand, indicate that a location has neighboring locations with dissimilar values [44]. The LISA method breaks down global indicators into individual components, illuminating the contribution of each observation to the total, and evaluates the degree of spatial clustering by pinpointing significant clusters or outliers [45]. These can be visualized in cluster maps.

Data Preparation

We compiled a ZIP-code–level dataset for North Carolina (N = 766 ZIP polygons) to quantify and map spatial patterns in rail-trespassing harm. Trespassing casualty events were geocoded as points and aggregated to ZIP polygons via a point-in-polygon join to obtain a count of events per ZIP (Incidents). Population denominators (POPULATION) and polygon area in square miles (SQMI) were used to construct a population-standardized outcome, the trespassing casualty rate per 10,000 residents (rate_per_10 k = 10,000 × Incidents/POPULATION). ZIPs with zero or missing population (n = 3) were excluded from rate-based statistics to avoid division by zero. Thus, 763 polygons were consequently used. All spatial processing was performed on valid geometries; coordinate reference systems were projected to European Petroleum Survey Group (EPSG) 2264 (USft) for analysis. This ZIP scale was selected to align with corridor/neighborhood phenomena while retaining statewide coverage and sufficient sample size for local spatial statistics. This procedure was done using QGIS^®. Table 2 provides further description of the dataset.

3.4. Cluster-Based Hotspot Analysis

Identifying future railroad trespassing hotspots is the central objective of this study, aimed at enhancing proactive safety planning and intervention strategies. In developing the framework, we also address the second research question: Which physical and environmental factors are most strongly associated with railroad trespassing incidents in North Carolina?. To achieve this, the study utilizes a set of environmental and spatial proximity variables that studies have shown to influence the likelihood of trespassing behavior. These variables include the level of rail exposure in each zip code (rail miles and number of crossings), population density, and land use composition (e.g., residential, industrial, commercial). Together, these factors provide a rich spatial context that reflects both accessibility and environmental opportunity for trespassing to occur. Areas with limited pedestrian infrastructure are expected to exhibit higher trespassing risk due to the perceived need for shortcuts or lack of alternatives. Land use type offers additional explanatory power by indicating the functional nature of the surrounding space. For instance, segments adjacent to residential neighborhoods, informal pathways, or transient spaces such as encampments may exhibit higher vulnerability to unauthorized access. Conversely, areas near commercial zones with controlled access or industrial zones with restricted entry may show different risk patterns.

3.4.1. Cluster Analysis Using k-Means

k-means clustering is applied to group ZIP codes into distinct contextual typologies based on rail infrastructure exposure, population density, and land-use composition. The clustering variables included rail miles, number of crossings, population density (POPU_SQMI), and proportional land-use measures (Pct_Residential, Pct_Commercial, Pct_Industrial, and Pct_Agric). All variables were standardized prior to clustering to ensure equal contribution to the distance metric. A known limitation of the k-means clustering algorithm is its sensitivity to the selection of initial cluster centroids. Thus, different initializations may lead to different local optima of the within-cluster sum of squares objective function, as observed in Jain [46] and Celebi et al. [47]. To mitigate this issue, the implementation used in this study employs the k-means++ initialization method, which selects initial centroids in a probabilistic manner to maximize the distance between cluster centers. This approach has been widely shown to improve clustering stability and convergence compared to random initialization [48,49].

In addition, the clustering algorithm was executed multiple times using the k-means algorithm from the scikit-learn library in Python (v 3.9.6) with different random seeds, and the solution with the lowest within-cluster sum of squared distances was selected as the final clustering configuration. This repeated initialization strategy reduces the likelihood that the final clustering solution is influenced by a poor starting configuration. To assess the robustness of the clustering solution, a sensitivity analysis was conducted by repeating the k-means algorithm 50 times with different random initializations. Cluster stability was evaluated using the Adjusted Rand Index (ARI), which measures agreement between cluster assignments across runs.

3.4.2. Derivation of Risk Indices

Hotspot Definition Using Getis–Ord $G_{i}^{*}$

Following the presence of significant spatial autocorrelation identified through Global Moran’s I, hotspot identification is conducted using the Getis–Ord

G_{i}^{*}

statistic. Unlike global measures, the

G_{i}^{*}

statistic provides a local indicator of spatial association, enabling the identification of statistically significant clusters of high or low values within the study area. The

G_{i}^{*}

statistic evaluates whether a given ZIP code and its neighboring locations exhibit values that are significantly higher (hotspots) or lower (cold spots) than expected under spatial randomness. The statistic is computed as:

G_{i}^{*} = \frac{\sum_{j} w_{i j} x_{j} - \bar{x} \sum_{j} w_{i j}}{S \sqrt{\frac{n \sum_{j} w_{i j}^{2} - {(\sum_{j} w_{i j})}^{2}}{n - 1}}},

(2)

where

x_{j}

represents the trespassing incidents at location j,

w_{i j}

is a binary variable denoting whether locations i and j share a boundary,

\bar{x}

is the global mean, S is the standard deviation, and n is the total number of ZIP codes. The resulting

G_{i}^{*}

statistic is standardized as a z-score, allowing for statistical inference under the assumption of approximate normality. In this study, ZIP codes are classified as statistically significant hotspots if their corresponding z-scores exceeded a threshold of 1.96, corresponding to a 95% confidence level. Formally,

{Hotspot}_{i} = \{\begin{matrix} 1, & if z_{i} > 1.96 \\ 0, & otherwise \end{matrix}

(3)

To provide a continuous measure of hotspot intensity, the

G_{i}^{*}

z-scores are further transformed into probabilities using the standard normal cumulative distribution function:

P_{i} = Φ (z_{i}),

(4)

where

Φ (\cdot)

denotes the cumulative distribution function of the standard normal distribution. This transformation yields a probability-like measure that reflects the likelihood of each ZIP code belonging to a spatial hotspot, with values approaching 1 indicating strong hotspot presence and values near 0.5 indicating spatial randomness. By combining a statistically grounded threshold with a continuous probability measure, this approach enables both binary hotspot classification and nuanced interpretation of spatial risk intensity across the study area.

Relative Risk Calculation

Relative risk (RR) is computed to compare each cluster’s hotspot probability to the overall hotspot probability across all ZIP codes. Relative risk is defined as

R R_{c} = \frac{P (hotspot ∣ c)}{P (hotspot)}

, where values greater than one indicate clusters with above-average risk and values less than one indicate below-average risk. Relative risk provides a standardized measure of the strength of association between cluster membership and hotspot occurrence, but does not convey absolute risk magnitude on its own.

Composite Risk Index Calculation

The Composite Risk Index (CRI) is used to capture both the likelihood and the intensity of railroad trespassing risk within each ZIP code. The index is defined as

C R I_{i} = P_{i} \times R R_{i}

where

P_{i}

represents the probability that location i belongs to a hotspot cluster and

R R_{i}

denotes the relative risk level compared with the spatial average.

The formulation follows established principles of quantitative risk assessment, where risk is commonly represented as the product of the probability of occurrence and the magnitude of potential impact. The multiplicative formulation is motivated by standard principles of quantitative risk assessment, where risk is commonly expressed as the product of the probability of occurrence and the magnitude or severity of the associated outcome [50,51]. This formulation ensures that both components jointly influence the resulting risk score. Locations with high values in both probability and relative risk will receive the highest CRI scores, while locations with low values in either component will receive lower risk scores. This interaction structure prevents locations with moderate values across both dimensions from being masked. Alternative formulations such as additive or weighted average combinations are less dimensionally inconsistent, according to studies by Haimes [52] and Pejovic [53]. For instance, a location with a very high relative risk but low hotspot probability could receive a similar score to a location with moderate values for both components under an additive formulation. The multiplicative form preserves the joint influence of both factors and therefore provides a more meaningful representation of spatial risk intensity. Similar formulations are widely used in epidemiology, reliability engineering, and safety analysis.

For visualization and comparison purposes, CRI values are normalized to a 0–1 scale using min–max normalization, defined as

C R I_{c}^{norm} = \frac{C R I_{c} - min (C R I)}{max (C R I) - min (C R I)}

. The normalized index facilitates direct comparison across clusters and supports the identification of priority contexts for rail-safety interventions. Lastly, the CRI is intended as an exploratory early-warning indicator rather than a predictive classifier. It provides a spatial prioritization tool for identifying locations that may require additional monitoring or targeted safety interventions. The index can therefore support proactive safety management by highlighting areas where elevated incident probability and risk intensity coincide.

4. Results of Spatial Autocorrelation

4.1. Global Spatial Autocorrelation

Global Moran’s I measures the overall spatial dependence in a numeric variable y observed on n spatial units with weights

W = {w_{i j}}

. Using row-standardized weights (

\sum_{j} w_{i j} = 1

), we compute

I = \frac{n}{\sum_{i = 1}^{n} z_{i}^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i j} z_{i} z_{j}, z_{i} = y_{i} - \bar{y} .

Positive I indicates that neighboring units tend to have similar values (positive autocorrelation), negative I indicates dissimilarity (checkerboard), and for a random pattern the expectation satisfies

E [I] \approx - \frac{1}{n - 1}

.

Weights: k-nearest neighbors (KNN) (k = 10) with row standardization is used to ensure connectivity and comparable neighbor influence across ZIPs. Sensitivity is also assessed at

k \in {8, 12}

.

Hypothesis 1.

H₀:: No spatial autocorrelation (spatial randomness); y is exchangeable across locations.
H₁:: positive autocorrelation exists; one-sided.

Permutation inference: A conditional randomization test with 999 permutations was performed using the following procedure:

1.: Holding the spatial topology (the weight matrix W) fixed.
2.: Randomly permuting y across locations to generate the reference distribution of I under $H_{0}$ .
3.: Computing a pseudo p-value as the proportion of permuted statistics at least as extreme as the observed I (per the chosen alternative). With 999 permutations, the minimum attainable p-value is $1 / (999 + 1) = 0.001$ .
4.: And finally report I, the permutation-based p-value, and a z-score from the permutation distribution.

The incident rate data were exported as a geopackage and analyzed with GeoPandas, libpysal, and esda in a Python environment using Anaconda^® desktop version 2.5.0 [54]. Global Moran’s I for trespassing casualty rate (per 10,000 population) using KNN (k = 10), row-standardized weights, and 999 permutations indicated positive spatial autocorrelation (I = 0.101, z = 7.025, p = 0.001). Also, the mean distance to the 10th neighbor is 80,849.1 ft, equivalent to approximately 15.3 miles.

The Moran’s I values in Table 3 illustrate how the global autocorrelation signal varies with the neighborhood definition. Moran’s I declines from about

0.124

at

k = 6

to about

0.084

at

k = 20

, as larger k smooths local contrasts by averaging over more neighbors. In contrast, the permutation z-score generally increases with k (rising from

\sim 6.5

at

k = 6

to

\sim 8

–

8.5

by

k = 18

–20), indicating that—even as the effect size becomes more conservative—the statistic remains highly atypical under spatial randomness and thus strongly significant across all reasonable k. Together, the plots imply a robust, modest positive autocorrelation. The precise magnitude of I depends on k, but the inference (

p \approx 0.001

throughout) is stable. Thus,

k = 10

offers a balanced neighborhood size (local enough to preserve corridor structure, large enough to avoid isolated units), with corroborating sensitivity at

k = 8

and

k = 12

.

4.2. Local Indicators of Spatial Association (LISA)

This section uses Local Moran’s I to move from the global question, does clustering exist, to the local question: where, exactly, are the clusters and spatial outliers? For each ZIP, we evaluate the trespassing casualty rate per 10,000 residents against the rates of its neighbors, using k-nearest neighbors (k = 10) with row-standardized weights to represent local spatial context. Significance is obtained by permutation testing (999 permutations) for each unit; because many tests are run simultaneously, we controlled for multiple testing using the Benjamini–Hochberg false discovery rate (FDR,

α = 0.05

). The resulting LISA map classifies places as High–High (HH) and Low–Low (LL) clusters—interpreted as hotspots and cold spots—alongside High–Low (HL) and Low–High (LH) spatial outliers that may indicate emerging risk or protective pockets. These outputs guide prioritization, such as engineering, enforcement, or education, and provide a basis for robustness checks, such as alternative k, distance bands, or exposure metrics, such as per rail-miles or per crossing.

The resulting classification shows that the majority of ZIP codes were not significant (ns = 483), indicating no detectable local association after correction. A sizeable set formed low–low clusters (LL = 244), i.e., areas with lower rates surrounded by similarly low neighbors (cold spots). In contrast, high–high clusters (HH = 13) were relatively rare but represent the most defensible hotspots—locations with elevated rates embedded within high-rate neighborhoods. We also observed a small number of spatial outliers: low–High (LH = 21), suggesting comparatively low-rate ZIPs adjacent to high-rate neighbors (potential protective pockets), and high–low (HL = 2), indicating isolated high-rate ZIPs amid low-rate surroundings (possible emerging hotspots). Overall, these counts imply that while most of the state does not exhibit significant local association after FDR adjustment, a compact set of hotspots and a narrow band of outliers merit targeted investigation. Figure 1 and Figure 2 visualize the contrast between the global intensity hotspots (Gi*) and local similarity clusters (LISA).

As presented in Figure 3, the positive Local Moran’s I values indicate locations that resemble their neighbors (clustering), while negative values would indicate spatial outliers. Points higher on the plot have smaller permutation p-values (greater statistical evidence). The orange, annotated points identify ZIPs that form High–High clusters after FDR adjustment; these represent the most defensible hotspots where elevated trespassing casualty rates are embedded within similarly high-rate neighborhoods. Most ZIPs cluster near

I \approx 0

with lower

- {log}_{10} (p)

, indicating no detectable local association after multiple-testing correction, consistent with a sparse landscape punctuated by a compact set of statistically significant hotspots.

4.3. Sensitivity Analysis

To assess robustness to the neighborhood definition, we repeated the global test with k-nearest-neighbor weights over

k \in {8, 10, 12}

. Results were stable: Moran’s I ranged from 0.095–0.114 (k = 8:

I \approx 0.114

,

z \approx 7.23

,

p = 0.001

; k = 10:

I = 0.101

,

z = 7.025

,

p = 0.001

; k = 12:

I \approx 0.095

,

z \approx 7.32

,

p = 0.001

), indicating a consistent, modest positive spatial autocorrelation irrespective of reasonable changes in k. As expected, a larger k slightly smooths local variation and reduces I, but significance remains unchanged. In local analyses (LISA), cluster detection proved more sensitive to k and multiple-testing control: with FDR at

α = 0.05

, k = 10 yielded a small set of hotspots (HH = 13), whereas k = 8 and k = 12 produced no FDR-significant clusters despite similar raw (

p \leq 0.05

) patterns. Accordingly, we report k = 10 (row-standardized) as the primary specification and include k = 8/12 and raw vs. FDR-adjusted results as sensitivity checks. Thus, conclusions about the presence of global clustering are robust, while the exact set of local hotspots varies modestly with neighborhood choice and correction method.

5. Results of Hotspot Analysis

Using the k-means clustering approach, the optimal number of clusters was determined using the elbow method, which evaluates the within-cluster sum of squares (inertia) as a function of the number of clusters. As shown in Figure 4, inertia decreases sharply as the number of clusters increases from k = 1 to k = 4, after which the rate of improvement diminishes substantially. This inflection point indicates that additional clusters beyond k = 4 yield only marginal reductions in within-cluster variance. Based on this pattern, a four-cluster solution is selected as the most parsimonious and interpretable representation of the data. A silhouette score of 0.50 was obtained, which indicates strong cluster cohesion and separation, supporting the robustness and interpretability of the selected k-means clustering solution.

Table 4 presents the centroid values for each cluster, which represent the average characteristics of ZIP codes belonging to each cluster group. These centroids provide insight into the structural environments associated with varying levels of rail trespassing risk. Cluster 0 is characterized by relatively high population density (

\approx 1217

persons per square mile) and a predominance of residential land use (approximately 71%). Rail exposure in this cluster is moderate, with an average of approximately 3.9 rail miles and 6 grade crossings per ZIP code. The combination of dense residential development and moderate rail infrastructure suggests that this cluster represents urban or suburban residential areas where pedestrian interactions with rail infrastructure may occur frequently.

Cluster 1 exhibits the highest levels of rail infrastructure exposure among the four clusters, with an average of approximately 6.9 rail miles and 11 crossings per ZIP code. Land-use composition in this cluster is dominated by industrial activity (approximately 51%), with relatively low residential presence. Population density is moderate (

\approx 438

persons per square mile). These characteristics indicate that Cluster 1 likely represents industrial rail corridors or freight-oriented environments where rail infrastructure is heavily concentrated. Cluster 2 displays the lowest levels of rail exposure, with approximately 1.0 rail mile and 3 crossings per ZIP code on average. Land use is overwhelmingly commercial (approximately 89%), while residential and industrial land uses are minimal. Population density is moderate (

\approx 418

persons per square mile). This cluster appears to represent commercial districts or retail corridors with limited rail infrastructure presence.

Cluster 3 is characterized by predominantly agricultural land use (approximately 82%) and the lowest population density among the clusters (

\approx 349

persons per square mile). Rail exposure is relatively low to moderate, with approximately 2.3 rail miles and 4 crossings per ZIP code. These characteristics suggest that Cluster 3 corresponds to rural or agricultural environments where rail lines traverse sparsely populated areas. Overall, the clustering results reveal four structurally distinct rail environments: (1) dense residential corridors with moderate rail exposure, (2) industrial rail corridors with high infrastructure concentration, (3) commercial areas with limited rail presence, and (4) rural agricultural regions with low population density. These environmental archetypes provide a useful framework for analyzing spatial variation in trespassing risk across the study area.

5.1. Cluster Sensitivity Analysis

To assess the robustness of the clustering solution, a sensitivity analysis was conducted by repeating the k-means clustering procedure across multiple runs with different random initializations. Because the k-means algorithm relies on random centroid initialization, different starting points can potentially produce different cluster assignments. Evaluating the stability of the clustering results ensures that the identified clusters represent inherent structure in the data rather than artifacts of the initialization process. The clustering procedure was therefore repeated 50 times using identical input variables but varying the random seed controlling centroid initialization. For each run, ZIP-code observations were assigned to one of four clusters (

k = 4

). The similarity between cluster assignments across runs was quantified using the Adjusted Rand Index (ARI), a widely used metric for measuring agreement between two clustering solutions while correcting for chance. ARI values range from 0 (random agreement) to 1 (identical clustering).

Pairwise ARI values were computed for all combinations of clustering runs, producing a distribution of stability scores. The results indicate extremely high clustering stability. The average ARI across all pairwise comparisons was

0.999

, with a minimum ARI of

0.992

and a maximum ARI of

1.000

. These values indicate that cluster assignments remained nearly identical across repeated runs, demonstrating that the identified clusters are highly robust to variations in centroid initialization.

5.2. Hotspot Probability and Risk Metrics

Table 5 provides details of the various indices, while Figure 5 illustrates the spatial distribution of the Cluster Risk Index (CRI) across ZIP codes in North Carolina. Higher CRI values (shown in darker red) correspond to ZIP codes that belong to cluster typologies characterized by both a high probability of severe rail-related incidents and elevated relative risk compared to the statewide average. The map reveals a clear spatial concentration of high-risk ZIP codes, particularly in dense urban areas and industrial rail corridors, while rural regions generally exhibit low CRI values.

Further, the distribution of CRI across ZIP codes is shown in Figure 6. The distribution exhibits near symmetry with slight positive skewness (0.179), indicating a modest concentration of higher-risk ZIP codes. The negative kurtosis (−1.61) suggests a platykurtic distribution, reflecting a relatively even spread of risk across the study area with limited extreme outliers. Visual inspection of the CRI distribution reveals multiple distinct peaks, corresponding to the cluster-based segmentation of ZIP codes, which further supports the presence of structurally differentiated risk environments. Also, thresholds based on empirical percentiles were used to classify risk levels, with the 90th and 95th percentiles representing high-risk and priority intervention zones, respectively. This percentile-based approach enables data-driven identification of critical areas while preserving the relative distribution of risk across the study area. Overall, the CRI map provides a concise, policy-relevant visualization that integrates clustering and hotspot analysis results into a unified decision-support framework.

6. Discussion

Recent advances in spatial analysis, particularly the integration of spatial autocorrelation and cluster-based hotspot characterization, have enabled researchers such as Habib et al. [55] and Mekonnen et al. [56] to systematically examine the spatial structure of incident spots at granular levels such as ZIP codes, facilitating targeted interventions and resource allocation. Results from spatial autocorrelation analysis showed a highly significant positive spatial dependence of rail casualty occurrence, indicating that high-risk ZIP codes do not occur in isolation but are geographically clustered. This pattern suggests that the risk of rail trespassing is influenced by shared context and external environmental factors beyond ZIP codes, such as continuity of the rail corridor, urban development characteristics, and regional land use structure. The presence of spatial clustering also reinforces the need to move beyond independent-unit assumptions and supports the use of spatially informed methods for identifying and prioritizing high-risk areas. From a practical perspective, these findings imply that targeted interventions may be more effective when coordinated across neighboring ZIP codes along the same high-risk corridors or urban areas.

The cluster analysis also confirmed the presence of heterogeneity in the underlying causes for this spatial clustering of incidents, rather than a single dominant risk factor. Rural ZIP codes with low rail mileage and land-use dominated by agriculture consistently had very low hotspot probabilities, supporting the idea that low risk of trespassing results from both limited rail exposure and sparse population density. Conversely, two distinct high-risk cluster types emerged: industrial rail corridors and dense urban mixed-use ZIP codes. While both clusters showed increased hotspot probabilities, the mechanisms driving risk differed substantially between them. High-risk areas in industrial rail corridors were mostly linked to long stretches of railroad and numerous crossings, where regular encounters between railroad activities and the surrounding environment are common. These areas are probably impacted by freight operations, switching yards, and at-grade crossings that increase exposure independently of residential density.

In contrast, dense urban ZIPs, despite having a moderate amount of rail mileage, exhibited high hotspot probabilities, suggesting that population densities and mixed land-use patterns play key roles in increasing trespass risk. In these contexts, pedestrian activity, proximity of residential and commercial uses to rail infrastructure, and informal access points may contribute more strongly to risk than infrastructure volume alone. The integration of cluster types with hotspot probabilities and relative risks through the CRI offers a cohesive approach to turning spatial patterns into actionable information. While spatial autocorrelation indicates where clustering occurs, the CRI shows which context consistently generates risk. The significant difference in CRI between clusters—ranging from negligible risk in rural ZIP codes to almost certain hotspot occurrence in urban and industrial clusters—underscores the need for interventions tailored to local conditions.

The findings of this study are broadly consistent with prior research on transportation safety and spatial risk analysis. The observed association between rail infrastructure characteristics and increased trespassing risk aligns with previous studies that identify infrastructure exposure as a key determinant of accident occurrence [12,13]. The study’s focus on North Carolina is supported by a growing body of empirical research on rail trespassing in the United States and internationally. Searcy et al. [14] and the Institute for Transportation Research and Education (ITRE) have documented the prevalence and spatial distribution of trespassing events across North Carolina, using both FRA data and direct observation to identify high-risk corridors and communities. Their findings corroborate this study’s identification of urban–industrial corridors and mixed-use environments as primary hotspots. Silla & Luoma [10] and Grabušić et al. [18] additionally highlight the role of insufficient legal crossings, poor urban planning, and land-use barriers in driving trespassing behavior, reinforcing the importance of structural exposure metrics in risk assessment. Comparative analyses also reveal the influence of reporting practices, data quality, and local context on observed patterns, underscoring the need for standardized methodologies and cross-jurisdictional collaboration in rail safety research.

A key contribution of this study is the development of a framework that is both interpretable and scalable for practical transportation safety analysis. Thus, the ability of practitioners and decision makers to understand how model inputs influence spatial risk assessments is crucial. Unlike black-box machine learning approaches, the proposed framework relies on transparent statistical and spatial analytical methods, including spatial autocorrelation metrics such as Moran’s I, and geographically referenced variables such as rail density, population density, and land-use characteristics. Because the model parameters directly correspond to measurable environmental and demographic variables, the contribution of each factor to predicted trespassing risk can be readily interpreted by analysts and policymakers. In this regard, the present framework aligns with interpretable engineering modeling paradigms that emphasize physically meaningful variables and transparent probabilistic structures that provide transparent relationships between inputs and outputs to support operational decision-making and system accountability [57]. Consequently, the methodology is extendable beyond North Carolina to other states or national-scale railroad safety analyses.

7. Conclusions

This study combined spatial autocorrelation analysis with cluster-based hotspot characterization to examine the spatial structure and contextual drivers of rail trespassing risk at the ZIP-code level. The results provide converging evidence that incidents are not randomly distributed in space, but instead exhibit statistically significant spatial clustering shaped by distinct combinations of rail infrastructure, land-use composition, and population density. Together, these findings advance understanding of how and why rail trespassing hotspots emerge across different spatial contexts. These results have direct implications for rail safety planning and resource allocation. The findings suggest that a single approach to preventing trespassing may not be effective. In industrial rail corridors, interventions might need to focus on engineering controls, crossing treatments, and coordination with freight operations; dense urban areas could benefit more from pedestrian-focused strategies such as access control, urban design modifications, and targeted public education. Identifying adjacent high-risk ZIP codes also points to opportunities for corridor-level strategies rather than site-specific interventions. By combining spatial autocorrelation with cluster-based hotspot detection, this study offers a multi-level analytical framework for assessing trespassing risks along rail tracks.

However, this study is still subject to several limitations that also point to important avenues for future research. The results presented in this study are based on observational data and statistical associations, and therefore should not be interpreted as causal relationships. While the clustering analysis and spatial risk modeling identify patterns and correlations among environmental, demographic, and infrastructure variables and railroad trespassing incidents, these relationships do not imply direct causation. Several potential confounding factors may influence the observed associations. For instance, regional population mobility, pedestrian accessibility to rail corridors, proximity to public service facilities, and informal crossing behavior may affect trespassing risk but are not explicitly captured in the current dataset. These unobserved variables may partially explain the spatial patterns identified in the analysis. Thus, future research should extend this work by incorporating causal inference frameworks, such as quasi-experimental designs or agent-based simulation models, to better understand the mechanisms driving railroad trespassing behavior and to evaluate the effectiveness of targeted interventions.

Author Contributions

H.M.: Conceptualization, Writing—original draft, Methodology, Data analysis, Visualization, Validation. R.L.: Supervision, Writing—review and editing, Validation. S.J.: Supervision, Writing—review and editing, Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data used for this study is publicly available.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

FRA Safety Admin. Injury/Illness Summary—Casualty Source Data (Form 55A). 2023. Available online: https://data.transportation.gov/d/kuvg-3uwp (accessed on 4 April 2025).
Grabušić, S.; Barić, D.; Grabušić, S.; Barić, D. A Systematic Review of Railway Trespassing: Problems and Prevention Measures. Sustainability 2024, 16, 6743, Correction in Sustainability 2023, 15, 13878. https://doi.org/10.3390/su16166743. [Google Scholar] [CrossRef]
da Silva, M.; Barron, R.; George, B. Railroad Infrastructure Trespass Detection Performance Guidelines; Report No. DOT/FRA/ORD-11/19; U.S. Department of Transportation, Federal Railroad Administration: Washington, DC, USA, 2011. Available online: https://railroads.dot.gov/elibrary/railroad-infrastructure-trespass-detection-performance-guidelines (accessed on 30 May 2025).
Tobler, W.R. A Computer Movie Simulating Urban Growth in the Detroit Region. Econ. Geogr. 1970, 46, 234–240. [Google Scholar] [CrossRef]
Moran, P.A.P. Notes on Continuous Stochastic Phenomena. Biometrika 1950, 37, 17–23. [Google Scholar] [CrossRef]
Geary, R.C. The Contiguity Ratio and Statistical Mapping. Inc. Stat. 1954, 5, 115–146. [Google Scholar] [CrossRef]
Anselin, L. Local Indicators of Spatial Association—LISA. Geogr. Anal. 1995, 27, 93–115. [Google Scholar] [CrossRef]
Skládaná, P.; Havlíček, M.; Dostál, I.; Skládaný, P.; Tučka, P.; Perůtka, J. Land Use as a Motivation for Railway Trespassing: Experience from the Czech Republic. Land 2018, 7, 1. [Google Scholar] [CrossRef]
Sasidharan, M.; Burrow, M.P.N.; Ghataora, G.S.; Marathu, R. A risk-informed decision support tool for the strategic asset management of railway track infrastructure. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit 2021, 236, 183–197. [Google Scholar] [CrossRef]
Silla, A.; Luoma, J. Opinions on railway trespassing of people living close to a railway line. Saf. Sci. 2012, 50, 62–67. [Google Scholar] [CrossRef]
Zhang, H.; Zahnow, R.; Liu, Y.; Corcoran, J. Crime at train stations: The role of passenger presence. Appl. Geogr. 2022, 140, 102666. [Google Scholar] [CrossRef]
Stanchak, K.; Foderaro, F.; DaSilva, M. High-Security Fencing for Rail Right-of-Way Applications: Current Use and Best Practices; Technical Report; U.S. Department of Transportation: Washington, DC, USA, 2015.
Savage, I. Trespassing on the Railroad. Res. Transp. Econ. 2007, 20, 199–224. [Google Scholar] [CrossRef]
Searcy, S.; Vaughan, C.; Coble, D.; Poslusny, J.; Cunningham, C. Rail Network Trespass Statewide Severity Assessment and Predictive Modeling; Technical Report; Institute for Transportation Research and Education: Raleigh, NC, USA, 2020. [Google Scholar]
Kang, Y.; Iranitalab, A.; Khattak, A. Modeling railroad trespassing crash frequency using a mixed-effects negative binomial model. Int. J. Rail Transp. 2018, 7, 208–218. [Google Scholar] [CrossRef]
U.S. Government Accountability Office. Railway-Highway Crossings: Improvements Needed to Federal Technical Assistance About Pedestrian Projects Related to Trespassing; Technical Report GAO-25-107115; Report to Congressional Committees; GAO: Washington, DC, USA, 2025.
Florida Department of Transportation. Strategies for Reducing Railroad Trespassing (SRRT): Florida East Coast Railway (FEC) Trespass Report; Final Report;Technical Report; FDOT Rail and Motor Carrier Operations Office: Tallahassee, FL, USA, 2021. [Google Scholar]
Grabušić, S.; Barić, D.; Ricci, S. Understanding Spatial—Temporal Patterns in Trespassing on Railway Property. Safety 2025, 11, 55. [Google Scholar] [CrossRef]
Longley, P.A.; Goodchild, M.F.; Maguire, D.J.; Rhind, D.W. Geographic Information Science and Systems, 4th ed.; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
Li, L.; Zhu, L.; Sui, D.Z. A GIS-based Bayesian approach for analyzing spatial–temporal patterns of intra-city motor vehicle crashes. J. Transp. Geogr. 2007, 15, 274–285. [Google Scholar] [CrossRef]
Bilim, A. Identifying unsafe locations for pedestrians in Konya with spatio-temporal analyses. Cities 2025, 156, 105523. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, M.; Zhang, C.; Hou, L. Formulating a GIS-based geometric design quality assessment model for Mountain highways. Accid. Anal. Prev. 2021, 157, 106172. [Google Scholar] [CrossRef] [PubMed]
Chandra, S.; Nguyen, H.; Nguyen, A. Evaluating critical gas pipeline crossings for freight truck routes. Case Stud. Transp. Policy 2019, 7, 680–688. [Google Scholar] [CrossRef]
Tomaszewski, B. Geographic Information Systems (GIS) for Disaster Management; Routledge: Abingdon, UK, 2020. [Google Scholar] [CrossRef]
Zhang, X.; Ding, Y.; Zhao, H.; Yi, L.; Guo, T.; Li, A.; Zou, Y. Mixed Skewness Probability Modeling and Extreme Value Predicting for Physical System Input–Output Based on Full Bayesian Generalized Maximum-Likelihood Estimation. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
Gomes, M.J.T.L.; Cunto, F.; da Silva, A.R. Geographically weighted negative binomial regression applied to zonal level safety performance models. Accid. Anal. Prev. 2017, 106, 254–261. [Google Scholar] [CrossRef] [PubMed]
North Carolina Department of Transportation Rail Division. The Economic Contribution of Rail in North Carolina; North Carolina Department of Transportation: Raleigh, NC, USA, 2021. Available online: https://www.ncdot.gov/divisions/rail/Pages/economic-benefits-rail-report.aspx (accessed on 20 March 2026).
USDOT. Crossing Inventory Data (Form 71)—Current | Department of Transportation—Data Portal—data.transportation.gov. Available online: https://data.transportation.gov/Railroads/Crossing-Inventory-Data-Form-71-Current/m2f8-22s6/about_data (accessed on 20 December 2025).
Pedestrian/Motorist | Federal Railroad Administration—fra.dot.gov. Available online: https://www.fra.dot.gov/Page/P0843#:~:text=AS%20A%20MOTORIST,At%20a%20Passive%20Crossing (accessed on 24 December 2025).
North Carolina Department of Transportation Rail Division. NCDOT North Carolina Railroads; NC OneMap: Raleigh, NC, USA, 2024. Available online: https://www.nconemap.gov/datasets/NCDOT::ncdot-north-carolina-railroads/explore?location=35.223450%2C-80.108350%2C7 (accessed on 20 March 2026).
QGIS Development Team. QGIS Geographic Information System, version 3.42; QGIS Association: Gossau, Switzerland, 2025. Available online: https://www.qgis.org (accessed on 30 March 2026).
NextGIS. North Carolina (US-NC) Base Layers: Geospatial Data Sets. NextGIS Data. 2026. Available online: https://data.nextgis.com/en/region/US-NC/base/ (accessed on 20 March 2026).
Washington, S.P.; Karlaftis, M.G.; Mannering, F.L. Statistical and Econometric Methods for Transportation Data Analysis; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Ewing, R.; Cervero, R. Travel and the Built Environment: A Meta-Analysis. J. Am. Plan. Assoc. 2010, 76, 265–294. [Google Scholar] [CrossRef]
Iyer, N.; Menezes, R.; Barbosa, H. Does Transport Inequality Perpetuate Housing Insecurity? arXiv 2023. [Google Scholar] [CrossRef]
Zhu, H.; Zhao, H.; Ou, R.; Xiang, H.; Hu, L.; Jing, D.; Sharma, M.; Ye, M. Epidemiological Characteristics and Spatiotemporal Analysis of Mumps from 2004 to 2018 in Chongqing, China. Int. J. Environ. Res. Public Health 2019, 16, 3052. [Google Scholar] [CrossRef] [PubMed]
Zhu, S. Analysis and Evaluation of the Inequality of the Spatial Distribution of Medical Resources in Jinan. arXiv 2021. [Google Scholar] [CrossRef]
Chaney, R.A.; Rojas-Guyler, L. Spatial Analysis Methods for Health Promotion and Education. Health Promot. Pract. 2015, 17, 408. [Google Scholar] [CrossRef]
Martin, D.C. Spatial Patterns in Residential Burglary. J. Contemp. Crim. Justice 2002, 18, 132. [Google Scholar] [CrossRef]
Duncan, E.; White, N.; Mengersen, K. Spatial smoothing in Bayesian models: A comparison of weights matrix specifications and their impact on inference. Int. J. Health Geogr. 2017, 16, 47. [Google Scholar] [CrossRef]
Goodchild, M.F. What Problem? Spatial Autocorrelation and Geographic Information Science. Geogr. Anal. 2009, 41, 411. [Google Scholar] [CrossRef]
Malleson, N.; Steenbeek, W.; Andresen, M.A. Identifying the appropriate spatial resolution for the analysis of crime patterns. PLoS ONE 2019, 14, e0218324. [Google Scholar] [CrossRef]
Bhattacharyya, A.; Haldar, S.K.; Banerjee, S. Determinants of Crime Against Women in India: A Spatial Panel Data Regression Analysis. Millenn. Asia 2021, 13, 411. [Google Scholar] [CrossRef]
Gao, Y.; He, Q.; Liu, Y.; Zhang, L.; Wang, H.; Cai, E. Imbalance in Spatial Accessibility to Primary and Secondary Schools in China: Guidance for Education Sustainability. Sustainability 2016, 8, 1236. [Google Scholar] [CrossRef]
Muriuki, J.; Hudson, D.; Fuad, S.; March, R.J.; Lacombe, D.J. Spillover effect of violent conflicts on food insecurity in sub-Saharan Africa. Food Policy 2023, 115, 102417. [Google Scholar] [CrossRef]
Jain, A.K. Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Celebi, M.E.; Kingravi, H.A.; Vela, P.A. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 2013, 40, 200–210. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LO, USA, 7–9 January 2007; SODA ’07. pp. 1027–1035. [Google Scholar]
Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable k-means++. Proc. VLDB Endow. 2012, 5, 622–633. [Google Scholar] [CrossRef]
Kaplan, S.; Garrick, B.J. On The Quantitative Definition of Risk. Risk Anal. 1981, 1, 11–27. [Google Scholar] [CrossRef]
Anthony (Tony) Cox, L. What’s Wrong with Risk Matrices? Risk Anal. 2008, 28, 497–512. [Google Scholar] [CrossRef]
Haimes, Y.Y. Risk Modeling, Assessment, and Management; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar] [CrossRef]
Pejovic, T. Composite risk index: The new Safety Performance Indicator of risk exposure. MATEC Web Conf. 2020, 314, 01007. [Google Scholar] [CrossRef]
Anaconda, Inc. Anaconda Software Distribution, version 2-2.6.6; Anaconda, Inc.: Austin, TX, USA, 2025. Available online: https://www.anaconda.com (accessed on 30 March 2026).
Habib, M.F.; Bridgelall, R.; Motuba, D.; Rahman, B. Exploring the Robustness of Alternative Cluster Detection and the Threshold Distance Method for Crash Hot Spot Analysis: A Study on Vulnerable Road Users. Safety 2023, 9, 57. [Google Scholar] [CrossRef]
Mekonnen, A.A.; Sipos, T.; Krizsik, N. Identifying Hazardous Crash Locations Using Empirical Bayes and Spatial Autocorrelation. ISPRS Int. J. Geo Inf. 2023, 12, 85. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, X.; Ding, Y.; Guo, T.; Li, A.; Soh, C.K. Probabilistic mixture model driven interpretable modeling, clustering, and predicting for physical system data. Eng. Appl. Artif. Intell. 2025, 160, 112069. [Google Scholar] [CrossRef]

Figure 1. Getis-Ord Gi* hotspots (feet band; 90–99%).

Figure 2. LISA hotspots (k = 10; FDR

\leq 0.05

): High–High (HH).

Figure 2. LISA hotspots (k = 10; FDR

\leq 0.05

): High–High (HH).

Figure 3. LISA volcano plot of Local Moran’s I (x) versus

- {log}_{10} (p)

(y). Orange points mark High–High (HH) clusters that remain significant after FDR correction (

α = 0.05

); labels indicate ZIP codes. The dashed horizontal line at

- {log}_{10} (0.05) \approx 1.30

corresponds to the nominal

p = 0.05

threshold.

Figure 3. LISA volcano plot of Local Moran’s I (x) versus

- {log}_{10} (p)

(y). Orange points mark High–High (HH) clusters that remain significant after FDR correction (

α = 0.05

); labels indicate ZIP codes. The dashed horizontal line at

- {log}_{10} (0.05) \approx 1.30

corresponds to the nominal

p = 0.05

threshold.

Figure 4. Elbow method for selecting the optimal number of clusters in the k-means analysis.

Figure 5. Spatial distribution of the Cluster Risk Index (CRI) across ZIP codes in North Carolina.

Figure 6. Distribution of CRI with Extreme Value Thresholds.

Table 1. Sample ZIP-Level Dataset.

ZIP	Population/ sq. mi	Rail Miles	Crossings	Incidents	Residential	Commercial	Industrial	Agric./ Rural	Other Land Use
27101	2150.4	18.25	22	6	0.32	0.28	0.21	0.12	0.07
27514	1480.6	9.40	11	2	0.41	0.24	0.12	0.18	0.05
28301	890.3	6.75	8	1	0.27	0.19	0.16	0.33	0.05
27896	95.2	0.00	1	0	0.05	0.02	0.01	0.90	0.02
28403	610.7	2.10	4	0	0.18	0.11	0.07	0.58	0.06

Note. Land-use variables are proportional coverages that sum to 1.00.

Table 2. Descriptive statistics of Zip code dataset.

Variable	N Valid	Mean	SD	Min	Median	Max
POPULATION	766	14,054	16,632	0	6838	85,514
POP_SQMI	766	605	1562	0	126	22,950
SQMI	766	65.246	60.962	0.02	47.845	428.130
Incidents	766	0.663	1.903	0	0	17
rate_per_10k	763	0.374	1.253	0	0	14.6

Note. rate_per_10k excludes 3 ZIPs with zero/unknown population. Many ZIPs report zero incidents (median = 0), so rate distributions are right-skewed; permutation-based spatial statistics are therefore used throughout to avoid normality assumptions.

Table 3. Moran’s I values across different k-neighbors.

k	Moran I	z Score	p-Value
6	0.124407	6.536211	0.001
7	0.115609	6.603078	0.001
8	0.113817	7.049733	0.001
9	0.106806	6.886261	0.001
10	0.101180	7.174451	0.001
11	0.091619	6.359131	0.001
12	0.095258	7.209565	0.001
13	0.087108	6.742585	0.001
14	0.086150	6.862160	0.001
15	0.093052	8.265287	0.001
16	0.087048	7.556754	0.001
17	0.086203	7.652760	0.001
18	0.088630	8.241578	0.001
19	0.086769	8.115465	0.001
20	0.084399	8.533200	0.001

Table 4. k-means Cluster Centroids Based on Rail Exposure, Population Density, and Land-Use Composition.

Cluster	Rail Miles	Crossings	Popu./sq.mi	Pct_Res.	Pct_Comm.	Pct_Ind.	Pct_Agric.
0	3.923	6.148	1217	0.706	0.113	0.055	0.097
1	6.908	10.627	438	0.076	0.132	0.510	0.115
2	1.025	3.206	418	0.031	0.888	0.030	0.048
3	2.266	4.267	349	0.047	0.073	0.039	0.816

Note. Popu./sq.mi = Zip population per square miles. Pct = Percentage, Res = Residential, Comm = Commercial, Ind = Industrial, Agric = Agricultural/Rural.

Table 5. Hotspot Probability, Relative Risk, and Cluster Risk Index (CRI) by ZIP Code Cluster.

Cluster	ZIPs	Hotspots	Hotspot Prob.	${CI}_{L}$	${CI}_{U}$	Rel. Risk	CRI	${CRI}_{norm}$
0	169	2	0.5539	0.0033	0.0421	1.1016	0.61	1.00
1	228	3	0.5227	0.0045	0.0380	1.0394	0.54	0.63
2	102	3	0.4698	0.0101	0.0829	0.9343	0.44	0.05
3	243	2	0.4650	0.0023	0.0295	0.9247	0.43	0.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mohammed, H.; Liu, R.; Jiang, S. A Spatial and Cluster-Based Framework for Identifying Railroad Trespassing Hotspots. Systems 2026, 14, 396. https://doi.org/10.3390/systems14040396

AMA Style

Mohammed H, Liu R, Jiang S. A Spatial and Cluster-Based Framework for Identifying Railroad Trespassing Hotspots. Systems. 2026; 14(4):396. https://doi.org/10.3390/systems14040396

Chicago/Turabian Style

Mohammed, Habeeb, Rongfang Liu, and Steven Jiang. 2026. "A Spatial and Cluster-Based Framework for Identifying Railroad Trespassing Hotspots" Systems 14, no. 4: 396. https://doi.org/10.3390/systems14040396

APA Style

Mohammed, H., Liu, R., & Jiang, S. (2026). A Spatial and Cluster-Based Framework for Identifying Railroad Trespassing Hotspots. Systems, 14(4), 396. https://doi.org/10.3390/systems14040396

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Spatial and Cluster-Based Framework for Identifying Railroad Trespassing Hotspots

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Study Area Description

3.2. Data Sources and Preprocessing

3.2.1. Trespassing Incident Data

3.2.2. Railroad Crossings

3.2.3. Rail Mileage Calculation

3.2.4. Land Use Composition

3.3. Spatial Autocorrelation Using Moran’s I

Data Preparation

3.4. Cluster-Based Hotspot Analysis

3.4.1. Cluster Analysis Using k-Means

3.4.2. Derivation of Risk Indices

Hotspot Definition Using Getis–Ord G i *

Relative Risk Calculation

Composite Risk Index Calculation

4. Results of Spatial Autocorrelation

4.1. Global Spatial Autocorrelation

4.2. Local Indicators of Spatial Association (LISA)

4.3. Sensitivity Analysis

5. Results of Hotspot Analysis

5.1. Cluster Sensitivity Analysis

5.2. Hotspot Probability and Risk Metrics

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Hotspot Definition Using Getis–Ord $G_{i}^{*}$