1. Introduction
Preventing crashes has long been a key challenge in the road traffic sector. Despite the increasing number of vehicles, the human and social toll of crashes remains significant. According to the World Health Organization (WHO), approximately 1.35 million people die, and tens of millions are injured in crashes each year [
1]. These losses go beyond mere human casualties and impose extensive social and economic burdens, including increased medical and insurance costs and decreased productivity [
2]. Consequently, identifying high-risk road sections to allocate limited traffic safety resources effectively has become a key topic in traffic safety research [
3,
4,
5,
6]. Crash hotspots (also known as black spots) are generally defined as road sections where crashes occur repeatedly within a defined spatial range. Accurately identifying these sections is essential for targeting safety improvement areas and maximizing policy effectiveness [
7,
8,
9].
Traditionally, various methods have been used to identify crash hotspots based on crash characteristics that concentrate around specific points [
10,
11]. Representative statistical approaches include the Safety Performance Function (SPF) and the Empirical Bayes (EB) family of methods, which estimate the expected number of crashes by accounting for exposure factors such as road geometry and traffic volume [
12]. While this approach has the advantage of reflecting the probabilistic nature of crash occurrence, its applicability is limited by the need for additional data, such as traffic volume and detailed road attributes [
13]. Various studies have proposed combining SPF and EB estimation to identify crash hotspots [
14,
15]. The EB technique is considered a method that can mitigate estimation variability compared with crash-frequency-based approaches by combining observed and model-based expected crash counts [
16]. However, recent studies have pointed out that EB-based approaches may have limitations depending on the application environment [
17]. Significantly, when spatial analysis units, such as road segments, are subdivided, direct utilization of existing traffic volume data becomes difficult, leading to sensitivity of analysis results to model settings or spatial aggregation methods [
18]. These limitations pose a practical constraint in quantitatively assessing crash risk in environments where traffic volume information is insufficient [
19].
Along with statistical approaches, spatially based crash hotspot identification methods that utilize the spatial distribution of crash locations have also been widely used. Kernel Density Estimation (KDE) [
20] has been used to identify areas with high crash concentrations by representing individual crash points as a continuous density distribution [
21]. However, point-based KDE has the limitation of not explicitly reflecting the actual road network structure at crash sites [
22]. To address this limitation, Network Kernel Density Estimation (NKDE), which extends KDE to the road network [
23], was proposed. NKDE projects crash points onto the road network and applies a kernel function based on network distance, enabling the estimation of crash density at the road link or segment level. Recent studies have increasingly moved beyond using KDE merely as a visualization tool for crash concentration and have sought to improve interpretability by integrating mobility and exposure information. In a Global Positioning System (GPS)-based framework, KDE is first used to identify spatial concentration patterns of crashes involving vulnerable road users, and the resulting density patterns are then interpreted in the context of traffic activity by combining floating-car-data (FCD)-based driver behavior analysis with Space Syntax-based pedestrian flow analysis [
24]. Other studies use KDE heatmaps to pre-screen potentially problematic intersections and subsequently incorporate FCD-derived traffic-volume proxies (e.g., average daily traffic estimates) to compute exposure-adjusted crash rates at the intersection level [
25].
Previous studies have reported that NKDE can more realistically represent crash-concentrated sections in road network environments than point-based KDE. However, NKDE is also a density estimation technique based on crash location or frequency. It thus does not explicitly consider differences in traffic exposure across road segments. Traditional KDE and NKDE-based approaches model crashes as point-based events and infer risk from the spatial distribution of observed crash points. Consequently, risk is typically calculated as a spatially diffuse high-risk area rather than a specific road segment, limiting the direct interpretation of risk factors at the individual road segment level [
22]. Moreover, NKDE performance and the spatial extent of detected risk segments can be highly sensitive to the bandwidth parameter, raising concerns about parameter-driven conclusions when a data-driven protocol does not select the bandwidth.
To address the limitations of existing approaches, this study proposes a path-based analysis framework that characterizes crash-related road segments from a network perspective, even under limited traffic volume data. This approach provides a foundation for comparing and analyzing relative segment-level traversal patterns and associated risk. Unlike existing approaches that rely on point-based density or model-based estimation, this study explicitly models crash risk as a function of path-level movement and segment-level exposure. In particular, the proposed framework complements NKDE by incorporating movement context through reconstructed origin–crash paths and by normalizing segment risk with an exposure proxy derived from simulated origin–destination paths when traffic volumes are unavailable.
The contributions of this paper are as follows. First, it goes beyond crash-centered analysis and presents a road segment-level risk analysis procedure based on the crash vehicle’s travel path. Second, it proposes a method for constructing relative risk at the road segment level using a path-based exposure proxy in environments where traffic volume data are missing. Third, it visualizes the results of the proposed method on a map using a case study in Daejeon Metropolitan City. It also illustrates differences in segment risk identification characteristics by comparing them with the existing NKDE approach.
This paper is organized as follows.
Section 3 describes the proposed method.
Section 4 presents the experimental setup and results, along with a comparative analysis with NKDE. Finally,
Section 5 summarizes the discussion and conclusion.
3. Methodology
Crashes are typically recorded as single-point events. However, this point-based representation alone fails to adequately capture the vehicle movement process leading up to the crash. This study interprets crashes not as static point events, but as the result of interactions between vehicle movement along a road segment and the surrounding traffic environment. Specifically, we define crashes as path-based events and explicitly integrate the pre-crash movement process into the analysis by reconstructing each crash’s vehicle path before the crash.
Figure 1 illustrates the overall analysis pipeline for assessing road segment risk by integrating crash-based segment frequency derived from observed crash data with a Monte Carlo-based exposure proxy. The proposed framework consists of three main steps: estimating crash frequency and exposure for each road segment, computing relative risk, and identifying high-risk segments.
First, to capture crash-based traversal characteristics of individual road segments, Origin–Crash (OC) coordinate pairs are derived from the origin and crash locations recorded in crash data. Using these pairs, crash-related travel paths are reconstructed. Aggregating these paths at the road-segment level yields a road network that reflects historical crash occurrences.
Using the same procedure, Monte Carlo-generated OD coordinate pairs are employed to reconstruct travel paths under normal traffic conditions, from which segment-level traversal patterns are derived.
Next, considering the positive correlation between traffic volume and crash frequency, the risk of each road segment is calculated by comparing the crash-based network with the Monte Carlo-based network. Finally, based on the calculated risk score distribution, predefined ratio criteria are applied to identify high-risk road segments.
3.1. Path Generation on Road Networks
To derive a path, the road network is defined as a graph
. Here,
V is a set of nodes,
E is a set of road segments (edges), and the function
maps each node to the planar coordinate system of the road network. In real-world situations, crash coordinates recorded during police investigations often contain GPS errors or are reported as locations of adjacent buildings rather than on the road [
46]. Therefore, all location coordinates are aligned with the road network via snapping before path search. An arbitrary query point
is mapped to the nearest node
v on the road network as follows:
Here,
is a distance function. In this study, the Euclidean distance
is used. Similarly, the coordinates
and
, corresponding to the two endpoints of the path, are mapped to nodes
and
on the road network, respectively.
Consequently, given the starting node
and the destination node
, the travel path
is defined as follows:
In Equation (
3),
is a path finding function on the road network, which can be calculated based on the shortest distance, minimum time, or other cost functions. A path
consists of a series of road segments
connecting
and
. Depending on the definition of the road network, it can be set as a directed path (
) or an undirected path (
).
3.2. Crash-Based Segment Representation
Accordingly, we indirectly estimate the travel path of a crash vehicle by utilizing OC information recorded in crash data. In this paper, “Origin” is defined as the network-mapped trip start location extracted (or inferred) from crash reports for OC path reconstruction. While the OC path partially reflects the vehicle’s actual driving process before the crash, the Crash-Destination (CD) path represents the hypothetical movement after the crash. The post-crash path is not causally linked to the crash mechanism, and including it in the aggregation could bias exposure estimates. Therefore, this study focuses on the OC path to capture the driving process leading up to the crash.
After mapping the OC coordinate pair to the node pair on the road network, a crash-based path is generated, and the passage frequency of each road segment is aggregated. Specifically, whether link is included in the crash path is determined and used as an input indicator for subsequent risk calculations.
The cumulative crash path frequency
is defined by accumulating whether each path
in the crash path set
includes link
e. Specifically, the cumulative crash frequency is calculated by adding 1 if link
e is included in the crash path
, which is expressed as follows:
Here, the indicator function returns 1 if the condition is true and 0 if it is false. A simple accumulation method without weighting was applied to treat each crash with equal severity. The resulting represents the intensity of crash occurrences on each road segment, reflecting the concentration of crash-related movements rather than exposure.
The road network representation captures the geometric characteristics of real roads beyond a simple intersection-based structure. Unlike typical road network models that segment links at intersections, this approach defines segments based on changes in road curvature and branching [
47]. For example, for sections that are difficult to represent with a single straight link, such as loops or curved roads, nodes are added at points where the road geometry significantly changes, thereby dividing the road into multiple segments. This allows complex road geometries, such as curves, turns, and elevated loops, to be more precisely represented on the network. Furthermore, since road branching and junctions change driving path options, segmenting them based on these points clearly reflects the network’s connectivity structure. This curvature- and junction-based segmentation method helps minimize errors that may arise when estimating path overlap frequency and segment risk [
48]. This study does not consider additional weights, such as length or travel time, for road network links.
3.3. Monte Carlo-Based Exposure Proxy Representation
In this study, exposure is defined as the frequency with which a road segment is traversed. Since actual traffic volume data are unavailable, we approximate exposure using a Monte Carlo-based exposure proxy. This proxy represents segment-level exposure derived from simulated paths and serves as a substitute for true traffic volume in risk estimation.
The proposed exposure measure does not represent absolute traffic volume, but rather captures relative differences in utilization across road segments. As it is derived from simulated paths on the road network, it captures structural patterns of segment-level exposure shaped by network connectivity and the spatial distribution of demand in urban areas. Therefore, it serves as a suitable denominator for relative risk normalization.
To construct this measure, a Monte Carlo-based approach is employed under normal traffic conditions. Specifically, multiple pairs are generated to simulate the spatial distribution of urban traffic, which tends to concentrate in specific areas. The number of samples is determined based on the convergence of path overlap distributions and the stability of segment-level exposure values. A sufficient number of samples is achieved when the exposure distribution no longer changes significantly with additional samples.
Based on the generated normal traffic paths, the exposure for each segment is calculated by counting how many times it appears in those paths. This exposure serves as a relative indicator of how frequently each segment is used under normal traffic flow. It is then combined with crash-based indicators to serve as an input variable in the calculation of road risk.
The simulated path cumulative frequency
is defined as the number of times link
e is traversed across paths
within the set of normal traffic paths
. It serves as an exposure indicator and acts as a surrogate for actual traffic volume, and is expressed as follows:
The simulated cumulative trip frequency reflects typical road-use patterns unrelated to crash occurrences. It is defined at the same segment level as the cumulative crash frequency, allowing a direct comparison between the two metrics.
3.4. Relative Risk Definition and High-Risk Road Segment Identification
To assess the crash risk of a road segment, the number of crashes on that segment must be adjusted for the segment’s traffic volume. Typically, the number of crashes increases naturally on heavily trafficked road segments, making it difficult to fairly compare risk levels across road segments based solely on crash frequency. The purpose of this study is not simply to identify segments with a high number of crashes, but to assess relative risk by accounting for segment-level exposure.
To achieve this, cumulative crash frequency is normalized by segment-level exposure under normal traffic conditions, and relative risk is defined as the rate of crash occurrence per unit of exposure. The relative risk,
, of link
e is defined as follows, using the crash-based path cumulative frequency,
, and the Monte Carlo-based exposure,
:
Here, and serve distinct smoothing roles. The term acts as a pseudo-exposure regularizer in the denominator, preventing undefined ratios when and suppressing low-exposure inflation by reducing the variability of when is extremely small. In contrast, is an additive pseudo-count on crash frequency, encoding a weak prior for zero-crash segments () and avoiding overly extreme rankings driven by single events on rarely traversed segments. Because the same are applied uniformly across all segments, the risk score remains comparable across the network. At the same time, extreme outliers induced by near-zero exposure or near-zero crash counts are effectively mitigated.
The proposed risk metric is conceptually similar to the traditional crash rate, but differs in how exposure is defined. Unlike existing crash rates that rely on traffic volume metrics with limited spatial resolution, such as AADT or VKT, the proposed exposure proxy is derived from path-level aggregation on the same road network representation as the crash paths. This allows for direct segment-level comparison and interpretation, even in environments where segment-level traffic volume data are lacking.
Figure 2 illustrates the proposed road risk assessment concept. The blue line represents the OC path of the crash vehicle reconstructed from the crash report. In contrast, the red line represents the cumulative result of paths generated by simulation under normal traffic conditions. The thickness of each path is proportional to the number of times a road segment is included, allowing intuitive comparison of relative exposure levels across segments. High-risk road segments are defined as those with relatively low exposure under normal traffic conditions but high concentrations of crash-related paths.
To identify high-risk road segments, we use a quantile-based threshold computed from the empirical distribution of risk scores over the entire network. A segment
e is labeled as high-risk if
where
is the high-risk indicator and
is the empirical
-quantile of
. With sorted scores
, we set
. Here,
denotes the selection ratio.
This quantile-based thresholding method has the advantage of automatically adjusting the number of high-risk road segments by reflecting the relative distribution characteristics of risk scores, rather than relying on the absolute magnitude of the risk score. This allows for consistent identification of risk sections even when network size or simulation conditions vary.
The proposed framework directly assigns risk levels to discrete road segments actually traversed by crash-related travel trajectories, using them as the basic units of analysis. This allows analysis results to be presented in clearly defined road segment units rather than continuous spatial density or diffuse areas, and can be used immediately to derive priorities for policy interventions such as traffic safety facility placement, road geometry improvement, and speed management.
3.5. Evaluation Metrics
To evaluate the performance of the identified high-risk segments, we use two metrics, namely point-based and path-based hit rates.
The point-based hit rate (
) is commonly used to assess hotspot detection performance. It measures the proportion of crash locations that fall within the identified high-risk segments and is defined as follows:
where
m is the total number of crashes in the evaluation dataset, and
takes a value of 1 if the
k-th crash location is included in the high-risk segments, and 0 otherwise. A higher
indicates that the identified risk segments cover a larger proportion of crash locations.
However, the point-based hit rate has a limitation in that it only considers the inclusion of crash locations. Importantly, selecting larger risk regions tends to yield higher values, as more crash points are likely to be included. As a result, while it reflects the coverage of risk segments, it does not capture how closely those segments align with actual crash-related travel paths.
To address this limitation, we propose the path-based hit rate (
), which incorporates path-level movement information:
where
denotes the travel path associated with the
k-th crash and
H represents the set of top
high-risk segments. Here,
denotes the number of segments on the
k-th path that are also included in
H, and
is the total number of segments in the
k-th path. A higher value indicates that a larger portion of crash-involved paths overlaps with the identified high-risk segments.
4. Experimental Results
4.1. Study Area and Data
This study focuses on Daejeon Metropolitan City, Republic of Korea. The city covers approximately and has a complex urban structure comprising commercial and business districts, residential areas, research and development complexes, and suburban regions. Administratively, it comprises five districts, with approximately 57% of the population concentrated in Seo-gu and Yuseong-gu, resulting in spatially uneven travel demand.
Daejeon serves as a major transportation hub in Korea, but traffic demand has been rapidly increasing relative to the existing road infrastructure. Previous studies have reported that the city experiences low average travel speeds and severe congestion on major arterial roads [
49]. In addition, the number of registered vehicles exceeds 640,000 and continues to increase by more than 13,000 vehicles annually, suggesting that the current road network is insufficient to accommodate growing traffic demand. Notably, congestion during peak commuting hours has become a significant concern.
Furthermore, compared with other metropolitan areas, Daejeon has a relatively underdeveloped ring road system, which limits the effective distribution of traffic. This structural limitation leads to traffic concentration in specific corridors and is closely associated with increased crash risk. A regional study further shows that approximately 52.9% of congested road segments operating at Level of Service (LOS) E or worse are concentrated in major cities in the Chungcheong region, including Daejeon, Cheonan, and Cheongju [
50]. This indicates that Daejeon represents a typical case of high traffic congestion and elevated traffic demand within the region.
Accordingly, Daejeon is selected as a representative study area characterized by complex urban structure, rapidly increasing traffic demand, and structural congestion issues.
A total of 1352 crash records collected between 2018 and 2023 from the Korean National Police Agency (KNPA) were used. Each record includes the crash location (latitude and longitude) and narrative descriptions of the crash environment.
4.2. Experimental Setup
To construct both crash-based segment frequency and simulation-based exposure, we generated a road network and the corresponding paths. The road network was extracted from OpenStreetMap using OSMnx v2.1.0 and uniformly discretized into
lixels for aggregation. Shortest paths were computed using Dijkstra’s algorithm in NetworkX v3.6.1 [
51] with the OSMnx edge-length attribute as the routing weight. Real-time traffic conditions and travel-time weights were not incorporated in this baseline setting. Crash locations, as well as origin and destination coordinates, were snapped to the nearest network nodes within a
radius, and records exceeding this tolerance were excluded. Detailed implementation settings and parameter values are provided in
Appendix A (
Table A1) for reproducibility.
First, crash-based segment representations were constructed by generating OC paths from the KNPA crash dataset and aggregating them at the lixel level to obtain segment-level crash counts . In this process, we excluded 59 OA path reconstructions that failed due to routing errors, representing 4.36% of the attempted OA paths. The remaining 1293 crash records were used in the experiments.
Next, to approximate traffic volume, an exposure proxy was generated using Monte Carlo-based OD paths. Origin and destination points were sampled within Daejeon’s administrative boundary with a bias toward major activity centers, namely Daejeon Station, Dunsan-dong, and Yuseong-gu, using a Gaussian Mixture Model (GMM).
We used one Gaussian component per activity center, resulting in components. The component means were set to representative coordinates of each center, namely Daejeon Station , Dunsan-dong , and Yuseong-gu . Because reliable prior demand or traffic volume information was unavailable to calibrate center-specific weights, we assigned equal-mixture weights of 0.3333 to all components. We assumed an isotropic covariance structure with a shared variance parameter , which corresponds to approximately under the adopted coordinate interpretation.
OD points were sampled only within Daejeon’s administrative boundary to focus on relative internal network usage, given limited information about external demand. We discarded OD pairs whose Euclidean distance was less than , which rejected 0.11% of attempts, corresponding to 110 per 100,000 samples. We also discarded invalid OD routes that failed during routing, accounting for 7.03% of attempts, or 7028 per 100,000 samples.
Figure 3 illustrates both the crash-based segment representation and the Monte Carlo-based exposure proxy. The exposure proxy
was computed for
= 21,605 edges in the study area, and its distribution is highly right-skewed. As shown in
Table 1, the median (84) is far smaller than the mean (658.49), and upper-tail quantiles increase sharply (e.g., Q90 = 1985.60 and Q99 = 7721.84), indicating that simulated traversals concentrate on a small subset of links. This heavy-tailed pattern supports interpreting
as a relative route-usage proxy and motivates stability and smoothing analyses to avoid over-ranking low-exposure links in exposure-normalized risk screening.
4.3. Stability Analysis
Because the proposed road risk measure relies on an exposure proxy generated via Monte Carlo simulation, insufficient simulation size N can lead to unstable risk rankings. In addition, the set of top high-risk segments obtained at each N may vary due to ranking fluctuations of segments near the decision boundary.
To evaluate this effect, we use the Jaccard similarity. For two sets
A and
B, it is defined as
where
, and larger values indicate greater similarity between the two sets.
Based on Equation (
10), we evaluate both local stability and global convergence.
Local stability measures how consistently high-risk segment sets are preserved as the simulation size increases. Let
denote the set of top
high-risk segments at simulation size
N. The change between successive sets is measured by
. Among a total of 21,605 segments,
corresponds to 1081 high-risk segments. As
N increases from
to
with
, the similarity stays above
across all settings and exceeds
once
N reaches
. At the same time, churn, defined as the number of replaced elements between successive sets, decreases from 46 to 4, indicating that the high-risk set becomes increasingly stable as
N increases. Global convergence evaluates how the set
approaches a reference set as
N increases. Using
as the reference,
increases from
at
to
at
, exceeding
after
. Meanwhile, churn based on the symmetric difference decreases from 60 to 4, confirming that the high-risk set gradually converges as the simulation size increases.
Figure 4a summarizes the local stability and global convergence trends across simulation scales.
In addition to the set-based stability analysis, we further examine whether the exposure proxy itself stabilizes as N increases. Set-level overlap measures such as mainly reflect changes near the selection boundary of the top- set. They do not necessarily guarantee that the underlying segment-level exposure values or their induced rankings have converged over the entire network.
Let
denote the rank vector over edges induced by the exposure proxy
computed using
N simulated OD routes, where rank 1 corresponds to the largest exposure. We quantify rank-level drift between two simulation sizes by the mean absolute rank change
:
where
denotes the
-norm and
denotes the total number of edges in the road network.
Using Equation (
11), the local rank-level drift between adjacent simulation sizes decreases monotonically as
N increases. Specifically,
is 639.59 and gradually decreases, reaching 102.35 at
. This trend indicates that the exposure distribution and the ranking it induces become progressively stable as simulation size increases. That rank changes in later stages are increasingly localized while preserving the overall ordering structure. We also evaluate global rank-level drift using a reference-based comparison. Taking
as the reference,
decreases as
N increases, confirming a consistent convergence trend in the exposure-induced ranking. Taken together, these results complement the Jaccard-based stability analysis by showing that not only the selected high-risk set but also the underlying exposure proxy rankings converge sufficiently as the simulation size grows.
Figure 4b visualizes the local and global rank-level drift trends as a function of
N.
Based on these findings, we set the simulation size to to mitigate under-sampling effects and ensure reproducibility.
4.4. Low-Exposure Diagnostics for Sensitivity Analysis
To quantify whether the smoothing parameter induces an over-selection of extremely low-exposure segments, we additionally compute the prevalence of low-exposure edges within the selected high-risk set. The risk score is defined as , where provides additive smoothing in low-count regimes (e.g., ). In this diagnostic, we fix and vary while keeping all other settings unchanged. We define as the set of segments selected by the top- selection ratio according to , and we report results at unless stated otherwise.
Along with fixed-threshold diagnostics, we define a quantile-based low-exposure prevalence as
where
denotes the
-quantile of the exposure distribution
over the full edge set
E. Thus,
represents the fraction of selected edges whose exposure falls within the bottom 5% of the overall exposure distribution, and
is defined analogously for the bottom 10%.
Table 2 reports these prevalences for different
values. When
is small, the selected set is heavily dominated by extremely low-exposure edges. At
,
of the selected edges satisfy
and
satisfy
. As
increases, this dominance is substantially mitigated. At
and
, the prevalences drop to
for the bottom 5% criterion and
for the bottom 10% criterion. At
, the low-exposure prevalence becomes nearly negligible. It reaches
for the bottom 5% criterion and
for the bottom 10% criterion.
These results support the interpretation that acts as a pseudo-exposure regularizer that suppresses low-exposure inflation and improves the robustness of the exposure-normalized ranking. We select as a conservative compromise because it already achieves a substantial reduction in low-exposure dominance (comparable to in this diagnostic) while avoiding overly aggressive compression of risk scores among low-to-moderate exposure segments, which can reduce discrimination in the ranking. This choice is consistent with the overall stability results and remains robust under the tested combinations.
4.5. Risk Segmentation Results
Based on our stability analysis and low-exposure sensitivity diagnostics, we fixed the simulation size and smoothing parameters as , , and . Using these settings, was computed, and the top high-risk segments were identified. The proposed method was compared with NKDE, where bandwidths were applied.
Figure 5 compares the statistical characteristics of the top
high-risk segments
under different NKDE bandwidths and the proposed method, based on 1293 crash records. A risk cluster is defined as a set of spatially connected lixels exceeding a given threshold, forming a continuous road segment.
With a small bandwidth (), the number of clusters is large (760), and the average length is short (), indicating over-fragmentation due to sensitivity to local density variations. In contrast, a large bandwidth () produces only 59 clusters, with a significantly increased average length () and maximum length (), reflecting over-smoothing as risk values spread over broader regions and adjacent segments merge into large clusters.
For an intermediate bandwidth (
), the results show a more balanced spatial structure, with 334 clusters and an average length of
. This observation is consistent with prior studies indicating that bandwidth selection critically affects result quality and that the optimal value depends on data characteristics [
23,
52].
In comparison, the proposed method produces 261 clusters with an average length of , yielding coherent risk segments without excessive fragmentation or over-smoothing. Since the proposed framework integrates crash-path accumulation with an exposure proxy, it tends to produce stable segment representations that are less sensitive to kernel bandwidth choices. These results suggest that the proposed method provides a consistent network-wide risk segmentation layer suitable for screening and prioritization.
4.6. Comparison with NKDE
We compare the proposed method with NKDE using two evaluation metrics, namely and . The experiments were conducted using a 10-fold cross-validation scheme. In each fold, 90% of the crash records in the KNPA dataset were randomly selected as the source dataset for risk estimation, and the top road segments were identified based on the computed risk scores. The remaining 10% of the crash data were used as the evaluation dataset.
To determine an appropriate NKDE bandwidth empirically, we performed a data-driven tuning experiment on the training folds. We considered a candidate set
and for each
computed hit-rate curves over
for both
and
. An AUC-like score summarized each curve
The bandwidth was selected by maximizing this criterion on the training folds.
Based on as the primary tuning objective, the best-performing NKDE bandwidth was with . For completeness, we also tuned NKDE using and obtained its maximum at with .
4.6.1. Quantitative
Results (Global Analysis)
In
Figure 6a,
increases with
for all methods and approaches saturation as the selected high-risk budget becomes large. Under point-based evaluation, NKDE with smaller bandwidths generally achieves higher hit rates at low
and attains the largest overall AUC (
Table 3). The proposed method shows comparable performance to NKDE (
) and the tuned NKDE baseline with
over
, and it consistently outperforms NKDE with a large bandwidth (
) in the low-
regime. Overall, these results suggest that the proposed method does not primarily improve point-based hotspot coverage. Instead, it yields a stable, monotonic trend as
increases under a fixed formulation, and it avoids bandwidth-dependent smoothing effects that can substantially alter NKDE results.
In
Figure 6b,
increases with
for all methods and approaches 1.0 as the selected high-risk budget becomes large. Under the path-based evaluation, the proposed method consistently achieves higher hit rates than NKDE across low-to-moderate
levels (
Table 4). The improvement is most pronounced in the low-
regime, where the proposed method yields markedly better overlap with crash-involved travel paths. For example, at
the proposed method achieves 0.067 compared with 0.008–0.015 for NKDE, and at
it achieves 0.631 compared with 0.235–0.278 for NKDE. The AUC-like summary score corroborates this trend. The proposed method attains the highest value of 66.003, whereas the tuned NKDE baseline with
achieves 55.810. The performance gains become more gradual beyond approximately
, which is consistent with diminishing marginal overlap as the selected set expands. Although NKDE can achieve higher values at larger
in some settings, such differences primarily reflect broader coverage under a larger selection budget. Overall, these results suggest that incorporating path-level movement context with exposure normalization improves the efficiency of network-wide risk screening in capturing crash-involved traversal patterns.
4.6.2. Case
Study (Local Analysis)
To enable an intuitive comparison between density-based and path-based approaches, we select a region that includes both major arterial roads and crash hotspots. The study area corresponds to a radius centered at in Daejeon.
Figure 7 shows the point-based results. The point-based hit rate
was computed for 32 crash locations in the evaluation dataset. Red lines denote high-risk segments, and blue points indicate crash locations that fall within them. NKDE achieves
(21 hits) with
,
(21 hits) with
, and
(19 hits) with
, while the proposed method achieves
(17 hits). These results indicate that NKDE performs well in identifying crash locations based on spatial density when an appropriate bandwidth is selected. For example,
achieves the highest performance by focusing on localized crash clusters.
However, from a spatial perspective, NKDE tends to spread risk values into surrounding areas via kernel diffusion, leading to over-smoothing and the inclusion of segments that are not directly related to crashes. In contrast, the proposed method leverages path-level information to identify risk segments along major travel corridors, producing a more structurally concentrated distribution.
Consistent with the limitations of point-based evaluation, density-based approaches achieve higher performance under point-based metrics. However, this does not reflect alignment with actual travel paths. We therefore evaluate the results using the path-based hit rate.
Figure 8 presents the path-based evaluation results in terms of
, comparing NKDE and the proposed method. As summarized in
Table 5, NKDE shows limited improvement in the path-based hit rate
as the bandwidth increases, ranging from
to
. Although larger bandwidths increase the number of selected segments, the gain in overlapping segments remains relatively small, indicating inefficient expansion of risk regions.
In contrast, the proposed method achieves a substantially higher of while selecting only 3084 segments. Notably, the number of overlapping segments (1117 hits) is comparable to or higher than those of NKDE despite the smaller selection size. This reflects the definition of , where the hit rate is determined by the proportion of crash-related path segments that overlap with the selected high-risk segments.
From this perspective, the proposed method improves both components of the metric, increasing the number of overlapping segments while reducing the number of selected segments, thereby achieving a higher overlap ratio. In contrast, NKDE primarily increases the number of selected segments without a proportional increase in overlap, resulting in lower values.
Furthermore, the higher ratio of overlapping to selected segments indicates that the proposed method more precisely captures segments actually traversed by crash-related paths. In contrast, NKDE tends to include many segments that are not aligned with actual travel paths, due to kernel-based spatial diffusion.
These results indicate that density-based approaches emphasize spatial coverage, whereas path-based approaches better capture alignment with actual travel behavior.
5. Discussion and Conclusions
This study proposes a path-based risk segmentation framework that estimates segment-level road risk by integrating path-derived traversal frequencies with simulation-based exposure. Unlike conventional approaches that treat crashes as independent points, the proposed method interprets crashes within the context of travel paths. It identifies risk patterns that reflect the sequential relationships among road segments.
The results show that the proposed method exhibits distinct characteristics compared with NKDE. While achieving comparable performance in point-based evaluation, it substantially outperforms in path-based evaluation, aligning more closely with crash-involved travel paths. In addition, whereas NKDE shows substantial variation in the spatial extent and shape of risk segments across bandwidth values, the proposed method produces stable, coherent risk patterns that are less sensitive to bandwidth-driven smoothing.
These results demonstrate the practical utility of the proposed ranking for network-wide screening, while also calling for careful interpretation of the identified high-risk segments.
We emphasize that the proposed framework is intended for screening and prioritization rather than causal attribution. Although the method highlights segments that frequently appear in crash-involved proxy paths after exposure normalization, such overlap should be interpreted as a risk-correlated indicator rather than evidence of a direct causal mechanism. A segment may appear frequently because it lies on an unavoidable route to a genuinely hazardous area, even if the segment itself is not the primary causal factor. Accordingly, the ranked outputs are best used as a first-stage filter to prioritize candidate segments for follow-up validation, including field inspection, geometric review, signal operation assessment, and near-miss analytics.
However, several limitations remain. First, the dataset is limited to 1293 crash records, which may introduce bias if crashes are concentrated in specific regions or road segments. Second, actual vehicle trajectories are not directly observed and are approximated using shortest paths, which may differ from real driving behavior, especially under congestion or user-equilibrium routing. We therefore interpret reconstructed routes as proxy paths derived under a consistent routing rule, approximating structural exposure rather than individual route choice under congestion. Third, exposure is estimated via simulation, which introduces uncertainty that depends on the simulation settings and scale. The exposure proxy is not intended to reproduce calibrated absolute volumes such as AADT or VKT and should be interpreted as a relative denominator for normalization. OD points are confined to the study area’s administrative boundary, and external inflows and outflows, including through traffic, are not explicitly modeled.
Future work will focus on improving realism and reducing uncertainty in exposure estimation by incorporating measured traffic volumes and operational data when available. When calibrated traffic counts, OD matrices, or signal timing and control information become available, we can retain the downstream risk formulation and replace with a more realistic exposure model, such as an assignment-based or calibrated simulation-based exposure estimate. In addition, the OD prior can be refined by integrating population and employment distributions, point-of-interest density, census-tract variables, or FCD data to calibrate mixture weights and spatial dispersion. Extending the OD generation process to account for external demand will further improve the representativeness of the exposure proxy, particularly for corridors influenced by through traffic. Finally, the generalizability of the proposed framework should be validated across multiple cities with heterogeneous network structures and traffic patterns.
Overall, this study provides a new perspective on crash risk analysis by shifting from point-based to path-based interpretation. The proposed framework enables the identification of high-risk segments at the network level and facilitates traffic safety management and risk-based prioritization of road interventions. We note that the proposed framework is primarily intended for network-wide screening and prioritization in data-limited settings. The identified high-risk candidates should be validated through subsequent engineering review and, where available, richer sensing and operational data.