Path-Based Risk Segmentation of Road Networks with Exposure Modeling

Yoon, Yeongho; Shin, Inkyoung; Lee, Yonggeol

doi:10.3390/electronics15102069

Open AccessArticle

Path-Based Risk Segmentation of Road Networks with Exposure Modeling

by

Yeongho Yoon

¹

,

Inkyoung Shin

²

and

Yonggeol Lee

^1,*

¹

School of Computing and Artificial Intelligence, Hanshin University, Osan 18101, Republic of Korea

²

Police Science Institute, Korean National Police University, Asan 31539, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2069; https://doi.org/10.3390/electronics15102069

Submission received: 9 April 2026 / Revised: 6 May 2026 / Accepted: 9 May 2026 / Published: 12 May 2026

(This article belongs to the Special Issue Automated Driving Systems: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

Crash hotspot analysis has been widely studied in road traffic safety. Conventional approaches primarily rely on the spatial density or frequency of crash locations but fail to capture vehicle traversal patterns and segment-level exposure. In addition, when detailed traffic volume data are unavailable, it becomes difficult to assess risk while accounting for road exposure. In particular, Network Kernel Density Estimation (NKDE) is sensitive to bandwidth selection and remains limited in representing exposure-normalized, path-consistent risk at the road-segment level. To overcome these limitations, this study proposes a path-based risk segmentation framework that integrates crash paths with simulation-based exposure. Origin–crash coordinate pairs are extracted from crash reports, and vehicle paths are reconstructed over a road network. Monte Carlo simulation is used to estimate a relative exposure proxy across road segments and combine it with path-derived traversal patterns to compute segment-level risk. A case study in Daejeon Metropolitan City demonstrates that the proposed method addresses key limitations of NKDE by yielding more coherent risk segments and improving path alignment, and it identifies high-risk segments more effectively than the conventional NKDE baseline, particularly under small top-

α %

selection ratios, as measured by the path-based hit rate. This study provides a new perspective on crash risk analysis by shifting from point-based to path-based interpretation and by explicitly normalizing risk with an exposure proxy under data-limited conditions. It offers a practical framework for identifying high-risk segments at the road network level.

Keywords:

traffic safety; crash analysis; network-based analysis; road segment risk; road network

1. Introduction

Preventing crashes has long been a key challenge in the road traffic sector. Despite the increasing number of vehicles, the human and social toll of crashes remains significant. According to the World Health Organization (WHO), approximately 1.35 million people die, and tens of millions are injured in crashes each year [1]. These losses go beyond mere human casualties and impose extensive social and economic burdens, including increased medical and insurance costs and decreased productivity [2]. Consequently, identifying high-risk road sections to allocate limited traffic safety resources effectively has become a key topic in traffic safety research [3,4,5,6]. Crash hotspots (also known as black spots) are generally defined as road sections where crashes occur repeatedly within a defined spatial range. Accurately identifying these sections is essential for targeting safety improvement areas and maximizing policy effectiveness [7,8,9].

Traditionally, various methods have been used to identify crash hotspots based on crash characteristics that concentrate around specific points [10,11]. Representative statistical approaches include the Safety Performance Function (SPF) and the Empirical Bayes (EB) family of methods, which estimate the expected number of crashes by accounting for exposure factors such as road geometry and traffic volume [12]. While this approach has the advantage of reflecting the probabilistic nature of crash occurrence, its applicability is limited by the need for additional data, such as traffic volume and detailed road attributes [13]. Various studies have proposed combining SPF and EB estimation to identify crash hotspots [14,15]. The EB technique is considered a method that can mitigate estimation variability compared with crash-frequency-based approaches by combining observed and model-based expected crash counts [16]. However, recent studies have pointed out that EB-based approaches may have limitations depending on the application environment [17]. Significantly, when spatial analysis units, such as road segments, are subdivided, direct utilization of existing traffic volume data becomes difficult, leading to sensitivity of analysis results to model settings or spatial aggregation methods [18]. These limitations pose a practical constraint in quantitatively assessing crash risk in environments where traffic volume information is insufficient [19].

Along with statistical approaches, spatially based crash hotspot identification methods that utilize the spatial distribution of crash locations have also been widely used. Kernel Density Estimation (KDE) [20] has been used to identify areas with high crash concentrations by representing individual crash points as a continuous density distribution [21]. However, point-based KDE has the limitation of not explicitly reflecting the actual road network structure at crash sites [22]. To address this limitation, Network Kernel Density Estimation (NKDE), which extends KDE to the road network [23], was proposed. NKDE projects crash points onto the road network and applies a kernel function based on network distance, enabling the estimation of crash density at the road link or segment level. Recent studies have increasingly moved beyond using KDE merely as a visualization tool for crash concentration and have sought to improve interpretability by integrating mobility and exposure information. In a Global Positioning System (GPS)-based framework, KDE is first used to identify spatial concentration patterns of crashes involving vulnerable road users, and the resulting density patterns are then interpreted in the context of traffic activity by combining floating-car-data (FCD)-based driver behavior analysis with Space Syntax-based pedestrian flow analysis [24]. Other studies use KDE heatmaps to pre-screen potentially problematic intersections and subsequently incorporate FCD-derived traffic-volume proxies (e.g., average daily traffic estimates) to compute exposure-adjusted crash rates at the intersection level [25].

Previous studies have reported that NKDE can more realistically represent crash-concentrated sections in road network environments than point-based KDE. However, NKDE is also a density estimation technique based on crash location or frequency. It thus does not explicitly consider differences in traffic exposure across road segments. Traditional KDE and NKDE-based approaches model crashes as point-based events and infer risk from the spatial distribution of observed crash points. Consequently, risk is typically calculated as a spatially diffuse high-risk area rather than a specific road segment, limiting the direct interpretation of risk factors at the individual road segment level [22]. Moreover, NKDE performance and the spatial extent of detected risk segments can be highly sensitive to the bandwidth parameter, raising concerns about parameter-driven conclusions when a data-driven protocol does not select the bandwidth.

To address the limitations of existing approaches, this study proposes a path-based analysis framework that characterizes crash-related road segments from a network perspective, even under limited traffic volume data. This approach provides a foundation for comparing and analyzing relative segment-level traversal patterns and associated risk. Unlike existing approaches that rely on point-based density or model-based estimation, this study explicitly models crash risk as a function of path-level movement and segment-level exposure. In particular, the proposed framework complements NKDE by incorporating movement context through reconstructed origin–crash paths and by normalizing segment risk with an exposure proxy derived from simulated origin–destination paths when traffic volumes are unavailable.

The contributions of this paper are as follows. First, it goes beyond crash-centered analysis and presents a road segment-level risk analysis procedure based on the crash vehicle’s travel path. Second, it proposes a method for constructing relative risk at the road segment level using a path-based exposure proxy in environments where traffic volume data are missing. Third, it visualizes the results of the proposed method on a map using a case study in Daejeon Metropolitan City. It also illustrates differences in segment risk identification characteristics by comparing them with the existing NKDE approach.

This paper is organized as follows. Section 3 describes the proposed method. Section 4 presents the experimental setup and results, along with a comparative analysis with NKDE. Finally, Section 5 summarizes the discussion and conclusion.

2. Related Work

2.1. Trajectory Acquisition and Path Reconstruction

The most precise method for obtaining vehicle trajectories at the segment level is direct tracking based on position sensors. By collecting GPS coordinates chronologically from the vehicle or the driver’s smartphone, the trajectory can be reconstructed intuitively and accurately. Vehicle On-Board Diagnostics (OBD), Controller Area Network (CAN), and black-box GPS logs can also be used [26,27]. However, obtaining this data from actual crash vehicles presents practical limitations due to privacy concerns, legal restrictions, and the lack of on-board sensors.

Consequently, reconstructing vehicle paths from available data has become a practical alternative to direct trajectory measurement. Specifically, hybrid path estimation methods that combine crash data, video information, and road networks are widely used [28,29,30,31]. Several studies have explored path estimation approaches that utilize navigation Application Programming Interfaces (APIs) to reconstruct vehicle paths from location information, including origins, crash locations, and destinations, expressed either as textual descriptions (e.g., addresses or place names) or as latitude/longitude coordinates [29,32]. Identified location-related text is subsequently transformed into geographic coordinates

p = (λ, ϕ)

via reverse geocoding.

2.2. Crash Counts and Segment-Level Utilization Proxies

The frequency of crashes observed on a specific road segment reflects the empirical record of repeated crashes on that segment. It can serve as an indicator of potential risk concentration due to structural or environmental factors. The repeated overlapping of multiple crash paths on the same segment is not a temporary coincidence but rather indicates the presence of persistent or potential risk factors. However, this cumulative crash frequency should be interpreted as an indirect indicator of the segment’s ongoing exposure to risk, rather than a direct quantification of crash risk.

To estimate segment-level exposure from the movement trajectories of vehicles involved in crashes, the actual driving trajectories of those vehicles are required. However, such data are generally difficult to obtain due to privacy protection and constraints in the collection environment. Therefore, approaches that utilize movement path information on the road network to estimate relative road use levels have been proposed. A typical method is to define segment-specific utilization by aggregating the frequency with which each road segment is included in a path [32]. If trajectory data are available, exposure can be calculated by map-matching them to the road network and aggregating them at the link or segment level [33,34].

2.3. Exposure Estimation Under Missing Traffic Volumes

Traffic volume on a road segment is an essential factor in normalizing crash frequency into exposure and calculating relative crash risk [35]. Typically, Annual Average Daily Traffic (AADT) or Vehicle Kilometers Traveled (VKT) have been used as denominators for crash rates [12,36]. However, in real-world traffic environments, reliable traffic volume information is often unavailable for all road segments. Especially on local roads, such data are frequently missing due to limitations in traffic measurement infrastructure [37].

To address these limitations, various exposure proxies have been proposed to estimate segment-level exposure without relying on actual traffic volume data. For example, AADT can be estimated by combining observations from limited traffic measurement points with external variables such as road structure and population density, and subsequently used as an exposure measure for crash rate estimation [38,39,40]. In addition, methods that estimate segment-level utilization by incorporating both road network structure and surrounding environmental characteristics have also been studied [41,42].

Meanwhile, in environments where actual movement data are unavailable, path-based approaches have been adopted to generate Monte Carlo-based traffic flows [43]. These approaches construct relative exposure by generating virtual Origin–Destination (OD) pairs and iteratively computing paths across the road network.

Recent studies further improve realism by incorporating urban concentration and land-use characteristics into probabilistic OD generation [44,45].

3. Methodology

Crashes are typically recorded as single-point events. However, this point-based representation alone fails to adequately capture the vehicle movement process leading up to the crash. This study interprets crashes not as static point events, but as the result of interactions between vehicle movement along a road segment and the surrounding traffic environment. Specifically, we define crashes as path-based events and explicitly integrate the pre-crash movement process into the analysis by reconstructing each crash’s vehicle path before the crash.

Figure 1 illustrates the overall analysis pipeline for assessing road segment risk by integrating crash-based segment frequency derived from observed crash data with a Monte Carlo-based exposure proxy. The proposed framework consists of three main steps: estimating crash frequency and exposure for each road segment, computing relative risk, and identifying high-risk segments.

First, to capture crash-based traversal characteristics of individual road segments, Origin–Crash (OC) coordinate pairs are derived from the origin and crash locations recorded in crash data. Using these pairs, crash-related travel paths are reconstructed. Aggregating these paths at the road-segment level yields a road network that reflects historical crash occurrences.

Using the same procedure, Monte Carlo-generated OD coordinate pairs are employed to reconstruct travel paths under normal traffic conditions, from which segment-level traversal patterns are derived.

Next, considering the positive correlation between traffic volume and crash frequency, the risk of each road segment is calculated by comparing the crash-based network with the Monte Carlo-based network. Finally, based on the calculated risk score distribution, predefined ratio criteria are applied to identify high-risk road segments.

3.1. Path Generation on Road Networks

To derive a path, the road network is defined as a graph

G = (V, E, ξ)

. Here, V is a set of nodes, E is a set of road segments (edges), and the function

ξ

maps each node to the planar coordinate system of the road network. In real-world situations, crash coordinates recorded during police investigations often contain GPS errors or are reported as locations of adjacent buildings rather than on the road [46]. Therefore, all location coordinates are aligned with the road network via snapping before path search. An arbitrary query point

p \in R^{2}

is mapped to the nearest node v on the road network as follows:

v = arg min_{v \in V} D (p, ξ (v))

(1)

Here,

D

is a distance function. In this study, the Euclidean distance

{∥ p - ξ (v) ∥}_{2}^{2}

is used. Similarly, the coordinates

p_{x}

and

p_{y}

, corresponding to the two endpoints of the path, are mapped to nodes

v_{x}

and

v_{y}

on the road network, respectively.

\begin{matrix} v_{x} & = arg min_{v \in V} D (p_{x}, ξ (v)), \\ v_{y} & = arg min_{v \in V} D (p_{y}, ξ (v)) \end{matrix}

(2)

Consequently, given the starting node

v_{x}

and the destination node

v_{y}

, the travel path

P

is defined as follows:

P ∣ (v_{x}, v_{y}) = path (v_{x}, v_{y})

(3)

In Equation (3),

path (v_{x}, v_{y})

is a path finding function on the road network, which can be calculated based on the shortest distance, minimum time, or other cost functions. A path

P

consists of a series of road segments

e \in E

connecting

v_{x}

and

v_{y}

. Depending on the definition of the road network, it can be set as a directed path (

P (v_{x}, v_{y}) \neq P (v_{y}, v_{x})

) or an undirected path (

P (v_{x}, v_{y}) = P (v_{y}, v_{x})

).

3.2. Crash-Based Segment Representation

Accordingly, we indirectly estimate the travel path of a crash vehicle by utilizing OC information recorded in crash data. In this paper, “Origin” is defined as the network-mapped trip start location extracted (or inferred) from crash reports for OC path reconstruction. While the OC path partially reflects the vehicle’s actual driving process before the crash, the Crash-Destination (CD) path represents the hypothetical movement after the crash. The post-crash path is not causally linked to the crash mechanism, and including it in the aggregation could bias exposure estimates. Therefore, this study focuses on the OC path to capture the driving process leading up to the crash.

After mapping the OC coordinate pair

(p_{o}, p_{c})

to the node pair

(v_{o}, v_{c})

on the road network, a crash-based path is generated, and the passage frequency of each road segment is aggregated. Specifically, whether link

e \in E

is included in the crash path is determined and used as an input indicator for subsequent risk calculations.

The cumulative crash path frequency

C_{e}

is defined by accumulating whether each path

r_{i}

in the crash path set

A

includes link e. Specifically, the cumulative crash frequency is calculated by adding 1 if link e is included in the crash path

r_{i}

, which is expressed as follows:

C_{e} = \sum_{i \in A} 1 (e \in r_{i})

(4)

Here, the indicator function

1 (\cdot)

returns 1 if the condition is true and 0 if it is false. A simple accumulation method without weighting was applied to treat each crash with equal severity. The resulting

C_{e}

represents the intensity of crash occurrences on each road segment, reflecting the concentration of crash-related movements rather than exposure.

The road network representation captures the geometric characteristics of real roads beyond a simple intersection-based structure. Unlike typical road network models that segment links at intersections, this approach defines segments based on changes in road curvature and branching [47]. For example, for sections that are difficult to represent with a single straight link, such as loops or curved roads, nodes are added at points where the road geometry significantly changes, thereby dividing the road into multiple segments. This allows complex road geometries, such as curves, turns, and elevated loops, to be more precisely represented on the network. Furthermore, since road branching and junctions change driving path options, segmenting them based on these points clearly reflects the network’s connectivity structure. This curvature- and junction-based segmentation method helps minimize errors that may arise when estimating path overlap frequency and segment risk [48]. This study does not consider additional weights, such as length or travel time, for road network links.

3.3. Monte Carlo-Based Exposure Proxy Representation

In this study, exposure is defined as the frequency with which a road segment is traversed. Since actual traffic volume data are unavailable, we approximate exposure using a Monte Carlo-based exposure proxy. This proxy represents segment-level exposure derived from simulated paths and serves as a substitute for true traffic volume in risk estimation.

The proposed exposure measure does not represent absolute traffic volume, but rather captures relative differences in utilization across road segments. As it is derived from simulated paths on the road network, it captures structural patterns of segment-level exposure shaped by network connectivity and the spatial distribution of demand in urban areas. Therefore, it serves as a suitable denominator for relative risk normalization.

To construct this measure, a Monte Carlo-based approach is employed under normal traffic conditions. Specifically, multiple

(v_{o}, v_{d})

pairs are generated to simulate the spatial distribution of urban traffic, which tends to concentrate in specific areas. The number of samples is determined based on the convergence of path overlap distributions and the stability of segment-level exposure values. A sufficient number of samples is achieved when the exposure distribution no longer changes significantly with additional samples.

Based on the generated normal traffic paths, the exposure for each segment is calculated by counting how many times it appears in those paths. This exposure serves as a relative indicator of how frequently each segment is used under normal traffic flow. It is then combined with crash-based indicators to serve as an input variable in the calculation of road risk.

The simulated path cumulative frequency

X_{e}

is defined as the number of times link e is traversed across paths

r_{j}

within the set of normal traffic paths

B

. It serves as an exposure indicator and acts as a surrogate for actual traffic volume, and is expressed as follows:

X_{e} = \sum_{j \in B} 1 (e \in r_{j})

(5)

The simulated cumulative trip frequency reflects typical road-use patterns unrelated to crash occurrences. It is defined at the same segment level as the cumulative crash frequency, allowing a direct comparison between the two metrics.

3.4. Relative Risk Definition and High-Risk Road Segment Identification

To assess the crash risk of a road segment, the number of crashes on that segment must be adjusted for the segment’s traffic volume. Typically, the number of crashes increases naturally on heavily trafficked road segments, making it difficult to fairly compare risk levels across road segments based solely on crash frequency. The purpose of this study is not simply to identify segments with a high number of crashes, but to assess relative risk by accounting for segment-level exposure.

To achieve this, cumulative crash frequency is normalized by segment-level exposure under normal traffic conditions, and relative risk is defined as the rate of crash occurrence per unit of exposure. The relative risk,

R_{e}

, of link e is defined as follows, using the crash-based path cumulative frequency,

C_{e}

, and the Monte Carlo-based exposure,

X_{e}

:

R_{e} = \frac{C_{e} + ϵ}{X_{e} + δ}

(6)

Here,

δ

and

ϵ

serve distinct smoothing roles. The term

δ

acts as a pseudo-exposure regularizer in the denominator, preventing undefined ratios when

X_{e} = 0

and suppressing low-exposure inflation by reducing the variability of

R_{e}

when

X_{e}

is extremely small. In contrast,

ϵ

is an additive pseudo-count on crash frequency, encoding a weak prior for zero-crash segments (

C_{e} = 0

) and avoiding overly extreme rankings driven by single events on rarely traversed segments. Because the same

(ϵ, δ)

are applied uniformly across all segments, the risk score remains comparable across the network. At the same time, extreme outliers induced by near-zero exposure or near-zero crash counts are effectively mitigated.

The proposed risk metric is conceptually similar to the traditional crash rate, but differs in how exposure is defined. Unlike existing crash rates that rely on traffic volume metrics with limited spatial resolution, such as AADT or VKT, the proposed exposure proxy is derived from path-level aggregation on the same road network representation as the crash paths. This allows for direct segment-level comparison and interpretation, even in environments where segment-level traffic volume data are lacking.

Figure 2 illustrates the proposed road risk assessment concept. The blue line represents the OC path of the crash vehicle reconstructed from the crash report. In contrast, the red line represents the cumulative result of paths generated by simulation under normal traffic conditions. The thickness of each path is proportional to the number of times a road segment is included, allowing intuitive comparison of relative exposure levels across segments. High-risk road segments are defined as those with relatively low exposure under normal traffic conditions but high concentrations of crash-related paths.

To identify high-risk road segments, we use a quantile-based threshold computed from the empirical distribution of risk scores over the entire network. A segment e is labeled as high-risk if

H_{e} = \{\begin{matrix} 1, & R_{e} \geq Q_{1 - α} (R), \\ 0, & otherwise, \end{matrix}

(7)

where

H_{e}

is the high-risk indicator and

Q_{1 - α} (R)

is the empirical

(1 - α)

-quantile of

{R_{e}}_{e \in E}

. With sorted scores

R_{(1)} \leq \dots \leq R_{(| E |)}

, we set

Q_{1 - α} (R) = R_{(⌈ (1 - α) | E | ⌉)}

. Here,

α \in (0, 1)

denotes the selection ratio.

This quantile-based thresholding method has the advantage of automatically adjusting the number of high-risk road segments by reflecting the relative distribution characteristics of risk scores, rather than relying on the absolute magnitude of the risk score. This allows for consistent identification of risk sections even when network size or simulation conditions vary.

The proposed framework directly assigns risk levels to discrete road segments actually traversed by crash-related travel trajectories, using them as the basic units of analysis. This allows analysis results to be presented in clearly defined road segment units rather than continuous spatial density or diffuse areas, and can be used immediately to derive priorities for policy interventions such as traffic safety facility placement, road geometry improvement, and speed management.

3.5. Evaluation Metrics

To evaluate the performance of the identified high-risk segments, we use two metrics, namely point-based and path-based hit rates.

The point-based hit rate (

{HR}_{po} @ α %

) is commonly used to assess hotspot detection performance. It measures the proportion of crash locations that fall within the identified high-risk segments and is defined as follows:

{HR}_{po} @ α % = \frac{1}{m} \sum_{k = 1}^{m} | p_{a}^{k} \cap H |

(8)

where m is the total number of crashes in the evaluation dataset, and

| p_{a}^{k} \cap H |

takes a value of 1 if the k-th crash location is included in the high-risk segments, and 0 otherwise. A higher

{HR}_{po} @ α %

indicates that the identified risk segments cover a larger proportion of crash locations.

However, the point-based hit rate has a limitation in that it only considers the inclusion of crash locations. Importantly, selecting larger risk regions tends to yield higher values, as more crash points are likely to be included. As a result, while it reflects the coverage of risk segments, it does not capture how closely those segments align with actual crash-related travel paths.

To address this limitation, we propose the path-based hit rate (

{HR}_{pa} @ α %

), which incorporates path-level movement information:

{HR}_{pa} @ α % = \frac{1}{m} \sum_{k = 1}^{m} \frac{| A^{k} \cap H |}{| A^{k} |}

(9)

where

A^{k}

denotes the travel path associated with the k-th crash and H represents the set of top

α %

high-risk segments. Here,

| A^{k} \cap H |

denotes the number of segments on the k-th path that are also included in H, and

| A^{k} |

is the total number of segments in the k-th path. A higher value indicates that a larger portion of crash-involved paths overlaps with the identified high-risk segments.

4. Experimental Results

4.1. Study Area and Data

This study focuses on Daejeon Metropolitan City, Republic of Korea. The city covers approximately

539.8 {km}^{2}

and has a complex urban structure comprising commercial and business districts, residential areas, research and development complexes, and suburban regions. Administratively, it comprises five districts, with approximately 57% of the population concentrated in Seo-gu and Yuseong-gu, resulting in spatially uneven travel demand.

Daejeon serves as a major transportation hub in Korea, but traffic demand has been rapidly increasing relative to the existing road infrastructure. Previous studies have reported that the city experiences low average travel speeds and severe congestion on major arterial roads [49]. In addition, the number of registered vehicles exceeds 640,000 and continues to increase by more than 13,000 vehicles annually, suggesting that the current road network is insufficient to accommodate growing traffic demand. Notably, congestion during peak commuting hours has become a significant concern.

Furthermore, compared with other metropolitan areas, Daejeon has a relatively underdeveloped ring road system, which limits the effective distribution of traffic. This structural limitation leads to traffic concentration in specific corridors and is closely associated with increased crash risk. A regional study further shows that approximately 52.9% of congested road segments operating at Level of Service (LOS) E or worse are concentrated in major cities in the Chungcheong region, including Daejeon, Cheonan, and Cheongju [50]. This indicates that Daejeon represents a typical case of high traffic congestion and elevated traffic demand within the region.

Accordingly, Daejeon is selected as a representative study area characterized by complex urban structure, rapidly increasing traffic demand, and structural congestion issues.

A total of 1352 crash records collected between 2018 and 2023 from the Korean National Police Agency (KNPA) were used. Each record includes the crash location (latitude and longitude) and narrative descriptions of the crash environment.

4.2. Experimental Setup

To construct both crash-based segment frequency and simulation-based exposure, we generated a road network and the corresponding paths. The road network was extracted from OpenStreetMap using OSMnx v2.1.0 and uniformly discretized into

10 m

lixels for aggregation. Shortest paths were computed using Dijkstra’s algorithm in NetworkX v3.6.1 [51] with the OSMnx edge-length attribute as the routing weight. Real-time traffic conditions and travel-time weights were not incorporated in this baseline setting. Crash locations, as well as origin and destination coordinates, were snapped to the nearest network nodes within a

50 m

radius, and records exceeding this tolerance were excluded. Detailed implementation settings and parameter values are provided in Appendix A (Table A1) for reproducibility.

First, crash-based segment representations were constructed by generating OC paths from the KNPA crash dataset and aggregating them at the lixel level to obtain segment-level crash counts

C_{e}

. In this process, we excluded 59 OA path reconstructions that failed due to routing errors, representing 4.36% of the attempted OA paths. The remaining 1293 crash records were used in the experiments.

Next, to approximate traffic volume, an exposure proxy was generated using Monte Carlo-based OD paths. Origin and destination points were sampled within Daejeon’s administrative boundary with a bias toward major activity centers, namely Daejeon Station, Dunsan-dong, and Yuseong-gu, using a Gaussian Mixture Model (GMM).

We used one Gaussian component per activity center, resulting in

K = 3

components. The component means were set to representative coordinates of each center, namely Daejeon Station

(36.3323, 127.4342)

, Dunsan-dong

(36.3510, 127.3849)

, and Yuseong-gu

(36.3622, 127.3560)

. Because reliable prior demand or traffic volume information was unavailable to calibrate center-specific weights, we assigned equal-mixture weights of 0.3333 to all components. We assumed an isotropic covariance structure with a shared variance parameter

σ = 0.010

, which corresponds to approximately

1.11 km

under the adopted coordinate interpretation.

OD points were sampled only within Daejeon’s administrative boundary to focus on relative internal network usage, given limited information about external demand. We discarded OD pairs whose Euclidean distance was less than

1 km

, which rejected 0.11% of attempts, corresponding to 110 per 100,000 samples. We also discarded invalid OD routes that failed during routing, accounting for 7.03% of attempts, or 7028 per 100,000 samples.

Figure 3 illustrates both the crash-based segment representation and the Monte Carlo-based exposure proxy. The exposure proxy

X_{e}

was computed for

| E |

= 21,605 edges in the study area, and its distribution is highly right-skewed. As shown in Table 1, the median (84) is far smaller than the mean (658.49), and upper-tail quantiles increase sharply (e.g., Q90 = 1985.60 and Q99 = 7721.84), indicating that simulated traversals concentrate on a small subset of links. This heavy-tailed pattern supports interpreting

X_{e}

as a relative route-usage proxy and motivates stability and smoothing analyses to avoid over-ranking low-exposure links in exposure-normalized risk screening.

4.3. Stability Analysis

Because the proposed road risk measure relies on an exposure proxy generated via Monte Carlo simulation, insufficient simulation size N can lead to unstable risk rankings. In addition, the set of top

α %

high-risk segments obtained at each N may vary due to ranking fluctuations of segments near the decision boundary.

To evaluate this effect, we use the Jaccard similarity. For two sets A and B, it is defined as

J (A, B) = \frac{| A \cap B |}{| A \cup B |}

(10)

where

0 \leq J (A, B) \leq 1

, and larger values indicate greater similarity between the two sets.

Based on Equation (10), we evaluate both local stability and global convergence.

Local stability measures how consistently high-risk segment sets are preserved as the simulation size increases. Let

S^{(N)}

denote the set of top

α %

high-risk segments at simulation size N. The change between successive sets is measured by

J (S^{(N)}, S^{(N + Δ)})

. Among a total of 21,605 segments,

α = 5 %

corresponds to 1081 high-risk segments. As N increases from

20 k

to

200 k

with

Δ = 20 k

, the similarity stays above

0.95

across all settings and exceeds

0.9945

once N reaches

140 k

. At the same time, churn, defined as the number of replaced elements between successive sets, decreases from 46 to 4, indicating that the high-risk set becomes increasingly stable as N increases. Global convergence evaluates how the set

S^{(N)}

approaches a reference set as N increases. Using

S^{(200 k)}

as the reference,

J (S^{(N)}, S^{(200 k)})

increases from

0.9460

at

20 k

to

0.9963

at

180 k

, exceeding

0.99

after

140 k

. Meanwhile, churn based on the symmetric difference decreases from 60 to 4, confirming that the high-risk set gradually converges as the simulation size increases. Figure 4a summarizes the local stability and global convergence trends across simulation scales.

In addition to the set-based stability analysis, we further examine whether the exposure proxy itself stabilizes as N increases. Set-level overlap measures such as

J (S^{(N)}, S^{(N + Δ)})

mainly reflect changes near the selection boundary of the top-

α %

set. They do not necessarily guarantee that the underlying segment-level exposure values or their induced rankings have converged over the entire network.

Let

I^{(N)}

denote the rank vector over edges induced by the exposure proxy

X^{(N)}

computed using N simulated OD routes, where rank 1 corresponds to the largest exposure. We quantify rank-level drift between two simulation sizes by the mean absolute rank change

D (\cdot, \cdot)

:

D (I^{(N)}, I^{(N + Δ)}) = \frac{{∥I^{(N)} - I^{(N + Δ)}∥}_{1}}{| E |} .

(11)

where

{∥\cdot∥}_{1}

denotes the

ℓ_{1}

-norm and

| E |

denotes the total number of edges in the road network.

Using Equation (11), the local rank-level drift between adjacent simulation sizes decreases monotonically as N increases. Specifically,

D (I^{(20 k)}, I^{(40 k)})

is 639.59 and gradually decreases, reaching 102.35 at

D (I^{(180 k)}, I^{(200 k)})

. This trend indicates that the exposure distribution and the ranking it induces become progressively stable as simulation size increases. That rank changes in later stages are increasingly localized while preserving the overall ordering structure. We also evaluate global rank-level drift using a reference-based comparison. Taking

N_{ref} = 200 k

as the reference,

D (I^{(N)}, I^{(200 k)})

decreases as N increases, confirming a consistent convergence trend in the exposure-induced ranking. Taken together, these results complement the Jaccard-based stability analysis by showing that not only the selected high-risk set but also the underlying exposure proxy rankings converge sufficiently as the simulation size grows. Figure 4b visualizes the local and global rank-level drift trends as a function of N.

Based on these findings, we set the simulation size to

N = 200, 000

to mitigate under-sampling effects and ensure reproducibility.

4.4. Low-Exposure Diagnostics for Sensitivity Analysis

To quantify whether the smoothing parameter

δ

induces an over-selection of extremely low-exposure segments, we additionally compute the prevalence of low-exposure edges within the selected high-risk set. The risk score is defined as

R_{e} (ϵ, δ) = \frac{C_{e} + ϵ}{X_{e} + δ}

, where

ϵ

provides additive smoothing in low-count regimes (e.g.,

C_{e} = 0

). In this diagnostic, we fix

ϵ = 1

and vary

δ

while keeping all other settings unchanged. We define

H^{α} (ϵ, δ)

as the set of segments selected by the top-

α %

selection ratio according to

R_{e} (ϵ, δ)

, and we report results at

α = 5 %

unless stated otherwise.

Along with fixed-threshold diagnostics, we define a quantile-based low-exposure prevalence as

p_{\leq q} (τ; ϵ, δ) = \frac{|\{e \in H^{α} (ϵ, δ) : X_{e} \leq Q_{τ} (X)\}|}{|H^{α} (ϵ, δ)|},

(12)

where

Q_{τ} (X)

denotes the

τ

-quantile of the exposure distribution

{X_{e}}_{e \in E}

over the full edge set E. Thus,

p_{\leq q} (0.05; ϵ, δ)

represents the fraction of selected edges whose exposure falls within the bottom 5% of the overall exposure distribution, and

p_{\leq q} (0.10; ϵ, δ)

is defined analogously for the bottom 10%.

Table 2 reports these prevalences for different

δ

values. When

δ

is small, the selected set is heavily dominated by extremely low-exposure edges. At

δ = 0

,

94.45 %

of the selected edges satisfy

X_{e} \leq Q_{0.05} (X)

and

99.72 %

satisfy

X_{e} \leq Q_{0.10} (X)

. As

δ

increases, this dominance is substantially mitigated. At

δ = 500

and

δ = 1000

, the prevalences drop to

1.85 %

for the bottom 5% criterion and

2.96 %

for the bottom 10% criterion. At

δ = 2000

, the low-exposure prevalence becomes nearly negligible. It reaches

0.00 %

for the bottom 5% criterion and

0.37 %

for the bottom 10% criterion.

These results support the interpretation that

δ

acts as a pseudo-exposure regularizer that suppresses low-exposure inflation and improves the robustness of the exposure-normalized ranking. We select

δ = 1000

as a conservative compromise because it already achieves a substantial reduction in low-exposure dominance (comparable to

δ = 2000

in this diagnostic) while avoiding overly aggressive compression of risk scores among low-to-moderate exposure segments, which can reduce discrimination in the ranking. This choice is consistent with the overall stability results and remains robust under the tested

(ϵ, δ)

combinations.

4.5. Risk Segmentation Results

Based on our stability analysis and low-exposure sensitivity diagnostics, we fixed the simulation size and smoothing parameters as

N = 200,000

,

ϵ = 1

, and

δ = 1000

. Using these settings,

R_{e}

was computed, and the top

α %

high-risk segments

H_{e}

were identified. The proposed method was compared with NKDE, where bandwidths

h \in {100 m, 200 m, 1000 m}

were applied.

Figure 5 compares the statistical characteristics of the top

α = 5 %

high-risk segments

H_{e}

under different NKDE bandwidths and the proposed method, based on 1293 crash records. A risk cluster is defined as a set of spatially connected lixels exceeding a given threshold, forming a continuous road segment.

With a small bandwidth (

h = 100 m

), the number of clusters is large (760), and the average length is short (

266.6 m

), indicating over-fragmentation due to sensitivity to local density variations. In contrast, a large bandwidth (

h = 1000 m

) produces only 59 clusters, with a significantly increased average length (

3288.88 m

) and maximum length (

90, 723.57 m

), reflecting over-smoothing as risk values spread over broader regions and adjacent segments merge into large clusters.

For an intermediate bandwidth (

h = 200 m

), the results show a more balanced spatial structure, with 334 clusters and an average length of

568.43 m

. This observation is consistent with prior studies indicating that bandwidth selection critically affects result quality and that the optimal value depends on data characteristics [23,52].

In comparison, the proposed method produces 261 clusters with an average length of

870.62 m

, yielding coherent risk segments without excessive fragmentation or over-smoothing. Since the proposed framework integrates crash-path accumulation with an exposure proxy, it tends to produce stable segment representations that are less sensitive to kernel bandwidth choices. These results suggest that the proposed method provides a consistent network-wide risk segmentation layer suitable for screening and prioritization.

4.6. Comparison with NKDE

We compare the proposed method with NKDE using two evaluation metrics, namely

{HR}_{po} @ α %

and

{HR}_{pa} @ α %

. The experiments were conducted using a 10-fold cross-validation scheme. In each fold, 90% of the crash records in the KNPA dataset were randomly selected as the source dataset for risk estimation, and the top

α %

road segments were identified based on the computed risk scores. The remaining 10% of the crash data were used as the evaluation dataset.

To determine an appropriate NKDE bandwidth empirically, we performed a data-driven tuning experiment on the training folds. We considered a candidate set

H = {100, 150, \dots, 1000} m

and for each

h \in H

computed hit-rate curves over

α \in {1, 2, \dots, 100} %

for both

{HR}_{po} @ α %

and

{HR}_{pa} @ α %

. An AUC-like score summarized each curve

AUC (HR) = \frac{1}{100} \sum_{α = 1}^{100} HR @ α % .

(13)

The bandwidth was selected by maximizing this criterion on the training folds.

Based on

AUC ({HR}_{po})

as the primary tuning objective, the best-performing NKDE bandwidth was

h_{po}^{*} = 200 m

with

AUC ({HR}_{po}) = 82.707

. For completeness, we also tuned NKDE using

AUC ({HR}_{pa})

and obtained its maximum at

h_{pa}^{*} = 100 m

with

AUC ({HR}_{pa}) = 56.990

.

4.6.1. Quantitative Results (Global Analysis)

In Figure 6a,

{HR}_{po} @ α %

increases with

α

for all methods and approaches saturation as the selected high-risk budget becomes large. Under point-based evaluation, NKDE with smaller bandwidths generally achieves higher hit rates at low

α

and attains the largest overall AUC (Table 3). The proposed method shows comparable performance to NKDE (

h = 100 m

) and the tuned NKDE baseline with

h_{po}^{*} = 200 m

over

α \in [1 %, 25 %]

, and it consistently outperforms NKDE with a large bandwidth (

h = 1000 m

) in the low-

α

regime. Overall, these results suggest that the proposed method does not primarily improve point-based hotspot coverage. Instead, it yields a stable, monotonic trend as

α

increases under a fixed formulation, and it avoids bandwidth-dependent smoothing effects that can substantially alter NKDE results.

In Figure 6b,

{HR}_{pa} @ α %

increases with

α

for all methods and approaches 1.0 as the selected high-risk budget becomes large. Under the path-based evaluation, the proposed method consistently achieves higher hit rates than NKDE across low-to-moderate

α

levels (Table 4). The improvement is most pronounced in the low-

α

regime, where the proposed method yields markedly better overlap with crash-involved travel paths. For example, at

α = 1 %

the proposed method achieves 0.067 compared with 0.008–0.015 for NKDE, and at

α = 25 %

it achieves 0.631 compared with 0.235–0.278 for NKDE. The AUC-like summary score corroborates this trend. The proposed method attains the highest value of 66.003, whereas the tuned NKDE baseline with

h_{pa}^{*} = 100 m

achieves 55.810. The performance gains become more gradual beyond approximately

α = 25 %

, which is consistent with diminishing marginal overlap as the selected set expands. Although NKDE can achieve higher values at larger

α

in some settings, such differences primarily reflect broader coverage under a larger selection budget. Overall, these results suggest that incorporating path-level movement context with exposure normalization improves the efficiency of network-wide risk screening in capturing crash-involved traversal patterns.

4.6.2. Case Study (Local Analysis)

To enable an intuitive comparison between density-based and path-based approaches, we select a region that includes both major arterial roads and crash hotspots. The study area corresponds to a

3 km

radius centered at

(λ, ϕ) = (127.384294, 36.353589)

in Daejeon.

Figure 7 shows the point-based results. The point-based hit rate

{HR}_{po} @ 5 %

was computed for 32 crash locations in the evaluation dataset. Red lines denote high-risk segments, and blue points indicate crash locations that fall within them. NKDE achieves

0.6563

(21 hits) with

h = 100 m

,

0.6563

(21 hits) with

h = 200 m

, and

0.5938

(19 hits) with

h = 1000 m

, while the proposed method achieves

0.5313

(17 hits). These results indicate that NKDE performs well in identifying crash locations based on spatial density when an appropriate bandwidth is selected. For example,

h = 100 m

achieves the highest performance by focusing on localized crash clusters.

However, from a spatial perspective, NKDE tends to spread risk values into surrounding areas via kernel diffusion, leading to over-smoothing and the inclusion of segments that are not directly related to crashes. In contrast, the proposed method leverages path-level information to identify risk segments along major travel corridors, producing a more structurally concentrated distribution.

Consistent with the limitations of point-based evaluation, density-based approaches achieve higher performance under point-based metrics. However, this does not reflect alignment with actual travel paths. We therefore evaluate the results using the path-based hit rate.

Figure 8 presents the path-based evaluation results in terms of

{HR}_{pa} @ 5 %

, comparing NKDE and the proposed method. As summarized in Table 5, NKDE shows limited improvement in the path-based hit rate

{HR}_{pa} @ 5 %

as the bandwidth increases, ranging from

0.1223

to

0.1787

. Although larger bandwidths increase the number of selected segments, the gain in overlapping segments remains relatively small, indicating inefficient expansion of risk regions.

In contrast, the proposed method achieves a substantially higher

{HR}_{pa} @ 5 %

of

0.3622

while selecting only 3084 segments. Notably, the number of overlapping segments (1117 hits) is comparable to or higher than those of NKDE despite the smaller selection size. This reflects the definition of

{HR}_{pa} @ 5 %

, where the hit rate is determined by the proportion of crash-related path segments that overlap with the selected high-risk segments.

From this perspective, the proposed method improves both components of the metric, increasing the number of overlapping segments while reducing the number of selected segments, thereby achieving a higher overlap ratio. In contrast, NKDE primarily increases the number of selected segments without a proportional increase in overlap, resulting in lower

{HR}_{pa} @ 5 %

values.

Furthermore, the higher ratio of overlapping to selected segments indicates that the proposed method more precisely captures segments actually traversed by crash-related paths. In contrast, NKDE tends to include many segments that are not aligned with actual travel paths, due to kernel-based spatial diffusion.

These results indicate that density-based approaches emphasize spatial coverage, whereas path-based approaches better capture alignment with actual travel behavior.

5. Discussion and Conclusions

This study proposes a path-based risk segmentation framework that estimates segment-level road risk by integrating path-derived traversal frequencies with simulation-based exposure. Unlike conventional approaches that treat crashes as independent points, the proposed method interprets crashes within the context of travel paths. It identifies risk patterns that reflect the sequential relationships among road segments.

The results show that the proposed method exhibits distinct characteristics compared with NKDE. While achieving comparable performance in point-based evaluation, it substantially outperforms in path-based evaluation, aligning more closely with crash-involved travel paths. In addition, whereas NKDE shows substantial variation in the spatial extent and shape of risk segments across bandwidth values, the proposed method produces stable, coherent risk patterns that are less sensitive to bandwidth-driven smoothing.

These results demonstrate the practical utility of the proposed ranking for network-wide screening, while also calling for careful interpretation of the identified high-risk segments.

We emphasize that the proposed framework is intended for screening and prioritization rather than causal attribution. Although the method highlights segments that frequently appear in crash-involved proxy paths after exposure normalization, such overlap should be interpreted as a risk-correlated indicator rather than evidence of a direct causal mechanism. A segment may appear frequently because it lies on an unavoidable route to a genuinely hazardous area, even if the segment itself is not the primary causal factor. Accordingly, the ranked outputs are best used as a first-stage filter to prioritize candidate segments for follow-up validation, including field inspection, geometric review, signal operation assessment, and near-miss analytics.

However, several limitations remain. First, the dataset is limited to 1293 crash records, which may introduce bias if crashes are concentrated in specific regions or road segments. Second, actual vehicle trajectories are not directly observed and are approximated using shortest paths, which may differ from real driving behavior, especially under congestion or user-equilibrium routing. We therefore interpret reconstructed routes as proxy paths derived under a consistent routing rule, approximating structural exposure rather than individual route choice under congestion. Third, exposure is estimated via simulation, which introduces uncertainty that depends on the simulation settings and scale. The exposure proxy

X_{e}

is not intended to reproduce calibrated absolute volumes such as AADT or VKT and should be interpreted as a relative denominator for normalization. OD points are confined to the study area’s administrative boundary, and external inflows and outflows, including through traffic, are not explicitly modeled.

Future work will focus on improving realism and reducing uncertainty in exposure estimation by incorporating measured traffic volumes and operational data when available. When calibrated traffic counts, OD matrices, or signal timing and control information become available, we can retain the downstream risk formulation and replace

X_{e}

with a more realistic exposure model, such as an assignment-based or calibrated simulation-based exposure estimate. In addition, the OD prior can be refined by integrating population and employment distributions, point-of-interest density, census-tract variables, or FCD data to calibrate mixture weights and spatial dispersion. Extending the OD generation process to account for external demand will further improve the representativeness of the exposure proxy, particularly for corridors influenced by through traffic. Finally, the generalizability of the proposed framework should be validated across multiple cities with heterogeneous network structures and traffic patterns.

Overall, this study provides a new perspective on crash risk analysis by shifting from point-based to path-based interpretation. The proposed framework enables the identification of high-risk segments at the network level and facilitates traffic safety management and risk-based prioritization of road interventions. We note that the proposed framework is primarily intended for network-wide screening and prioritization in data-limited settings. The identified high-risk candidates should be validated through subsequent engineering review and, where available, richer sensing and operational data.

Author Contributions

Conceptualization, I.S. and Y.L.; methodology, I.S. and Y.L.; software, Y.Y.; formal analysis, Y.Y.; investigation, Y.Y.; resources, I.S.; writing—original draft preparation, Y.Y.; writing—review and editing, I.S. and Y.L.; visualization, Y.Y. and Y.L.; supervision, Y.L.; funding acquisition, I.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2024-00337489, Development of data drift management technology to overcome performance degradation of AI analysis models).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to legal restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AADT	Annual Average Daily Traffic
APIs	Application Programming Interfaces
CAN	Controller Area Network
EB	Empirical Bayes
FCD	loating-car-data
GMM	Mixture of Gaussian Models
GPS	Global Positioning System
KDE	Kernel Density Estimation
KNPA	Korean National Police Agency
LOS	Level of Service
NKDE	Network Kernel Density Estimation
OBD	On-Board Diagnostics
OA	Origin–Crash
OD	Origin–Destination
SPF	Safety Performance Function
VKT	Vehicle Kilometers Traveled
WHO	World Health Organization

Appendix A. Implementation Details for Reproducibility

This subsection summarizes the implementation choices required to reproduce the construction of the crash-based segment count

C_{e}

and the simulation-based exposure proxy

X_{e}

. Unless otherwise stated, the same settings are used throughout all experiments; the full parameter list is provided in Table A1.

Table A1. Key implementation details and parameters for reproducibility.

Component	Parameter/Setting	Value Used in This Study (and Notes)
Road network (OSMnx)	Network type	`drive` (drivable roads only).
	Directionality/one-way	Directed graph; one-way restrictions preserved via OSM tags.
	Graph simplification	`simplify = True`.
	Disconnected components	`retain_all = False` (default; largest component retained).
	Distance weight attribute	`length` (meters) computed by OSMnx.
	Time weight	`travel_time` (seconds) derived from OSM speed estimates.
Snapping/map matching	Snap target	Nearest network node (OSMnx `nearest_nodes`).
	Distance metric	Euclidean distance in a projected CRS (recommended for reproducibility).
	Maximum tolerance	$d_{max} = 50 m$ ; records beyond $d_{max}$ are excluded.
	Failure handling	Unmatched crash/origin/destination points are dropped.
	Coordinate handling	WGS84 inputs projected to a planar CRS for distance computations.
Lixelization/segmentation	Lixel length	$10 m$ lixels for length normalization and aggregation.
	Segmentation rule	No additional segmentation thresholds; uniform $10 m$ lixelization only.
	Branching definition	Not used (no branching-point-based splitting beyond OSMnx default graph representation).
	Curvature threshold	Not used (no explicit curvature/turn-angle threshold $θ_{min}$ applied).
	Node insertion rule	Split edges uniformly into $10 m$ lixels; no additional node insertion rules.
Path reconstruction (NetworkX)	Algorithm	Dijkstra shortest path.
	Baseline routing weight	`weight = length` (distance-minimizing).
	Sensitivity routing weight	`weight = travel_time` (time-minimizing; OSM speed-based).
	Multi-edge handling	Directed MultiDiGraph supported; routing on directed network.
	Route failure handling	OD pairs with no feasible path are excluded.
	Interpretation	Proxy routing under a consistent rule (structural exposure, not user-equilibrium).
Exposure proxy (Monte Carlo)	OD sampling model	GMM prior over activity centers.
	Number of components	$K = 3$ (Daejeon Station, Dunsan-dong, Yuseong-gu).
	Means $(μ_{k})$	$μ_{1} = (36.3323, 127.4342)$ , $μ_{2} = (36.3510, 127.3849)$ , $μ_{3} = (36.3622, 127.3560)$ .
	Mixture weights $(π_{k})$	$π_{1} = π_{2} = π_{3} = 1 / 3$ .
	Covariance $(Σ_{k})$	Isotropic: $Σ_{k} = diag (σ^{2}, σ^{2})$ ; $σ = 0.010$ .
	Boundary constraint	OD points restricted to the Daejeon boundary via rejection sampling.
	Minimum OD distance	$∥ p_{o} - p_{d} ∥ \geq 1 km$ (Euclidean).
Counts & risk	Crash-path count	$C_{e}$ : number of crash-involved reconstructed paths traversing segment e.
	Exposure proxy count	$X_{e}$ : number of simulated OD routes traversing segment e.
	Risk score	$R_{e} = (C_{e} + ϵ) / (X_{e} + δ)$ with $ϵ = 1$ , $δ = 1000$ .
	High-risk set	Top- $α %$ selection ratio by $R_{e}$ ; deterministic tie-break at the boundary.
Filtering statistics	Rejected OD samples (<1 km)	0.11% (110/100,000)
	Invalid OD routes	7.03% (7028/100,000)
	Final simulation size	$N = 200,000$ valid Origin-Destination routes.

References

Ahmed, S.; Hossain, M.A.; Ray, S.K.; Bhuiyan, M.M.I.; Sabuj, S.R. A study on road accident prediction and contributing factors using explainable machine learning models: Analysis and performance. Transp. Res. Interdiscip. Perspect. 2023, 19, 100814. [Google Scholar] [CrossRef]
Cañaveras Perea, R.M.; Tejada Ponce, Á.; Sánchez González, M.P. How to prevent 3 million deaths worldwide: A systematic review of occupational accident research—A factor-and cost-based approach. Eur. J. Public Health 2025, 35, 91–100. [Google Scholar] [CrossRef]
Li, M.; Li, Z.; Xu, C.; Liu, T. Short-term prediction of safety and operation impacts of lane changes in oscillations with empirical vehicle trajectories. Accid. Anal. Prev. 2020, 135, 105345. [Google Scholar] [CrossRef]
Bougna, T.; Hundal, G.; Taniform, P. Quantitative analysis of the social costs of road traffic crashes literature. Accid. Anal. Prev. 2022, 165, 106282. [Google Scholar] [CrossRef]
Tandrayen-Ragoobur, V. The economic burden of road traffic accidents and injuries: A small island perspective. Int. J. Transp. Sci. Technol. 2025, 17, 109–119. [Google Scholar] [CrossRef]
Bonera, M.; Barabino, B.; Yannis, G.; Maternini, G. Network-wide road crash risk screening: A new framework. Accid. Anal. Prev. 2024, 199, 107502. [Google Scholar] [CrossRef]
Alkaabi, K. Identification of hotspot areas for traffic accidents and analyzing drivers’ behaviors and road accidents. Transp. Res. Interdiscip. Perspect. 2023, 22, 100929. [Google Scholar] [CrossRef]
Khattak, M.W.; De Backer, H.; De Winne, P.; Brijs, T.; Pirdavani, A. Comparative evaluation of crash hotspot identification methods: Empirical Bayes vs. potential for safety improvement using variants of negative binomial models. Sustainability 2024, 16, 1537. [Google Scholar] [CrossRef]
Mhetre, K.V.; Thube, A.D. Road safety, crash hot-spot, and crash cold-spot identification on a rural national highway in maharashtra, India. Mater. Today Proc. 2023, 77, 780–787. [Google Scholar] [CrossRef]
Wan, Y.; He, W.; Zhou, J. Urban road accident black spot identification and classification approach: A novel grey verhuls–Empirical bayesian combination method. Sustainability 2021, 13, 11198. [Google Scholar] [CrossRef]
Ghadi, M.; Török, Á. A comparative analysis of black spot identification methods and road accident segmentation methods. Accid. Anal. Prev. 2019, 128, 1–7. [Google Scholar] [CrossRef]
Lord, D.; Mannering, F. The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives. Transp. Res. Part A Policy Pract. 2010, 44, 291–305. [Google Scholar] [CrossRef]
Mendes, O.B.B.; Larocca, A.P.C.; Rodrigues Silva, K.; Pirdavani, A. Assessing the Performance of Highway Safety Manual (HSM) Predictive Models for Brazilian Multilane Highways. Sustainability 2023, 15, 10474. [Google Scholar] [CrossRef]
Montella, A. A comparative analysis of hotspot identification methods. Accid. Anal. Prev. 2010, 42, 571–581. [Google Scholar] [CrossRef]
Li, H.; Graham, D.J.; Ding, H.; Ren, G. Comparison of empirical Bayes and propensity score methods for road safety evaluation: A simulation study. Accid. Anal. Prev. 2019, 129, 148–155. [Google Scholar] [CrossRef]
Hauer, E. Empirical Bayes approach to the estimation of “unsafety”: The multivariate regression method. Accid. Anal. Prev. 1992, 24, 457–477. [Google Scholar] [CrossRef] [PubMed]
Zarei, M.; Hellinga, B.; Izadpanah, P. CGAN-EB: A non-parametric empirical Bayes method for crash frequency modeling using conditional generative adversarial networks as safety performance functions. Int. J. Transp. Sci. Technol. 2023, 12, 753–764. [Google Scholar] [CrossRef]
Cui, H.; Dong, J.; Zhu, M.; Li, X.; Wang, Q. Identifying accident black spots based on the accident spacing distribution. J. Traffic Transp. Eng. Engl. Ed. 2022, 9, 1017–1026. [Google Scholar] [CrossRef]
Mahmoud, N.; Abdel-Aty, M.; Cai, Q.; Zheng, O. Vulnerable road users’ crash hotspot identification on multi-lane arterial roads using estimated exposure and considering context classification. Accid. Anal. Prev. 2021, 159, 106294. [Google Scholar] [CrossRef]
Chen, Y.C. A tutorial on kernel density estimation and recent advances. Biostat. Epidemiol. 2017, 1, 161–187. [Google Scholar] [CrossRef]
Thakali, L.; Kwon, T.J.; Fu, L. Identification of crash hotspots using kernel density estimation and kriging methods: A comparison. J. Mod. Transp. 2015, 23, 93–106. [Google Scholar] [CrossRef]
Xie, Z.; Yan, J. Detecting traffic accident clusters with network kernel density estimation and local spatial statistics: An integrated approach. J. Transp. Geogr. 2013, 31, 64–71. [Google Scholar] [CrossRef]
Xie, Z.; Yan, J. Kernel density estimation of traffic accidents in a network space. Comput. Environ. Urban Syst. 2008, 32, 396–406. [Google Scholar] [CrossRef]
Maestroni, D.; Cappelli, G.; Gagliardi, V.; Nardoianni, S.; Mahabadi, P.H.; Tika, T.P.; Caushaj, N.; Bella, F.; D’Apuzzo, M.; Misso, F.E. Mapping Risk Factors to Build Inclusive Roads: A Systematic Diagnosis for Enhancing Vulnerable Users and Persons with Reduced Mobility Safety. In Proceedings of the International Conference on Computational Science and Its Applications; Springer: Berlin/Heidelberg, Germany, 2025; pp. 87–104. [Google Scholar]
Grigonis, V.; Plačiakis, J. A Methodological Approach to Identifying Unsafe Intersections for Micromobility Users: A Case Study of Vilnius. Sustainability 2025, 17, 11053. [Google Scholar] [CrossRef]
Bianchi, T.; Brighente, A.; Conti, M.; Valori, A. Your Car Tells Me Where You Drove: A Novel Path Inference Attack via CAN Bus and OBD-II Data. In Proceedings of the 2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P), Venice, Italy, 30 June–4 July 2025; pp. 113–132. [Google Scholar]
Jain, A.; Kumar, R. Driving behavior analysis and classification by vehicle OBD data using machine learning. J. Supercomput. 2023, 79, 18800–18819. [Google Scholar] [CrossRef] [PubMed]
Zygouras, N.; Panagiotou, N.; Li, Y.; Gunopulos, D.; Guibas, L. HTTE: A hybrid technique for travel time estimation in sparse data environments. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems; Association for Computing Machinery: New York, NY, USA, 2019; pp. 99–108. [Google Scholar]
Cho, M.; Park, J.; Kim, S.; Lee, Y. Estimation of Driving Direction of Traffic Accident Vehicles for Improving Traffic Safety. Appl. Sci. 2023, 13, 7710. [Google Scholar] [CrossRef]
Zhang, H.; Shang, Y. Analyzing road traffic crashes through multidisciplinary video data approaches. Front. Public Health 2025, 13, 1614017. [Google Scholar] [CrossRef]
Dai, R.; Xu, S.; Gu, Q.; Ji, C.; Liu, K. Hybrid spatio-temporal graph convolutional network: Improving traffic prediction with navigation data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2020; pp. 3074–3082. [Google Scholar]
Jayasinghe, A.; Sano, K.; Abenayake, C.C.; Mahanama, P. A novel approach to model traffic on road segments of large-scale urban road networks. MethodsX 2019, 6, 1147–1163. [Google Scholar] [CrossRef]
Dey, S.; Tomko, M.; Winter, S. Map-matching error identification in the absence of ground truth. ISPRS Int. J. Geo-Inf. 2022, 11, 538. [Google Scholar] [CrossRef]
Kan, Z.; Liu, D.; Yang, X.; Lee, J. Measuring exposure and contribution of different types of activity travels to traffic congestion using GPS trajectory data. J. Transp. Geogr. 2024, 117, 103896. [Google Scholar] [CrossRef]
Ko, Y.G.; Jo, K.C.; Lee, J.S.; Yu, J.S. Vehicle Collision Frequency Prediction Using Traffic Accident and Traffic Volume Data with a Deep Neural Network. Appl. Sci. 2025, 15, 9884. [Google Scholar] [CrossRef]
Skaug, L.; Nojoumian, M.; Dang, N.; Yap, A. Road Crash Analysis and Modeling: A Systematic Review of Methods, Data, and Emerging Technologies. Appl. Sci. 2025, 15, 7115. [Google Scholar] [CrossRef]
Mimi, M.S.; Das, S.; Dutta, A.K. Non-spatial AI modeling to estimate traffic volume measures on local roadways. Int. J. Urban Sci. 2026, 1–30. [Google Scholar] [CrossRef]
Sfyridis, A.; Agnolucci, P. Annual average daily traffic estimation in England and Wales: An application of clustering and regression modelling. J. Transp. Geogr. 2020, 83, 102658. [Google Scholar] [CrossRef]
Pulugurtha, S.S.; Mathew, S. Modeling AADT on local functionally classified roads using land use, road density, and nearest nonlocal road data. J. Transp. Geogr. 2021, 93, 103071. [Google Scholar] [CrossRef]
Das, S.; Tsapakis, I. Interpretable machine learning approach in estimating traffic volume on low-volume roadways. Int. J. Transp. Sci. Technol. 2020, 9, 76–88. [Google Scholar] [CrossRef]
Jayasinghe, A.; Sano, K. Estimation of annual average daily traffic on road segments: Network centrality-based method for metropolitan areas. In Proceedings of the Transportation Research Board Annual Meeting Compendium of Papers, Washington, DC, USA, 8–12 January 2017. Number 17-03141. [Google Scholar]
Ma, L.; Al-Shukairi, R.; Stettler, M.; Graham, D. Estimating Annual Average Daily Traffic on Local Roads: Integrating Spatial Insights with Machine Learning 2025. Available online: https://www.researchsquare.com/article/rs-7189895/v1 (accessed on 8 May 2026).
Simini, F.; González, M.C.; Maritan, A.; Barabási, A.L. A universal model for mobility and migration patterns. Nature 2012, 484, 96–100. [Google Scholar] [CrossRef]
Mungthanya, W.; Phithakkitnukoon, S.; Demissie, M.G.; Kattan, L.; Veloso, M.; Bento, C.; Ratti, C. Constructing time-dependent origin-destination matrices with adaptive zoning scheme and measuring their similarities with taxi trajectory data. IEEE Access 2019, 7, 77723–77737. [Google Scholar] [CrossRef]
Afandizadeh Zargari, S.; Memarnejad, A.; Mirzahossein, H. Hourly origin–destination matrix estimation using intelligent transportation systems data and deep learning. Sensors 2021, 21, 7080. [Google Scholar] [CrossRef] [PubMed]
Green, E.R.; Agent, K.R. Evaluation of the Accuracy of GPS Coordinates Used on Traffic Collision Reporting Forms. In Kentucky Transportation Center Research Report; Technical Report; University of Kentucky: Lexington, KY, USA, 2004. [Google Scholar]
Pung, J.; D’Souza, R.M.; Ghosal, D.; Zhang, M. A road network simplification algorithm that preserves topological properties. Appl. Netw. Sci. 2022, 7, 79. [Google Scholar] [CrossRef]
Wang, L.; Wang, G.; Luo, X.; Wang, L.; Yu, W.; Zhang, Z.; Gao, H. Contour-based instance segmentation method of road scene. Sci. Rep. 2025, 15, 33692. [Google Scholar] [CrossRef] [PubMed]
Kim, M.; Kim, Y.; Kim, H.; Joe, S. Expansion measure of a Beltway to solve traffic jam in Daejeon. In The 77th Conference of Korean Society of Transportation; Korean Society of Transportation: Seoul, Republic of Korea, 2017; pp. 295–298. [Google Scholar]
Kang, M.J.; Oh, J.T.; Park, J.S. A study of main-road analysis for efficient road management: Focusing on the Chungcheong area. J. Korea Inst. Intell. Transp. Syst. 2021, 20, 132–145. [Google Scholar] [CrossRef]
Dijkstra, E.W. A note on two problems in connexion with graphs. In Edsger Wybe Dijkstra: His Life, Work, and Legacy; Association for Computing Machinery: New York, NY, USA, 2022; pp. 287–290. [Google Scholar]
Botev, Z.I.; Grotowski, J.F.; Kroese, D.P. Kernel density estimation via diffusion. Ann. Stat. 2010, 38, 2916–2957. [Google Scholar] [CrossRef]

Figure 1. Road segment risk assessment framework based on crash-derived segment frequency and Monte Carlo-based exposure proxy.

Figure 2. Illustration of path-based road segment risk assessment.

Figure 3. Comparison between crash-based segment frequency and Monte Carlo-based exposure proxy: (a) crash-based segment representation; (b) Monte Carlo-based exposure.

Figure 4. Stability across simulation scale N for the top

5 %

high-risk selection: (a) set-level Jaccard similarity; (b) mean absolute rank change.

Figure 4. Stability across simulation scale N for the top

5 %

high-risk selection: (a) set-level Jaccard similarity; (b) mean absolute rank change.

Figure 5. Spatial distribution of top 5% risk segments across methods. The selected segments are shown in red: (a) NKDE (

h = 100 m

); (b) NKDE (

h = 200 m

); (c) NKDE (

h = 1000 m

); (d) proposed method.

Figure 5. Spatial distribution of top 5% risk segments across methods. The selected segments are shown in red: (a) NKDE (

h = 100 m

); (b) NKDE (

h = 200 m

); (c) NKDE (

h = 1000 m

); (d) proposed method.

Figure 6. Point- and path-based hit rates across

α

: (a)

{HR}_{po} @ α %

; (b)

{HR}_{pa} @ α %

.

Figure 6. Point- and path-based hit rates across

α

: (a)

{HR}_{po} @ α %

; (b)

{HR}_{pa} @ α %

.

Figure 7. Comparison of spatial risk distributions (top 5%) using point-based evaluation: (a) NKDE (

h = 200 m

); (b) proposed method.

Figure 7. Comparison of spatial risk distributions (top 5%) using point-based evaluation: (a) NKDE (

h = 200 m

); (b) proposed method.

Figure 8. Comparison of spatial risk distributions (top 5%) using path-based evaluation: (a) NKDE (

h = 100 m

); (b) proposed method.

Figure 8. Comparison of spatial risk distributions (top 5%) using path-based evaluation: (a) NKDE (

h = 100 m

); (b) proposed method.

Table 1. Summary statistics of the exposure proxy

X_{e}

over edges in the study area.

Table 1. Summary statistics of the exposure proxy

X_{e}

over edges in the study area.

Statistic	Value
Number of edges ( $\| E \|$ )	21,605
Min	0
Max	16,922
Mean	658.49
Std. dev.	1532.90
Median (50%)	84
25% quantile	19
75% quantile	483
90% quantile	1985.60
95% quantile	3371
99% quantile	7721.84

Table 2. Low -exposure prevalence within the selected high-risk set under different

δ

values.

Table 2. Low -exposure prevalence within the selected high-risk set under different

δ

values.

	$δ = 0$	10	50	100	500	1000	2000
$p_{\leq q} (0.05; ϵ, δ$ )	0.9445	0.6707	0.1230	0.1230	0.0185	0.0185	0.0000
$p_{\leq q} (0.10; ϵ, δ)$	0.9972	0.7493	0.2017	0.2017	0.0296	0.0296	0.0037

Table 3. Point -based hit rate

{HR}_{po} @ α %

and cumulative AUC under different NKDE bandwidths and the proposed method.

Table 3. Point -based hit rate

{HR}_{po} @ α %

and cumulative AUC under different NKDE bandwidths and the proposed method.

$α$	NKDE ( $h_{po} = 100 m$ )		NKDE ( $h_{po}^{*} = 200 m$ )		NKDE ( $h_{po} = 1000 m$ )		Proposed Method
$α$	${HR}_{po} @ α %$	AUC	${HR}_{po} @ α %$	AUC	${HR}_{po} @ α %$	AUC	${HR}_{po} @ α %$	AUC
1	0.137	0.137	0.124	0.124	0.076	0.076	0.084	0.084
5	0.378	1.345	0.318	1.206	0.230	0.768	0.311	0.977
10	0.556	3.795	0.490	3.431	0.401	2.415	0.468	3.023
25	0.680	13.501	0.717	13.209	0.656	10.589	0.711	12.393
50	0.998	34.354	0.999	33.847	0.924	30.778	0.843	31.771
75	1.000	59.350	1.000	58.843	0.989	54.995	0.959	54.574
100	1.000	84.350	1.000	83.843	1.000	79.979	1.000	79.186

Table 4. Path -based hit rate

{HR}_{pa} @ α %

and cumulative AUC under different NKDE bandwidths and the proposed method.

Table 4. Path -based hit rate

{HR}_{pa} @ α %

and cumulative AUC under different NKDE bandwidths and the proposed method.

$α$	NKDE ( $h_{pa}^{*} = 100 m$ )		NKDE ( $h_{pa} = 200 m$ )		NKDE ( $h_{pa} = 1000 m$ )		Proposed Method
$α$	${HR}_{pa} @ α %$	AUC	${HR}_{pa} @ α %$	AUC	${HR}_{pa} @ α %$	AUC	${HR}_{pa} @ α %$	AUC
1	0.009	0.009	0.008	0.008	0.015	0.015	0.067	0.067
5	0.047	0.137	0.046	0.133	0.048	0.165	0.229	0.767
10	0.094	0.510	0.090	0.493	0.089	0.527	0.359	2.364
25	0.278	3.242	0.253	3.086	0.235	3.014	0.631	10.176
50	0.604	15.009	0.594	14.669	0.515	12.566	0.667	26.260
75	0.826	33.309	0.826	32.887	0.736	28.401	0.773	44.171
100	1.000	55.810	1.000	55.523	1.000	50.183	1.000	66.003

Table 5. Path-based performance at top 5%: comparison of selected and overlapping segments.

Method	Selected Segments	Overlapping Segments	${HR}_{pa} @ 5 %$
NKDE ( $h_{p a}^{*} = 100 m$ )	4777	584	0.1223
Proposed Method	3084	1117	0.3622

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yoon, Y.; Shin, I.; Lee, Y. Path-Based Risk Segmentation of Road Networks with Exposure Modeling. Electronics 2026, 15, 2069. https://doi.org/10.3390/electronics15102069

AMA Style

Yoon Y, Shin I, Lee Y. Path-Based Risk Segmentation of Road Networks with Exposure Modeling. Electronics. 2026; 15(10):2069. https://doi.org/10.3390/electronics15102069

Chicago/Turabian Style

Yoon, Yeongho, Inkyoung Shin, and Yonggeol Lee. 2026. "Path-Based Risk Segmentation of Road Networks with Exposure Modeling" Electronics 15, no. 10: 2069. https://doi.org/10.3390/electronics15102069

APA Style

Yoon, Y., Shin, I., & Lee, Y. (2026). Path-Based Risk Segmentation of Road Networks with Exposure Modeling. Electronics, 15(10), 2069. https://doi.org/10.3390/electronics15102069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Path-Based Risk Segmentation of Road Networks with Exposure Modeling

Abstract

1. Introduction

2. Related Work

2.1. Trajectory Acquisition and Path Reconstruction

2.2. Crash Counts and Segment-Level Utilization Proxies

2.3. Exposure Estimation Under Missing Traffic Volumes

3. Methodology

3.1. Path Generation on Road Networks

3.2. Crash-Based Segment Representation

3.3. Monte Carlo-Based Exposure Proxy Representation

3.4. Relative Risk Definition and High-Risk Road Segment Identification

3.5. Evaluation Metrics

4. Experimental Results

4.1. Study Area and Data

4.2. Experimental Setup

4.3. Stability Analysis

4.4. Low-Exposure Diagnostics for Sensitivity Analysis

4.5. Risk Segmentation Results

4.6. Comparison with NKDE

4.6.1. Quantitative Results (Global Analysis)

4.6.2. Case Study (Local Analysis)

5. Discussion and Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Implementation Details for Reproducibility

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI