Prediction of Rear-End Collision Risk in Urban Expressway Diverging Areas Under Rainy Weather Conditions

Xia, Xiaomei; Zhang, Tianyi; Yao, Jiao; Wang, Pujie; Zhu, Chenke; Zhu, Chenqiang

doi:10.3390/systems14010056

Open AccessArticle

Prediction of Rear-End Collision Risk in Urban Expressway Diverging Areas Under Rainy Weather Conditions

by

Xiaomei Xia

¹,

Tianyi Zhang

¹

,

Jiao Yao

¹

,

Pujie Wang

¹,

Chenke Zhu

¹ and

Chenqiang Zhu

^1,2,*

¹

Business School, University of Shanghai for Science and Technology, Shanghai 200093, China

²

Laboratory of Computation and Analytics of Complex Management Systems (CACMS), Tianjin University, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Systems 2026, 14(1), 56; https://doi.org/10.3390/systems14010056

Submission received: 21 November 2025 / Revised: 23 December 2025 / Accepted: 3 January 2026 / Published: 6 January 2026

Download

Browse Figures

Versions Notes

Abstract

To mitigate the frequent occurrence of rear-end collisions on urban expressways under rainy weather conditions, firstly, accident risk levels were classified using traffic conflict indicators. Secondly, three machine learning models were employed to predict the accident severity across different scenarios. Furthermore, key influencing factors of rear-end collisions were identified and analyzed based on SHAP values. Case studies were conducted by simulating vehicle trajectory data under light, moderate, and heavy rain scenarios, using an open urban expressway dataset and car-following parameters for rainy conditions. Next, the Modified Time-to-Collision (MTTC) metric was calculated. Risk thresholds for low-, medium-, and high-risk levels were established for each rainfall category using percentile-based cumulative distribution analysis. Finally, real-time risk prediction under the three rainfall scenarios was conducted using XGBoost, LightGBM, and Random Forest models. The model performances were evaluated in terms of accuracy, recall, precision, and AUC. Overall, the study finds that the LightGBM model achieves the highest predictive capability, with AUC values exceeding 0.78 under all weather conditions. Moreover, the study concludes that factors ranked by SHAP values reveal that the minimum distance has the greatest influence in light rain scenarios. As rainfall intensity increases, the influences of minimum headway time and average vehicle speed are found to grow, highlighting an interaction pattern characterized by “speed-distance-flow” coupling.

Keywords:

traffic engineering; rear-end collision; risk prediction model; rainy weather condition; diverging areas of urban expressway; Modified Time-to-Collision (MTTC) metric

1. Introduction

With the continuous advancement of urbanization, expressways, as fast corridors connecting urban areas, have become increasingly significant. The merging and diverging zones along these expressways have emerged as critical areas for traffic flow transition between regions, directly affecting the operational efficiency and safety levels of urban transportation.

According to the 2020 traffic accident statistics in China [1], the number of traffic accidents on urban expressways was approximately 244,670, with rear-end collisions accounting for nearly 40%. The weaving zones of urban expressways are prone to rear-end collisions due to high traffic density, frequent lane changing, and uneven vehicle speeds. In merging areas, the probability of rear-end collisions is relatively low due to smaller speed variations and higher average speeds. In contrast, diverging areas are more prone to rear-end collisions, as drivers are more susceptible to route selection interference, leading to significant speed fluctuations.

On the other hand, rainy weather conditions contribute to slippery road surfaces, lower road friction coefficients, reduced visibility, water film effects, and changes in driver behavior, all of which further increase the probability of rear-end collisions. Therefore, it is particularly necessary to conduct risk prediction for rear-end collisions in diverging areas of expressways under rainy weather conditions.

Currently, research on rear-end collisions in diverging areas of expressways, both domestically and internationally, primarily focuses on rear-end collision risk identification and prediction. Rear-end collision risk identification aims to extract dynamic or static key indicators that influence rear-end collisions from complex traffic systems through data mining and feature analysis methods, revealing the underlying mechanisms of accidents. In contrast, rear-end collision risk prediction focuses on developing statistical models and algorithms to assess the probability and severity of accidents based on historical data and real-time information, enabling dynamic early warning. The following provides a detailed explanation from these two aspects.

Regarding research on rear-end collision risk identification, Chinese scholars have primarily utilized relevant theories such as the CHAID tree model [2], genetic algorithms [3], PC-Crash simulation [4], and binary Logit model [5] to assess the factors influencing rear-end collisions, aiming to improve traffic safety on various road segments. Liu Benmin et al. [6], based on rear-end collision data from the United States, utilized the SVM model to analyze chain-reaction rear-end collisions. They identified key factors such as the leading vehicle’s motion state, road speed limits, season, and the number of lanes, with higher speed limits (above 80 km/h), summer conditions, and multi-lane roads being more likely to lead to chain-reaction collisions. Zou, R. et al. [7] employed a mixed Logit model to examine the determinants of driver injury severity in rear-end collisions between passenger cars and pickup trucks on urban roadways. The analysis indicated that, across different collision configurations, alcohol involvement, roadway curvature, gender, and failure to wear seat belts were significantly associated with injury severity. Yuan Renteng et al. [8], based on the CROPHM model, analyzed the causal differences in the severity of rear-end collisions on highways between day and night. They found that accidents occurring during the early morning hours, involving multiple vehicles, or large vehicles significantly increased the probability of fatal accidents, and proposed targeted improvement recommendations. Ahmadi, A. et al. [9] integrated 5 years of data from California and employed Multinomial Logit, Mixed Multinomial Logit, and SVM models to analyze the severity and influencing factors of rear-end collisions. They clarified that the SVM model slightly outperformed the others in terms of predictive performance, providing important insights for improving driver safety education, as well as vehicle and road design. Qi, Y. et al. [10] analyzed the frequency and severity of rear-end collisions in construction zones using truncated count models and ordered Probit models. They found that factors such as construction type, traffic control methods, driving under the influence, and truck involvement significantly affected accident frequency and severity, providing important insights for safety management in construction areas.

Regarding research on rear-end collision risk prediction, scholars have primarily focused on studies of rear-end collision risks in foggy conditions, the selection of rear-end indicators, and the use of various machine learning techniques for model validation. Wen Huiying et al. [11] selected the TTC indicator and utilized the German HighD dataset to assess rear-end collision risks involving large trucks on highways using machine learning models. They found that the RF model achieved the highest prediction accuracy, with minimum headway distance, standard deviation of speed, and standard deviation of acceleration having the greatest impact on rear-end collision risk. Wu, Y. et al. [12] developed a novel algorithm to assess rear-end collision risk in foggy conditions by comparing the safe braking distances between leading and following vehicles. The findings demonstrate the algorithm’s effectiveness in identifying collision risk disparities across different lanes and vehicle types. Wang Jiali [13] used a driving simulator to study the risk of chain-reaction rear-end collisions on highways in foggy conditions. She found that the primary cause was the reduced visibility leading to insufficient following distance and developed a TTC-based risk propagation model with high prediction accuracy. Li Yun et al. [14] developed a car-following rear-end collision risk threshold model for highway construction zones based on driver characteristics. They quantitatively assessed the risk levels of different sections and found that the conflict risk was highest in the warning zone, with traffic volume increases significantly elevating the risk. Gecchele, G. et al. [15] applied extreme value theory combined with Time-to-Collision as an alternative safety indicator to assess rear-end collision risks on highways. This approach effectively addressed the limitations of accident data-based analysis and was validated for feasibility on Italian toll roads. Li, Z. et al. [16] proposed a Rear-End Collision Risk Index based on circular detector data and used a logistic regression model to assess rear-end collision risks in real-time at highway bottleneck areas. The results indicated that the highest rear-end collision risks occurred when the upstream approached saturation and the downstream was severely congested.

In summary, both domestic and international scholars have made significant progress in the fields of rear-end collision risk identification and prediction. Existing studies primarily use multi-source data fusion and diversified models to reveal the influencing factors of rear-end collisions on specific road segments, and have established dynamic early-warning frameworks for certain environments. However, existing studies still have the following limitations:

(1): Scene limitations: Most studies focus on specific environments such as foggy conditions, mountainous areas, or construction zones. In contrast, the complex scenario of rain-affected urban expressway diverging areas, characterized by reduced visibility, a sharp decrease in road surface friction, and frequent lane changes, presents risk mechanisms that differ significantly from those of conventional road sections. Meanwhile, the proportion of traffic accidents occurring in rainy weather accounts for as much as 10.03% of all traffic accidents, with direct property damage constituting 14.53% [17], highlighting the importance of further attention and in-depth research.
(2): Model limitations: Existing models often perform well under single-factor conditions, but their ability to represent multi-factor coupling scenarios, such as those in rain-affected urban expressway diverging areas, remains unclear and requires further investigation.

In light of this, the author considers the characteristics of high traffic density, frequent lane changes, uneven speeds, and driver susceptibility to route choice interference in urban expressway diverging areas under rainy conditions. Firstly, a scenario analysis of rear-end collisions in rainy weather is conducted, listing the relevant influencing factors and classifying rear-end collision risk levels based on the MTTC indicator thresholds. Secondly, three machine learning models are selected to build a dynamic prediction model for rear-end collision risk in urban expressway diverging areas under rainy conditions, with SHAP values used to analyze the related influencing factors. Finally, a simulation platform is developed for data collection and processing, aiming to provide a theoretical foundation for effectively predicting the probability of rear-end collisions in urban expressway weaving areas under rainy weather and enhancing road capacity and safety.

The contributions of this paper are summarized as follows:

(1): A rainfall-specific rear-end collision risk classification framework is developed based on the Modified Time-to-Collision metric, with distinct risk thresholds established for light, moderate, and heavy rain conditions.
(2): Given that existing studies on rear-end collision scenarios in urban expressway diverging areas remain limited, this study systematically analyzes the rear-end collision risk mechanisms specific to this traffic environment.
(3): By integrating machine learning models with SHAP-based interpretability analysis, this study reveals the evolution of dominant risk factors under different rainfall intensities, providing insights for differentiated traffic safety management in rainy conditions.

2. Factors and Risk Level Classification

2.1. Rear-End Collision Scenarios on Expressways in Rainy Weather

(1): Classification of Rainfall Levels

Rain is a common natural phenomenon that, while nourishing all living things, also impacts the transportation system, including effects on people, vehicles, roads, and the environment. According to the “Grade of Precipitation” (GB/T 28592-2012) [18] by the China Meteorological Administration, the precipitation intensity is classified into six levels based on the total precipitation over 24 h, as shown in Table 1.

According to the “Technical Guidelines for Highway Traffic Meteorological Disaster Risk Assessment” by the Ministry of Transport, under light rain, the road surface becomes slightly wet with little to no water accumulation, the friction coefficient decreases by approximately 10–15%, and visibility is greater than 1 km. Under moderate rain, a thin water film tends to form on the road surface, the friction coefficient decreases by approximately 20–30%, and visibility ranges from 500 m to 1 km. Under heavy rain, the road surface water accumulation depth is approximately 2–5 mm, the friction coefficient decreases by approximately 40–50%, and visibility is reduced to between 200 m and 500 m. Due to the high frequency of occurrence of these three scenarios in daily life, this paper will provide a detailed description of these three types of rainy weather conditions.

(2): IDM Car-Following Model under Rainy Weather Conditions

In both domestic and international research on traffic flow and vehicle behavior, car-following models, as an important tool for characterizing the longitudinal interaction between leading and following vehicles, are widely used in traffic simulation and safety analysis. Distance-based models, speed-difference-based models, and optimization-based models are commonly applied. Among them, traditional models include the GM model and the OV model, which emphasize the relationship between inter-vehicle spacing and desired speed, respectively. The FVD model, on the other hand, excels in improving the speed response mechanism.

Yang et al. [19] pointed out that the Intelligent Driver Model (IDM) performs best in real-world freeway simulations and is highly consistent with the car-following behavior of drivers. Therefore, this paper selects this model for simulation research.

The IDM is a widely used microscopic car-following model designed to characterize the dynamic behavior of vehicles in traffic flow. By describing the drivers’ response to the distance and speed difference between the leading and following vehicles, and incorporating the safe distance and desired speed, the model calculates the real-time acceleration of the vehicle. The core of its equation primarily includes the free-flow acceleration term and the car-following deceleration term, as shown in Equation (1):

\{\begin{cases} a_{n} (t) = a \{1 - [\frac{v_{n} (t)}{v_{0}}] δ - {[\frac{s * (v_{n} (t), Δ v_{n, n - 1} (t))}{Δ x_{n} (t)}]}^{2}\} \\ s * [v_{n} (t), Δ v_{n} (t)] = s_{0} + T v_{n} (t) + \frac{v_{n} (t) Δ v_{n, n - 1} (t)}{2 \sqrt{a_{\max} b_{c o m f}}} \end{cases}

(1)

where

a

denotes the maximum acceleration;

s * [v_{n} (t), Δ v_{n} (t)]

denotes the desired following distance;

b_{c o m f}

denotes the comfortable deceleration;

v_{0}

denotes the desired speed;

s_{0}

denotes the standstill safety distance;

T

denotes the desired headway time;

Δ v_{n, n - 1} (t)

denotes the relative speed between leading and following vehicles and

δ

denotes the acceleration index.

Shan et al. [20], based on driving data collected under different weather conditions, calibrated the IDM parameters for rainy weather by incorporating changes in friction coefficient and visibility distance. The specific calibration results are shown in Table 2, and this paper adopts these calibration results.

2.2. Factors Contributing to Rear-End Collisions

Rear-end collisions in the diverging areas of urban expressways under rainy weather conditions result from the interaction of multiple factors, including human, vehicle, road, and environmental elements. This section reviews the key factors influencing rear-end collisions, based on the micro-level analysis conducted by relevant scholars, focusing on these four dimensions.

(1): Driver Factors

The driver-related factors influencing rear-end collisions in rainy weather can be broadly categorized into two main aspects: emotional and psychological characteristics, and driving behavior decisions. The specific human factors and their primary causes are shown in Table 3.

(2): Vehicle Factors

Vehicle-related influencing factors can be categorized into two main aspects: vehicle performance and measured vehicle trajectory data. The former, in the context of rainy weather, is reflected in the effectiveness of the braking system and the skid resistance of the tires. The latter plays a crucial role in relevant studies, which indicate that rear-end collisions are closely related to vehicle speed, acceleration, and traffic flow. Ultimately, this study identifies the following 10 predictive factors: traffic volume, average distance, minimum distance, average vehicle speed, standard deviation of vehicle speed, average headway time, minimum headway time, average acceleration, standard deviation of acceleration, and traffic density.

(3): Road Factors

Road-related influencing factors can be divided into two dimensions: the geometric design of the diverging areas and the functional condition of the road surface. The former primarily includes aspects such as the length of the diverging area, the number of lanes in the diverging area, and the gradient of the widening section, while the latter focuses on drainage performance and skid resistance.

The “Design Code for Urban Expressways” (CJJ 129—2009) states that the length of auxiliary lanes in diverging areas of expressways should be greater than 1000 m, and the number of lanes at the diverging point is given by Equation (2).

N_{C} \geq N_{F} + N_{E} - 1

(2)

where

N_{C}

denotes the number of mainline lanes before divergence;

N_{F}

denotes the number of mainline lanes after divergence and

N_{E}

denotes the number of ramp lanes.

(4): Environment Factors

Environment-related influencing factors include rainfall intensity, visibility, and spatiotemporal distribution. Among them, rainfall intensity is based on the three scenarios mentioned in Section 2.1, while spatiotemporal distribution refers to special scenarios such as morning and evening traffic peaks, and large-scale events, which are prone to traffic congestion and accidents.

In rear-end collision risk identification in expressway diverging areas, selecting appropriate accident influencing factors is crucial. On one hand, the factors should be closely related to the occurrence of accidents; on the other hand, they need to be easily accessible and measurable. Based on this, this study conducts a comprehensive analysis of the characteristic variables contributing to rear-end collision risk in expressway diverging areas and identifies the following 16 influencing factors, as shown in Table 4.

Considering the specific traffic environment of expressway diverging areas and the characteristics of rainy weather conditions, this study focuses on vehicle-related factors for subsequent modeling and analysis. Vehicle-level factors directly reflect drivers’ instantaneous behaviors and interaction dynamics, such as speed, spacing, and acceleration, which play a dominant role in rear-end collision formation, particularly under reduced visibility and degraded pavement conditions in rainy weather. Moreover, vehicle-related variables can be directly derived from trajectory data with higher temporal resolution and reliability, making them more suitable for fine-grained risk identification and machine-learning-based modeling in this study.

2.3. Rear-End Collision Risk Level Classification

In existing research, traffic conflict evaluation indicators are mainly divided into three categories: deceleration-based indicators, time-based indicators, and space-based indicators, as shown in Table 5.

In rear-end collision risk research, the most widely used indicator is the Time-to-Collision (TTC) metric. TTC is based on the relative speed and headway between the lead and following vehicles. Under the condition that both vehicles are traveling on the same lane with stable trajectories, TTC represents the time from the initiation of evasive behavior to the occurrence of a collision. This indicator is applicable for safety analysis in various scenarios. However, the TTC metric only considers longitudinal motion and is suitable for use in deterministic models.

In the study of rear-end collisions on urban expressways during rainy weather, the applicability of the traditional TTC metric is constrained due to the complex and variable external conditions. On one hand, the rainy environment reduces visibility and decreases the road surface friction coefficient, which leads to an increase in vehicle braking distance. On the other hand, the variability in drivers’ reaction times increases, and the uncertainty in driving behavior intensifies. As a result, vehicles often exhibit nonlinear deceleration patterns, weakening the ability of TTC to accurately reflect the actual collision risk.

To accurately reflect rear-end collision risk under rainy weather conditions, this study introduces the Modified Time-to-Collision (MTTC) metric. Built upon the TTC metric, MTTC incorporates acceleration as a reference, aligning with the changes in vehicle braking characteristics and evasive capabilities in rainy scenarios, thereby enhancing the dynamic adaptability of the metric. Previous studies have demonstrated [21] that, compared with the traditional TTC, the MTTC metric is more suitable for capturing rear-end collision risk under complex and dynamic traffic conditions. By explicitly incorporating relative acceleration into the collision assessment, MTTC is able to reflect nonlinear braking behavior and dynamic interaction processes between consecutive vehicles, which are particularly pronounced under adverse weather conditions. Therefore, MTTC provides a more realistic and dynamically adaptive surrogate safety indicator for rear-end collision risk analysis in rainy urban expressway environments. The criteria for determining rear-end collision risk are presented in Equation (3), and further organized in Equation (4):

v_{b} t + \frac{1}{2} a_{b} t^{2} \geq d + v_{l} t + \frac{1}{2} a_{l} t^{2}

(3)

\frac{1}{2} Δ a t^{2} + Δ v t - d \geq 0

(4)

Among them:

\{\begin{cases} Δ a = a_{b} - a_{l} \\ Δ v = v_{b} - v_{l} \end{cases}

(5)

where

v_{b}

,

v_{l}

denote the speeds of the following and leading vehicles, respectively;

a_{b}

,

a_{l}

denote the accelerations of the following and leading vehicles, respectively;

d

denotes the initial headway between the following and leading vehicles.

Let the inequality in Equation (4) become an equality. By solving the quadratic equation, the two solutions for MTTC,

t_{1}

and

t_{2}

are as follows:

t_{1, 2} = \frac{- Δ v \pm \sqrt{Δ v^{2} + 2 Δ a d}}{Δ a}

(6)

Thus, the value of MTTC can be expressed as:

M T T C = \{\begin{cases} \min (t_{1}, t_{2}), t_{1}, t_{2} > 0 a n d Δ a \neq 0 \\ t_{1}, t_{1} > 0, t_{2} \leq 0 a n d Δ a \neq 0 \\ t_{2}, t_{2} > 0, t_{1} \leq 0 a n d Δ a \neq 0 \\ \frac{d}{Δ v}, Δ v > 0 a n d Δ a = 0 \end{cases}

(7)

From a physical perspective, MTTC represents the estimated time to collision under a short-term uniform-acceleration assumption, where both relative speed and relative acceleration jointly describe the longitudinal interaction between the leading and following vehicles. Compared with the traditional TTC metric, which assumes constant relative speed, the inclusion of acceleration allows MTTC to capture transient braking and car-following adjustments that commonly occur in complex traffic environments.

In this study, the acceleration term used in MTTC is consistent with the longitudinal motion description adopted in the car-following model. By accounting for acceleration-induced changes in the relative motion state, MTTC provides a more dynamically responsive surrogate safety indicator for rear-end collision risk analysis under time-varying traffic conditions.

A decrease in the MTTC metric indicates an increase in the probability of a rear-end collision between the two vehicles. Ozbay, K. et al. [21] noted that the threshold for MTTC is typically set at 4 s. When the MTTC value exceeds 4 s, drivers usually have sufficient time to make appropriate decisions. Therefore, in this study, MTTC = 4 s is set as the threshold for determining potential conflicts. In this study, the percentile method based on the MTTC value is used to determine the specific classification of rear-end collision risk levels. The total sample is denoted as N, and the dataset

S = \{M T T C_{1}, M T T C_{2}, \dots, M T T C_{n}\}

is sorted in ascending order. The cumulative distribution is calculated, with the MTTC values corresponding to the 15%, 50%, and 85% percentiles denoted as

P_{15}

,

P_{50}

and

P_{85}

, respectively [22]. The specific risk levels are then derived, as shown in Table 6:

3. Rear-End Collision Risk Prediction

3.1. Risk Prediction Model Selection

In this section, considering the high traffic density, large speed fluctuations, and limited visibility in the diverging areas of urban expressways under rainy weather conditions, three machine learning methods are selected: eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF). These methods integrate the influencing factors identified in Section 2.2 to construct a rear-end collision risk prediction model framework. The model hyperparameters are optimized using grid search and five-fold cross-validation, and the prediction performance of each model under different risk level scenarios is evaluated to identify the optimal model for predicting rear-end collision risk in rainy weather conditions on urban expressway diverging areas.

(1)

Overview of the Prediction Models

(a): XGBoost Model

The XGBoost algorithm achieves high efficiency and flexibility by incorporating a loss function and regularization term, making it an optimization of distributed gradient boosting. The complexity of each decision tree in the XGBoost model is determined by the number of leaf nodes and their corresponding weights. The optimal parameters are determined by minimizing the objective function, which, after a second-order Taylor expansion, can be expressed as the following Equation (8):

o b j^{t} = \sum_{i = 1}^{n} (y_{i} - ({\hat{y}}_{i} + f_{t} (x_{i})))^{2} + Ω (f_{t})

(8)

where

o b j^{t}

denotes the objective function at the t-th iteration;

{\hat{y}}_{i}

denotes the predicted value at the i-th iteration;

f_{t} (x_{i})

denotes the new function added at the i-th iteration and

Ω (f_{t})

denotes Complexity.

(b)
LightGBM Model

The Light Gradient Boosting Machine (LightGBM) algorithm, like the eXtreme Gradient Boosting algorithm, uses the Taylor expansion of the loss function to generate decision trees and combines multiple weak learners into a model. Its advantage lies in the introduction of the “Leaf-wise” splitting mechanism, which prioritizes selecting features and their thresholds that provide the greatest gain for the current sample set at each split, effectively controlling the tree depth and improving training efficiency. On the other hand, the model supports the Gradient-based One-Side Sampling algorithm, which efficiently eliminates low-gradient samples, thereby enhancing the algorithm’s accuracy and improving training quality.

(c)
RF Model

The Random Forest algorithm is an ensemble learning method that performs bootstrapping from the initial dataset. At each decision tree node, feature variables are selected to perform growth and splitting, with trees being uncorrelated with one another. By aggregating the predictions from a series of decision trees, the final result is obtained either through voting classification for each category or by averaging, as shown in the following Equation (9):

y^{'} (x) = \frac{1}{N} \sum_{n = 1}^{N} y_{n} (x)

(9)

where

y^{'} (x)

denotes the final predicted result;

N

denotes the number of trees and

y_{n} (x)

denotes the prediction result of the n-th tree.

(2): Model Applicability Analysis

The unique advantages and good adaptability of the three models for predicting rear-end collision risk in urban expressway diverging areas under rainy weather conditions are primarily reflected in the following aspects, as shown in Table 7.

(3): Evaluation Metrics for Predictive Performance

Using the three machine learning methods mentioned above, rear-end collision risk prediction for each risk level is performed. This study selects accuracy, precision, recall, F1-score, and AUC-ROC as the evaluation metrics for model prediction performance, as shown in Equation (10).

(a)
Accuracy: The proportion of correctly predicted samples to total samples.
(b)
Precision: The proportion of samples predicted as positive that are actually positive.
(c)
Recall: The proportion of correctly predicted positive samples to the total actual positive samples.
(d)
F1-score: The harmonic mean of precision and recall.
(e)
AUC-ROC: Measure of the model’s ability to distinguish between positive and negative samples at different classification thresholds.

\{\begin{cases} a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \\ p r e c i s i o n = \frac{T P}{T P + F P} \\ r e c a l l = \frac{T P}{T P + F N} \\ F 1 - s c o r e = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l} \end{cases}

(10)

where

T P

denotes the true positive;

F P

denotes the false positive;

T N

denotes the true negative and

F N

denotes the false negative.

3.2. SHAP Evaluation Index of Factors

The XGBoost and LightGBM models in Section 3.1 are both gradient boosting tree models, while the RF model is based on a Bagging ensemble tree approach. All three models fall under the category of tree-based models. However, the decision-making process of machine learning models is non-transparent, functioning as a ‘black box’, which results in limited interpretability and makes it difficult to provide feedback on the positive or negative effects of feature variables in rear-end collision prediction. Therefore, this study selects the SHAP (Shapley Additive Explanations) value model based on game theory for interpretation. By calculating the marginal contribution of each feature variable to the prediction outcome, the model quantifies the importance and impact of each feature in rear-end collision risk, and is applicable to various tree-based models.

In this study, machine learning models are combined with SHAP values, and the model interpretation logic is illustrated in Figure 1. This approach enables both global interpretation of all feature variables and different risk levels, as well as local interpretation of feature variables and predicted values for individual samples, significantly enhancing the interpretability of feature variables in risk prediction. The corresponding formula is shown in Equation (11).

ω_{i} = \sum_{S \subseteq F \ \{i\}} \frac{|S|! (|F| - |S| - 1)!}{|F|!} [f_{S \cup \{i\}} (x_{S \cup \{i\}}) - f_{S} (x_{S})]

(11)

where

ω_{i}

denotes the SHAP value of feature variable

i

, which measures its marginal contribution to the rear-end collision risk outcome;

S

denotes the subset of feature variables excluding feature variable

i

;

F

denotes the set of all feature variables;

f_{S} (x_{S})

denotes the rear-end collision risk prediction value when only the feature variables in subset

S

are applied;

f_{S \cup \{i\}} (x_{S \cup \{i\}})

denotes the rear-end collision risk prediction value after the inclusion of feature variable

i

.

4. Simulation Analysis and Results Validation

4.1. Data Sources and Processing

(1): Data Sources

This study utilizes a publicly available trajectory dataset from urban expressways in Wuhan [23] as the data source for model construction and validation. The selected study segment and prevailing rainy-weather traffic conditions are consistent with the urban expressway scenarios considered in previous studies [20] where rainy-weather IDM parameters were calibrated, supporting the applicability of the adopted parameter set in this context. The dataset includes operational data from multiple expressway networks, such as vehicle ID, time frames, vehicle pixel coordinates, speed, vehicle length, and travel direction. This paper focuses on a representative segment of a diverging area, as shown in Figure 2, which spans approximately 414 m and contains data from about 2400 vehicles.

In the data preprocessing stage, basic cleaning procedures were conducted prior to simulation, including the removal of abnormal records and entries with missing key variables, as well as temporal organization of the remaining data, among other operations, to ensure trajectory consistency and continuity. The dataset is combined with the features of the expressway diverging area to extract the initial car-following pairs. Using Python (version 3.8) for simulation experiments with a time step of 0.1 s, the IDM car-following model’s rainy weather parameters are applied to extract trajectory data for three weather scenarios on this segment. The extracted data includes traffic volume, average distance, minimum distance, average vehicle speed, standard deviation of vehicle speed, average headway time, minimum headway time, average acceleration, standard deviation of acceleration, and traffic density for each scenario.

(2): Correlation Test

Pearson correlation tests are conducted on the influencing factors to mitigate the potential negative effects of correlations between factors on the model’s predictive performance. If the absolute value of the correlation coefficient r exceeds 0.8, it indicates a high correlation between the two factors, and typically, one of them is selected as a reference variable. Following previous studies [11], a Pearson correlation threshold of r > 0.8 is commonly adopted to identify strongly correlated features and reduce feature redundancy prior to model construction. Although tree-based models such as LightGBM are generally robust to multicollinearity, excluding highly correlated features helps ensure a more concise input space and improves the stability and interpretability of subsequent SHAP-based analysis. This study performs the correlation tests using Python code and conducts experiments for light, moderate, and heavy rain scenarios separately.

Taking the light rain scenario as an example, as shown in Figure 3, the correlation coefficients of three variable pairs—average headway time and average distance, minimum headway time and minimum distance, as well as traffic volume and traffic density—exceed 0.8, indicating strong correlations among several distance- and headway-related variables. To mitigate potential multicollinearity, this study retains representative variables from each correlated group. Given that traffic volume and traffic density are inherently redundant, traffic volume is retained as the representative macroscopic flow indicator. For the distance–headway variable pairs, distance-based indicators are retained under light rain conditions to characterize spatial safety margins, while the corresponding headway-time variables are excluded. Consequently, traffic density, minimum headway time, and average headway time are excluded from the influencing factors in the light rain scenario.

The Pearson correlation matrices for the moderate and heavy rain scenarios are presented in Appendix A. The final input variables selected for the three rainfall scenarios are summarized in Table 8.

(3): MTTC Threshold Division

After the simulation and data cleaning and verification, the trajectory data for light, moderate, and heavy rain are obtained, with 19,657, 15,013, and 14,601 records, respectively. According to the MTTC = 4s threshold division logic outlined in Section 2.3, the final processed data are summarized in Table 9.

All MTTC values below 4 s in the conflict dataset are arranged in ascending order, and the percentile distribution method is applied to determine the risk classification thresholds for each rainfall scenario. Consistent with Section 2.3, the P15, P50, and P85 percentiles in each scenario are defined as the thresholds for high-, medium-, and low-risk levels, respectively. Furthermore, to balance the number of samples across different class labels and prevent model overfitting caused by uneven sample distribution, undersampling is applied to the overrepresented categories. While undersampling can lead to potential information loss, especially in the minority classes, it was chosen here due to the relatively small size of the dataset and the need to ensure sufficient representation of each class. Other methods, such as SMOTE or class weighting, could also be considered to handle the class imbalance; however, undersampling was preferred in this study to avoid introducing synthetic data or excessively complicating the model. It should be noted that although this method reduces the overall sample size, it helps to improve the model’s generalizability by reducing the potential bias towards the majority class. Subsequently, each machine learning model is trained and tested on the undersampled datasets, which are randomly divided into training and testing sets at a 7:3 ratio. The results are presented in Table 10.

4.2. Comparison and Analysis of Prediction Models

This study calculates the accuracy of the three models for each weather scenario based on the risk confusion matrix data generated by the prediction models. The aim is to compare the most suitable prediction model for different weather conditions and display their ROC curves.

(1): Accuracy Analysis of Models under Three Weather Scenarios

Accuracy evaluation of the proposed model, in conjunction with Section 3.1, is primarily reflected through three key metrics: precision, recall, and F1-score.

(a)
Light Rain Scenario

In the light rain scenario, the confusion matrix results of the three prediction models are shown in Figure 4. A detailed classification report of the model predictions is presented in Table 11.

As shown in Table 11, all three models perform relatively well in predicting the low-risk category. Among them, the LightGBM model achieves the highest overall accuracy at 84%, which is 4 percentage points higher than the XGBoost model and 15 percentage points higher than the RF model. In addition, for the prediction of the medium-risk category in the light rain scenario, the LightGBM model significantly outperforms the other models in both precision and recall. Therefore, the LightGBM model demonstrates superior overall predictive performance in the light rain scenario.

(b)
Moderate Rain Scenario

The confusion matrix results for the moderate rain scenario are presented in Figure A3 in Appendix B. The prediction classification report for the three models is shown in Table 12.

As shown in Table 12, the overall prediction accuracy of the LightGBM model in the moderate rain scenario is 69%. In predicting the three risk levels, this model outperforms the other two models in both precision and recall. Specifically, for the low-risk category, its prediction precision reaches 78%, with an F1 score of 76%; for the high-risk category, the prediction precision is 74%, with an F1 score of 73%. Therefore, the LightGBM model demonstrates strong classification prediction capability in the moderate rain scenario.

(c)
Heavy Rain Scenario

The confusion matrix results for the heavy rain scenario are presented in Figure A4 in Appendix B. The prediction classification report for the three models is shown in Table 13. It can be seen that the LightGBM model has the highest overall prediction accuracy at 76%, slightly higher than the 74% accuracy of the XGBoost model. In contrast, the RF model performs relatively weaker in this scenario, with a difference of nearly ten percentage points compared to the first two models. In the low-risk and high-risk category predictions, both the XGBoost and LightGBM models demonstrate relatively stable performance. However, in the medium-risk category prediction for the heavy rain scenario, the LightGBM model exhibits higher precision and recall. Overall, the LightGBM model shows stronger ability in distinguishing risk levels in the heavy rain scenario.

The detailed hyperparameter settings of the machine learning models are provided in Appendix C.

(2): ROC Curves of LightGBM under Three Weather Scenarios

Based on the comparison of model accuracies across the three weather scenarios, the LightGBM model consistently demonstrates the highest predictive performance. Therefore, this section presents the ROC curves of the LightGBM model under the three scenarios.

Based on the LightGBM confusion matrix results across the three scenarios, ROC curves are plotted with the true positive rate on the vertical axis and the false positive rate on the horizontal axis. The ROC curves illustrate the model’s discriminative ability under different classification thresholds, while the AUC represents the area under the ROC curve. A larger AUC indicates stronger capability in distinguishing between classes. The results are presented in Figure 5.

Across the three scenarios, the LightGBM model demonstrates strong discriminative ability in predicting both low-risk and high-risk categories, with corresponding AUC values exceeding 0.85. Notably, in the light rain scenario, the AUC for the low-risk category reaches 0.97, reflecting high model sensitivity. In comparison, the AUC values for the medium-risk category are slightly lower but remain above 0.75, indicating a reasonable overall performance and confirming that the model maintains stable predictive capability for this category as well.

Specifically, the LightGBM model exhibits the strongest predictive capability for low-risk samples, slightly weaker performance for high-risk samples, and relatively lower accuracy for medium-risk samples. This can be attributed to two main factors. First, the medium-risk category represents an intermediate zone between low- and high-risk states, where some data are susceptible to interference from both extremes, resulting in blurred boundaries. Second, from a driving behavior perspective, drivers tend to maintain high attention in high-risk situations, while in low-risk states, they may adopt relatively more aggressive driving behaviors. Medium-risk conditions fall between these extremes, exhibiting greater driving variability. These two factors partially affect the model’s ability to accurately predict medium-risk samples.

Across the three weather scenarios, the model prediction results indicate that the LightGBM model maintains high accuracy and stability in risk category identification tasks, demonstrating strong robustness and generalization capability.

4.3. Risk Factors Analysis

This study integrates the LightGBM model with the SHAP value interpretation method to identify key influencing factors and reveal the underlying mechanisms of rear-end collision risk under light, moderate, and heavy rain scenarios. SHAP values are used to quantify the marginal contribution of each input variable to the model’s predictions across different risk levels, where a larger absolute value indicates a stronger influence on risk classification.

By combining the strong predictive capability of LightGBM with the transparent interpretability provided by the SHAP evaluation framework, this study achieves robust risk prediction while maintaining clear insight into the model’s decision logic. LightGBM effectively captures complex nonlinear relationships among traffic variables under varying rainfall conditions, whereas SHAP enables an intuitive assessment of the relative importance of different factors. This integrated framework therefore provides a reliable methodological basis for comparing dominant risk drivers across rainfall scenarios and analyzing the evolution of rear-end collision risk mechanisms.

Figure 6a–c present the SHAP-based importance distributions of the key influencing factors under light, moderate, and heavy rain scenarios, respectively.

In the light rain scenario, the headway time and average vehicle speed exhibit relatively high SHAP values. Among them, the minimum headway time accounts for the largest proportion in the high-risk category, indicating that under such conditions, the traffic environment remains relatively stable, and individual microscopic car-following behavior becomes the key factor influencing risk. Therefore, in light rain scenario, it is essential to maintain a steady driving speed, minimize abrupt speed fluctuations, and enhance the capability of intelligent sensing systems to identify abnormally close car-following behavior, thereby guiding drivers to maintain a safe following distance.

In the moderate rain scenario, the minimum headway time emerges as the dominant influencing factor, far exceeding the impact of other variables. The average vehicle speed and traffic volume are identified as the next two key contributors. The underlying mechanism can be explained by the fact that as rainfall intensity increases, the road friction coefficient decreases further, making the interaction between the extreme fluctuations of dynamic following distance and the macroscopic traffic flow state a critical determinant of risk. Therefore, in moderate rain scenario, it is crucial to identify “short headway–high density” zones and implement multi-level smooth speed control measures on complex segments of expressways to ensure appropriate following distances.

In the heavy rain scenario, the average vehicle speed and the minimum headway time are identified as the two dominant influencing factors, while the effects of traffic volume and standard deviation of vehicle speed increase significantly. This reflects the model’s dual sensitivity to both macroscopic traffic flow conditions and microscopic speed fluctuations. Therefore, under heavy rain scenario, it is essential to develop a wide-area risk early-warning system based on the coordinated identification of speed–flow disturbances, enabling the early detection of potential rear-end collision zones and the real-time guidance of critical road segments.

To further illustrate the evolution of dominant risk mechanisms across rainfall intensities, the top three influencing factors identified by SHAP under light, moderate, and heavy rain scenarios are summarized in Figure 7. The results indicate a clear shift in dominant risk drivers as rainfall intensifies. Under light rain conditions, rear-end collision risk is primarily governed by microscopic car-following behaviors, with minimum distance and speed-related factors playing a leading role. As rainfall increases to a moderate level, the importance of minimum time headway becomes more pronounced, accompanied by an increasing contribution of traffic volume, reflecting the growing influence of interactions between individual driving behavior and macroscopic traffic flow states. Under heavy rain conditions, speed-related factors and traffic flow disturbances jointly dominate risk prediction, highlighting the strengthened coupling between macroscopic flow instability and microscopic speed fluctuations. Overall, this transition reveals an evolutionary shift from microscopically driven risk mechanisms toward macro–micro coupled disturbances as rainfall severity increases.

5. Conclusions

(1): Based on three machine learning models—XGBoost, LightGBM, and RF—this study proposed a rear-end collision risk prediction model for urban expressways under rainy conditions. Among the three weather scenarios, the LightGBM model consistently achieves the highest prediction accuracy, and demonstrates superior performance in terms of precision, recall, and overall accuracy, highlighting its strong predictive capability.
(2): Using SHAP values, the study ranked the importance of factors influencing rear-end collision risk. In the light rain scenario, the minimum distance has the greatest impact; in the moderate rain scenario, the minimum headway time plays the most significant role; and in the heavy rain scenario, the average vehicle speed and minimum headway time exert the strongest influence.
(3): According to different rainfall intensities, a hierarchical and categorized driving safety management strategy was proposed. Under the light rain scenario, the focus is on guiding drivers to maintain operational stability and safe headway distances; under moderate and heavy rain scenarios, the emphasis shifts to strengthening the coordinated control of “speed–distance–traffic volume.”

This study provided theoretical insights for the prediction and prevention of rear-end collision risks in freeway diverging areas under rainy conditions. However, given the complexity of the scenarios and data analysis, the present research considers only two-vehicle rear-end collisions and vehicle-related influencing factors. Future research can build upon the findings of this study by conducting more in-depth analyses of the vehicle-related factors associated with rear-end collisions under different rainfall scenarios. By integrating the identified dominant risk factors with improved IDM-based car-following models and intelligent connected vehicle strategies, subsequent studies may further develop targeted prevention and control approaches, ultimately contributing to enhanced traffic safety and operational efficiency on urban expressways.

Additionally, this study has several limitations. The rainy-weather vehicle trajectory data were generated through simulation using previously calibrated IDM parameters reported in the literature, rather than being directly collected under real rainy conditions. Although this approach enables a controlled analysis of rear-end collision risk under different rainfall intensities, it may limit the external validity of the findings when applied to other road segments or traffic environments. Future research will incorporate real-world rainy-weather trajectory data and multi-site validation to further examine the generalizability of the proposed framework.

Moreover, the proposed model was specifically calibrated for urban expressway areas, which have unique traffic dynamics compared to other road types such as highways. Therefore, applying the model to highways or other road segments would require recalibration of the IDM parameters, and further validation is necessary to assess its performance and applicability in these contexts.

Author Contributions

Conceptualization, X.X. and C.Z. (Chenqiang Zhu); methodology, J.Y.; software, T.Z.; validation, T.Z., P.W. and C.Z. (Chenke Zhu).; formal analysis, T.Z.; investigation, P.W. and C.Z. (Chenke Zhu); resources, J.Y.; data curation, P.W.; writing—original draft preparation, T.Z.; writing—review and editing, X.X. and C.Z. (Chenqiang Zhu).; visualization, C.Z. (Chenke Zhu); supervision, X.X. and C.Z. (Chenqiang Zhu); project administration, X.X and C.Z. (Chenqiang Zhu); funding acquisition, C.Z. (Chenqiang Zhu) All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shanghai Pujiang Programme (23PJC075), National Natural Science Foundation of China (72501184), Shanghai Planning Office of Philosophy and Social Sciences (2023EGL005), and Laboratory of Computation and Analytics of Complex Management Systems(CACMS) (Tianjin University).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

Acknowledgments

The authors gratefully acknowledge the financial support from the Shanghai Pujiang Programme, the National Natural Science Foundation of China, and the Shanghai Planning Office of Philosophy and Social Sciences.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Correlation Test

Figure A1. Correlation analysis of influencing factors under moderate rain scenario.

Figure A2. Correlation analysis of influencing factors under heavy rain scenario.

Appendix B. Confusion Matrix

Figure A3. (a) Confusion matrix of XGBoost prediction model under the moderate rain scenario; (b) Confusion matrix of LightGBM prediction model under the moderate rain scenario; (c) Confusion matrix of RF prediction model under the moderate rain scenario.

Figure A4. (a) Confusion matrix of XGBoost prediction model under the heavy rain scenario; (b) Confusion matrix of LightGBM prediction model under the heavy rain scenario; (c) Confusion matrix of RF prediction model under the heavy rain scenario.

Appendix C. ML Model Hyperparameters

Table A1. XGBoost hyperparameters under three rainfall scenarios.

XGBoost Hyperparameter	Light Rain	Moderate Rain	Heavy Rain
n_estimators	300	200	200
learning_rate	0.03	0.05	0.01
max_depth	6	6	8
subsample	1.0	0.9	0.8
colsample_bytree	0.9	0.9	0.9
reg_lambda	1.2	1.2	1.0
reg_alpha	0	0	0
gamma	0	0	0.1

Table A2. LightGBM hyperparameters under three rainfall scenarios.

LightGBM Hyperparameter	Light Rain	Moderate Rain	Heavy Rain
n_estimators	200	150	300
learning_rate	0.10	0.10	0.07
num_leaves	15	31	20
max_depth	10	10	−1
min_child_samples	30	20	10
subsample	0.7	0.7	0.6
colsample_bytree	1.0	1.0	1.0

Table A3. RF hyperparameters under three rainfall scenarios.

RF Hyperparameter	Light Rain	Moderate Rain	Heavy Rain
n_estimators	400	500	400
max_depth	15	12	12
min_samples_split	10	6	10
min_samples_leaf	2	3	3
max_features	log2	log2	sqrt
criterion	entropy	entropy	entropy
bootstrap	True	True	True

References

Ministry of Public Security Traffic Management Bureau. Traffic Accident Statistical Yearbook of the People’s Republic of China 2020; Ministry of Public Security Traffic Management Bureau: Beijing, China, 2021. [Google Scholar]
Pan, H.; Wang, Y.; Li, D.; Zhang, X.; Chen, J. Risk Assessment and Influence Factors Analysis of Rear-End Collision on Curved Slope Combination Section. J. Harbin Inst. Technol. 2023, 55, 36–46. [Google Scholar]
Das, A.; Abdel-Aty, M.A. A Combined Frequency–Severity Approach for the Analysis of Rear-End Crashes on Urban Arterials-ScienceDirect. Saf. Sci. 2011, 49, 1156–1163. [Google Scholar] [CrossRef]
Huang, X.; Huang, S. Freeway Rear-End Perception Model Based on PC-Crash Simulation and Statistical Analysis. China Saf. Sci. J. 2020, 30, 143–148. [Google Scholar]
Mohamed, S.A.; Mohamed, K.; Al-Harthi, H.A. Investigating Factors Affecting the Occurrence and Severity of Rear-End Crashes. Transp. Res. Procedia 2017, 25, 2103–2112. [Google Scholar] [CrossRef]
Liu, B.; Yan, H. An Analysis of Influencing Factors of Multi-vehicle Rear-end Accidents Based on Accident Classification of SVM. J. Transp. Inf. Saf. 2020, 38, 43–51. [Google Scholar]
Zou, R.; Yu, H.; Zhang, C.G. Severity Analyses of Urban Two-Vehicle-Involved Rear-End Crashes Characterized by Different Configurations Using Mixed Logit Models. J. Transp. Eng. Part A. Syst. 2024, 150, 4023141.1–4023141.11. [Google Scholar] [CrossRef]
Yuan, R.; Wang, C.; Xiang, Q. Analysis on the Heterogeneity of Factors Influencing the Rear-End Crash Severity on Highways. J. Southeast Univ. Nat. Sci. Ed. 2024, 54, 1231–1238. [Google Scholar]
Ahmadi, A.; Jahangiri, A.; Berardi, V.; Machiani, S.G. Crash Severity Analysis of Rear-End Crashes in California Using Statistical and Machine Learning Classification Methods. J. Transp. Saf. Secur. 2018, 1–25. [Google Scholar] [CrossRef]
Qi, Y.; Srinivasan, R.; Teng, H.; Baker, R. Analysis of the Frequency and Severity of Rear-End Crashes in Work Zones. J. Crash Prev. Inj. Control. 2013, 14, 61–72. [Google Scholar] [CrossRef] [PubMed]
Wen, H.; Huang, K.; Zhao, S. Prediction of Rear-End Collision Risk of Freeway Trucks Based on Machine Learning. China Saf. Sci. J. 2023, 33, 173–180. [Google Scholar]
Wu, Y.; Abdel-Aty, M.; Cai, Q.; Lee, J.; Park, J. Developing an algorithm to assess the rear-end collision risk under fog conditions using real-time data. Transp. Res. Part C Emerg. Technol. 2018, 87, 11–25. [Google Scholar] [CrossRef]
Wang, J. Study on the Mechanism of Multiple-Vehicle Rear-End Collision on Highway Under Fog Weather Condition. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2018. [Google Scholar]
Li, Y.; Zhang, S.; Ru, M. Car Following Rear-End Conflict Risk of Freeway Work Zone. J. Chang. Univ. Nat. Sci. Ed. 2017, 37, 81–88. [Google Scholar]
Gecchele, G.; Orsini, F.; Gastaldi, M.; Rossi, R. Freeway Rear-End Collision Risk Estimation with Extreme Value Theory Approach—A case study. Transp. Res. Procedia 2019, 37, 195–202. [Google Scholar] [CrossRef]
Li, Z.; Ahn, S.; Chung, K.; Ragland, D.R.; Wang, W.; Yu, J.W. Surrogate Safety Measure for Evaluating Rear-End Collision Risk Related to Kinematic Waves near Freeway Recurrent Bottlenecks. Accid. Anal. Prev. 2014, 64, 52–61. [Google Scholar] [CrossRef] [PubMed]
Zhu, L. Research on Following Control Strategies for Connected and Automated Vehicles Considering Rear-End Collision Risks in Continuous Rainy Traffic Flow. Master’s thesis, Chongqing Jiaotong University, Chongqing, China, 2024. [Google Scholar]
GB/T 28592-2012; Grade of Precipitation. Standards Press of China: Beijing, China, 2012.
Yang, T.; Wang, W.; Li, Y.; Li, J. Classification of Driver Car-Following Behavior Style Based on Vehicle Trajectory Data. J. Jilin Univ. Eng. Technol. Ed. 2025, 1–18. [Google Scholar] [CrossRef]
Shan, H.; Yu, C.; Xu, J.; Gao, C.; Yao, Y. Calibration on Expressway Car-following Model with Various Rainfall Intensities. J. Highw. Transp. Res. Dev. 2024, 41, 190–198. [Google Scholar]
Ozbay, K.; Yang, H.; Bartin, B.; Mudigonda, S. Derivation and Validation of New Simulation-Based Surrogate Safety Measure. Transp. Res. Rec. 2008, 2083, 105–113. [Google Scholar] [CrossRef]
Zhu, S.; Jiang, R.; Wang, H.; Zou, H.; Wang, P.; Qiu, J. Review of Research on Traffic Conflict Techniques. China J. Highw. Transp. 2020, 33, 15–33. [Google Scholar] [CrossRef]
TOD_VT Introductions. ITS Research Centre, Wuhan University of Technology [EB/OL]. Available online: http://www.whuttis.cn/About.aspx?ClassID=74 (accessed on 30 May 2023).

Figure 1. SHAP-based interpretation of machine learning models.

Figure 2. Satellite imagery of the road segment.

Figure 3. Correlation analysis of influencing factors under light rain scenario.

Figure 4. (a) Confusion matrix of XGBoost prediction model under the light rain scenario; (b) Confusion matrix of LightGBM prediction model under the light rain scenario; (c) Confusion matrix of RF prediction model under the light rain scenario.

Figure 5. (a) ROC curves of the LightGBM model under the light rain scenario; (b) ROC curves of the LightGBM model under the moderate rain scenario; (c) ROC curves of the LightGBM model under the heavy rain scenario.

Figure 6. (a) SHAP value analysis under the light rain scenario; (b) SHAP value analysis under the moderate rain scenario; (c) SHAP value analysis under the heavy rain scenario.

Figure 7. Top-ranked influential factors under three rainfall scenarios.

Table 1. Classification of rainfall intensity levels.

Category	Level	24-h Total Precipitation (mm)
Rainy weather	Light rain	0.1–9.9
	Moderate rain	10.0–24.9
	Heavy rain	25.0–49.9
	Storm	50.0–99.9
	Severe storm	100.0–249.9
	Extra severe storm	≥250.0

Table 2. Calibration of IDM parameters under rainy weather conditions.

Parameter	Clear Weather	Light Rain	Moderate Rain	Heavy Rain
$v_{0} / (km \cdot h^{- 1})$	112.50	108.00	95.29	90.00
$a / (m \cdot s^{- 2})$	0.6277	0.6331	0.3066	0.2646
$δ$	4.8623	4.9999	4.6809	4.0170
$s_{0} / m$	2.1120	1.8213	1.2300	1.8920
$T / s$	1.6700	2.5639	1.8542	1.9316
$b_{c o m f} / (m \cdot s^{- 2})$	0.8053	0.4142	0.7385	1.3121

Table 3. Driver-related factors.

Human Factors	Primary Causes	Consequences
Reaction delay	Reduced visibility	Increased safe following distance requirement
Distracted driving	Personal factors	Increased probability of accidents
Lane-changing decision	Increased blind spots	Increased probability of lane-change rear-end collisions
Car-following decision	Underestimation of braking distance	Headway time below the safety threshold

Table 4. Influencing factors of rear-end collision risk in diverging areas of urban expressways.

Serial Number	Factor Category	Factor Symbol	Factor Name	Unit or Value
1	Road	L	Diverging area length	m
2		N	Number of lanes in diverging area	Lane count
3		P	Widening section gradient rate	[0, 1]
4	Environment	R1	Light rain	0.1–9.9 (mm)
5		R2	Moderate rain	10.0–24.9 (mm)
6		R3	Heavy rain	25.0–49.9 (mm)
7	Vehicle	traffic_volume	traffic volume	veh/h
8		mean_distance	average distance	m
9		min_distance	minimum distance	m
10		mean_speed	average vehicle speed	m/s
11		std_speed	standard deviation of vehicle speed	m/s
12		mean_time_headway	average headway time	s
13		min_time_headway	minimum headway time	s
14		mean_acceleration	average acceleration	m/s²
15		std_acceleration	standard deviation of acceleration	m/s²
16		density	traffic density	veh/km

Table 5. Traffic conflict indicators.

Indicator Category	Typical Indicator	Advantage	Limitation
Deceleration-based indicator	DRAC, CPI	Closely linked to vehicle performance	Dependent on driving behavior assumptions
Time-based indicator	TTC, PET	Strong dynamic warning capability, high real-time performance	High data requirements
Space-based indicator	DSS	Reflects intuitive safety distance, strong practicality in static scenarios	Not applicable to dynamic traffic flow

Table 6. Risk level classification.

Risk Level	Range of MTTC Values
Low risk	$P_{50} \leq M T T C < P_{85}$
Medium risk	$P_{15} \leq M T T C < P_{50}$
High risk	$M T T C < P_{15}$

Table 7. Model Applicability Analysis.

Model	Reason for Model Applicability Selection
XGBoost	The vehicle data in rainy weather conditions is discrete. XGBoost’s regularization effectively prevents overfitting and enhances its noise resistance. It supports dynamic threshold adjustment, making it adaptable for multi-level risk prediction.
LightGBM	Rear-end collision data for vehicles in rainy weather requires large-scale filtering. LightGBM has fast training speed, making it suitable for large-scale data; it can perform efficient rear-end collision risk prediction based on real-time trajectory data.
RF	The factors influencing rear-end collision risk are highly interpretable, robust to outliers, and the decision trees are independent, allowing for in-depth analysis of individual influencing factors.

Table 8. Selection of influencing factors under three rainfall scenarios.

Factor Symbol	Light Rain	Moderate Rain	Heavy Rain
traffic_volume	☑	☑	☑
mean_distance	☑	⊠	☑
min_distance	☑	⊠	⊠
mean_speed	☑	☑	☑
std_speed	☑	☑	⊠
mean_time_headway	⊠	☑	☑
min_time_headway	⊠	☑	☑
mean_acceleration	☑	☑	☑
std_acceleration	☑	☑	☑
density	⊠	⊠	⊠

where ☑ denotes that the influencing factor is selected under the given rainfall scenario; ⊠ denotes that the influencing factor is not selected under the given rainfall scenario.

Table 9. Conflict dataset partitioning.

Scenario	Conflict Dataset	Non-Conflict Dataset	Total Dataset
Light Rain	2366	17,291	19,657
Moderate Rain	2613	12,400	15,013
Heavy Rain	2673	11,928	14,601

Table 10. MTTC threshold segmentation and undersampling under various rainy weather conditions.

Weather	Risk Level	MTTC Value (s)	Dataset	Undersampled Dataset	Number of Samples in Test Set
Light rain	High	(0, 0.76)	355	355	106
	Medium	[0.76, 2.16)	828	355	107
	Low	[2.16, 3.44)	828	355	107
Moderate rain	High	(0, 0.56)	392	392	117
	Medium	[0.56, 1.87)	915	392	118
	Low	[1.87, 3.29)	915	392	118
Heavy rain	High	(0, 0.43)	401	401	120
	Medium	[0.43, 1.50)	935	401	120
	Low	[1.50, 2.89)	935	401	121

Table 11. Prediction performance report of the model under light rain scenario.

Model Report	XGBoost			LightGBM			RF			Total Number of Test Samples
Model Report	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Total Number of Test Samples
Low risk	0.83	0.90	0.86	0.88	0.87	0.87	0.77	0.78	0.78	106
Medium risk	0.80	0.64	0.72	0.78	0.78	0.78	0.62	0.47	0.53	107
High risk	0.78	0.87	0.82	0.86	0.88	0.87	0.67	0.82	0.74	107
Accuracy	0.80			0.84			0.69			320
Macro average	0.80	0.80	0.80	0.84	0.84	0.84	0.69	0.69	0.68	320
Weighted average	0.80	0.80	0.80	0.84	0.84	0.84	0.69	0.69	0.68	320

Table 12. Prediction performance report of the model under moderate rain scenario.

Model Report	XGBoost			LightGBM			RF			Total Number of Test Samples
Model Report	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Total Number of Test Samples
Low risk	0.77	0.71	0.74	0.78	0.74	0.76	0.69	0.72	0.71	117
Medium risk	0.50	0.58	0.53	0.58	0.61	0.60	0.41	0.42	0.42	118
High risk	0.69	0.63	0.65	0.74	0.73	0.73	0.61	0.58	0.59	118
Accuracy	0.64			0.69			0.57			353
Macro average	0.65	0.64	0.64	0.70	0.69	0.70	0.57	0.57	0.57	353
Weighted average	0.65	0.64	0.64	0.70	0.69	0.70	0.57	0.57	0.57	353

Table 13. Prediction performance report of the model under heavy rain scenario.

Model Report	XGBoost			LightGBM			RF			Total Number of Test Samples
Model Report	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Total Number of Test Samples
Low risk	0.84	0.86	0.85	0.87	0.86	0.87	0.73	0.83	0.78	120
Medium risk	0.64	0.62	0.63	0.68	0.66	0.67	0.51	0.44	0.48	120
High risk	0.74	0.74	0.74	0.74	0.77	0.75	0.64	0.64	0.64	121
Accuracy	0.74			0.76			0.64			361
Macro average	0.74	0.74	0.74	0.76	0.76	0.76	0.63	0.64	0.63	361
Weighted average	0.74	0.74	0.74	0.76	0.76	0.76	0.63	0.64	0.63	361

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xia, X.; Zhang, T.; Yao, J.; Wang, P.; Zhu, C.; Zhu, C. Prediction of Rear-End Collision Risk in Urban Expressway Diverging Areas Under Rainy Weather Conditions. Systems 2026, 14, 56. https://doi.org/10.3390/systems14010056

AMA Style

Xia X, Zhang T, Yao J, Wang P, Zhu C, Zhu C. Prediction of Rear-End Collision Risk in Urban Expressway Diverging Areas Under Rainy Weather Conditions. Systems. 2026; 14(1):56. https://doi.org/10.3390/systems14010056

Chicago/Turabian Style

Xia, Xiaomei, Tianyi Zhang, Jiao Yao, Pujie Wang, Chenke Zhu, and Chenqiang Zhu. 2026. "Prediction of Rear-End Collision Risk in Urban Expressway Diverging Areas Under Rainy Weather Conditions" Systems 14, no. 1: 56. https://doi.org/10.3390/systems14010056

APA Style

Xia, X., Zhang, T., Yao, J., Wang, P., Zhu, C., & Zhu, C. (2026). Prediction of Rear-End Collision Risk in Urban Expressway Diverging Areas Under Rainy Weather Conditions. Systems, 14(1), 56. https://doi.org/10.3390/systems14010056

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Rear-End Collision Risk in Urban Expressway Diverging Areas Under Rainy Weather Conditions

Abstract

1. Introduction

2. Factors and Risk Level Classification

2.1. Rear-End Collision Scenarios on Expressways in Rainy Weather

2.2. Factors Contributing to Rear-End Collisions

2.3. Rear-End Collision Risk Level Classification

3. Rear-End Collision Risk Prediction

3.1. Risk Prediction Model Selection

3.2. SHAP Evaluation Index of Factors

4. Simulation Analysis and Results Validation

4.1. Data Sources and Processing

4.2. Comparison and Analysis of Prediction Models

4.3. Risk Factors Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Correlation Test

Appendix B. Confusion Matrix

Appendix C. ML Model Hyperparameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI