1. Introduction
With the continuous advancement of urbanization, expressways, as fast corridors connecting urban areas, have become increasingly significant. The merging and diverging zones along these expressways have emerged as critical areas for traffic flow transition between regions, directly affecting the operational efficiency and safety levels of urban transportation.
According to the 2020 traffic accident statistics in China [
1], the number of traffic accidents on urban expressways was approximately 244,670, with rear-end collisions accounting for nearly 40%. The weaving zones of urban expressways are prone to rear-end collisions due to high traffic density, frequent lane changing, and uneven vehicle speeds. In merging areas, the probability of rear-end collisions is relatively low due to smaller speed variations and higher average speeds. In contrast, diverging areas are more prone to rear-end collisions, as drivers are more susceptible to route selection interference, leading to significant speed fluctuations.
On the other hand, rainy weather conditions contribute to slippery road surfaces, lower road friction coefficients, reduced visibility, water film effects, and changes in driver behavior, all of which further increase the probability of rear-end collisions. Therefore, it is particularly necessary to conduct risk prediction for rear-end collisions in diverging areas of expressways under rainy weather conditions.
Currently, research on rear-end collisions in diverging areas of expressways, both domestically and internationally, primarily focuses on rear-end collision risk identification and prediction. Rear-end collision risk identification aims to extract dynamic or static key indicators that influence rear-end collisions from complex traffic systems through data mining and feature analysis methods, revealing the underlying mechanisms of accidents. In contrast, rear-end collision risk prediction focuses on developing statistical models and algorithms to assess the probability and severity of accidents based on historical data and real-time information, enabling dynamic early warning. The following provides a detailed explanation from these two aspects.
Regarding research on rear-end collision risk identification, Chinese scholars have primarily utilized relevant theories such as the CHAID tree model [
2], genetic algorithms [
3], PC-Crash simulation [
4], and binary Logit model [
5] to assess the factors influencing rear-end collisions, aiming to improve traffic safety on various road segments. Liu Benmin et al. [
6], based on rear-end collision data from the United States, utilized the SVM model to analyze chain-reaction rear-end collisions. They identified key factors such as the leading vehicle’s motion state, road speed limits, season, and the number of lanes, with higher speed limits (above 80 km/h), summer conditions, and multi-lane roads being more likely to lead to chain-reaction collisions. Zou, R. et al. [
7] employed a mixed Logit model to examine the determinants of driver injury severity in rear-end collisions between passenger cars and pickup trucks on urban roadways. The analysis indicated that, across different collision configurations, alcohol involvement, roadway curvature, gender, and failure to wear seat belts were significantly associated with injury severity. Yuan Renteng et al. [
8], based on the CROPHM model, analyzed the causal differences in the severity of rear-end collisions on highways between day and night. They found that accidents occurring during the early morning hours, involving multiple vehicles, or large vehicles significantly increased the probability of fatal accidents, and proposed targeted improvement recommendations. Ahmadi, A. et al. [
9] integrated 5 years of data from California and employed Multinomial Logit, Mixed Multinomial Logit, and SVM models to analyze the severity and influencing factors of rear-end collisions. They clarified that the SVM model slightly outperformed the others in terms of predictive performance, providing important insights for improving driver safety education, as well as vehicle and road design. Qi, Y. et al. [
10] analyzed the frequency and severity of rear-end collisions in construction zones using truncated count models and ordered Probit models. They found that factors such as construction type, traffic control methods, driving under the influence, and truck involvement significantly affected accident frequency and severity, providing important insights for safety management in construction areas.
Regarding research on rear-end collision risk prediction, scholars have primarily focused on studies of rear-end collision risks in foggy conditions, the selection of rear-end indicators, and the use of various machine learning techniques for model validation. Wen Huiying et al. [
11] selected the TTC indicator and utilized the German HighD dataset to assess rear-end collision risks involving large trucks on highways using machine learning models. They found that the RF model achieved the highest prediction accuracy, with minimum headway distance, standard deviation of speed, and standard deviation of acceleration having the greatest impact on rear-end collision risk. Wu, Y. et al. [
12] developed a novel algorithm to assess rear-end collision risk in foggy conditions by comparing the safe braking distances between leading and following vehicles. The findings demonstrate the algorithm’s effectiveness in identifying collision risk disparities across different lanes and vehicle types. Wang Jiali [
13] used a driving simulator to study the risk of chain-reaction rear-end collisions on highways in foggy conditions. She found that the primary cause was the reduced visibility leading to insufficient following distance and developed a TTC-based risk propagation model with high prediction accuracy. Li Yun et al. [
14] developed a car-following rear-end collision risk threshold model for highway construction zones based on driver characteristics. They quantitatively assessed the risk levels of different sections and found that the conflict risk was highest in the warning zone, with traffic volume increases significantly elevating the risk. Gecchele, G. et al. [
15] applied extreme value theory combined with Time-to-Collision as an alternative safety indicator to assess rear-end collision risks on highways. This approach effectively addressed the limitations of accident data-based analysis and was validated for feasibility on Italian toll roads. Li, Z. et al. [
16] proposed a Rear-End Collision Risk Index based on circular detector data and used a logistic regression model to assess rear-end collision risks in real-time at highway bottleneck areas. The results indicated that the highest rear-end collision risks occurred when the upstream approached saturation and the downstream was severely congested.
In summary, both domestic and international scholars have made significant progress in the fields of rear-end collision risk identification and prediction. Existing studies primarily use multi-source data fusion and diversified models to reveal the influencing factors of rear-end collisions on specific road segments, and have established dynamic early-warning frameworks for certain environments. However, existing studies still have the following limitations:
- (1)
Scene limitations: Most studies focus on specific environments such as foggy conditions, mountainous areas, or construction zones. In contrast, the complex scenario of rain-affected urban expressway diverging areas, characterized by reduced visibility, a sharp decrease in road surface friction, and frequent lane changes, presents risk mechanisms that differ significantly from those of conventional road sections. Meanwhile, the proportion of traffic accidents occurring in rainy weather accounts for as much as 10.03% of all traffic accidents, with direct property damage constituting 14.53% [
17], highlighting the importance of further attention and in-depth research.
- (2)
Model limitations: Existing models often perform well under single-factor conditions, but their ability to represent multi-factor coupling scenarios, such as those in rain-affected urban expressway diverging areas, remains unclear and requires further investigation.
In light of this, the author considers the characteristics of high traffic density, frequent lane changes, uneven speeds, and driver susceptibility to route choice interference in urban expressway diverging areas under rainy conditions. Firstly, a scenario analysis of rear-end collisions in rainy weather is conducted, listing the relevant influencing factors and classifying rear-end collision risk levels based on the MTTC indicator thresholds. Secondly, three machine learning models are selected to build a dynamic prediction model for rear-end collision risk in urban expressway diverging areas under rainy conditions, with SHAP values used to analyze the related influencing factors. Finally, a simulation platform is developed for data collection and processing, aiming to provide a theoretical foundation for effectively predicting the probability of rear-end collisions in urban expressway weaving areas under rainy weather and enhancing road capacity and safety.
The contributions of this paper are summarized as follows:
- (1)
A rainfall-specific rear-end collision risk classification framework is developed based on the Modified Time-to-Collision metric, with distinct risk thresholds established for light, moderate, and heavy rain conditions.
- (2)
Given that existing studies on rear-end collision scenarios in urban expressway diverging areas remain limited, this study systematically analyzes the rear-end collision risk mechanisms specific to this traffic environment.
- (3)
By integrating machine learning models with SHAP-based interpretability analysis, this study reveals the evolution of dominant risk factors under different rainfall intensities, providing insights for differentiated traffic safety management in rainy conditions.
2. Factors and Risk Level Classification
2.1. Rear-End Collision Scenarios on Expressways in Rainy Weather
- (1)
Classification of Rainfall Levels
Rain is a common natural phenomenon that, while nourishing all living things, also impacts the transportation system, including effects on people, vehicles, roads, and the environment. According to the “Grade of Precipitation” (GB/T 28592-2012) [
18] by the China Meteorological Administration, the precipitation intensity is classified into six levels based on the total precipitation over 24 h, as shown in
Table 1.
According to the “Technical Guidelines for Highway Traffic Meteorological Disaster Risk Assessment” by the Ministry of Transport, under light rain, the road surface becomes slightly wet with little to no water accumulation, the friction coefficient decreases by approximately 10–15%, and visibility is greater than 1 km. Under moderate rain, a thin water film tends to form on the road surface, the friction coefficient decreases by approximately 20–30%, and visibility ranges from 500 m to 1 km. Under heavy rain, the road surface water accumulation depth is approximately 2–5 mm, the friction coefficient decreases by approximately 40–50%, and visibility is reduced to between 200 m and 500 m. Due to the high frequency of occurrence of these three scenarios in daily life, this paper will provide a detailed description of these three types of rainy weather conditions.
- (2)
IDM Car-Following Model under Rainy Weather Conditions
In both domestic and international research on traffic flow and vehicle behavior, car-following models, as an important tool for characterizing the longitudinal interaction between leading and following vehicles, are widely used in traffic simulation and safety analysis. Distance-based models, speed-difference-based models, and optimization-based models are commonly applied. Among them, traditional models include the GM model and the OV model, which emphasize the relationship between inter-vehicle spacing and desired speed, respectively. The FVD model, on the other hand, excels in improving the speed response mechanism.
Yang et al. [
19] pointed out that the Intelligent Driver Model (IDM) performs best in real-world freeway simulations and is highly consistent with the car-following behavior of drivers. Therefore, this paper selects this model for simulation research.
The IDM is a widely used microscopic car-following model designed to characterize the dynamic behavior of vehicles in traffic flow. By describing the drivers’ response to the distance and speed difference between the leading and following vehicles, and incorporating the safe distance and desired speed, the model calculates the real-time acceleration of the vehicle. The core of its equation primarily includes the free-flow acceleration term and the car-following deceleration term, as shown in Equation (1):
where
denotes the maximum acceleration;
denotes the desired following distance;
denotes the comfortable deceleration;
denotes the desired speed;
denotes the standstill safety distance;
denotes the desired headway time;
denotes the relative speed between leading and following vehicles and
denotes the acceleration index.
Shan et al. [
20], based on driving data collected under different weather conditions, calibrated the IDM parameters for rainy weather by incorporating changes in friction coefficient and visibility distance. The specific calibration results are shown in
Table 2, and this paper adopts these calibration results.
2.2. Factors Contributing to Rear-End Collisions
Rear-end collisions in the diverging areas of urban expressways under rainy weather conditions result from the interaction of multiple factors, including human, vehicle, road, and environmental elements. This section reviews the key factors influencing rear-end collisions, based on the micro-level analysis conducted by relevant scholars, focusing on these four dimensions.
The driver-related factors influencing rear-end collisions in rainy weather can be broadly categorized into two main aspects: emotional and psychological characteristics, and driving behavior decisions. The specific human factors and their primary causes are shown in
Table 3.
Vehicle-related influencing factors can be categorized into two main aspects: vehicle performance and measured vehicle trajectory data. The former, in the context of rainy weather, is reflected in the effectiveness of the braking system and the skid resistance of the tires. The latter plays a crucial role in relevant studies, which indicate that rear-end collisions are closely related to vehicle speed, acceleration, and traffic flow. Ultimately, this study identifies the following 10 predictive factors: traffic volume, average distance, minimum distance, average vehicle speed, standard deviation of vehicle speed, average headway time, minimum headway time, average acceleration, standard deviation of acceleration, and traffic density.
Road-related influencing factors can be divided into two dimensions: the geometric design of the diverging areas and the functional condition of the road surface. The former primarily includes aspects such as the length of the diverging area, the number of lanes in the diverging area, and the gradient of the widening section, while the latter focuses on drainage performance and skid resistance.
The “Design Code for Urban Expressways” (CJJ 129—2009) states that the length of auxiliary lanes in diverging areas of expressways should be greater than 1000 m, and the number of lanes at the diverging point is given by Equation (2).
where
denotes the number of mainline lanes before divergence;
denotes the number of mainline lanes after divergence and
denotes the number of ramp lanes.
Environment-related influencing factors include rainfall intensity, visibility, and spatiotemporal distribution. Among them, rainfall intensity is based on the three scenarios mentioned in
Section 2.1, while spatiotemporal distribution refers to special scenarios such as morning and evening traffic peaks, and large-scale events, which are prone to traffic congestion and accidents.
In rear-end collision risk identification in expressway diverging areas, selecting appropriate accident influencing factors is crucial. On one hand, the factors should be closely related to the occurrence of accidents; on the other hand, they need to be easily accessible and measurable. Based on this, this study conducts a comprehensive analysis of the characteristic variables contributing to rear-end collision risk in expressway diverging areas and identifies the following 16 influencing factors, as shown in
Table 4.
Considering the specific traffic environment of expressway diverging areas and the characteristics of rainy weather conditions, this study focuses on vehicle-related factors for subsequent modeling and analysis. Vehicle-level factors directly reflect drivers’ instantaneous behaviors and interaction dynamics, such as speed, spacing, and acceleration, which play a dominant role in rear-end collision formation, particularly under reduced visibility and degraded pavement conditions in rainy weather. Moreover, vehicle-related variables can be directly derived from trajectory data with higher temporal resolution and reliability, making them more suitable for fine-grained risk identification and machine-learning-based modeling in this study.
2.3. Rear-End Collision Risk Level Classification
In existing research, traffic conflict evaluation indicators are mainly divided into three categories: deceleration-based indicators, time-based indicators, and space-based indicators, as shown in
Table 5.
In rear-end collision risk research, the most widely used indicator is the Time-to-Collision (TTC) metric. TTC is based on the relative speed and headway between the lead and following vehicles. Under the condition that both vehicles are traveling on the same lane with stable trajectories, TTC represents the time from the initiation of evasive behavior to the occurrence of a collision. This indicator is applicable for safety analysis in various scenarios. However, the TTC metric only considers longitudinal motion and is suitable for use in deterministic models.
In the study of rear-end collisions on urban expressways during rainy weather, the applicability of the traditional TTC metric is constrained due to the complex and variable external conditions. On one hand, the rainy environment reduces visibility and decreases the road surface friction coefficient, which leads to an increase in vehicle braking distance. On the other hand, the variability in drivers’ reaction times increases, and the uncertainty in driving behavior intensifies. As a result, vehicles often exhibit nonlinear deceleration patterns, weakening the ability of TTC to accurately reflect the actual collision risk.
To accurately reflect rear-end collision risk under rainy weather conditions, this study introduces the Modified Time-to-Collision (MTTC) metric. Built upon the TTC metric, MTTC incorporates acceleration as a reference, aligning with the changes in vehicle braking characteristics and evasive capabilities in rainy scenarios, thereby enhancing the dynamic adaptability of the metric. Previous studies have demonstrated [
21] that, compared with the traditional TTC, the MTTC metric is more suitable for capturing rear-end collision risk under complex and dynamic traffic conditions. By explicitly incorporating relative acceleration into the collision assessment, MTTC is able to reflect nonlinear braking behavior and dynamic interaction processes between consecutive vehicles, which are particularly pronounced under adverse weather conditions. Therefore, MTTC provides a more realistic and dynamically adaptive surrogate safety indicator for rear-end collision risk analysis in rainy urban expressway environments. The criteria for determining rear-end collision risk are presented in Equation (3), and further organized in Equation (4):
Among them:
where
,
denote the speeds of the following and leading vehicles, respectively;
,
denote the accelerations of the following and leading vehicles, respectively;
denotes the initial headway between the following and leading vehicles.
Let the inequality in Equation (4) become an equality. By solving the quadratic equation, the two solutions for MTTC,
and
are as follows:
Thus, the value of MTTC can be expressed as:
From a physical perspective, MTTC represents the estimated time to collision under a short-term uniform-acceleration assumption, where both relative speed and relative acceleration jointly describe the longitudinal interaction between the leading and following vehicles. Compared with the traditional TTC metric, which assumes constant relative speed, the inclusion of acceleration allows MTTC to capture transient braking and car-following adjustments that commonly occur in complex traffic environments.
In this study, the acceleration term used in MTTC is consistent with the longitudinal motion description adopted in the car-following model. By accounting for acceleration-induced changes in the relative motion state, MTTC provides a more dynamically responsive surrogate safety indicator for rear-end collision risk analysis under time-varying traffic conditions.
A decrease in the MTTC metric indicates an increase in the probability of a rear-end collision between the two vehicles. Ozbay, K. et al. [
21] noted that the threshold for MTTC is typically set at 4 s. When the MTTC value exceeds 4 s, drivers usually have sufficient time to make appropriate decisions. Therefore, in this study, MTTC = 4 s is set as the threshold for determining potential conflicts. In this study, the percentile method based on the MTTC value is used to determine the specific classification of rear-end collision risk levels. The total sample is denoted as N, and the dataset
is sorted in ascending order. The cumulative distribution is calculated, with the MTTC values corresponding to the 15%, 50%, and 85% percentiles denoted as
,
and
, respectively [
22]. The specific risk levels are then derived, as shown in
Table 6:
4. Simulation Analysis and Results Validation
4.1. Data Sources and Processing
This study utilizes a publicly available trajectory dataset from urban expressways in Wuhan [
23] as the data source for model construction and validation. The selected study segment and prevailing rainy-weather traffic conditions are consistent with the urban expressway scenarios considered in previous studies [
20] where rainy-weather IDM parameters were calibrated, supporting the applicability of the adopted parameter set in this context. The dataset includes operational data from multiple expressway networks, such as vehicle ID, time frames, vehicle pixel coordinates, speed, vehicle length, and travel direction. This paper focuses on a representative segment of a diverging area, as shown in
Figure 2, which spans approximately 414 m and contains data from about 2400 vehicles.
In the data preprocessing stage, basic cleaning procedures were conducted prior to simulation, including the removal of abnormal records and entries with missing key variables, as well as temporal organization of the remaining data, among other operations, to ensure trajectory consistency and continuity. The dataset is combined with the features of the expressway diverging area to extract the initial car-following pairs. Using Python (version 3.8) for simulation experiments with a time step of 0.1 s, the IDM car-following model’s rainy weather parameters are applied to extract trajectory data for three weather scenarios on this segment. The extracted data includes traffic volume, average distance, minimum distance, average vehicle speed, standard deviation of vehicle speed, average headway time, minimum headway time, average acceleration, standard deviation of acceleration, and traffic density for each scenario.
Pearson correlation tests are conducted on the influencing factors to mitigate the potential negative effects of correlations between factors on the model’s predictive performance. If the absolute value of the correlation coefficient r exceeds 0.8, it indicates a high correlation between the two factors, and typically, one of them is selected as a reference variable. Following previous studies [
11], a Pearson correlation threshold of r > 0.8 is commonly adopted to identify strongly correlated features and reduce feature redundancy prior to model construction. Although tree-based models such as LightGBM are generally robust to multicollinearity, excluding highly correlated features helps ensure a more concise input space and improves the stability and interpretability of subsequent SHAP-based analysis. This study performs the correlation tests using Python code and conducts experiments for light, moderate, and heavy rain scenarios separately.
Taking the light rain scenario as an example, as shown in
Figure 3, the correlation coefficients of three variable pairs—average headway time and average distance, minimum headway time and minimum distance, as well as traffic volume and traffic density—exceed 0.8, indicating strong correlations among several distance- and headway-related variables. To mitigate potential multicollinearity, this study retains representative variables from each correlated group. Given that traffic volume and traffic density are inherently redundant, traffic volume is retained as the representative macroscopic flow indicator. For the distance–headway variable pairs, distance-based indicators are retained under light rain conditions to characterize spatial safety margins, while the corresponding headway-time variables are excluded. Consequently, traffic density, minimum headway time, and average headway time are excluded from the influencing factors in the light rain scenario.
The Pearson correlation matrices for the moderate and heavy rain scenarios are presented in
Appendix A. The final input variables selected for the three rainfall scenarios are summarized in
Table 8.
- (3)
MTTC Threshold Division
After the simulation and data cleaning and verification, the trajectory data for light, moderate, and heavy rain are obtained, with 19,657, 15,013, and 14,601 records, respectively. According to the MTTC = 4s threshold division logic outlined in
Section 2.3, the final processed data are summarized in
Table 9.
All MTTC values below 4 s in the conflict dataset are arranged in ascending order, and the percentile distribution method is applied to determine the risk classification thresholds for each rainfall scenario. Consistent with
Section 2.3, the P15, P50, and P85 percentiles in each scenario are defined as the thresholds for high-, medium-, and low-risk levels, respectively. Furthermore, to balance the number of samples across different class labels and prevent model overfitting caused by uneven sample distribution, undersampling is applied to the overrepresented categories. While undersampling can lead to potential information loss, especially in the minority classes, it was chosen here due to the relatively small size of the dataset and the need to ensure sufficient representation of each class. Other methods, such as SMOTE or class weighting, could also be considered to handle the class imbalance; however, undersampling was preferred in this study to avoid introducing synthetic data or excessively complicating the model. It should be noted that although this method reduces the overall sample size, it helps to improve the model’s generalizability by reducing the potential bias towards the majority class. Subsequently, each machine learning model is trained and tested on the undersampled datasets, which are randomly divided into training and testing sets at a 7:3 ratio. The results are presented in
Table 10.
4.2. Comparison and Analysis of Prediction Models
This study calculates the accuracy of the three models for each weather scenario based on the risk confusion matrix data generated by the prediction models. The aim is to compare the most suitable prediction model for different weather conditions and display their ROC curves.
- (1)
Accuracy Analysis of Models under Three Weather Scenarios
Accuracy evaluation of the proposed model, in conjunction with
Section 3.1, is primarily reflected through three key metrics: precision, recall, and F1-score.
In the light rain scenario, the confusion matrix results of the three prediction models are shown in
Figure 4. A detailed classification report of the model predictions is presented in
Table 11.
As shown in
Table 11, all three models perform relatively well in predicting the low-risk category. Among them, the LightGBM model achieves the highest overall accuracy at 84%, which is 4 percentage points higher than the XGBoost model and 15 percentage points higher than the RF model. In addition, for the prediction of the medium-risk category in the light rain scenario, the LightGBM model significantly outperforms the other models in both precision and recall. Therefore, the LightGBM model demonstrates superior overall predictive performance in the light rain scenario.
- (b)
Moderate Rain Scenario
The confusion matrix results for the moderate rain scenario are presented in
Figure A3 in
Appendix B. The prediction classification report for the three models is shown in
Table 12.
As shown in
Table 12, the overall prediction accuracy of the LightGBM model in the moderate rain scenario is 69%. In predicting the three risk levels, this model outperforms the other two models in both precision and recall. Specifically, for the low-risk category, its prediction precision reaches 78%, with an F1 score of 76%; for the high-risk category, the prediction precision is 74%, with an F1 score of 73%. Therefore, the LightGBM model demonstrates strong classification prediction capability in the moderate rain scenario.
The confusion matrix results for the heavy rain scenario are presented in
Figure A4 in
Appendix B. The prediction classification report for the three models is shown in
Table 13. It can be seen that the LightGBM model has the highest overall prediction accuracy at 76%, slightly higher than the 74% accuracy of the XGBoost model. In contrast, the RF model performs relatively weaker in this scenario, with a difference of nearly ten percentage points compared to the first two models. In the low-risk and high-risk category predictions, both the XGBoost and LightGBM models demonstrate relatively stable performance. However, in the medium-risk category prediction for the heavy rain scenario, the LightGBM model exhibits higher precision and recall. Overall, the LightGBM model shows stronger ability in distinguishing risk levels in the heavy rain scenario.
The detailed hyperparameter settings of the machine learning models are provided in
Appendix C.
- (2)
ROC Curves of LightGBM under Three Weather Scenarios
Based on the comparison of model accuracies across the three weather scenarios, the LightGBM model consistently demonstrates the highest predictive performance. Therefore, this section presents the ROC curves of the LightGBM model under the three scenarios.
Based on the LightGBM confusion matrix results across the three scenarios, ROC curves are plotted with the true positive rate on the vertical axis and the false positive rate on the horizontal axis. The ROC curves illustrate the model’s discriminative ability under different classification thresholds, while the AUC represents the area under the ROC curve. A larger AUC indicates stronger capability in distinguishing between classes. The results are presented in
Figure 5.
Across the three scenarios, the LightGBM model demonstrates strong discriminative ability in predicting both low-risk and high-risk categories, with corresponding AUC values exceeding 0.85. Notably, in the light rain scenario, the AUC for the low-risk category reaches 0.97, reflecting high model sensitivity. In comparison, the AUC values for the medium-risk category are slightly lower but remain above 0.75, indicating a reasonable overall performance and confirming that the model maintains stable predictive capability for this category as well.
Specifically, the LightGBM model exhibits the strongest predictive capability for low-risk samples, slightly weaker performance for high-risk samples, and relatively lower accuracy for medium-risk samples. This can be attributed to two main factors. First, the medium-risk category represents an intermediate zone between low- and high-risk states, where some data are susceptible to interference from both extremes, resulting in blurred boundaries. Second, from a driving behavior perspective, drivers tend to maintain high attention in high-risk situations, while in low-risk states, they may adopt relatively more aggressive driving behaviors. Medium-risk conditions fall between these extremes, exhibiting greater driving variability. These two factors partially affect the model’s ability to accurately predict medium-risk samples.
Across the three weather scenarios, the model prediction results indicate that the LightGBM model maintains high accuracy and stability in risk category identification tasks, demonstrating strong robustness and generalization capability.
4.3. Risk Factors Analysis
This study integrates the LightGBM model with the SHAP value interpretation method to identify key influencing factors and reveal the underlying mechanisms of rear-end collision risk under light, moderate, and heavy rain scenarios. SHAP values are used to quantify the marginal contribution of each input variable to the model’s predictions across different risk levels, where a larger absolute value indicates a stronger influence on risk classification.
By combining the strong predictive capability of LightGBM with the transparent interpretability provided by the SHAP evaluation framework, this study achieves robust risk prediction while maintaining clear insight into the model’s decision logic. LightGBM effectively captures complex nonlinear relationships among traffic variables under varying rainfall conditions, whereas SHAP enables an intuitive assessment of the relative importance of different factors. This integrated framework therefore provides a reliable methodological basis for comparing dominant risk drivers across rainfall scenarios and analyzing the evolution of rear-end collision risk mechanisms.
Figure 6a–c present the SHAP-based importance distributions of the key influencing factors under light, moderate, and heavy rain scenarios, respectively.
In the light rain scenario, the headway time and average vehicle speed exhibit relatively high SHAP values. Among them, the minimum headway time accounts for the largest proportion in the high-risk category, indicating that under such conditions, the traffic environment remains relatively stable, and individual microscopic car-following behavior becomes the key factor influencing risk. Therefore, in light rain scenario, it is essential to maintain a steady driving speed, minimize abrupt speed fluctuations, and enhance the capability of intelligent sensing systems to identify abnormally close car-following behavior, thereby guiding drivers to maintain a safe following distance.
In the moderate rain scenario, the minimum headway time emerges as the dominant influencing factor, far exceeding the impact of other variables. The average vehicle speed and traffic volume are identified as the next two key contributors. The underlying mechanism can be explained by the fact that as rainfall intensity increases, the road friction coefficient decreases further, making the interaction between the extreme fluctuations of dynamic following distance and the macroscopic traffic flow state a critical determinant of risk. Therefore, in moderate rain scenario, it is crucial to identify “short headway–high density” zones and implement multi-level smooth speed control measures on complex segments of expressways to ensure appropriate following distances.
In the heavy rain scenario, the average vehicle speed and the minimum headway time are identified as the two dominant influencing factors, while the effects of traffic volume and standard deviation of vehicle speed increase significantly. This reflects the model’s dual sensitivity to both macroscopic traffic flow conditions and microscopic speed fluctuations. Therefore, under heavy rain scenario, it is essential to develop a wide-area risk early-warning system based on the coordinated identification of speed–flow disturbances, enabling the early detection of potential rear-end collision zones and the real-time guidance of critical road segments.
To further illustrate the evolution of dominant risk mechanisms across rainfall intensities, the top three influencing factors identified by SHAP under light, moderate, and heavy rain scenarios are summarized in
Figure 7. The results indicate a clear shift in dominant risk drivers as rainfall intensifies. Under light rain conditions, rear-end collision risk is primarily governed by microscopic car-following behaviors, with minimum distance and speed-related factors playing a leading role. As rainfall increases to a moderate level, the importance of minimum time headway becomes more pronounced, accompanied by an increasing contribution of traffic volume, reflecting the growing influence of interactions between individual driving behavior and macroscopic traffic flow states. Under heavy rain conditions, speed-related factors and traffic flow disturbances jointly dominate risk prediction, highlighting the strengthened coupling between macroscopic flow instability and microscopic speed fluctuations. Overall, this transition reveals an evolutionary shift from microscopically driven risk mechanisms toward macro–micro coupled disturbances as rainfall severity increases.
5. Conclusions
- (1)
Based on three machine learning models—XGBoost, LightGBM, and RF—this study proposed a rear-end collision risk prediction model for urban expressways under rainy conditions. Among the three weather scenarios, the LightGBM model consistently achieves the highest prediction accuracy, and demonstrates superior performance in terms of precision, recall, and overall accuracy, highlighting its strong predictive capability.
- (2)
Using SHAP values, the study ranked the importance of factors influencing rear-end collision risk. In the light rain scenario, the minimum distance has the greatest impact; in the moderate rain scenario, the minimum headway time plays the most significant role; and in the heavy rain scenario, the average vehicle speed and minimum headway time exert the strongest influence.
- (3)
According to different rainfall intensities, a hierarchical and categorized driving safety management strategy was proposed. Under the light rain scenario, the focus is on guiding drivers to maintain operational stability and safe headway distances; under moderate and heavy rain scenarios, the emphasis shifts to strengthening the coordinated control of “speed–distance–traffic volume.”
This study provided theoretical insights for the prediction and prevention of rear-end collision risks in freeway diverging areas under rainy conditions. However, given the complexity of the scenarios and data analysis, the present research considers only two-vehicle rear-end collisions and vehicle-related influencing factors. Future research can build upon the findings of this study by conducting more in-depth analyses of the vehicle-related factors associated with rear-end collisions under different rainfall scenarios. By integrating the identified dominant risk factors with improved IDM-based car-following models and intelligent connected vehicle strategies, subsequent studies may further develop targeted prevention and control approaches, ultimately contributing to enhanced traffic safety and operational efficiency on urban expressways.
Additionally, this study has several limitations. The rainy-weather vehicle trajectory data were generated through simulation using previously calibrated IDM parameters reported in the literature, rather than being directly collected under real rainy conditions. Although this approach enables a controlled analysis of rear-end collision risk under different rainfall intensities, it may limit the external validity of the findings when applied to other road segments or traffic environments. Future research will incorporate real-world rainy-weather trajectory data and multi-site validation to further examine the generalizability of the proposed framework.
Moreover, the proposed model was specifically calibrated for urban expressway areas, which have unique traffic dynamics compared to other road types such as highways. Therefore, applying the model to highways or other road segments would require recalibration of the IDM parameters, and further validation is necessary to assess its performance and applicability in these contexts.