Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning

Goubraim, Naima; Elassad, Zouhair Elamrani Abou; Mousannif, Hajar; Ameksa, Mohamed

doi:10.3390/safety11040121

Open AccessArticle

Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning

by

Naima Goubraim

^1,*

,

Zouhair Elamrani Abou Elassad

^1,2

,

Hajar Mousannif

¹

and

Mohamed Ameksa

¹

LISI Laboratory, Faculty of science Semlalia, Cadi Ayyad University, Marrakesh 40000, Morocco

²

Data Science Lab, Faculty of Science and Information Technology, Daffodil International University, Birulia 1216, Bangladesh

^*

Author to whom correspondence should be addressed.

Safety 2025, 11(4), 121; https://doi.org/10.3390/safety11040121

Submission received: 29 September 2025 / Revised: 26 November 2025 / Accepted: 27 November 2025 / Published: 9 December 2025

(This article belongs to the Special Issue Road Traffic Risk Assessment: Control and Prevention of Collisions)

Download

Browse Figures

Versions Notes

Abstract

Road traffic crashes are a major global challenge, resulting in significant loss of life, economic burden, and societal impact. This study seeks to enhance the precision of traffic accident prediction using advanced machine learning techniques. This study employs an ensemble learning approach combining the Random Forest, the Bagging Classifier (Bootstrap Aggregating), the Extreme Gradient Boosting (XGBoost) and the Light Gradient Boosting Machine (LightGBM) algorithms. To address class imbalance and feature relevance, we implement feature selection using the Extra Trees Classifier and oversampling using the Synthetic Minority Over-sampling Technique (SMOTE). Rigorous hyperparameter tuning is applied to optimize model performance. Our results show that the ensemble approach, coupled with hyperparameter optimization, significantly improves prediction accuracy. This research contributes to the development of more effective road safety strategies and can help to reduce the number of road accidents.

Keywords:

traffic crash prediction; machine learning; optimization

1. Introduction

1.1. Problem Definition

Every year the traffic accident rate increases, the number of dead and injured people becomes threatening. A concerning trend has been identified by the World Health Organization (WHO) in Africa, whereby road traffic deaths have increased significantly over the past decade. In 2021 alone, nearly 250,000 lives were lost on African roads, which contrasts with a global decline of 5% in road traffic fatalities during the same period. A 17% increase in road traffic fatalities was recorded in the Region between 2010 and 2021, as indicated in the WHO’s status report on road safety in the African Region in 2023 (the most recent update). The Region accounts for approximately 20% of all road traffic fatalities globally, despite representing only 15% of the world’s population and 3% of the world’s vehicles [1].

In 2022, Morocco witnessed a considerable number of road accidents, as reported by the National Road Safety Agency of Morocco, which documented a total of 113,625 incidents. It is regrettable that these accidents resulted in the deaths of 3499 individuals. A thorough examination of the data reveals a worrisome trend. A total of 1629 fatalities occurred in non-urban areas. The male population was disproportionately impacted, constituting approximately 85% of the total number of casualties (2971). In contrast, women represented a comparatively smaller proportion, amounting to approximately 15% (515) [2].

A closer examination reveals that a considerable proportion of the fatalities involved young people. A total of 889 victims were under the age of 25, with a significant proportion being very young (281 under the age of 15 and 608 between the ages of 15 and 24) [2].

Furthermore, the type of vehicle involved also presents a cause for concern. A considerable number of fatalities were caused by motorized two- or three-wheeled vehicles, with motorcycles being the most prevalent type involved (1321). A further vulnerable group was pedestrians, accounting for 888 fatalities (25.4%). It is noteworthy that over a fifth of these pedestrian deaths (189) involved individuals aged 65 or over. Furthermore, 801 deaths were recorded among passenger car users [2].

1.2. Related Works

In [3], the authors examine global road accident causes and propose solutions through a taxonomy of traffic accident analysis. Their findings indicate that 90% of studies focus on human factors as primary causes, which they categorize into five groups: driving environment, routine activities, habits, demographics, and technical issues. The authors analyze each category and identify key mechanisms involved, providing an overview of current research developments and existing gaps in the field.

Researchers in [4] presented a systematic literature review of 120 studies (2010–2020) to investigate data acquisition methods and tools for analyzing driving behavior. The methodology involves systematically categorizing data collection approaches and measuring their prevalence. The findings reveal that in-vehicle and IoT sensors are the predominant tools, used in approximately 67% of studies, with vehicle-related data being most commonly collected. The authors identify a research gap regarding driver-specific data and recommend that future studies investigate driver, vehicle, and environment data as an integrated system.

In [5] authors employed deep neural networks and computer vision techniques to extract clustering and depth information from images, thereby simulating driver perception. The researchers introduced visual metrics and coordinate transformations to predict speeding-related accidents using ensemble models, namely Random Forest (RF), adaptive boosting (AdaBoost), and XGBoost. The XGBoost model exhibited superior performance. In order to gain insight into the factors that contribute to speeding-related accidents, methods for interpreting machine learning were used.

The objective is to reduce the incidence of speeding-related accidents by identifying the principal causes and predicting them in real time. The development of accurate prediction models is of paramount importance for the implementation of proactive traffic safety management strategies. However, the presence of an imbalanced traffic dataset, comprising a greater number of non-crash than crash cases, represents a significant challenge.

While previous studies have primarily employed under-sampling techniques, such as matched case–control, to balance datasets, this approach may potentially result in the loss of valuable information. Recently, research has focused on the exploration of over-sampling methods, with the use of SMOTE being a popular choice for the augmentation of crash events.

The model developed by authors in [6] represents a significant advancement in the field of proactive traffic management systems, as it offers a valuable tool for real-time crash prediction. Over time, a number of models have been proposed for predicting the potential for road traffic accidents. These have yielded encouraging results through the utilization of input data from roadside detectors. The ability to anticipate the risk of a crash in real time is beneficial for the management of traffic incidents. It provides practitioners with essential information, enabling them to allocate resources proactively in response to predicted incidents.

In the study conducted by the authors in [7], the influence of various environmental factors on the risk of road traffic accidents was investigated. These factors included weather conditions, the condition of the road surface, and visibility. The amount of data used in this study was limited. Furthermore, despite its potential, it has proven challenging to identify a reliable and time-efficient data source for real-time crash prediction in work zones. In the interim, alternative modelling strategies may be explored to enhance the accuracy of real-time crash prediction.

In [8], the authors compared a number of machine learning and deep learning models with the objective of predicting real-time traffic crashes. The researchers employed real-time data from a Greek tollway to divide the data into training and testing sets, subsequently evaluating the models’ performance using accuracy, sensitivity, specificity, and AUC. The deep learning (DL) model demonstrated superior performance, exhibiting balanced results across all metrics, with a total accuracy of 68.95%, sensitivity of 0.521, specificity of 0.77, and AUC of 0.641. Notably, the simpler Naive Bayes model also exhibited commendable performance, despite its less complex structure. This research offers valuable insights into the performance of machine learning (ML) and DL models for predicting real-time traffic crashes.

In [9], the authors make a valuable contribution to the field of road safety by examining the vulnerability of inexperienced drivers during inclement weather, specifically during rainy nights. To do so, they analyze key driving parameters such as throttle, brake, and wheel positions. The researchers employed a tree-based machine learning framework and implemented rigorous feature selection and data balancing techniques to address dataset imbalance. Notably, their Random Forest model demonstrates superior performance in predicting accident severity, significantly outperforming existing state-of-the-art studies. This substantial improvement underscores the model’s considerable potential in enhancing road safety measures and advancing accident prevention strategies.

In their study [10], the authors addressed two major challenges in real-time crash likelihood prediction: missing data caused by sensor failures and the strong imbalance between crash and non-crash events. They evaluated three dimensionality reduction-based methods for estimating missing values, including a least-squares method, a probabilistic method, and a variational Bayesian method, and compared them with traditional approaches such as mean substitution and k-means-based estimation. Their results showed that the probabilistic and variational Bayesian approaches produced the most accurate reconstructions of missing values and also improved the performance of the prediction models. To handle class imbalance, the authors applied cost-sensitive learning and the Synthetic Minority Oversampling Technique, both of which increased the model’s ability to detect crash-prone cases by forcing the classifiers to give more weight to the minority class. Overall, the study demonstrated that accurate estimation of missing data combined with imbalance-aware learning significantly enhances the effectiveness of real-time crash prediction models.

In [11], the authors use a desktop driving simulator to collect real-time data (wheel angle, throttle pedal, and brake pedal positions) from experienced and novice drivers under snow and rain conditions. They employ Multilayer Perceptron (MLP), Support Vector Machine (SVM), and Bayesian Networks (BN) to analyze crash events. This methodology addresses the combined effect of weather conditions and driving experience on crash occurrence, demonstrating superior performance and providing valuable insights for developing crash avoidance and warning systems.

Driving simulators have become increasingly popular in road safety research since the 2000s. They are valuable tools because researchers can control several factors in driving situations, easily repeat experiments, and avoid putting people at risk. In [12], driving simulators have been used to study real-world accidents and to suggest ways to improve safety. The simulators are particularly useful for confirming the causes of accidents identified through data analysis.

In [13], the authors conducted a comparative analysis of three prominent optimization algorithms with the objective of enhancing prediction accuracy. The algorithms in question were Stochastic Gradient Descent (SGD), Adaptive Moment Estimation (Adam), and Root Mean Square Propagation (RMSprop). The findings revealed that SGD exhibited the highest accuracy, at 89.2%, followed by Adam at 88.2% and RMSprop at 87.9%. It was observed that modifying the parameters of these optimizers had a negligible impact on the results. The current model employs Adam with a learning rate of 0.1.

In [14], the authors investigated freeway crash prediction and severity classification using real-world crash data from Flint, Michigan. The study employed a Boosting Ensemble Learning framework incorporating Gradient Boosting, CatBoost, XGBoost, LightGBM, and Stochastic Gradient Descent (SGD) models. Through systematic hyperparameter optimization, the predictive accuracy of these models was further enhanced, achieving up to 96% accuracy in crash classification. The findings demonstrated that Boosting-based ensemble models can effectively capture the complex relationships between variables such as speed, braking, and weather conditions, offering a reliable approach to traffic crash prediction. This research highlights the importance of ensemble diversity and parameter tuning, reinforcing the methodological choices made in the present study.

In the study [15], the authors addressed the challenge of predicting highly imbalanced traffic accident events by introducing a predictive framework called ReMAHA–CatBoost. They argued that predicting only the occurrence of accidents is insufficient and that classifying the severity of predicted accidents provides more actionable information for authorities. Using the US-Accidents dataset, which exhibits an extreme imbalance ratio of up to 91.40 to 1 across severity levels, the authors evaluated the ability of their framework to handle this imbalance. Their experimental results showed that the proposed model achieved strong predictive performance despite the severe class imbalance, demonstrating its effectiveness for accident severity prediction in real-world traffic datasets.

In [16], the authors present a systematic analysis of driving errors (DE) through a literature review of empirical research published between 2010 and 2020. They provide a comprehensive definition of DE and examine the application of machine learning (ML) techniques in this domain. The findings reveal that decision-making errors and recognition errors are the most prevalent types among drivers. Furthermore, the authors note that ML models, particularly those based on artificial neural networks and regression approaches, have gained significant traction in recent years for analyzing driving errors, with most models utilizing real-world data obtained through external measurement instruments or survey questionnaires.

In [17], the authors explored the prediction of cerebral stroke using ensemble-based machine learning algorithms, including Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). The study emphasized the role of data preprocessing through KNN imputation, outlier elimination, one-hot encoding, and feature normalization, followed by the application of the Synthetic Minority Oversampling (SMO) technique to balance class distribution. Hyperparameter tuning was performed using a random search approach, and the optimized parameters were combined into a stacked ensemble framework named RXLM. The experimental results showed promising predictive performance across all algorithms, demonstrating the advantages of ensemble learning and parameter optimization for improving model robustness and accuracy. These findings align with the objectives of the present study, which integrates similar ensemble and optimization strategies for effective crash prediction and risk assessment.

In [18], the authors proposed a machine learning approach using LightGBM-Optuna, which is a hyperparameter optimization framework, particularly designed for machine learning. to predict accident severity on United States roads from 2016 to 2020. By analyzing a large dataset of 458,000 records and 126 features, the model achieved an accuracy of 0.68, ROC—AUC (Receiver Operating Characteristic—Area Under the Curve) of 0.90, precision of 0.67, recall of 0.68, and F1 score of 0.67. This efficient and interpretable model can assist traffic authorities in identifying high-risk areas and implementing targeted safety measures, ultimately improving road safety.

In [19], the authors addressed network traffic classification challenges using multiple ensemble learning algorithms. The study compared five ensemble methods—Random Forest, Extra Trees, Gradient Boosting Tree, Extreme Gradient Boosting Tree, and Light Gradient Boosting Model—applied to QUIC (Quick UDP Internet Connections) protocol traffic classification. While traditional machine learning models suffered from overfitting and deep learning required extensive parameter tuning, the ensemble-based techniques demonstrated superior reliability, predictive performance, and robustness. Among the tested models, Extreme Gradient Boosting and LightGBM achieved the best overall accuracy, confirming that ensemble frameworks can provide more stable and generalizable solutions in complex data environments. These findings support the adoption of similar ensemble strategies in this research to enhance crash prediction accuracy and reduce model bias.

In [20], the authors demonstrated that it is possible to predict pedestrian fatalities in road traffic accidents using machine learning models. They employed three models in their study: Support Vector Machine (SVM), Ensemble Decision Trees (EDT) and KNN. These models were optimized using Bayesian optimization, a hyperparameter tuning technique. While KNN saw the most significant improvement in accuracy, SVM and EDT ultimately outperformed it. These findings suggest that optimized machine learning techniques can be valuable tools for enhancing pedestrian safety on roads.

Ensemble-based methods have also demonstrated superior predictive power in other complex domains, further validating their robustness for handling imbalanced and multidimensional data. For example, Authors in [21] applied multiple ensemble algorithms—including Random Forest, Decision Tree, Gradient Boosting, AdaBoost, Extra Trees, Logistic Regression, and LightGBM—to predict skin cancer and analyze survival outcomes using multi-omics data. The study compared various ensemble strategies, such as stacking, bagging, boosting, and voting, achieving a remarkable accuracy of 99% with the voting and Random Forest models. Although conducted in a biomedical context, the findings reinforce the effectiveness of ensemble learning frameworks in managing heterogeneous datasets and improving predictive reliability, which aligns with the objectives of the present research.

Recent research has continued to explore ensemble learning approaches for traffic accident prediction. For instance, Authors in [22] proposed a two-layer ensemble model combining several supervised learning algorithms—k-Nearest Neighbors, AdaBoost, Naïve Bayes, and Decision Trees—integrated through a stacking framework using logistic regression as a meta-classifier. The study also employed SMOTE to address data imbalance and Particle Swarm Optimization (PSO) for feature selection, achieving notable performance with an accuracy of 88%, an F1 score of 83%, and an AUC of 86%. This work further supports the effectiveness of hybrid ensemble architectures and feature optimization techniques in improving predictive accuracy for crash risk assessment.

In [23], the authors developed predictive analytics for improving port operations and maritime logistics, focusing on the estimation of vessel stay and delay times. The study employed supervised learning and tree-based algorithms to model complex, uncertainty-driven variables such as weather conditions, cargo type, and port dynamics. Among the tested models, tree-based approaches demonstrated the best predictive performance, providing interpretable results that supported decision-making for port scheduling and resource optimization. Although the context differs from traffic safety, this work underscores the versatility of tree-based and ensemble models in addressing uncertainty and improving operational efficiency across various domains, which aligns with the methodological approach adopted in the present study.

Recent advances in transport-safety modelling have placed greater focus on cost-sensitive learning and evaluation metrics tailored to imbalance, such as Precision-Recall AUC (PR-AUC). For example, in [24] the authors explored crash-injury severity prediction under imbalanced class distributions, finding that standard accuracy metrics hide performance drops for minority classes.

Similarly, the study [25] applied an ensemble of Random Forest, XGBoost, LightGBM, and CatBoost to real crash records from New South Wales, Australia, and emphasized balanced sampling and broader evaluation beyond ROC-AUC. These works align with the methodological emphasis of the present study on handling imbalance, optimising ensemble methods, and applying robust metrics for crash-prediction performance.

In the work [26], the authors developed an AI-driven machine learning framework for traffic crash severity prediction using a large-scale traffic dataset. They integrated human, crash-specific, and vehicle-related factors, applied clustering techniques such as K-Means and HDBSCAN, and addressed class imbalance with oversampling methods including Random OverSampler, SMOTE, Borderline-SMOTE, and ADASYN. Feature selection was performed using Correlation-Based Feature Selection and Recursive Feature Elimination. Among the tested classifiers, the Extra Trees ensemble model achieved the best performance. The framework provides a scalable, AI-powered solution for traffic safety, supporting intelligent transportation systems and accident-prevention strategies.

The study [27] addresses class imbalance in Jordanian traffic accident data (2009–2011) by implementing three balancing techniques: under-sampling, oversampling, and a mixed approach. The authors apply these methods with Bayes classifiers to predict accident severity. Results demonstrate that oversampling combined with Bayesian Networks significantly improves classification accuracy for severe and fatal injuries, representing the first application of data balancing techniques to analyze traffic accident severity in Jordan.

1.3. Research Gap

Despite the increased number of studies on traffic crash prediction using machine learning, key limitations remain in the literature. The majority of research has used single algorithms, such as Random Forest, XGBoost, and Support Vector Machines, in isolation, without considering the combined, synergistic capabilities of ensemble-based methods that unify Bagging and Boosting techniques. Although ensemble learning has demonstrated great potential in many fields, its potential for traffic crash prediction, particularly when combining heterogeneous models, is understudied.

Additionally, class imbalance, which is inherent in crash datasets due to the low occurrence of accident events compared to non-crash cases, persists and affects predictive performance. Although methods such as the Synthetic Minority Over-sampling Technique (SMOTE) have been proposed to reduce this problem, they are often used without considering feature selection simultaneously. This can result in the retention of irrelevant or redundant variables, which impacts model reliability and transparency. Furthermore, although some recent studies have incorporated hyperparameter tuning to improve model performance, these efforts are typically confined to individual models and lack a systematic, comparative approach across multiple ensemble classifiers. This gap is exacerbated by the lack of consolidation of preprocessing procedures—balancing, feature selection, and tuning—into a coherent, streamlined pipeline for crash prediction problems. Together, these shortcomings reveal a significant gap in the literature: there are no comprehensive frameworks that address data imbalance, feature relevance, and hyperparameter optimization simultaneously within a multi-algorithm ensemble learning paradigm specifically designed for traffic crash prediction. This research addresses this gap by presenting an integrated predictive framework that combines various ensemble methods, feature selection, and data balancing methodologies. These are optimized using rigorous hyperparameter tuning. This integrated approach aims to advance the state of the art in predictive modeling for road traffic safety.

1.4. Objective and Contributions

Research Path and Novelty Statement:

This study follows a structured research path: problem, goal, tasks, novelty, and practical significance.

Problem: The problem is that conventional crash prediction models often have three persistent limitations: (i) class imbalance between crash and non-crash events, (ii) redundant or irrelevant features that reduce interpretability, and (iii) insufficient optimization of model parameters, which weakens predictive reliability.

Goal: The goal of this research is to develop a robust, generalizable, ensemble-based framework that can enhance the accuracy and interpretability of traffic crash prediction.

Tasks: To achieve this goal, the study will undertake the following tasks: (1) perform systematic feature selection using the Extra Trees Classifier to retain the most informative predictors, (2) address class imbalance using the Synthetic Minority Over-sampling Technique (SMOTE), (3) implement rigorous hyperparameter optimization using Grid Search, and (4) conduct a comparative evaluation of four ensemble learning algorithms: Random Forest, Bagging Classifier, XGBoost, and LightGBM.

Novelty: The novelty of this work lies in integrating these three methodological components—feature selection, class balancing, and parameter optimization—into a single, coherent predictive framework that is specifically applied to driving simulator–generated crash data. Unlike prior studies that applied these methods in isolation, this unified approach provides a new level of methodological consistency and robustness in crash prediction research.

Practical significance: The resulting framework offers road safety authorities and policymakers a reliable decision-support tool, enabling them to proactively identify crash-prone scenarios and develop more effective accident prevention strategies.

The primary objective of this study is to develop a robust and accurate framework for predicting traffic crashes by integrating advanced machine learning techniques with systematic data preprocessing and model optimization strategies. Due to the complexity of crash prediction and the associated challenges, such as class imbalance, high-dimensional data, and varying feature relevance, our work aims to address these issues through a multi-step methodological approach.

To this end, the study makes the following key contributions:

Simulation-based data collection for crash analysis:

Recognizing the limitations of observational data, we used a high-fidelity driving simulator to create comprehensive, controlled traffic scenarios. Simulators are increasingly utilized in transportation safety research due to their ability to replicate critical driving conditions without endangering human subjects. This setup allowed us to manipulate variables such as traffic density, weather, and driver behavior. Consequently, we produced a diverse dataset suitable for model training and evaluation under varied crash-related contexts.

Feature Selection Using the Extra Trees Classifier:

To reduce model complexity and enhance predictive performance, we applied a feature selection process based on the Extremely Randomized Trees classifier (Extra Trees). This ensemble-based method effectively identifies the most informative features by assessing their importance across multiple randomized decision trees. By removing irrelevant or redundant features, we improved model interpretability and computational efficiency.

Handling Class Imbalance with SMOTE:

Since traffic crash datasets are usually characterized by severe class imbalance, where non-crash instances far exceed crash events, we used the Synthetic Minority Over-sampling Technique (SMOTE). SMOTE synthetically generates new instances of the minority class, balancing the dataset and mitigating model bias. This enhances the ability to correctly identify crash-prone conditions.

Ensemble Learning for Crash Prediction:

The predictive modeling phase of this study involves a comparative evaluation of several ensemble learning algorithms, including Random Forest, Bagging Classifier, Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). These models were selected due to their proven effectiveness in handling structured data and their capacity to capture nonlinear relationships among features relevant to traffic crash risk.

Hyperparameter Optimization via Grid Search:

To ensure optimal model performance, we implemented an exhaustive hyperparameter tuning process using grid search. Grid Search systematically evaluates model performance across combinations of parameter values to identify the configuration that yields the highest predictive accuracy. This step is critical in avoiding sub-optimal model settings that could compromise the reliability of crash predictions.

In summary, this study proposes an end-to-end, empirically grounded framework for traffic crash prediction. By integrating simulation-based data generation, feature selection, class balancing, ensemble learning, and hyperparameter optimization, our approach bridges significant gaps in existing literature and propels the development of intelligent, data-driven road safety solutions.

1.5. Outline

The following is a description of the structure of this manuscript. The initial section, designated as Section 2, provides an account of the real-time crash prediction methodology and the driving simulator utilized. Subsequently, Section 3 comprises the data collection and data exploration. Subsequently, Section 4 illustrates the methodology and experience, and examines the models employed in crash prediction, with a view to comparing their performance using a range of parameters. Subsequently, the results and discussion are presented in Section 5, followed by the conclusion (Section 6).

2. Data Collection and Preparation

2.1. Driving Simulator

Driving simulators have emerged as a well-established instrument in the realm of driver behavior investigation, presenting a manageable, secure, and economical substitute for actual on-road driving experiences [28].

An ongoing challenge in utilizing driving simulators for research purposes has been the ability to replicate driving behavior that accurately mirrors real-world driving. Recent research findings lend substantial credence to the viability of employing driving simulators in various studies.

The University of Cadi Ayyad (UCA) facility offers a fixed-based driving simulator that was used for the investigation. Studies using simulators for driving have the significant benefit of modeling behavior in a secure setting with complete experimental control. Over the terrain, weather, and other factors that might affect driving, as well as the traffic [29].

The data employed in this study were procured from the simulator situated within the Cadi Ayyad University LISI Laboratory, which is a Computer Systems Engineering Laboratory, situated in the Semlalia faculty of science in Marrakesh. Figure 1 shows the structure of our driving simulator and explains how it collects data. Our simulator has the following characteristics, as shown in Table 1. The experiments were conducted using a MacBook Pro 2015 workstation equipped with an Intel Core i7 processor, 16 GB RAM, and macOS operating system. A Logitech G27 steering wheel and PlaySeat cockpit were employed to replicate realistic driving interactions. The simulation was developed using Project CARS 2, developed by Slightly Mad Studios, which provides a high-fidelity physics engine, diverse weather conditions, and configurable road environments. This setup ensured controlled and reproducible driving conditions for data collection.

This simulator offers a variety of guided simulation methods for the assessment of dangerous behaviors that can result in accidents. Previous research has employed four primary evaluation metrics for crash events, although they are seldom integrated. These metrics encompass physiological signals, driver actions, vehicle dynamics, and weather conditions [30].

In our study, we used the Project CARS 2 driving simulator, developed by Slightly Mad Studios, to perform high-fidelity driving experiments in a controlled environment. This commercial-grade simulation platform is widely recognized for its realistic vehicle dynamics, immersive graphics, and configurable driving scenarios, which make it suitable for behavioral and safety research. The simulator supports dynamic weather, day/night cycles, and detailed telemetry output. This allows researchers to replicate complex road environments and capture granular driving behavior data. The hardware setup included a Logitech G27 racing wheel and pedal set combined with a PlaySeat Evolution cockpit to ensure a realistic and responsive driving experience. Several peer-reviewed studies have adopted Project CARS 2, which underscores its validity and effectiveness in transport research and crash analysis contexts. Table 2 describes in detail key features and the advantages of this simulator.

This study was conducted involving 81 experienced young drivers, utilizing a driving simulator. Participants were required to navigate a specific distance, virtual urban route under a variety of weather conditions (clear, fog, rain, snow) and traffic scenarios. Data on vehicle telemetry, driver inputs, and environmental conditions was collected at a 20 Hz sampling rate. Table 3 gives a summary of the driving simulation study. The objective of the study was to identify factors influencing driving behavior and crash occurrences by analyzing the collected data [15].

2.2. Scenario and Data Acquisition

The participants were instructed to carry out a virtual driving task in a tranquil laboratory setting. Each participant drove along a simulated two-lane urban road extending 16.5 km, which typically required approximately 11 min to traverse when adhering to established speed limits. The experiment was conducted during daylight hours and incorporated a range of weather conditions, including clear skies, fog, rain, and snow, to emulate the multifaceted driving environment and collect data on how these factors impact driver behavior, particularly in pre-crash scenarios [15].

Upon arrival, participants were required to sign an informed consent form and complete a questionnaire detailing their demographics and recent activities. The experimental process involved two simulator sessions. The initial session constituted a preparatory exercise intended to assist participants in acclimatizing to the virtual environment and its controls. The second session constituted the primary trial, which incorporated numerous driving hazards, including interactions with other vehicles along the designated route.

The driving environment incorporated authentic landmarks and infrastructure, emulating urban roadways. Drivers were instructed to adhere to traffic regulations, including signals and signs. Throughout the session, data were recorded continuously at a frequency of 20 Hz using the User Datagram Protocol (UDP). The simulator captured telemetry data (e.g., speed, acceleration), driver inputs (e.g., throttle, braking, steering), and environmental factors such as weather. The meteorological conditions were recorded as a categorical variable, with categories including clear, foggy, rainy, and snowy, while all other variables were continuous. The occurrence of a crash was noted as a binary outcome (1 = crash, 0 = no crash).

In order to simulate realistic driving conditions, the virtual environment was designed to include potential hazards and near-collision situations in which road users came into close proximity, thereby creating moments of risk. These configurations were instrumental in the modeling of critical crash-related behaviors, including lane changes, turn-taking, and signaling. The hazard events were randomly generated to ensure unpredictability and to reduce the possibility of participants adapting to the scenarios over time. The absence of repetition in the scenarios ensured the preservation of the novelty and realism intrinsic to each session, despite the constant nature of the road layout.

The simulation software also incorporated decoy scenarios, defined as events that manifested as hazardous but ultimately did not culminate in a crash. For instance, a driver may reduce speed due to the sudden braking of the vehicle preceding them or an apparent alteration in trajectory that is subsequently retracted. These false alarms have been shown to encourage drivers to maintain situational awareness and scan their surroundings, thereby emulating real-world vigilance behaviors.

Furthermore, the simulator incorporated road users that remained concealed behind virtual structures, such as buildings, until they were prompted to appear and move, thereby impeding the driver’s capacity to anticipate their actions. This approach introduced further variability and realism to the simulation.

The total number of data samples recorded for each weather condition was approximately 75,900, amounting to a total of 303,600 data points. Crash events were infrequent, constituting approximately 8% of the data (5951 events). This allowed for the analysis of both typical and crash-prone driving behavior under various environmental and situational factors.

2.3. Data Description

The data used in this study contains 16 features as described in Table 4. We categorize features in this dataset into three categories: features related to vehicle kinematics, features coming from driver inputs, and there is one feature related to environmental conditions that is called Weather season [15].

2.4. Data Preprocessing

Categorical Variable Encoding

The categorical variable Weather Season was transformed into one-hot encoding. This process resulted in the creation of binary dummy variables representing each season. This preprocessing step ensures that all models receive a consistent numeric feature representation. While certain tree-based algorithms (e.g., Random Forest, Bagging Classifier) can handle categorical features implicitly, one-hot encoding was applied uniformly to maintain comparability across all ensemble models, including XGBoost and LightGBM, which require purely numerical input.

3. Methodology and Experience

3.1. Feature Selection

The reduction in dimensionality can be achieved through the process of feature selection, whereby specific input dimensions are chosen that carry pertinent information for the resolution of a given problem. Feature extraction, in contrast, is a more expansive approach that involves the creation of a transformation of the input space into a lower-dimensional subspace while retaining the majority of the essential information [31].

Feature selection is a valuable technique in data analysis that offers several advantages [31]. By reducing the dimensionality of the feature space it simplifies the data and speeds up machine learning algorithms. This not only improves computational efficiency but also helps to eliminate redundant or irrelevant features, resulting in cleaner and more focused data. By focusing on the most informative features, feature selection can improve the accuracy and predictive power of models. It can also help to conserve resources by reducing the need for extensive data collection and storage. In addition, feature selection can aid data understanding by providing insight into the underlying patterns and relationships within the data. Overall, feature selection is a powerful tool that can significantly improve the quality and efficiency of data analysis tasks.

This paper employs the extra tree classifier, which utilizes tree-based supervised models for the evaluation of the importance of individual features. The method employs multiple decision trees for feature selection, with the number of trees being a configurable parameter. The relative significance of each attribute is determined by aggregating the decision trees, and those deemed less relevant are eliminated [32].

Although the dataset in this study contained a moderate number of features, the variables represented diverse dimensions, including driver behavior, vehicle dynamics, and environmental conditions, which may exhibit multicollinearity or varying predictive strength. The Extra Trees Classifier was employed not solely to reduce dimensionality but to obtain a robust estimate of feature importance and to identify redundant or weakly contributing attributes. This process improves both the interpretability and stability of the predictive framework by ensuring that ensemble models focus on the most relevant factors influencing crash likelihood. As a result, the subsequent training phase benefits from reduced noise and enhanced generalization performance.

3.2. Data Balancing

The Synthetic Minority Over-Sampling Technique, or SMOTE, was created to solve the problem of class imbalance in machine learning datasets. In a dataset, class imbalance happens when one class—typically the minority class—is underrepresented in relation to the other classes. As a result, models may become biased and perform badly when predicting the minority class.

The application of the Synthetic Minority Over-sampling Technique (SMOTE) has been identified as a potential means of increasing the representation of underrepresented classes. This approach has been inspired by a successful method previously employed in the field of handwritten character recognition [33]

SMOTE works by generating synthetic samples for the minority class.

The algorithm works in 5 steps:

Identify the minority class: Determine which class has fewer instances in the dataset.
Select a minority instance: Randomly select an instance from the minority class.
Find nearest neighbors: Identify the k nearest instances to the selected instance. The value of k is specified by the user.
Generate synthetic instances: Create new synthetic instances by linearly interpolating feature values between the selected instance and its neighbors. This introduces new instances into the minority class feature space.
Repeat: Continue this process until the desired class balance is achieved.

SMOTE helps to overcome the imbalance by increasing the representation of the minority class through synthetic samples, allowing the machine learning model to train more effectively in both classes.

It is important to note that while SMOTE can be beneficial, it may not be suitable for all datasets or machine learning problems. Careful consideration and evaluation of the specific characteristics of the data are necessary when deciding whether to use SMOTE or other techniques to address class imbalance.

In this study, SMOTE was chosen as the primary balancing method because it increases the diversity of minority-class samples while preserving the structure of the feature space. SMOTE was implemented using the commonly adopted setting of k = 5 nearest neighbors, which provides a good compromise between generating realistic synthetic samples and avoiding the creation of outliers. Although alternative approaches such as class weighting and cost-sensitive learning were considered, preliminary tests indicated that these methods did not sufficiently correct the severe imbalance present in the original dataset. Class weighting modifies loss penalties but does not alter the underlying distribution of training samples, while cost-sensitive methods depend heavily on carefully tuned misclassification costs. In contrast, SMOTE produced substantial improvements in minority-class recall across all ensemble models and yielded more stable cross-validation performance. For these reasons, SMOTE was selected as the most effective and consistent strategy for addressing imbalance in the present study.

3.3. Traffic Crash Prediction Models

3.3.1. Random Forest

Random Forests consist of a collection of tree predictors, each operating independently with no reliance on the values of a randomly selected vector sample. Additionally, all trees within the forest share the same distribution. As the number of trees in the forest becomes substantial, the generalization error tends to approach a limit [34]. The RF methodology incorporates two randomized processes prior to seeking the most effective features and partition points. Initially, a predetermined quantity is randomly chosen from the training set. Subsequently, for each iteration of tree growth, RF randomly selects sub-samples. These dual procedures contribute to mitigating overfitting. The ultimate predictions generated by the RF are derived through the averaging of the individual outcomes from all the learners [3].

3.3.2. Bagging

An ensemble learning technique called Bootstrap Aggregating is used to improve the accuracy and stability of machine learning models. It entails merging the predictions of many instances of the same learning algorithm that have been trained on various subsets of the training data. Bagging predictors involves generating multiple instances of a predictor and merging them to form an aggregated predictor. In cases where a numerical outcome is predicted, the aggregation involves averaging the predictions of the individual instances; for class predictions, a majority vote is taken. By creating bootstrap replicas of the learning set and treating them as new learning sets, numerous predictor versions are produced. Bagging has demonstrated notable accuracy improvements in experiments with real and simulated datasets, particularly in the context of classification and regression trees, as well as subset selection in linear regression. The method is particularly advantageous when the prediction model exhibits instability, as Bagging can enhance accuracy by addressing large changes in the predictor resulting from perturbing the learning set [35].

3.3.3. Extreme Gradient Boosting

XGboost is one of the implementations of gradient boosting machines (GBM), which are well-known for being among the best algorithms for supervised learning. It can be used to solve problems with regression and classification. Because of XGboost’s rapid out-of-core computer performance, data scientists pick it [36].

Even with a small number of observations, XGBoost enables the consideration of a sizable collection of explanatory factors coupled with unspecified nonlinearities in the estimated effects. Additionally, the framework should strive to be explainable [37].

The implementation of gradient boosting is typically characterized by a slow training process due to the sequential modelling approach. However, Extreme Gradient Boosting (XGBoost) represents a comparatively recent algorithm, first introduced by Chen and Guestrin in 2016 [36]. XGBoost represents a variant of gradient boosting decision trees, designed with the objective of improving both speed and performance. This implementation offers a parallel tree boosting algorithm, which enables the training process to be optimized with greater speed and precision [38].

3.3.4. Light Gradient Boosting Machine (LightGBM)

A strong and effective gradient boosting framework for machine learning is LightGBM. It is a well-liked option for a variety of tasks, including classification, regression, and ranking, because of its rapid, accurate, and scalable design. It is a more sophisticated variant of Microsoft’s XGBoost, with reduced memory usage and quicker training rates. Parallel computing and cutting-edge methods like Exclusive Feature Bundling (EFB) and Gradient-based One-Side Sampling (GOSS) are used to accomplish this. GOSS prioritizes samples with larger gradients to optimize information gain, while EFB reduces feature dimensionality by bundling mutually exclusive features. These strategies allow LightGBM to build efficient decision trees with depth constraints, balancing sample size and accuracy [39].

Justification of Model Selection:

The impetus for this study was twofold: first, to investigate and optimize the performance of tree-based ensemble machine learning models for traffic crash prediction, and second, to provide a comprehensive justification of the selection of the model. Consequently, the present study deliberately centered on algorithms such as Random Forest, Bagging Classifier, XGBoost, and LightGBM, which represent the most widely used and effective ensemble tree-based approaches. The selection of these models was guided by the study’s methodological objective, which is to develop a crash-prediction framework that is both interpretable and robust, while being computationally efficient. This framework is to be based on tabular, structured data generated by a driving simulator.

Ensemble tree-based algorithms offer several advantages for this purpose. They effectively capture nonlinear relationships among behavioral, environmental, and vehicular features. They handle heterogeneous data types without extensive preprocessing. They are resistant to noise and multicollinearity. They provide built-in mechanisms to assess feature importance. This is an essential capability for identifying key risk factors in safety-critical applications. Furthermore, when employed in conjunction with oversampling methods such as SMOTE, these models have been shown to effectively address data imbalance while maintaining high classification performance.

While deep-learning models (e.g., artificial neural networks (ANN), convolutional neural networks (CNN), and long short-term memory (LSTM)) are powerful in contexts involving large-scale, sequential, or image-based data, their complexity and reduced interpretability make them less suitable for the present dataset and objectives. The present study, therefore, focuses on tree-based machine learning, which provides a solid, interpretable foundation for crash-risk analysis. Nevertheless, we are conducting a follow-up study that explores deep-learning architectures using extended multimodal datasets to complement and build upon the findings of this work.

3.4. Hyperparameters Optimization

3.4.1. Hyperparameter Tuning Techniques

Hyperparameter tuning, a critical aspect of machine learning, is such a black box problem. Numerous techniques have been proposed, ranging from manual and grid search to more sophisticated approaches such as random search, Bayesian optimization, genetic algorithms, and particle swarm optimization. While recent research has focused on the development of new models, the impact of hyperparameter tuning on model performance is often underexplored.

3.4.2. Grid Search Algorithm

Grid Search is a systematic approach to hyperparameter tuning that evaluates model performance for every possible combination of parameters within a user-defined grid. While it is exhaustive within this predefined grid, it does not explore the entire hyperparameter space, which can be infinite in scope. Previous research has shown that alternative optimization strategies such as random search [40], Bayesian optimization, and Optuna can improve efficiency by adaptively sampling promising parameter regions. Nevertheless, Grid Search was deliberately chosen in this study because it offers methodological transparency, reproducibility, and uniform applicability across all ensemble models (Random Forest, Bagging, XGBoost, and LightGBM). The dataset size and parameter ranges were moderate, rendering the computational requirements fully manageable. This approach ensured fair comparison among models and provided a reliable, interpretable framework for assessing the influence of hyperparameters on predictive performance.

The procedure for selecting hyperparameters is as follows:

Hyperparameter optimization was carried out exclusively on the training set using a 5-fold cross-validation strategy implemented through the GridSearchCV module in scikit-learn. The training data were internally partitioned into five folds, where four folds were used for model fitting and one for validation in each iteration. The grid search method evaluated all parameter combinations based on the mean cross-validation accuracy. The configuration achieving the best score was then retrained on the complete training data. The independent test set was utilized solely for the final performance evaluation, thereby ensuring an unbiased assessment and preventing any data leakage between the tuning and testing stages.

3.5. Validation Strategy

To ensure robust and unbiased model evaluation, a Stratified 5-Fold Cross-Validation procedure was adopted during both training and hyperparameter optimization. This approach maintains the same crash-to-non-crash class ratio across all folds, providing balanced and representative subsets of the data. In each iteration, four folds (80%) were used for training and one fold (20%) for validation, and the process was repeated five times so that each subset served as validation once. This stratified design mitigates bias due to data imbalance and yields a stable estimate of generalization performance. The reported metrics (accuracy, precision, recall, and F1-score) represent the average values across all folds. Standard deviations were also computed to capture model uncertainty. The final models were retrained on the complete training data using the optimal parameters obtained from Grid Search, and their predictive performance was assessed on an independent test set.

Hyperparameter Search Configuration and Reproducibility

Hyperparameter tuning was conducted using GridSearchCV with Stratified 5-Fold Cross-Validation on the training subset (80% of the data). Within each CV iteration, feature selection and SMOTE oversampling were applied only to the training folds to prevent data leakage. All random processes were fixed with random_state = 42, and every configuration was repeated three times to estimate variability.

The selection of hyperparameter ranges shown in Table 5 was guided by a combination of prior research, exploratory tuning, and computational considerations. Preliminary experiments were conducted to identify stable and meaningful regions of the search space, while avoiding configurations that caused overfitting or excessively long training times. For example, the XGBoost learning-rate values of 0.01, 0.1, and 0.2 were chosen because they are widely recommended in ensemble-learning literature and yielded stable convergence during initial tests. Higher learning rates (>0.3) produced unstable behavior, whereas very small values (<0.01) greatly increased computation time without improving performance. Similar reasoning informed the selection of parameter ranges for Random Forest, Bagging, and LightGBM, ensuring that the grid covered the most influential configurations while remaining computationally feasible for stratified 5-fold cross-validation.

Table 5 summarizes the parameter grids, the number of parameter combinations, and resulting CV evaluations. Experiments were executed on a workstation (Intel Core i7 @ 3.1 GHz, 16 GB RAM) using n_jobs = –1 for parallelization. Although Grid Search ensures exhaustive exploration and transparency, future work may adopt Randomized Search or Bayesian Optimization (e.g., Optuna) to improve efficiency for larger search spaces.

Experimental Pipeline and Reproducibility

During each iteration of Stratified 5-Fold Cross-Validation, both feature selection (using the Extra-Trees Classifier) and data balancing (using SMOTE) were applied only on the training folds to prevent any leakage of synthetic or selected features into the validation data. The experimental pipeline, therefore, followed the sequence:

Train Fold → Feature Selection → SMOTE → Model Training → Hyperparameter Tuning → Validation Fold Evaluation.

The final hold-out test set (20% of the data) remained completely untouched until the end of model development. All random processes (SMOTE sampling, cross-validation shuffling, and model initialization) were seeded with random_state = 42 to ensure reproducibility. Each configuration was repeated three times, and the mean ± standard deviation of the resulting metrics was reported to capture performance variability and model uncertainty.

A complete pseudocode of the entire pipeline, including preprocessing, SMOTE balancing, ExtraTrees feature selection, model training, and evaluation, is provided as Algorithm S1 in the Supplementary Materials.

3.6. Performance Evaluation

In this study, we divide the dataset into two subsets: the first is the training set, and the second is the testing set. For each model, the most appropriate parameters are selected for fine-tuning based on the training set. Subsequently, the performance of each model is evaluated using a range of assessment metrics. The outputs of these models can be classified into four categories: The terms “true negative” (TN), “false negative” (FN), “true positive” (TP), and “false positive” (FP) are used to describe the accuracy of a classification model. Of these, accuracy is the most frequently employed approach for gauging the overall performance of a model [41].

Accuracy: is the most fundamental performance metric used to evaluate the overall performance of a classification model. Accuracy measures how often a model correctly identifies instances, considering both correct positive and negative predictions, relative to the total number of instances in the dataset, as expressed in the following Equation (1).

A c c u r a c y = \frac{(T 1 + T 2)}{(T 2 + F 2 + T 1 + F 1)}

(1)

In summary, accuracy is a widely used metric for assessing the overall correctness of a classification model. However, it should be interpreted with caution, and additional metrics may be necessary, especially in imbalanced classification scenarios.

Precision is used as a statistic to evaluate the accuracy of a model’s positive predictions in both binary and multi-class classification. Precision quantifies the percentage of accurate positive predictions among all of the model’s positive predictions. As shown in the following equation, it evaluates the model’s capacity to accurately detect positive cases and is especially useful in situations when false positives are expensive (2).

P r e c i s i o n = \frac{T 1}{(T 1 + F 1)}

(2)

Recall: also recognized as Sensitivity or True Positive Rate, is a metric employed in both binary and multiclass classification assignments. It measures the capability of a classification model to accurately identify all pertinent instances in a dataset. Specifically, recall emphasizes the ratio of actual positive instances that the model correctly predicts as positive, as given in Equation (3).

R e c a l l = \frac{T 1}{(T 1 + F 2)}

(3)

F1 score: This metric is employed in both binary and multiclass classification to offer a well-balanced evaluation of a model’s effectiveness, considering both precision and recall. Its utility becomes especially prominent when dealing with imbalanced class distributions, aiming to achieve equilibrium between precision and recall. The F1 score is a unified measure that considers both precision and recall. It is calculated by taking the harmonic mean of these two metrics, as expressed in the following Equation (4), Where R means Recall, and P means Precision.

F 1 s c o r e = \frac{2 \times R \times P}{R + P}

(4)

Confusion matrix: Table 6 presents the performance evaluation measures derived from the confusion matrix. In this matrix, the True Positive (TP) represents the number of crash instances accurately identified as crashes, while the False Positive (FP) denotes the number of no-crash instances incorrectly predicted as crashes. The False Negative (FN) corresponds to crash instances that were mistakenly classified as no-crash cases, and the True Negative (TN) refers to no-crash instances that were correctly identified as such [42].

ROC-AUC and PR-AUC: ROC-AUC (Area Under the Receiver Operating Characteristic Curve) is a metric that evaluates a model’s ability to discriminate between crash and non-crash classes across varying thresholds, while PR-AUC (Area Under the Precision–Recall Curve) is particularly informative for imbalanced datasets by quantifying the trade-off between precision and recall. The calculation of both AUC metrics was derived from cross-validated predicted probabilities.

Threshold Sensitivity Analysis: Given the potential variability in classification performance with respect to the decision threshold, a threshold sensitivity analysis was conducted. Predicted probabilities were aggregated across validation folds, and model performance was evaluated at thresholds ranging from 0.1 to 0.9. For each threshold, the precision, recall, F1-score, and balanced accuracy were computed. This analysis lends support to the selection of a threshold and offers operational guidance for applications where false positives and false negatives incur different costs.

Feature Importance Estimation: The quantification of feature importance was achieved through the implementation of ExtraTreesClassifier, a machine learning algorithm, on SMOTE-balanced training folds. The mean importance and standard deviation across folds were computed to assess the stability of feature contributions. These values support interpretability and robustness evaluation.

Learning Curves: Representative ensemble models were utilized to generate learning curves, thereby facilitating the evaluation of the relationship between model performance and training sample size. Training and validation scores were computed for increasing training set fractions, providing insight into potential behaviors indicative of underfitting or overfitting and indicating whether the inclusion of additional data would enhance the model’s learning process.

Statistical Significance Testing: To ascertain whether these variations in model performance were statistically significant, a one-way analysis of variance (ANOVA) was conducted on the fold-level F1-scores, followed by paired t-tests for all pairwise model comparisons. These analyses confirm whether observed performance differences are statistically significant rather than due to sampling variability.

Baseline Model Evaluation: A logistic regression baseline was incorporated for the purpose of comparison under the same cross-validation procedure and evaluation metrics. This approach served to provide context for the improvements achieved by ensemble methods.

4. Experimental Results

4.1. Results of Feature Selection

In this study, feature selection was conducted using the Extra Trees Classifier, which evaluates the relative importance of each variable in predicting the crash state. The ten features with the highest computed importance scores were selected for further analysis, as illustrated in Figure 2. Features with low importance values were excluded, given their limited contribution to the model’s predictive performance.

The choice to retain the top 10 features was guided by both statistical and performance-based considerations. A cumulative importance analysis showed that the highest-ranked 10 features accounted for approximately 92% of the total importance across the Extra Trees ensemble. Beyond this point, the curve exhibited a clear elbow, indicating diminishing returns from adding lower-ranked variables. Additional validation tests confirmed that increasing the number of features beyond 10 resulted in negligible performance improvements (less than 0.3% change in F1-score on average), while increasing computational cost and model variability. Therefore, selecting the top 10 features provided an optimal balance between predictive power and model simplicity.

Regarding the excluded variables, such as Tyre Temp RL and Steering Angle, their low importance values suggest that they had limited predictive influence in the controlled simulator setting used in this study. The driving simulator does not fully replicate real tire-wear progression, load transfer, or lateral dynamics, which may reduce the impact of tire-temperature asymmetry on crash likelihood. Similarly, steering-angle variability was constrained by the structured urban environment and standardized road geometry of the simulator, resulting in limited discriminatory power. Although these variables hold practical relevance in real-world safety analysis, their contribution in the present experimental context was modest. Future validation using naturalistic driving data or instrumented-vehicle datasets may reveal stronger effects for these safety-critical variables.

To quantify the stability of feature-ranking results, feature importance values were computed across the Stratified 5-Fold Cross-Validation folds, and the mean ± standard deviation for each feature is reported in Supplementary Table S3. These variability estimates provide error bounds for Figure 2 and confirm that the most influential features remain consistently ranked across folds.

4.2. Data Balancing Results

In order to address class imbalance and enhance the representation of positive crash-state instances, the Synthetic Minority Oversampling Technique (SMOTE) was applied exclusively to the training set. As illustrated in Figure 3, this oversampling approach led to a substantial increase in the total number of training samples, rising from 60,725 to 111,734 per weather condition. The number of positive crash-state instances increased significantly from 4858 to 55,867 per weather condition. This ensures that the testing data remain unbiased, preserving the original class distribution for fair model evaluation.

4.3. Performance of Ensemble Machine Learning for Traffic Crash Prediction

In traffic-safety applications, relying solely on overall accuracy can be misleading due to the strong class imbalance between crash and non-crash events. Missing a crash instance (false negative) has far more severe consequences than generating a false alarm, which makes recall and F1-score more relevant indicators for assessing model usefulness. For this reason, we prioritize recall and F1-score when interpreting model performance, and accuracy is reported as a complementary metric rather than the primary criterion. This ensures that the evaluation reflects the operational requirements of safety-critical crash prediction systems, where correctly identifying high-risk situations is the foremost priority.

Figure 4 shows the comparative performance of four ensemble models—Bagging Classifier, Random Forest, XGBoost, and LightGBM—evaluated using accuracy, precision, recall, and F1 score. LightGBM achieved the highest overall performance (95.3% accuracy, 93.1% F1 score), followed closely by XGBoost (94.7% accuracy), Random Forest (93.9% accuracy), and Bagging (92.5% accuracy). Despite its slightly lower accuracy, Bagging Classifier maintained a balanced trade-off between precision and recall, demonstrating reliable performance in distinguishing positive and negative instances. The Random Forest model was highly accurate but showed a mild tendency to underestimate positive cases, suggesting potential vulnerability to false negatives. Overall, XGBoost and LightGBM provided the best balance between sensitivity and stability, reflecting their robustness in identifying high-risk scenarios. These initial results are promising, but further improvements can be achieved through systematic hyperparameter tuning, feature engineering, and class imbalance handling. These techniques are expected to enhance the predictive reliability of all ensemble models.

In addition to primary classification metrics, we computed the ROC-AUC and PR-AUC for all models. The corresponding ROC and Precision–Recall curves are provided in Figures S1 and S2 (Supplementary Materials), and AUC values are summarized in Table S1 (Supplementary Materials).

To assess whether performance differences were statistically significant, we conducted a one-way ANOVA and paired t-tests using fold-level F1-scores. The ANOVA revealed a highly significant overall effect (F = 2075.10, p = 2.29 × 10⁻⁴⁰), confirming substantial performance differences between models. Detailed results, including paired t-test statistics and effect sizes, are provided in Table S2 (Supplementary Materials).

We also report numerical feature importance values with their variability, computed as mean ± SD across CV folds, in Table S3 (Supplementary Materials).

A threshold sensitivity analysis was performed by varying the classification threshold from 0.1 to 0.9 and computing Precision, Recall, F1-score, and Balanced Accuracy for each value. The results are presented in Figure S3 and Table S4 (Supplementary Materials).

A logistic regression baseline model was trained using the same cross-validation setup as the ensemble models. Its performance metrics are reported in Table S5 (Supplementary Materials) for comparison.

Learning curves illustrating training and validation performance across increasing sample sizes are included in Figures S4–S6 (Supplementary Materials). All performance tables in the main text have been updated to report mean ± SD across folds.

4.4. Optimization and Hyperparameter Tuning Process Results

Among the evaluated algorithms, XGBoost and LightGBM achieved the highest overall performance, with recall values of 0.96 and F1-scores of 0.95, indicating superior capability in correctly identifying crash events while maintaining high precision (0.94). As shown in Figure 5, these models demonstrate strong generalization and balance between sensitivity and stability, making them well-suited for safety-critical applications that prioritize crash detection.

The results obtained after the hyperparameter optimization process reveal a clear enhancement in the performance of all four ensemble models—Random Forest, Bagging Classifier, XGBoost, and LightGBM. As illustrated in Figure 6, each subplot presents the optimal parameter configuration and corresponding evaluation metrics obtained through Grid Search. The optimized configurations led to an average improvement of 3–5% in accuracy and 4–6% in F1 score compared to the default settings, underscoring the positive impact of systematic parameter tuning on model generalization. Among the optimized models, Bagging Classifier and LightGBM exhibited the most significant performance gains across all metrics, demonstrating strong predictive capability and reliability in identifying traffic crash risks. While XGBoost and Random Forest also achieved competitive results, the Bagging and LightGBM models provided the most robust and consistent predictions for crash analysis.

The Bagging Classifier achieved the highest precision (0.99) and overall accuracy (0.98), signifying highly reliable crash predictions with minimal false alarms. However, its relatively lower recall (0.81) suggests a more conservative classification behavior, where some crash events were missed to maintain prediction certainty. This reflects a trade-off between alert reliability and completeness of crash detection, which can be valuable in cost-sensitive or real-time safety-warning contexts.

In contrast, the Random Forest model, despite its solid overall accuracy (0.93), exhibited a very low recall (0.08), indicating a strong bias toward the majority class before optimization. These findings collectively confirm that boosting-based methods (XGBoost and LightGBM) deliver the best balance between precision and recall, while the Bagging Classifier remains optimal when false-positive minimization is the primary operational objective.

These results collectively demonstrate the effectiveness of the Grid Search optimization procedure in enhancing model performance. By systematically exploring different hyperparameter combinations, all models achieved higher accuracy, precision, recall, and F1-scores.

Table 7 presents the optimized hyperparameters for each ensemble model as identified through the grid search procedure. For the Random Forest model, performance improved with an increased maximum tree depth and a larger number of features considered at each split. The Bagging Classifier achieved its best results with a higher number of estimators and a smaller subset of features per tree, enhancing diversity and stability across the ensemble. XGBoost reached optimal performance using a moderate learning rate, deeper tree structures, and an appropriate number of estimators combined with a reduced subsample ratio. Finally, LightGBM demonstrated superior outcomes with a moderate learning rate, a greater number of leaves, and a sufficient number of estimators to ensure robust learning without overfitting.

5. Discussion

5.1. The Interpretation of Features and Derived Insights on Road Safety

The analysis of feature importance provides not only predictive value but also meaningful insights into road safety. Among the most salient predictors identified by the Extra Trees Classifier and ensemble models were vehicle speed, braking intensity, and weather conditions. These factors have direct causal relationships with crash occurrence and severity. High vehicle speeds and abrupt braking patterns frequently indicate driver distraction or reaction delays. Adverse weather conditions, such as rain or fog, significantly increase the likelihood of loss of control or reduced visibility. Furthermore, variables associated with driver behavior, including steering variability and lane deviation, emerged as pivotal risk indicators. These findings are consistent with the existing literature and demonstrate that the proposed framework not only achieves strong predictive accuracy but also enhances interpretability by linking model outputs to tangible behavioral and environmental risk factors. The ability to interpret such data is critical for translating data-driven models into actionable road safety interventions and policy decisions.

The risk factors identified by the ensemble models also point toward several practical and policy-relevant safety interventions. The strong influence of vehicle speed supports the implementation of adaptive speed management strategies, including dynamic or variable speed limits that adjust to real-time traffic and weather conditions. The prominence of abrupt braking patterns suggests the value of advanced driver-assistance alerts capable of detecting hazardous deceleration or unsafe following distances. Furthermore, the elevated crash likelihood in rain and fog indicates the need for weather-responsive countermeasures such as dynamic roadside signage, automated visibility warnings, and integration with connected-vehicle systems that broadcast real-time environmental hazard information. These recommendations illustrate how the insights derived from the predictive framework can guide evidence-based policymaking and support the design of intelligent transport systems that proactively mitigate crash risk.

A comparative analysis of model performance before and after optimization highlights both the practical significance of the results and the effectiveness of the proposed methodology. In the domain of traffic crash prediction, recall is a critical metric, as the failure to correctly identify positive cases (false negatives) can hinder timely preventive interventions and increase the likelihood of severe incidents. Prior to optimization, the Random Forest model exhibited a notably low recall of 0.051, indicating a strong bias toward the majority class. Although optimization improved recall to 0.0848, this level remains insufficient for deployment in safety-critical contexts. In contrast, the optimized XGBoost and LightGBM models achieved substantially higher recall values (0.9588 and 0.9619, respectively) while maintaining high precision. This demonstrates their ability to effectively identify high-risk situations without generating excessive false alarms. The Bagging Classifier achieved the highest precision (0.9936) and strong overall accuracy (0.9848) but a comparatively lower recall of 0.8057, indicating a trade-off between the reliability of positive predictions and the completeness of crash detection. These findings underline the importance of selecting models according to operational priorities—favoring high recall in safety-critical detection systems or high precision in applications where false alerts must be minimized.

The confusion matrices presented in Table 8 provide a clear view of how each ensemble model distinguishes between crash and non-crash events. XGBoost and LightGBM demonstrated the best balance between sensitivity and reliability, effectively detecting crash-prone situations while keeping false alarms low. The Bagging Classifier achieved the highest precision but missed more crash instances, indicating a conservative prediction tendency that prioritizes alert reliability over detection coverage. The Random Forest model showed moderate performance across both error types.

5.2. The Supplementary Evaluation Analyses

The additional Evaluation Analyses provide a more comprehensive understanding of the model’s behavior and its resistance to external influences. Initially, the ROC-AUC and PR-AUC analyses substantiate the ensemble models’ considerable discriminative capacity, even in the context of substantial class imbalance. The PR-AUC values in particular demonstrate that the best-performing models are capable of maintaining high recall without a disproportionate increase in false positives. Secondly, the threshold sensitivity analysis demonstrates the anticipated precision-recall trade-off as the decision threshold varies from 0.1 to 0.9. The F1-score attains its apex around a threshold of 0.5, thereby validating the decision threshold employed in this study. Furthermore, it elucidates the potential for disparate operating points to be selected in practice, contingent upon safety objectives.

The results of the feature importance analysis, reported as the mean ± standard deviation across CV folds, confirm the stability of the most influential predictors, thereby reinforcing the robustness of the model’s interpretability. The learning curves demonstrate that ensemble methods, particularly boosting models, continue to demonstrate a benefit from additional training data, and there is an absence of indications of severe overfitting.

The significance analysis further supports the observed performance differences. A one-way ANOVA conducted on fold-level F1-scores revealed a highly significant effect among the four ensemble models (F = 2075.10, p = 2.29 × 10⁻⁴⁰). Paired t-tests also confirmed that these differences were statistically meaningful. For example, XGBoost obtained significantly higher F1-scores than Random Forest (p < 0.001), and LightGBM significantly outperformed the Bagging Classifier (p < 0.01). These statistical results, together with the large effect sizes, indicate that the advantages of boosting-based models are both statistically and practically substantial.

Overall, the analysis underscores the significant contribution of this study. By integrating targeted feature selection, advanced class balancing, and rigorous hyperparameter tuning, we have attained performance levels seldom reported in the literature, particularly for boosting-based models. Furthermore, the findings provide clear operational insights for model selection based on safety priorities and offer practical improvement strategies—such as cost-sensitive learning, hybrid sampling, and domain-specific feature engineering—to further enhance detection of high-risk cases. The proposed approach is positioned as both scientifically rigorous and practically impactful for the development of intelligent traffic crash prevention systems.

5.3. Limitations, Model Transferability, and Future Validation

This study relies on simulator-generated data, which provide a controlled, safe, and highly reproducible setting for analyzing driving behavior and crash mechanisms. However, simulated environments cannot fully capture the cognitive, emotional, and risk-perception factors that influence real-world driving. Participants may take risks in simulators that they would avoid on public roads, which can shift the frequency of hazardous maneuvers and influence the relative importance of predictive features. These limitations highlight the need to validate the proposed framework using naturalistic driving datasets or verified crash reports.

Although access to real crash data was not possible due to ethical constraints, confidentiality requirements, and limited availability, several practical routes exist for future validation. These include establishing data-sharing agreements with national road safety agencies, police departments, or insurance companies; using instrumented vehicles or naturalistic driving study platforms that record continuous driver, vehicle, and environment information; and integrating anonymized police crash databases that contain verified crash details and driver characteristics. Such initiatives require strict adherence to ethical and privacy regulations, but they are essential for assessing the real-world applicability of the proposed model.

The methodological pipeline developed in this study, which includes feature selection, class balancing, and ensemble optimization, is independent of the data source and can be retrained directly on real-world datasets. Combining verified crash data with simulator-based and sensor-based sources in future research would strengthen model robustness, improve generalizability, and support practical implementation in road safety systems. This integration would reinforce the link between predictive modeling and real-world safety management and ultimately contribute to more reliable and effective crash prevention strategies.

Future research should also prioritize more rigorous temporal validation to assess how well the models generalize to future crash events in operational settings. In addition, advanced interpretability techniques such as permutation importance and SHAP values should be applied to verify the stability of feature rankings, capture nonlinear interactions, and ensure transparent decision-making. These analyses represent essential next steps for strengthening both the reliability and practical relevance of the predictive framework. Further investigation of boundary-focused oversampling techniques, including Borderline-SMOTE and ADASYN, is also critical for improving minority class representation and reducing the likelihood of missed high-risk cases.

6. Practical Deployment of Trained Models

In practical implementation, the trained crash-prediction models can be integrated into the vehicle’s electronic control and communication architecture, particularly via the Controller Area Network (CAN) Bus, which supports real-time data exchange among subsystems. The CAN Bus, developed by Bosch in 1983, serves as an internal network connecting braking, steering, and powertrain modules through a two-wire, multi-master topology. This configuration enables continuous acquisition of vehicle dynamics, driver inputs, and environmental variables, which can serve as real-time inputs to the predictive model.

Complementary information, such as weather conditions, road geometry, and traffic density, can be collected from onboard sensors (e.g., temperature, precipitation, visibility) or external sources, including roadside weather stations and GPS-based systems.

From a computational perspective, the optimized ensemble models (XGBoost and LightGBM) are efficient and suitable for embedded deployment. Benchmarking on a single-core processor equivalent to an automotive-grade ARM Cortex-A72 indicates an average inference time below 10 ms per instance and a memory footprint of 50–80 MB, depending on model configuration. These values demonstrate feasibility for integration into in-vehicle processors or lightweight embedded systems. For continuous real-time operation at 10–20 Hz, deployment techniques such as pruning, model quantization, or ONNX/TensorRT conversion can maintain total latency under 50 ms, meeting real-time constraints for driver-assistance systems.

The trained model can therefore run locally within a microcontroller-based or SoC-based unit connected to the CAN Bus, processing streaming data to estimate crash probability in real time. When high-risk conditions are detected, the system can issue immediate visual or auditory alerts, trigger preventive mechanisms (e.g., adaptive braking or steering correction), or transmit warnings to traffic management centers. Although microcontrollers are compatible with CAN protocols, integration in practice requires suitable firmware design, message prioritization, and bandwidth management to avoid bus congestion [43].

Overall, these specifications confirm the technical feasibility of deploying the proposed predictive framework in real-time environments, bridging simulation-based experimentation and intelligent transportation applications.

7. Conclusions

This study developed a comprehensive ensemble-based machine learning framework for predicting traffic crashes using simulator-generated driving data. The proposed methodology integrated feature selection via the Extra Trees Classifier, data balancing using SMOTE, and hyperparameter optimization through Grid Search under a Stratified 5-Fold Cross-Validation scheme. This systematic pipeline effectively addressed three major challenges in crash prediction—feature redundancy, class imbalance, and parameter optimization—thereby improving model interpretability, robustness, and reliability.

Four ensemble algorithms—Random Forest, Bagging Classifier, XGBoost, and LightGBM—were optimized and comparatively evaluated. The results demonstrated that the optimization process significantly enhanced model performance across all metrics. Among the tested models, XGBoost and LightGBM achieved the highest overall performance, with F1-scores of 0.95 and recall values of 0.96, confirming their superior ability to identify crash-prone situations. The Bagging Classifier, although exhibiting slightly lower recall (0.81), achieved the highest precision (0.99) and accuracy (0.98), reflecting a more conservative prediction pattern that prioritizes alert reliability over detection coverage.

The confusion-matrix analysis further highlighted how each model balances false positives and false negatives, providing valuable insight into model suitability for real-world use cases. The strong performance of boosting-based methods (XGBoost and LightGBM) confirms their robustness in handling complex, nonlinear relationships within the data, while ensemble averaging (Bagging and Random Forest) remains advantageous for improving stability and interpretability.

The integration of feature selection and oversampling within the training pipeline also proved essential for enhancing the models’ generalization and interpretability, allowing them to better capture the influence of key predictors such as vehicle kinematics, driver behavior, and weather conditions. These results demonstrate the potential of ensemble learning techniques to support evidence-based decision-making and the design of proactive safety interventions.

Although this study was based on simulator-generated data, the framework itself is methodologically transferable and can be retrained on empirical datasets to validate its generalizability. Future research will focus on testing the proposed models using naturalistic driving or field data, and extending the framework with deep learning architectures (such as ANN, CNN, and LSTM) and cost-sensitive approaches to further enhance prediction accuracy and reduce false negatives in rare-event scenarios.

In summary, the findings of this study demonstrate that combining ensemble machine learning, feature selection, data balancing, and systematic hyperparameter optimization yields a robust, interpretable, and reproducible framework for crash-risk prediction. The proposed approach contributes to the advancement of data-driven road safety management and represents a promising foundation for developing real-time, high-performance crash prediction systems aimed at reducing traffic accidents and enhancing public safety.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/safety11040121/s1, Figure S1: ROC curves; Figure S2: PR curves; Table S1: Table main metrics with std; Table S2: Table ANOVA summary; Table S3: feature importances; Table S4: threshold sensitivity RandomForest; Figure S3: threshold sensitivity RandomForest; Table S5: pairwise pvalues; Table S6: pairwise tstats; Table S7: logistic baseline; Algorithm S1: Supplementary AppendixA pseudocode; Figure S4: learning curve Bagging DT; Figure S5: learning curve RandomForest; Figure S6: learning curve XGBoost.

Author Contributions

Conceptualization, N.G. and Z.E.A.E.; methodology, N.G.; software, N.G.; validation, N.G., Z.E.A.E., H.M. and M.A.; formal analysis, N.G.; investigation, N.G.; resources, N.G. and Z.E.A.E.; data curation, Z.E.A.E. and M.A.; writing—original draft preparation, N.G.; writing—review and editing, N.G.; visualization, N.G.; supervision, Z.E.A.E. and H.M.; project administration, H.M.; funding acquisition, N.G. The authors used AI-based tools, including Grammarly, to enhance the clarity and readability of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the Moroccan Ministry of Equipment, Transport and Logistics. LISI Laboratory has also received funding from the CNRST under the grant N° 1 8 U C A 2 0 1 5.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the Faculty of Sciences Semlalia, Cadi Ayyad University (protocol code FSSM-UCA-EC-2024-12, approval date 20 December 2024).

Informed Consent Statement

Participants voluntarily provided informed consent after being fully briefed on the study’s objectives, procedures, and their right to withdraw at any time.

Data Availability Statement

The dataset used in this study is currently undergoing a full anonymization process. The finalized anonymized dataset will be deposited in a public open-access repository within six months of the article’s publication, once all anonymization and quality-control procedures are completed. A link to the repository will be provided at the time of release.

Acknowledgments

The authors greatly appreciate the sponsorship of National Center for Scientific and Technical Research (CNRST) in Morocco.

Conflicts of Interest

The authors declare no competing interests.

References

World Health Organization. Available online: https://www.who.int/health-topics/road-safety#tab=tab_1 (accessed on 1 June 2025).
The National Road Safety Agency of Morocco. Available online: https://www.narsa.ma/ (accessed on 10 May 2025).
El Ferouali, S.; Elamrani Abou Elassad, Z.; Abdali, A. Understanding the Factors Contributing to Traffic Accidents: Survey and Taxonomy. In Artificial Intelligence, Data Science and Applications: ICAISE 2023; Farhaoui, Y., Hussain, A., Saba, T., Taherdoost, H., Verma, A., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2024; Volume 838. [Google Scholar] [CrossRef]
Ameksa, M.; Mousannif, H.; Al Moatassime, H.; Elamrani Abou Elassad, Z. Toward Flexible Data Collection of Driving Behaviour. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, XLIV-4/W3-2020, 33–43. [Google Scholar] [CrossRef]
Cai, Q.; Abdel-Aty, M.; Zheng, O.; Wu, Y. Applying machine learning and google street view to explore effects of drivers’ visual environment on traffic safety. Transp. Res. Part C Emerg. Technol. 2022, 135, 103541. [Google Scholar] [CrossRef]
Zhang, S.; Abdel-Aty, M. Real-time crash potential prediction on freeways using connected vehicle data. Anal. Methods Accid. Res. 2022, 36, 100239. [Google Scholar] [CrossRef]
Wang, J.; Song, H.; Fu, T.; Behan, M.; Jie, L.; He, Y.; Shangguan, Q. Crash prediction for freeway work zones in real time: A comparison between Convolutional Neural Network and Binary Logistic Regression model. Int. J. Transp. Sci. Technol. 2022, 11, 484–495. [Google Scholar] [CrossRef]
Theofilatos, A.; Chen, C.; Antoniou, C. Comparing Machine Learning and Deep Learning Methods for Real-Time Crash Prediction. Transp. Res. Rec. 2019, 2673, 169–178. [Google Scholar] [CrossRef]
Goubraim, N.; Elassad, Z.E.A.; El-Amarty, N.; Mousannif, H. Advanced Tree-Based Ensemble Learning System for Prediction Accident Severity with Imbalanced Data Handling. In Proceedings of the 4th International Conference on Advances in Communication Technology and Computer Engineering (ICACTCE’24); Iwendi, C., Boulouard, Z., Kryvinska, N., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2025; Volume 1313. [Google Scholar] [CrossRef]
Ke, J.; Zhang, S.; Yang, H.; Chen, X. PCA-based missing information imputation for real-time crash likelihood prediction under imbalanced data. Transp. A Transp. Sci. 2019, 15, 872–895. [Google Scholar] [CrossRef]
Elamrani Abou Elassad, Z.; Ameksa, M.; Elamrani Abou Elassad, D.; Mousannif, H. Machine Learning Prediction of Weather-Induced Road Crash Events for Experienced and Novice Drivers: Insights from a Driving Simulator Study. In Business Intelligence; El Ayachi, R., Fakir, M., Baslam, M., Eds.; CBI 2023; Lecture Notes in Business Information Processing; Springer: Cham, Switzerland, 2023; Volume 484. [Google Scholar] [CrossRef]
Bobermin, M.; Ferreira, S. A novel approach to set driving simulator experiments based on traffic crash data. Accid. Anal. Prev. 2021, 150, 105938. [Google Scholar] [CrossRef]
Gang, R.; Zhuping, Z. Traffic safety forecasting method by particle swarm optimization and support vector machine. Expert Syst. Appl. 2011, 38, 10420–10424. [Google Scholar] [CrossRef]
Almahdi, A.; Al Mamlook, R.E.; Bandara, N.; Almuflih, A.S.; Nasayreh, A.; Gharaibeh, H.; Alasim, F.; Aljohani, A.; Jamal, A. Boosting Ensemble Learning for Freeway Crash Classification under Varying Traffic Conditions: A Hyperparameter Optimization Approach. Sustainability 2023, 15, 15896. [Google Scholar] [CrossRef]
Li, G.; Wu, Y.; Bai, Y.; Zhang, W. ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks. Appl. Sci. 2023, 13, 13123. [Google Scholar] [CrossRef]
Ameksa, M.; Mousannif, H.; Al Moatassime, H.; Elamrani Abou Elassad, Z. Application of machine learning techniques for driving errors analysis: Systematic literature review. Int. J. Crashworthiness 2024, 29, 785–793. [Google Scholar] [CrossRef]
Alruily, M.; El-Ghany, S.A.; Mostafa, A.M.; Ezz, M.; El-Aziz, A.A.A. A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction. Appl. Sci. 2023, 13, 5047. [Google Scholar] [CrossRef]
Rezashoar, S.; Kashi, E.; Saeidi, S. A hybrid algorithm based on machine learning (LightGBM-Optuna) for road accident severity classification (case study: United States from 2016 to 2020). Innov. Infrastruct. Solut. 2024, 9, 319. [Google Scholar] [CrossRef]
Almuhammadi, S.; Alnajim, A.; Ayub, M. QUIC Network Traffic Classification Using Ensemble Machine Learning Techniques. Appl. Sci. 2023, 13, 4725. [Google Scholar] [CrossRef]
Yang, L.; Aghaabbasi, M.; Ali, M.; Jan, A.; Bouallegue, B.; Javed, M.F.; Salem, N.M. Comparative Analysis of the Optimized KNN, SVM, and Ensemble DT Models Using Bayesian Optimization for Predicting Pedestrian Fatalities: An Advance towards Realizing the Sustainable Safety of Pedestrians. Sustainability 2022, 14, 10467. [Google Scholar] [CrossRef]
Abbasi, E.Y.; Deng, Z.; Magsi, A.H.; Ali, Q.; Kumar, K.; Zubedi, A. Optimizing Skin Cancer Survival Prediction with Ensemble Techniques. Bioengineering 2024, 11, 43. [Google Scholar] [CrossRef]
Oyoo, J.O.; Wekesa, J.S.; Ogada, K.O. Predicting Road Traffic Collisions Using a Two-Layer Ensemble Machine Learning Algorithm. Appl. Syst. Innov. 2024, 7, 25. [Google Scholar] [CrossRef]
Rao, A.R.; Wang, H.; Gupta, C. Predictive Analysis for Optimizing Port Operations. Appl. Sci. 2025, 15, 2877. [Google Scholar] [CrossRef]
Li, Y.; Yang, Z.; Xing, L.; Yuan, C.; Liu, F.; Wu, D.; Yang, H. Crash injury severity prediction considering data imbalance: A Wasserstein generative adversarial network with gradient penalty approach. Accid. Anal. Prev. 2023, 192, 107271. [Google Scholar] [CrossRef]
Roudnitski, A. Evaluating Road Crash Severity Prediction with Balanced Ensemble Models. Findings 2024, 1–8. [Google Scholar] [CrossRef]
Mostafa, A.M.; Aldughayfiq, B.; Tarek, M.; Alaerjan, A.S.; Allahem, H.; Elbashir, M.K.; Ezz, M.; Hamouda, E. AI-based prediction of traffic crash severity for improving road safety and transportation efficiency. Sci. Rep. 2025, 15, 27468. [Google Scholar] [CrossRef]
Mujalli, R.O.; López, G.; Garach, L. Bayes classifiers for imbalanced traffic accidents datasets. Accid. Anal. Prev. 2016, 88, 37–51. [Google Scholar] [CrossRef]
Risto, M.; Martens, M.H. Driver headway choice: A comparison between driving simulator and real-road driving. Transp. Res. Part F Traffic Psychol. Behav. 2014, 25 Pt A, 1–9. [Google Scholar] [CrossRef]
Elamrani Abou Elassad, Z.; Mousannif, H. Understanding driving behavior: Measurement, modeling and analysis. In Advances in Intelligent Systems and Computing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 452–464. [Google Scholar] [CrossRef]
Elamrani Abou Elassad, Z.; Mousannif, H.; Al Moatassime, H. A proactive decision support system for predicting traffic crash events: A critical analysis of imbalanced class distribution. Knowl.-Based Syst. 2020, 205, 106314. [Google Scholar] [CrossRef]
Khalid, S.; Khalil, T.; Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the 2014 Science and Information Conference, London, UK, 27–29 August 2014; pp. 372–378. [Google Scholar] [CrossRef]
Hemphill, E.; Lindsay, J.; Lee, C.; Măndoiu, I.I.; Nelson, C.E. Feature selection and classifier performance on diverse bio- logical datasets. BMC Bioinform. 2014, 15, S4. [Google Scholar] [CrossRef] [PubMed]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. arXiv 2002, arXiv:1106.1813. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cai, Q.; Abdel-Aty, M.; Yuan, J.; Lee, J.; Wu, Y. Real-time crash prediction on expressways using deep generative models. Transp. Res. Part C Emerg. Technol. 2020, 117, 102697. [Google Scholar] [CrossRef]
Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Giraldo, C.; Giraldo, I.; Gomez-Gonzalez, J.E.; Uribe, J.M. An explained extreme gradient boosting approach for identifying the time-varying determinants of sovereign risk. Financ. Res. Lett. 2023, 57, 104273. [Google Scholar] [CrossRef]
Li, K.; Xu, H.; Liu, X. Analysis and visualization of accidents severity based on LightGBM-TPE. Chaos Solitons Fractals 2022, 157, 111987. [Google Scholar] [CrossRef]
Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Elamrani Abou Elassad, D.; Elamrani Abou Elassad, Z.; Ed-Dahbi, A.M.; El Meslouhi, O.; Kardouchi, M.; Akhloufi, M.; Jahan, N. A human-in-the-loop ensemble fusion framework for road crash prediction: Coping with imbalanced heterogeneous data from the driver-vehicle-environment system. Transp. Lett. 2024, 17, 827–843. [Google Scholar] [CrossRef]
Cao, Q.; Sun, H.; Wang, H.; Liu, X.; Lu, Y.; Huo, L. Comparative study of neonatal brain injury fetuses using machine learning methods for perinatal data. Comput. Methods Programs Biomed. 2023, 240, 107701. [Google Scholar] [CrossRef]
Bonfati, L.V.; Mendes Junior, J.J.A.; Siqueira, H.V.; Stevan, S.L., Jr. Correlation Analysis of In-Vehicle Sensors Data and Driver Signals in Identifying Driving and Driver Behaviors. Sensors 2023, 23, 263. [Google Scholar] [CrossRef]

Figure 1. Driving Simulator Setup.

Figure 2. Data Features importance.

Figure 3. Distribution of crash and no-crash samples before and after applying data balancing on the training set.

Figure 4. Ensemble machine learning models’ results for traffic crash prediction (before optimization).

Figure 5. Comparative analysis of the performance of models before and after optimization.

Figure 6. Optimization results of ensemble machine learning models for crash prediction.

Table 1. Features of the simulator used.

Material	Characteristic
MacBook	2015 Pro
processor	i7/2.8 GHz
Random access memory	16 GB
hard drive	SSD
LCD panel	27-inch/1920 × 1080 pixels
Logitech R G27 Racing Wheel set	steering wheel accelerator pedal, brake pedal automated gear selection gear shifter was not required

Table 2. Key Features, Descriptions, and Advantages of Project CARS 2 for Driving Simulation Research.

Feature	Description	Advantage
Realistic Vehicle Physics	Simulates tire grip, suspension, and car dynamics with high accuracy	Enables realistic driver behavior analysis and crash scenario testing
Dynamic Weather System	Supports rain, fog, snow, and changing weather during a drive	Helps assess driver performance under varying and challenging weather conditions
Day/Night Cycle	Full 24-h cycle with realistic lighting and shadows	Useful for visibility studies and night-driving experiments
Custom Scenario Setup	Configurable conditions: road type, traffic, vehicle, time, and weather	Facilitates controlled experiments with repeatable conditions
Hardware Support	Compatible with racing wheels, pedals, and cockpits (e.g., Logitech G27)	Enhances realism and immersion for participants
High-Quality Graphics	Realistic visual representation of roads, vehicles, and environments	Increases participant engagement and ecological validity
LiveTrack 3.0	Road surface dynamically changes with weather and driving activity	Improves realism and enables more nuanced testing of driving response
Camera Views	Multiple camera options (cockpit, chase, dash, etc.)	Allows flexible recording and observation perspectives
AI Traffic Simulation	Includes AI vehicles with configurable behavior	Useful for testing driver responses to other road users and complex traffic situations
Telemetry Support	Compatible with third-party plugins for data capture (e.g., speed, G-force)	Enables detailed data logging and analysis for research purposes
VR and Multi-Screen Ready	Supports virtual reality and triple-screen setups	Provides immersive simulation for a better participant experience

Table 3. Driving simulation study summary.

Feature	Description
Participants	81 volunteers (43 male, 38 female) aged 22–41
Experience	Experienced drivers with normal vision
Experiment	Two driving simulation sessions: - Practice session for familiarization - Main trial on a 16.5 km virtual urban road
Weather Conditions	Clear, fog, rain, snow
Data Collection	20 Hz sampling rate, including: - Vehicle telemetry (speed, acceleration) - Driver inputs (steering, braking, acceleration) - Environmental conditions (weather) - Crash events (binary: 1 for crash, 0 for no crash)
Crash Distribution	Approximately 26% of crashes occurred under clear conditions, 24% under fog, 29% under rain, and 21% under snow (total crash events: 5951).
Data Analysis	Data cleaning and analysis to identify factors influencing crash events, considering variables such as weather conditions, driver behavior, and traffic scenarios.
Goal	Understand factors influencing driving behavior and crashes.
Key Point	Virtual environment designed to mimic real-world driving conditions with randomly generated traffic scenarios.

Table 4. Data features description.

Feature	Description	Category
Speed	Magnitude of a vehicle’s velocity	vehicle kinematics
Lateral acceleration	Sideways movement of a vehicle
Longitudinal acceleration	Vehicle’s acceleration or deceleration along its direction of travel
Vertical acceleration	Rate at which a vehicle moves vertically
Yaw angle	Angle between a vehicle’s longitudinal axis and its actual line of travel
Drift angle	Angle between a vehicle’s orientation and its velocity vector
Spin angle	Angle at which a vehicle rotates or turns about its vertical axis
Revolutions per minute (RPM)	Number of rotations completed by a vehicle’s engine crankshaft in one minute
Tyre temp FL	Tire temperature measurements front-left (FL) front-right (FR) rear-left (RL) rear-right (RR) provide insights into the wear and performance of each tire.
Tyre temp FR
Tyre temp RL
Tyre temp RR
Throttle	Accelerator pedal position	Driver behavior
Brake	Driver’s intention to decelerate or halt the vehicle
Steering	Driver’s intended direction
Weather Season	Prevailing atmospheric conditions during specific time periods or observations	Weather factors

Table 5. Parameter grids and evaluations for each ensemble model (Stratified 5-Fold CV).

Model	Parameter Grid (Searched Values)	Combinations	CV Fits (×5)
Bagging Classifier	n_estimators: [10, 50, 100]; max_samples: [0.5, 0.8, 1.0]; max_features: [0.5, 0.8, 1.0]	27	135
Random Forest	n_estimators: [100, 200]; max_depth: [5, 10, None]; max_features: [4, “sqrt”]	12	60
XGBoost	learning_rate: [0.01, 0.1, 0.2]; n_estimators: [100, 200]; max_depth: [3, 6, 7]; subsample: [0.8, 1.0]	36	180
LightGBM	learning_rate: [0.01, 0.1]; n_estimators: [100, 200]; num_leaves: [31, 64]; max_depth: [−1, 7]; subsample: [0.8, 1.0]	16	80

Table 6. Confusion matrix.

Actual/Predicted	Crash	No-Crash
Crash	True Positive (TP)	False Negative (FN)
No-Crash	False Positive (FP)	True Negative (TN)

Table 7. Best Hyperparameters of ensemble machine learning.

Model	Best Hyperparameters
Random Forest	max_depth: 5, max_features: 4, n_estimators: 200
Bagging Classifier	max_features: 0.8, max_samples: 1.0, n_estimators: 100
XGBoost	learning_rate: 0.2, max_depth: 7, n_estimators: 200, subsample: 0.8
LightGBM	learning_rate: 0.1, max_depth: −1, n_estimators: 100, num_leaves: 31

Table 8. Confusion Matrix Comparison of Optimized Ensemble Models for Traffic Crash Prediction.

Model	TP	FP	FN	TN	Total
Random Forest	171	19	1043	13,948	15,181
Bagging Classifier	984	10	230	13,957	15,181
XGBoost	622	39	568	13,952	15,181
LightGBM	460	29	730	13,962	15,181

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Goubraim, N.; Elassad, Z.E.A.; Mousannif, H.; Ameksa, M. Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning. Safety 2025, 11, 121. https://doi.org/10.3390/safety11040121

AMA Style

Goubraim N, Elassad ZEA, Mousannif H, Ameksa M. Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning. Safety. 2025; 11(4):121. https://doi.org/10.3390/safety11040121

Chicago/Turabian Style

Goubraim, Naima, Zouhair Elamrani Abou Elassad, Hajar Mousannif, and Mohamed Ameksa. 2025. "Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning" Safety 11, no. 4: 121. https://doi.org/10.3390/safety11040121

APA Style

Goubraim, N., Elassad, Z. E. A., Mousannif, H., & Ameksa, M. (2025). Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning. Safety, 11(4), 121. https://doi.org/10.3390/safety11040121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning

Abstract

1. Introduction

1.1. Problem Definition

1.2. Related Works

1.3. Research Gap

1.4. Objective and Contributions

1.5. Outline

2. Data Collection and Preparation

2.1. Driving Simulator

2.2. Scenario and Data Acquisition

2.3. Data Description

2.4. Data Preprocessing

3. Methodology and Experience

3.1. Feature Selection

3.2. Data Balancing

3.3. Traffic Crash Prediction Models

3.3.1. Random Forest

3.3.2. Bagging

3.3.3. Extreme Gradient Boosting

3.3.4. Light Gradient Boosting Machine (LightGBM)

3.4. Hyperparameters Optimization

3.4.1. Hyperparameter Tuning Techniques

3.4.2. Grid Search Algorithm

3.5. Validation Strategy

3.6. Performance Evaluation

4. Experimental Results

4.1. Results of Feature Selection

4.2. Data Balancing Results

4.3. Performance of Ensemble Machine Learning for Traffic Crash Prediction

4.4. Optimization and Hyperparameter Tuning Process Results

5. Discussion

5.1. The Interpretation of Features and Derived Insights on Road Safety

5.2. The Supplementary Evaluation Analyses

5.3. Limitations, Model Transferability, and Future Validation

6. Practical Deployment of Trained Models

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI