Predicting Road Traffic Collisions Using a Two-Layer Ensemble Machine Learning Algorithm

: Road traffic collisions are among the world’s critical issues, causing many casualties, deaths, and economic losses, with a disproportionate burden falling on developing countries. Existing research has been conducted to analyze this situation using different approaches and techniques at different stretches and intersections. In this paper, we propose a two-layer ensemble machine learning (ML) technique to assess and predict road traffic collisions using data from a driving simulator. The first (base) layer integrates supervised learning techniques, namely k-Nearest Neighbors (k-NN), AdaBoost, Naive Bayes (NB)


Introduction
Globally, road traffic crashes take the lives of nearly 1.35 million people every year, more than two every minute, with more than nine in ten of all deaths occurring in low-and middle-income countries.Road traffic collisions have become the leading cause of death for people aged 15-29 years, and the World Health Organization (WHO) estimates that crashes will cause another 13 million deaths and 500 million injuries around the world by 2030 if urgent action is not taken [1].A 2018 WHO research report revealed Kenya as having one of the world's worst collision records, accounting for a fatality rate of 27.8 per 100,000 of the population [2], with the city of Nairobi recording the highest share of the total road crashes in Kenya.In addition, road traffic collisions in Nairobi cause significant losses of human life and economic resources.According to a National Transport and Safety Authority (NTSA) report, 4690 people lost their lives to road collisions between 1 January and 13 December in 2022 [3].Additionally, the report notes that pedestrians and riders are dying at much higher rates because of car collisions from time to time in Kenya.The WHO recently announced a "Decade of Action for Road Safety 2021-2030", setting the target of preventing at least 50% of road traffic deaths and injuries by 2030 [4].Significant attention is required to minimize road collisions and as a result, research into building prediction models (PMs) and traffic collision prevention is critical to improve road safety policies and to reduce fatalities on roads [5].
Since road traffic collisions are random, traditional techniques, such as logit and probit models, have been widely used to predict these collisions [6].Although statistical models have good mathematical interpretation and provide a better understanding of the role of individual predictor variables, they have some limitations [7].These traditional approaches are built on assumptions, such as requiring a predefined mathematical form, the presence of outliers, and missing values in the dataset.Such inferences may be untrue and can negatively affect the outcome of the prediction model [8].With advancements in soft computing methods, machine learning techniques have emerged as promising road safety collision research tools to overcome the limitations of statistical methods.In contrast to traditional techniques, machine learning (ML) techniques can manage outliers and missing values in the dataset.To predict road collisions, ML techniques have been applied to primary and secondary road collision datasets for different road networks [9,10].Data unavailability in low-and middle-income countries impedes road safety improvements.Access to data is crucial for scientific research on identifying the factors that cause high road risk and assessing the effectiveness of interventions [11].
Our main objective in this study is to develop and evaluate a crash prediction model that can predict road traffic collisions and their patterns.We perform accident analysis by applying a two-layer ensemble stacking method using logistic regression as a metaclassifier, and the four most popular supervised machine learning algorithms (NB, k-NN, DT, and AdaBoost) because of their proven accuracy in this field [12][13][14].Datasets for this study were acquired from a fixed-base driving simulator [15].The prediction accuracy, precision, recall, and F1-score of each ML technique were compared and measured to highlight the best fit.Our contribution through this paper is the development of a crash prediction model that can predict the outcome of a collision, as this can help emergency centers to estimate the possible impacts, provide better appropriate medical treatment, enable policymakers to formulate better policies for road safety based on evidence, and enable better road traffic safety management.
The article is structured as follows.Section 2 focuses on the research methodology and explains data preprocessing, feature selection, and building the ensemble model.Section 3 gives the analysis outcomes.Section 4 discusses the key findings of this research.Lastly, in Section 5, we conclude the paper and address future works.

Materials and Methods
In this study, we developed an ensemble model with two layers using four base classifiers and a meta-classifier that integrates the base layer models to improve performance.The four supervised ML algorithms employed to predict road collisions and their patterns are k-NN, DT, AdaBoost, and Naïve Bayes.Subsequently, the logistic regression was integrated as a meta-classifier in the second layer of the model by integrating the outputs of the four first-layer models.Figure 1 presents the flowchart adopted in this study.The research methodology has been structured into the following steps: data collection, data preprocessing, building the ensemble model, and performance evaluation of the model.

Study Population and Data Description
A driving simulator was used to collect data for this study.It is very dangerous to conduct trials in a real-world environment, but a driving simulator provides an excellent tool for collecting data in a safe environment [16,17].The 3.5 km Mbagathi way in Nairobi, Kenya, was modeled in the driving simulator at the Strathmore University Business School's Institute of Healthcare Management.The simulations included 80 participants who were selected using the snowball approach.The participants were required to hold a valid driver's license and to have more than two years of driving experience.An informed consent form was administered to each participant, and they were briefed on why they were selected and informed of the importance of participating in the study.Weather, speed limit, lane width, and road layout served as the primary determinants of the scenarios.The driving simulator has a driving seat, a powerful simulation computer, three screens that display the driving scenarios, an observer screen, a 7 ′′ tablet that displays the speedometer, a steering wheel, a clutch, a gear stick, an accelerator, and brakes.Figure 2 shows a participant driving along the simulated road during the experiment.The simulations were based on two scenarios that included before and after treatments.

Study Population and Data Description
A driving simulator was used to collect data for this study.It is very dangerous to conduct trials in a real-world environment, but a driving simulator provides an excellent tool for collecting data in a safe environment [16,17].The 3.5 km Mbagathi way in Nairobi, Kenya, was modeled in the driving simulator at the Strathmore University Business School's Institute of Healthcare Management.The simulations included 80 participants speed limit, lane width, and road layout served as the primary determinants of the scenarios.The driving simulator has a driving seat, a powerful simulation computer, three screens that display the driving scenarios, an observer screen, a 7" tablet that displays the speedometer, a steering wheel, a clutch, a gear stick, an accelerator, and brakes.Figure 2 shows a participant driving along the simulated road during the experiment.The simulations were based on two scenarios that included before and after treatments.

Data Preprocessing
The data with 15 features were loaded into the panda dataframe object to facilitate various preprocessing procedures.First, the data set was normalized using 15 features, after which missing values were discovered in some of the fields.Since the missing values would affect the performance of the model, we replaced the blank and null feature values by applying the mean value of the relevant feature column [18,19].The mean values that were used to fill the missing feature records presented no extreme values that could have affected the mean.

Feature Selection
Feature selection is a critical factor in obtaining an accurate prediction.Using all the features leads to an inefficient model because, as the number of features increases, models struggle for accuracy, and hence model performance is reduced [20].In this study, we used Sklearn, a Python library, to select the features.To obtain the most important features for this study, we employed four algorithms: particle swarm optimization (PSO), univariate feature selection, recursive feature elimination, and feature importance.
1. Particle swarm optimization (PSO) algorithm: This technique works by searching for the optimal subset of features.It locates the minimum of a function by creating several 'particles'.These particles store their best position, as well as the global position.
It is this combination of local and global information that gives rise to 'swarm intelligence' [21].In our study, we implemented XGBoost and linear regression algorithms to select the best features.2. Recursive feature elimination: This technique works by selecting the optimal subset of features for estimation by iteratively reducing 0 to N features [22].The best subset is then chosen based on the model's accuracy, cross-validation score, or Roc-Auc curve.3. Univariate feature selection: This approach works by selecting the optimal features using univariate statistical tests.It might be considered a stage in the estimator's preprocessing process [23].In our study, we implemented the chi-squared statistical test using the SelectKBest method.

Data Preprocessing
The data with 15 features were loaded into the panda dataframe object to facilitate various preprocessing procedures.First, the data set was normalized using 15 features, after which missing values were discovered in some of the fields.Since the missing values would affect the performance of the model, we replaced the blank and null feature values by applying the mean value of the relevant feature column [18,19].The mean values that were used to fill the missing feature records presented no extreme values that could have affected the mean.

Feature Selection
Feature selection is a critical factor in obtaining an accurate prediction.Using all the features leads to an inefficient model because, as the number of features increases, models struggle for accuracy, and hence model performance is reduced [20].In this study, we used Sklearn, a Python library, to select the features.To obtain the most important features for this study, we employed four algorithms: particle swarm optimization (PSO), univariate feature selection, recursive feature elimination, and feature importance.

1.
Particle swarm optimization (PSO) algorithm: This technique works by searching for the optimal subset of features.It locates the minimum of a function by creating several 'particles'.These particles store their best position, as well as the global position.It is this combination of local and global information that gives rise to 'swarm intelligence' [21].In our study, we implemented XGBoost and linear regression algorithms to select the best features.

2.
Recursive feature elimination: This technique works by selecting the optimal subset of features for estimation by iteratively reducing 0 to N features [22].The best subset is then chosen based on the model's accuracy, cross-validation score, or Roc-Auc curve.

3.
Univariate feature selection: This approach works by selecting the optimal features using univariate statistical tests.It might be considered a stage in the estimator's preprocessing process [23].In our study, we implemented the chi-squared statistical test using the SelectKBest method.

4.
Feature importance: This works by classifying and evaluating each attribute to create splits.Decision tree models that are developed on ensembles; for example, extra trees and random forests can be used to rank the relevance of certain features [24].In our study, we employed the extra trees classifier for feature selection.
After performing the feature selection algorithms, we selected the top six features, as shown in Table 1, based on the selected features algorithms.
Three techniques, univariate feature selection, recursive elimination method, and feature importance had the top six common features, while the PSO algorithm had four features in common with the other three techniques.For this study, we employed the PSO feature selection method because the performance of the model was not affected when evaluating the model using the features selected by the other three techniques.

Building the Two-Layer Ensemble Model
We evaluated the performance of machine learning approaches by splitting the dataset in the ratio of 70% training dataset and 30% testing dataset.In our research, we employed four well-known classification algorithms (previously used to predict road traffic collisions) and the stacking ensemble method to predict road traffic collisions.Stacking is an ensemble method for integrating numerous models with a meta-classifier.Following the development of the base models, the four base models (level-0)-k-NN, AdaBoost, DT, and Naïve Bayeswere integrated using a stacking framework for road collision prediction.We selected the four base models because of their proven diversity in predicting road collisions.In the second layer, logistic regression was employed as a meta-classifier to classify road collisions from the outputs of the base models.A 10-fold cross-validation technique was used to evaluate how well the models predicted traffic collisions [25].The proposed twolayer ensemble model is shown in Figure 3.The following section expounds on the four supervised machine learning techniques and the stacking method employed in our study.
(i) Naïve Bayesian Classifier (NBC): This algorithm employs the theorem of Bayes.It works by estimating the probability of various classes based on a variety of features and allocates the new class to the class with the highest probability [26].In our study, Gaussian NB was chosen because the feature set contained continuous variables.The NB is represented by the following formula: where P(H|E) is the posterior probability of the hypothesis given that the evidence is true, P(E|H) is the likelihood of the evidence given that the hypothesis is true, P(H) is the prior probability of the hypothesis, and P(E) is the prior probability that the evidence is true.The posterior probability is mainly the probability of 'H' being true given that 'E' is true.
(ii) k-Nearest Neighbors (k-NN): This method can be considered a voting system in which the majority class determines the class label of a new data point among its nearest neighbors [27].It then analyzes datasets, calculates the distance function and similarities between them, and groups them based on k values.In our study, the k value was obtained by performing several tests with values ranging from 1 to 50, and the prediction performance was compared to the k value.We plotted the accuracies for both training and test datasets, as shown in Figure 4.The performance of k-NN showed a drop in both the test and training datasets after adding neighbors; the drop continued for both until the point at which they converged.The test dataset improved with an increase in the number of neighbors from iteration 33 until they converged with the training dataset at neighbor 42.In the proposed model, we set the k value at 42 because this yielded the best results, and Euclidean distance was selected as the distance function [28].where P(H|E) is the posterior probability of the hypothesis given that the evidence is true, P(E|H) is the likelihood of the evidence given that the hypothesis is true, P(H) is the prior probability of the hypothesis, and P(E) is the prior probability that the evidence is true.The posterior probability is mainly the probability of ′ being true given that ′ is true.(ii) k-Nearest Neighbors (k-NN): This method can be considered a voting system in which the majority class determines the class label of a new data point among its nearest neighbors [27].It then analyzes datasets, calculates the distance function and similarities between them, and groups them based on k values.In our study, the k value was obtained by performing several tests with values ranging from 1 to 50, and the prediction performance was compared to the k value.We plotted the accuracies for both training and test datasets, as shown in Figure 4.The performance of k-NN showed a drop in both the test and training datasets after adding neighbors; the drop continued for both until the point at which they converged.The test dataset improved with an increase in the number of neighbors from iteration 33 until they converged with the training dataset at neighbor 42.In the proposed model, we set the k value at 42 because this yielded the best results, and Euclidean distance was selected as the distance function [28].The distance between the clusters is used to classify the new input data, and the closest cluster is allocated.The following formula illustrates the k-NN approach: where x, y, are the two points in n-space, n is the number of input samples, and yi, xi are the distance vectors starting from the original point.
(iii) Decision Trees (DT): This methodology is a nonparametric supervised learning method for classification and regression.The goal is to build a model that predicts the target variable's value by learning simple decision rules based on data attributes [29].This is shown by the mathematical formula below:  (iv) Adaptive Boosting (AdaBoost): AdaBoost is a classification method that repeatedly calls a given weak learner algorithm over a number of rounds.In the training dataset, each instance is weighed, and overall errors are calculated.More weight is given when it is difficult to predict, and less weight is given when it is simple to predict [31,32].The AdaBoost approach has a weight that is represented as a vector for each weak learner.The input samples are illustrated in the following equation: where wi is the ith training instance weight and n is the number of training instances.(v) Stacking ensemble method: Stacking is a method of integrating predictions from various machine learning models into the same dataset, such as bagging and boosting [33].The stacking technique's architecture consists of two or more models, known as base models or level-0, and meta-models that combine the predictions of the base models, known as level-1 models [34].For our study, stacking was selected because the employed models are often distinct and fit the same dataset.Then, a single model was trained to integrate the outputs of the base as best as possible [35].In our study, we implemented (iv) Adaptive Boosting (AdaBoost): AdaBoost is a classification method that repeatedly calls a given weak learner algorithm over a number of rounds.In the training dataset, each instance is weighed, and overall errors are calculated.More weight is given when it is difficult to predict, and less weight is given when it is simple to predict [31,32].The AdaBoost approach has a weight that is represented as a vector for each weak learner.The input samples are illustrated in the following equation: where w i is the ith training instance weight and n is the number of training instances.(v) Stacking ensemble method: Stacking is a method of integrating predictions from various machine learning models into the same dataset, such as bagging and boosting [33].
The stacking technique's architecture consists of two or more models, known as base models or level-0, and meta-models that combine the predictions of the base models, known as level-1 models [34].For our study, stacking was selected because the employed models are often distinct and fit the same dataset.Then, a single model was trained to integrate the outputs of the base as best as possible [35].In our study, we implemented logistic regression as a meta-model to provide a seamless interpretation of the base models' predictions.

Validation and Performance Measurement
We performed some steps in our experiment to develop the accident prediction model.The first step was to partition the dataset in the ratio of 70% training and 30% testing data.The accuracy was assessed using a 10-fold cross-validation technique during the second stage.The entire dataset was divided into 10 subsets at random, with each subset being used as testing data along with the other nine subsets.

Data Oversampling
There are limitations associated with working with a binary classification when dealing with imbalanced datasets [36].Oversampling was chosen to mitigate the effect of any underlying samples with underrepresentation.Across most of the datasets considered to be imbalanced, sampling strategies have been implemented to improve the overall model's accuracy [37,38].One of the most important aspects to note is that oversampling is not considered to create any new data instances, as this can result in overfitting; conversely, undersampling may exclude important samples from the learning process, meaning that the most useful data instances may be overlooked by the model [39].
In this study, our dataset was imbalanced, and we therefore performed a synthetic minority oversampling technique (SMOTE) resampling strategy to handle the data imbalance [40].The SMOTE algorithm develops synthetic positive cases to enhance the proportion of the minority class [41].In our scenario, the data had 76% instances of no collision and 24% instances of collision, as shown in Figure 6.In this study, our dataset was imbalanced, and we therefore performed a synthetic minority oversampling technique (SMOTE) resampling strategy to handle the data imbalance [40].The SMOTE algorithm develops synthetic positive cases to enhance the proportion of the minority class [41].In our scenario, the data had 76% instances of no collision and 24% instances of collision, as shown in Figure 6.The dataset before SMOTE is illustrated in Figure 7 as a scatter plot with many points spread for the majority class and a small number of points scattered for the minority class.Majority class 0 represents no collisions, and 1 represents collisions.The dataset before SMOTE is illustrated in Figure 7 as a scatter plot with many points spread for the majority class and a small number of points scattered for the minority class.Majority class 0 represents no collisions, and 1 represents collisions.
The transformed dataset was balanced after SMOTE, as shown in the scatter plot in Figure 8, in the ratio of 1:1.
The crash prediction model's performance was evaluated using a classification report that included computed values of accuracy, precision, recall, and the F1 score of the algorithms.Our model suffered from underfitting because the outputs of the base layer model were used in the second layer, and to overcome the problem of underfitting in our model, some input features from Table 1 that were used in the base layer models were reduced and used together with the output of the base layer models.The reason for this approach was to improve the model.Logistic regression was used to train the level-1 input features as a meta-classifier.The test data set was then used to evaluate the two-layer ensemble model.The model with the highest values of the metrics was considered the best prediction model.The dataset before SMOTE is illustrated in Figure 7 as a scatter plot with many points spread for the majority class and a small number of points scattered for the minority class.Majority class 0 represents no collisions, and 1 represents collisions.The transformed dataset was balanced after SMOTE, as shown in the scatter plot in Figure 8, in the ratio of 1:1.The dataset before SMOTE is illustrated in Figure 7 as a scatter plot with many points spread for the majority class and a small number of points scattered for the minority class.Majority class 0 represents no collisions, and 1 represents collisions.The transformed dataset was balanced after SMOTE, as shown in the scatter plot in Figure 8, in the ratio of 1:1.The data generated by the confusion matrix were used to test each model's performance metric.The outcomes of the initial and predicted classifications generated by a classification model comprise the confusion matrix (CM) [42].Table 2 shows a representation of a confusion matrix.The confusion matrix layout shown above displays the actual classes in the rows and the predicted class observations in the columns.
The following defines each entity in CM: In TN, the entities that are originally negative are appropriately classified as negative.
In FN, the entities that are originally positive are wrongly classified as negative.
In TP, the entities that are originally positive are appropriately classified as positive.
In FP, the entities that are originally negative are incorrectly classified as positive.
The observations of the confusion matrix for every model were used to calculate the following performance metrics and evaluate model performance based on these metrics: Accuracy represents the percentage of the total number of instances that were correctly classified, as shown by the equation below: Recall represents the percentage of positive events that were correctly classified, as shown by the following equation: Precision represents the percentage of correctly predicted positive instances, as shown by the equation below: F1 measure: The performance of the model is measured using the F1 measure that represents the harmonic mean of Recall and Precision.Its value is in the range of 0 to 1, with 1 denoting the best model and 0 denoting the poorest model.F1 is represented by the equation below: Error rate represents the frequency of miscalculation of the predictions, as depicted in the equation below.

Results of the Classification before SMOTE
Since we wanted to predict the occurrence or absence of a road traffic collision, our problem was a binary classification [43].In this study, the data sample from the driver simulation was split into 30% training data and 70% test data.The model's predictive performance on the test dataset was evaluated by comparing accuracy, precision, and F1 scores.The effectiveness of each algorithm has been determined from the simulation driver data by employing AdaBoost, DT, NB, and k-NN as base models using the same selected feature set, then employing the stacking ensemble method using logistic regression as a meta-classifier to improve the model's accuracy.We performed two scenarios: the first without SMOTE, and the second with SMOTE.Before pruning DT and setting the k value for k-NN, DT achieved the highest accuracy of 87%, followed by the two-layer ensemble with 85%, and Naïve Bayes with 83%.AdaBoost and k-NN achieved a similar score of 79% before the SMOTE technique, as illustrated in Table 3.The robustness of the ML model is largely assessed and validated using the area under the receiver operator curve (AUC).When the AUC is higher than 0.7, the developed model is said to have good predictive power.Before SMOTE, the two-layer ensemble had an area of about 0.87%, followed by the NB algorithm at 0.83%, AdaBoost at 0.82%, k-NN at 0.8%, and DT at 0.77%, as shown in Figure 9.The experiment was conducted before implementing pruning on DT and setting the k value on k-NN.The robustness of the ML model is largely assessed and validated using the area under the receiver operator curve (AUC).When the AUC is higher than 0.7, the developed model is said to have good predictive power.Before SMOTE, the two-layer ensemble had an area of about 0.87%, followed by the NB algorithm at 0.83%, AdaBoost at 0.82%, k-NN at 0.8%, and DT at 0.77%, as shown in Figure 9.The experiment was conducted before implementing pruning on DT and setting the k value on k-NN.

Results of the Classification after SMOTE
The AUC scenarios were also compared: one without any resampling technique and one with a resampling strategy applied.However, after the SMOTE resampling strategy, pruning DT due to overfitting, and setting the k value for k-NN, NB had an improved

Results of the Classification after SMOTE
The AUC scenarios were also compared: one without any resampling technique and one with a resampling strategy applied.However, after the SMOTE resampling strategy, pruning DT due to overfitting, and setting the k value for k-NN, NB had an improved AUC of about 0.86%.AdaBoost remained unchanged, while a decrease was noted in the two-layer ensemble, DT, and k-NN, as shown in Figure 10.AUC of about 0.86%.AdaBoost remained unchanged, while a decrease was noted in the two-layer ensemble, DT, and k-NN, as shown in Figure 10.Overall, among all the base models, the recall value was improved in AdaBoost and the proposed two-layer ensemble when applying SMOTE, while a decrease was noted in DT, k-NN, and NB, as shown in Table 4.The precision of the model after applying SMOTE Overall, among all the base models, the recall value was improved in AdaBoost and the proposed two-layer ensemble when applying SMOTE, while a decrease was noted in DT, k-NN, and NB, as shown in Table 4.The precision of the model after applying SMOTE was reduced on DT, k-NN, NB, AdaBoost, and the two-layer ensemble model.Based on the F1 score, a noticeable increase was noted in the two-layer ensemble model and AdaBoost, while the same was reduced in NB, DT, and k-NN.NB and AdaBoost achieved the highest accuracies of 81% and 79%, respectively, followed by DT at 77%, while k-NN achieved the lowest accuracy of 72% among the base models after SMOTE, as shown in Table 4. Looking at overall accuracy performance, the two-layer ensemble model achieved 85% accuracy.

Results of the Proposed Ensemble Model
Accuracy is a measure of the effectiveness of a single algorithm, but relying solely on accuracy as a measure of performance index can lead to erroneous conclusions, as the model may be biased toward specific collision classes [44].To solve this limitation in our study, other performance measurement metrics, such as recall, F1 score, and precision, were evaluated.These performance indicators demonstrate the performance of individual collisions and allow better insights for the model.The outcomes of the "no collisions" and "with collisions" performance measurements are shown in Tables 5 and 6, respectively.The definition of precision and recall states that the optimum model is one that optimizes both performance measurements.The F1 score is also a good performance indicator because it interprets model performance using both precision and recall.In our study, all the models performed well for no collision, while k-NN, DT, and AdaBoost performed poorly for collisions.The two-layer ensemble and NB performed well for collisions, as shown in Tables 5 and 6.
After evaluating the model using the stacking ensemble method with reduced features, there was a significant improvement in the predictive performance of the models.Table 7 shows the classification accuracy of each model.The two-layer ensemble achieved the highest accuracy of 0.88%, while NB had 0.81%, DT 0.81%, and AdaBoost 0.79%.k-NN achieved the lowest score of 0.65%.Among the base models, NB had the highest F1 score performance, while k-NN had the lowest.Overall, the best F1 score was achieved by the two-layer ensemble model.Similarly, the proposed two-layer ensemble model had the best recall, while NB had the best recall among the base models, AdaBoost and DT had similar scores, and k-NN had the lowest recall score.The two-layer ensemble model had superior precision when compared with the other models, as shown in Table 7.The objective of the ensemble method is to predict road collisions by utilizing a minimal feature set, which may be acquired within a short period from the collision scene.Based on this prediction, policy makers, road constructors, and health facilities would be able to predict road traffic collisions at any given site and thus take all the measures required to avert collisions and save lives.The improved two-layer ensemble model demonstrates that it is the most effective method for predicting road collisions.

Discussion
The increase in road traffic collisions necessitates effective analysis and control of these collisions.The study adopted a unique methodological approach to propose a model that predicts road traffic collisions based on a dataset from a driving simulator.In the knowledge that it is very dangerous to conduct trials in a real-world environment, a driving simulator provides an excellent tool for collecting data in a safe environment devoid of life-threatening risks and damage to property.The dataset from the simulator was downloaded and normalized using 15 features.We then performed feature selection engineering techniques to select the best features, thus reducing the likelihood of overfitting for our model.The best parameters of each model were determined by a 10-fold cross validation.The training set was partitioned into 10 equal subsets, with one subset serving as testing data and the remaining nine serving as training data.The process was then repeated using the entire 10 subsets, so that the whole dataset was used for validation.Our problem was one of binary classification, since our study focused on predicting the occurrence or not of a collision [45].Given the stochastic nature of collisions, which tend to be underrepresented in the dataset, a synthetic minority oversampling technique (SMOTE) was used to balance the classes in the training dataset.Crash prediction offers a proactive approach to increasing road safety adherence and saving lives.Research into road safety has been of great interest to researchers, industry, and policy makers.Crash prediction remains complex and requires high dimensionality and large datasets to develop models that can effectively predict road traffic collisions [46].
Although depending on accuracy as a measure of a model's performance can be misleading, the model might be biased toward one class.In the present study, to overcome these limitations, we determined other performance measures, such as precision, recall, and F1 score.To demonstrate the effectiveness of the proposed model, we compared it with existing works in the literature.Notably, the authors are aware of few works that have focused on crash prediction models based on a dataset from a driving simulator [47].A comparison between the proposed two-layer ensemble approach and other works in the literature is presented in Table 8.The strategy was to include similar, closely related works that deployed the same methodologies.Our study findings align with the existing literature, but if a standard data collection format and a standard feature selection approach were to be standardized across the globe, the transferability, comparison, and usability of these models would be easy.

Conclusions
In this paper, we propose a two-layer ensemble model for predicting road traffic collisions.The two-layer ensemble method employed was created by combining the outputs of k-NN, DT, AdaBoost, NB, and logistic regression as a meta-classifier in the two levels.The models were compared in terms of accuracy, precision, recall, and F1 score.With the unique combination of the ML classifiers, the two-layer ensemble method achieved a remarkable accuracy of 88% in a 10-fold cross-validation, with precision at 86%, recall at 83%, and F1 score at 84%.Since traffic collisions are random, a model that can predict road traffic collisions in a timely manner by using a few input features is required.In practice, crash prediction is an important aspect for emergency services and trauma centers to estimate the potential risks resulting from collisions and accordingly equip the centers and other units with appropriate post-crash care equipment.For policy makers, the findings of this research can be implemented to formulate evidence-based policies, as opposed to the cause-and-effect approach that is common in most low-and middle-income countries.The two-layer ensemble model can then be used to predict road collisions and therefore save lives and prevent socioeconomic losses.Through validation, the proposed two-layer ensemble had the highest accuracy.One limitation of the proposed approach is the time it takes to run the model, which can be comparatively longer than individual models.Additionally, the dataset in this study was imbalanced; therefore, we applied SMOTE resampling strategy, although other advanced approaches could have been used to solve the issue of an imbalanced dataset.The dataset in this study was based on simulated crash data.We highly advocate for a common road collision data collection format to be used by traffic and policy enforcers worldwide.
The results in this study further show that the two-layer ensemble method not only provides practical solutions to improve predictive accuracy but also contributes to the theoretical understanding of machine learning concepts, such as bias-variance trade-off, model diversity, and statistical consistency.For future work, in order to improve prediction accuracy and road safety, we propose performing sensitivity analysis to select the best features, developing ensemble methods that can effectively integrate diverse sources of data, developing ensemble methods that can make real-time predictions and support decision making for drivers, traffic management systems, and emergency centers, and developing ensemble methods for anonymizing and securing sensitive road safety data.

Figure 2 .
Figure 2. A participant driving on the simulated road scenario at Strathmore University.

Figure 2 .
Figure 2. A participant driving on the simulated road scenario at Strathmore University.

Figure 3 .
Figure 3.The proposed two-layer ensemble model.(i) Naïve Bayesian Classifier (NBC): This algorithm employs the theorem of Bayes.It works by estimating the probability of various classes based on a variety of features and allocates the new class to the class with the highest probability [26].In our study, Gaussian NB was chosen because the feature set contained continuous variables.The NB is represented by the following formula:  |  | *     (1)

Figure 3 .
Figure 3.The proposed two-layer ensemble model.

Figure 4 .
Figure 4. Line plot illustrating k-NN accuracy on training and test datasets for different neighbors.The distance between the clusters is used to classify the new input data, and the closest cluster is allocated.The following formula illustrates the k-NN approach:

Figure 4 .
Figure 4. Line plot illustrating k-NN accuracy on training and test datasets for different neighbors.

( 3 )
Entropy(S) = −p + log 2 p + −p −log 2 p− (4) Given that S is the sample of training examples and p+ is the proportion of the positive training examples, while p− is the proportion of the negative training examples.DT has an overfitting problem, and to overcome it, we used a pruning technique to remove splits with little information gained (DT).This simplifies the DT by reducing the time cost of training and testing; it also eliminates the problem of overfitting [30].In our study, increasing the tree depth in the early stages resulted in a corresponding improved performance of the training dataset and reduced performance of the test dataset.As the tree depth grows, a corresponding improvement is noted on both the training and test datasets up to the depth of 4. Depth 5 reveals that the model overfits the training dataset at the expense of the test dataset.as shown in Figure 5.In our study, we set the maximum tree depth at 4. Appl.Syst.Innov.2024, 7, x FOR PEER REVIEW 8 of 18expense of the test dataset.as shown in Figure5.In our study, we set the maximum tree depth at 4.

Figure 5 .
Figure 5. Line plot illustrating DT accuracy on training and test datasets at different tree depths.

Figure 5 .
Figure 5. Line plot illustrating DT accuracy on training and test datasets at different tree depths.

Figure 8 .
Figure 8. Scatter plot of the balanced dataset after SMOTE.

Figure 8 .
Figure 8. Scatter plot of the balanced dataset after SMOTE.

Figure 8 .
Figure 8. Scatter plot of the balanced dataset after SMOTE.

Figure 9 .
Figure 9.Comparison of the Area Under the Curve (ROC) for the models before SMOTE.

Figure 9 .
Figure 9.Comparison of the Area Under the Curve (ROC) for the models before SMOTE.

Figure 10 .
Figure 10.Comparison of the Area Under the Curve (ROC) for the models after SMOTE.

Figure 10 .
Figure 10.Comparison of the Area Under the Curve (ROC) for the models after SMOTE.

Table 1 .
Features having a strong relationship with road collisions.

Table 2 .
The architecture of the confusion matrix.

Table 3 .
Results before performing SMOTE analysis.

Table 4 .
Results after SMOTE analysis for each model.

Table 5 .
Outcomes of the models for no collisions.

Table 6 .
Outcomes of the models for collisions.

Table 7 .
Outcomes of the models.

Table 8 .
Comparison of the proposed two-layer ensemble model with works in the literature.