1. Introduction
Autonomous driving is a crucial aspect of automation and an integral part of the future of life, encompassing various scenarios such as collisions between vehicles and motorbikes, pedestrians, cyclists, and other objects. Safe and proper communication, as well as Artificial Intelligence (AI) [
1], are two pillars of autonomous driving. Vehicle-to-Everything (V2X) is part of the communication strategy between the vehicles and other objects on the road. Other Vulnerable Road Users (VRUs) should be considered while providing proper traffic automation systems [
2]. Road traffic crashes remain a major global public health concern, causing approximately 1.19 million deaths annually. VRUs, including pedestrians, cyclists, and motorcyclists, account for more than half of these fatalities, highlighting their disproportionate risk exposure [
3].
There are different communication tools, such as Basic Safety Message (BSM) [
4]. To process and determine the data, two different procedures, such as edge computation or cloud computing systems, are used. By increasing the number of vehicles and VRUs on the road, the required computational capability increases exponentially. Thus, as much as the AI agent becomes heavier and deeper, the computational complexity increases [
5]. There are two main subsections of AI, namely Deep Learning (DL) and Machine Learning (ML). DL uses layered neural networks to learn directly from raw data, while ML often relies on manual feature extraction and simpler models [
6]. Data produced by V2X systems is episodic in nature, resulting in a time-series format that makes DL models especially well-suited for analyzing such datasets. Gated Recurrent Unit (GRU) [
7] and attention [
8] are sublime for extracting proper features from the time series dataset as well. GRU and attention are lightweight DL layers for extracting features from the time series data. In this research, we make the model lightweight and proper for edge computing. The evaluated models in this research are bidirectional LSTM, Convolutional Neural Network (CNN) with bidirectional LSTM [
6], and the Transformer [
9]. The proposed architecture is trained using BSM data, which includes geolocation, velocity, and acceleration information. In addition, the BSM signal encodes the selected trajectory for each vehicle and motorcycle.
The study employs the Simulation of Urban Mobility (SUMO) [
10] together with Vehicles in Network Simulation (VEINS) [
10] to generate the required dataset. The scenario for collecting the dataset involves the process of a possible collision between a pedestrian and a moving object. The second scenario is the possible collision between the motorcyclist and the vehicles.
Recent studies have primarily concentrated on simulating collision scenarios involving vehicles and VRUs, with a predominant emphasis on DL models for collision detection [
11,
12]. While both DL and ML approaches have been explored, previous researchers have largely overlooked critical aspects such as hyperparameter optimization and the discriminative contribution of individual features in distinguishing collision events from normal instances [
13]. Furthermore, the impact of class imbalance, specifically the disproportionate representation of normal versus collision samples, has not been adequately addressed during model training, potentially affecting detection performance and generalizability [
14]. This research aims to address the identified limitations and enhance the accuracy of collision detection.
The proposed methodology processes the input data and classifies samples into collision and non-collision scenarios. The model is further tuned to improve the true positive rate while reducing false positives. The remainder of the paper is organized as follows: Related work on collision detection is examined, with particular attention to research gaps and how they are addressed. The materials and methods describe the simulation process used to generate the dataset and present the architecture of the proposed collision detection system, including hyperparameter tuning. Experimental results report the outcomes obtained for each scenario and highlight key performance metrics such as accuracy and recall. The discussion provides a comparative analysis of these results against similar studies, emphasizing the advantages of the proposed approach, its potential for real-time applications, and the limitations that remain. The conclusion summarizes the methods and findings and outlines directions for future work.
The proposed method can be integrated within the smart city framework by linking traffic signals to urban traffic management instruments, thereby enabling a data-driven decision-support system for autonomous vehicles. Beyond addressing the technical challenges associated with collision detection, the approach is consistent with the broader objectives of smart cities, including safeguarding vulnerable populations, enhancing traffic efficiency, and promoting resilient urban environments.
2. Related Work
The safety of both the vehicle and pedestrians is a crucial aspect of autonomous driving. Different objects on the road need to be considered for road safety. Vehicles, pedestrians, buses, bicycles, and motorcycles are examples of objects on the road. Communication between these objects can be facilitated through safety BSM, Wi-Fi, or radio signals. Mobile Edge Computing (MEC) devices enable rapid response times for analyzing the aforementioned data and deploying AI models on it [
15]. This study reviews prior work that has employed comparable combinations of traffic signals, vehicle trajectories, and physical attributes such as velocity and acceleration for collision detection. A summary of the reviewed research is provided below.
Parada et al. [
11] applied the VEINS framework, which combines the SUMO traffic mobility model with the ns-3 network simulator. This setup generated controlled datasets of motorcyclist and vehicle trajectories under realistic V2X communication conditions. The inputs were time-series data, including position, velocity, and heading, sampled at regular intervals. These features were processed using a stacked unidirectional LSTM network [
16]. The model was tuned to capture temporal dependencies in vehicle–VRU interactions. A sliding-window approach was used to predict collision risks. Performance was evaluated in two collision scenarios. Metrics included detection rate, Average Prediction Time (APT), Correct Decision Percentage (CDP), and false-positive count. Results showed detection accuracies of 96% (APT = 4.53 s, CDP = 41%, 78 false positives) and 95% (APT = 4.44 s, CDP = 43%, 68 false positives) for Scenarios A and B. These findings demonstrate that the LSTM can provide early warnings while meeting V2X latency requirements. However, the study has limitations. It relies only on simulated motorcycle VRU data and lacks real-world validation. False-positive rates remain high. The evaluation is restricted to two specific collision patterns, which limits confidence in the model’s generalizability to broader urban environments and diverse VRU types.
Zhang et al. [
17] utilized a dataset of real-world vehicular accident combined with V2X communication logs from a beyond-5G experimental environment. A Random Forest (RF) classifier [
18] was developed to identify key contributory factors and classify accident severity levels. The model achieved a classification accuracy of about 80%. Despite this result, the study does not specify the dataset’s scale or diversity, lacks comparisons with other AI approaches such as deep neural networks, and does not assess latency or resilience under varying network conditions. These omissions highlight important directions for future research.
Sharma et al. [
19] examined urban vehicular trajectories collected in a 6G network communication setting. The approach employs a deep deterministic policy gradient agent [
20] that simultaneously analyzes speed, distance, direction, and time to detect anomalous motion patterns. Model performance was assessed using classification accuracy, achieving about 97% on the test set. Despite this strong result, the study does not clarify the dataset’s scale or representativeness, lacks comparisons with other anomaly detection or reinforcement learning methods, and does not evaluate latency or robustness under different network and traffic conditions. These gaps highlight areas for further investigation.
Oliveira et al. [
12]. worked on the publicly available highD dataset, which comprises 10 Hz vehicle trajectories recorded from a drone over multiple German highway sections—to train and evaluate their proposed Temporal Convolutional Network Attention (TCN-Attn) [
21] model. Their model first applies a stack of TCN blocks to encode short-term motion patterns, and uses a multi-head self-attention layer to reweight those temporal features; then, it finally decodes future positions via fully connected layers. The model’s performance is measured using mean displacement error and final displacement error. The TCN-Attn model lowers both errors by about 12% compared to LSTM and Social-LSTM. The authors also tested how well the model detects driving maneuvers. They reported 89% accuracy in telling the difference between lane changes and lane keeping. However, the paper didn’t evaluate generalization to other road types or mixed traffic (e.g., urban streets, pedestrians), omits uncertainty quantification under realistic V2X latency and packet-loss conditions, and lacks ablation studies on the attention module’s hyperparameters—gaps that future work should address.
Prathiba et al. [
22] worked with a custom simulation dataset. The dataset included expert maneuvers, overtaking, and lane changes. It was generated within a 6G-V2X testbed. This dataset was used to train and validate their cooperative collision avoidance scheme for autonomous vehicles. Their model integrates inverse RL [
23] augmented with Gaussian process regression to infer reward functions from limited expert demonstrations and to mimic human decision-making in overtaking and lane-change scenarios. Performance is reported in terms of classification accuracy, collision avoidance rate, and decision latency, with the proposed model achieving a classification accuracy of exactly 92.5%. However, the study’s dependence on simulated data precludes assessment under real-world traffic heterogeneity; the dataset’s scope and statistical properties are not fully detailed, no comparisons with alternative inverse RL or DL architectures are provided, and the impact of varying V2X network latency on system robustness remains unexplored.
Fu et al. [
24] conducted a survey rather than collecting new data. They reviewed major public trajectory datasets, including Next-Generation Simulation (NGSIM) [
25], highD [
26], and V2X communication logs from experimental testbeds, to benchmark collision-avoidance research in intelligent transportation systems (ITSs). Instead of proposing a single architecture, the study systematically compared a range of AI approaches. These included convolutional encoders, LSTM/recurrent neural network predictors [
27], graph-neural models, and reinforcement-learning controllers, all evaluated against established safety scenarios. Performance was measured using standard collision-detection and motion-forecasting metrics. The highest accuracy reported in the surveyed literature was 96.7% on the NGSIM dataset using a hybrid CNN-LSTM model. Despite these findings, the review highlights several gaps: the absence of a unified benchmarking framework or standardized metrics, limited real-world validation under V2X latency and packet-loss conditions, and insufficient exploration of multi-agent coordination in complex urban environments.
Kandhro et al. [
28] employed a customized 6G-V2X testbed that integrated IoT sensors and networking logs from autonomous vehicle trials as the source dataset. They introduced an anomaly detection framework that combines multi-agent reinforcement learning [
29] with maximum entropy inverse reinforcement learning to identify and isolate rogue vehicles in real-time. Model performance was benchmarked against an IoT-V2X baseline, with the proposed approach achieving an exact 8.01% improvement in classification accuracy over existing methods. Despite this gain, the study does not specify the dataset’s scale or diversity, lacks comparisons with alternative anomaly-detection architectures, and does not evaluate robustness under varying 6G-V2X latency or packet-loss conditions. These omissions highlight important directions for future research.
Ribeiro et al. [
30] drew upon a V2X communications dataset generated with the VEINS cosimulation framework integrating SUMO for urban mobility and ns-3 for network behavior to capture time series streams of motorcycle and vehicle states at intersections. Their proposed system employed stacked unidirectional LSTM networks that ingest sequences of positional, velocity, and heading data to forecast imminent VRU (motorcyclist) collisions several seconds before impact. Evaluation metrics include collision prediction accuracy, APT, CDP, and false-positive count; the model achieves exactly 96% classification accuracy (Scenario A: APT = 4.53 s, CDP = 41%). Nonetheless, the study is limited by its exclusive reliance on simulated data without real-world validation, its focus on a single VRU category and intersection topology, and a relatively high false-positive rate that currently precludes automated safety interventions—gaps.
Zeng et al. [
31] utilized both public traffic-flow benchmarks and proprietary V2V communication logs containing synchronized trajectories, speeds, accelerations, and relative positions to train and validate a collision-risk prediction framework. Their approach involved constructing a dynamic interaction graph of vehicles, applying a graph attention network [
32] to capture spatiotemporal inter-vehicular features, and integrating deep reinforcement learning to optimize driving strategies. Model performance was assessed using true warning and false-positive rates, with the system achieving 80% true warning accuracy. Despite this result, the study does not specify which public datasets were employed or describe their statistical properties, lacks comparisons with other graph-based or sequence-model architectures, and does not evaluate robustness under varying V2V latencies, packet-loss conditions, or heterogeneous traffic densities. These limitations highlight important directions for future research.
Based on the evaluated research, the most frequently applied models for collision prediction are LSTM and CNN-LSTM. The dominant scenarios considered involve either vehicle-to-vehicle or vehicle-to-pedestrian interactions. A key limitation of the reviewed studies is that the proposed models are often evaluated in only a single scenario, which prevents validation of their robustness across diverse traffic conditions. Another gap is the reliance on the simplest form of time-dependent models, such as basic LSTM architectures, without exploring more advanced variants. In addition, many studies fail to address the imbalance in data distribution by employing proper sampling methods, which undermines the reliability of the reported results. Furthermore, the parameters and architectures of the models are not systematically optimized through hyperparameter tuning frameworks, leaving performance improvements unexplored. To address these issues, the following solutions are proposed:
Evaluating the proposed model for different VRUs, such as pedestrians and motorcyclists. Different collision scenarios in conjunction with three ways are investigated in this research.
Evaluating different DL models, such as bidirectional LSTM, CNN-LSTM, and the Transformer model, for detecting the collision scenarios.
Proposing a hyperparameter tuning strategy to tune the previous models based on decreasing the false-positive responses and increasing true positive responses.
Proposing a new collision detection system with the ability to store information and update the weights of the models as an online learning strategy.
4. Experimental Results
This section examines the training process and the hyperparameters employed in model development. The training procedure includes selecting the batch size, optimizer, and loss function, and applying early stopping. The specific model configurations are detailed in the subsequent sections.
4.1. Training Settings
The batch size used for training varied between scenarios. For Scenario A, the batch size was set to 256, while Scenario B, having a larger number of samples, used a batch size of 2048. All models were trained for 100 epochs. To prevent unnecessary training, early stopping was applied: the model’s accuracy was monitored every 20 epochs, and if no improvement was observed within this threshold, training was terminated. This approach also helped streamline hyperparameter tuning by reducing training time.
The choice for the optimizer is Adaptive Moment Estimation with decoupled Weight decay (AdamW). AdamW extends the Adam optimizer by decoupling L2 weight decay from the moment-based gradient update, applying regularization directly to parameters rather than folding it into the adaptive learning-rate term by maintaining per-parameter first and second moment estimates of gradients alongside this decoupled decay [
47].The chosen option for the loss function is binary cross-entropy with logits loss, combining a sigmoid activation and the standard binary cross-entropy into one numerically stable operation. We allocate 80% of the data for training, with the remaining 10% each reserved for validation and testing. Applying the sampling method to the training set resulted in a distribution of 71% normal and 29% collision samples for Scenario A, and 83% normal to 17% collision samples for both Scenario B1 and Scenario B2. The distribution in both the validation and test sets mirrored the ratios observed in the original dataset.
To train the model, we used an NVIDIA A40 GPU that ensures high throughput and efficient memory handling, especially with large batch sizes like 1024. The A40’s ample Virtual Randomized Accessible Memory (VRAM) and tensor core acceleration allow for mixed-precision training to speed up computation and reduce memory usage. The allocated VRAM for the training is 124 GB. The designated CPU for training is Intel’s 11th-Gen Core i7 with 20 cores.
4.2. Evaluation Parameters
The aim of the proposed model is classification; thus, the criterion for training is based on the classification matrix. Metrics such as accuracy, recall, precision, F1-score, true-positive, and false-positive predictions are reported in this article. The formula for calculating these criteria is mentioned as follows:
where
,
,
, and
refer to the true-positive, true-negative, false-positive, and false-negative predictions. To check the number of true predictions, we have checked the total number of collisions to check the ratio of the true predictions compared to the total number of collision accidents.
4.3. Scenario A
Three different models evaluate the proposed methodology in this research. The CR for each scenario is different. For the jaywalking, the CR after sampling is 66.7%. The CR for scenario B is 44%. The results achieved for all the metrics are shown in
Table 3.
The time steps in the future prediction are two.
Table 3 displays the performance outcomes across all scenarios. Although results vary, the CNN–LSTM hybrid consistently outperforms the other models. The bidirectional LSTM ranks second, and the Transformer model comes in last. Transformer models struggle with time series forecasting primarily because their self-attention mechanism treats inputs more like sets than ordered sequences. At the same time, full self-attention scales quadratically with sequence length, forcing practitioners either to trim look-back windows, losing vital temporal context, or to down-sample aggressively, which blurs out fine-grained fluctuations that models like CNN–LSTM naturally capture [
48]. On top of these architectural gaps, vanilla Transformer models are massively overparameterized for most real-world time series datasets [
9,
46]. They require enormous amounts of training data to avoid overfitting, while typical forecasting tasks have modest history and noisy measurements. In this research, the simplest form of the Transformer model was used. Convolutional components in the CNN–LSTM framework extract varied feature channels from the time series, which the LSTM units then employ to model and retain temporal dynamics. The bidirectional LSTM acts as a memory mechanism that evaluates time-dependent relationships in the data from both past and future contexts. The results of the confusion matrix for the investigated scenario are shown in
Table 4.
4.4. Scenario B
The second scenario is based on the dataset referenced in [
30]. The samples for Scenario B are divided into two phases, as illustrated in
Figure 3.
The first phase, B1, involves a vehicle moving toward the top right and crossing the path of a motorcyclist. The second phase, B2, consists of a car moving in a straight line while the motorcyclist moves to the left. The results for both phases are summarized in
Table 5.
Table 5 presents the performance of the proposed methods. Similarly to
Table 3, the performances of the bidirectional LSTM and CNN LSTM are higher than that of the Transformer model. For scenario B1, the model is CNN LSTM, and for scenario B2, the best model is bidirectional LSTM. One FP and CDP inform us that even with a lower CR, the proposed model can detect accidents.
To understand why the Transformer will not perform well compared to others, we have gone deeper into the architecture of the Transformer and visualized the attention head features maps. The results are shown in
Figure 9.
Figure 9 demonstrates that Head 8 provides the most concentrated attention, characterized by sharper and more localized activation patterns. In comparison, Head 0 exhibits a broadly distributed focus, capturing general contextual information across the dataset. A similar diffuse distribution is observed in Heads 3 and 5, indicating that the model does not strongly emphasize critical temporal or spatial features, but instead allocates attention uniformly. By contrast, CNN-LSTM and bidirectional LSTM architectures are able to capture localized spatial features, such as variations in acceleration and velocity, which are essential for effective collision detection. Additionally, the Transformer model suffers from limited feature diversity. With only 11 input features, the use of a 128-dimensional embedding space may be excessive, potentially resulting in underfitting or over-smoothing.
An evaluation of all models indicates that, while their reported accuracies appear high, the corresponding precision and recall values are considerably lower. This inflated accuracy stems from the dominance of normal instances within the validation and test sets. In certain scenarios, the proportion of positive cases is extremely limited, with ratios of 0.0057 in Scenario A and 0.0010 in Scenarios B1 and B2. Consequently, even a model that classifies all samples as normal achieves an accuracy exceeding 99%. To address this imbalance, our study emphasizes the importance of true TP and TN, aiming not only to improve overall accuracy but also to reduce FP while enhancing TP and TN performance.
Choosing the best performance among the proposed models is based on consistent performance in all metrics. However, the high FP rates lead to constant wrong alarms and reduce the eligibility of the proposed framework for autonomous driving systems. The results based on the confusion matrix are shown in
Table 6 and
Table 7.
Based on what is demonstrated in the aforementioned confusion matrix table, the main focus of the proposed model is to increase the TP while decreasing the FP simultaneously. The model performance is based on the threshold on the hyperparameter tuning; we focused on increasing the TP and decreasing the FN.
The proposed model for the second scenario focuses on forecasting the collision in two time steps ahead of the current time. This helps us to work on collision avoidance as well. By increasing the number of time steps for predicting the future, the performance of the model decreases, and the FP rate increases.
5. Discussion
Autonomous driving is part of a new area of automation. The process of a vehicle autonomous driving system requires providing safety for all objects on the roads. In this research, we have focused on identifying scenarios with collision possibilities, allowing them to be avoided sooner. The scenario involves checking for possible collisions between pedestrians and vehicles, as well as between vehicles and motorcyclists. The moving cars are considered to check the collision probability for both of them. The collisions between vehicles and motorcyclists are considered in two different paths. This study examined various collision scenarios involving vulnerable road users to enable the provision of effective early warnings.
The number of time steps for the future prediction is two. DL models, namely the Transformer model, the bidirectional LSTM, and CNN LSTM, are used for future predictions.
5.1. Comparison
This section presents a performance analysis of related studies and compares their results with the findings of the proposed approach. A comprehensive comparison between the proposed methods and existing research is presented in
Table 8.
The proposed model comprises 34,945 trainable parameters and is structured with one convolutional layer, three LSTM layers, and a single dense layer. In comparison with related studies, it remains relatively lightweight and does not require extensive parameterization for training and evaluation. The studies used for comparison similarly relied on datasets organized according to the
format for model assessment. The main article for comparison is the work by Ribeiro et al. [
30]. Ribeiro et al. [
30] used LSTM for collision detection, and they reported 95% and 96% accuracies for B1 and B2 scenarios, respectively. The proposed methods in this research have outperformed them using CNN-LSTM and Bi-directional LSTM for scenarios B1 and B2, respectively. Another improvement over prior research lies in the reduction in false positives (FPs). The proposed method outperforms earlier approaches by lowering the number of FPs to 7 (from 39) in Scenario B1 and to 6 (from 33) in Scenario B2.
Compared to similar research, DL models based on time series architectures, namely LSTM and CNN LSTM, have shown promising results. The proposed model in the jaywalking scenarios has outperformed others with similar works, such as Parada et al. [
11]. Parada et al. [
11] worked on collision detection between vehicles and VRUs. Compared to [
11], the proposed methods improved the accuracy by 3.73% (from 96 to 99.73). The improvement was achieved using the CNN bidirectional LSTM compared to the LSTM. The proposed model outperformed similar projects using both TSMOTE and the combination of CNN-bidirectional LSTM. Increasing the number of collision samples using TSMOTE helped the model to reduce the FPs and increase the TPs. Changing the model architectures to a combination of CNN with bidirectional LSTM improved the ability of long-term feature extraction as well.
While the proposed model employs T-SMOTE as the primary sampling method, additional imbalance handling strategies such as focal loss and weighted sampling were also considered [
49]. Focal loss was configured with gamma = 2 and alpha = 0.0057 for the jaywalking scenario, and alpha = 0.0010 for scenarios B1 and B2, corresponding to the respective sampling ratios and reflecting the substantial imbalance between collision and normal instances. The CNN-LSTM model was selected as the benchmark architecture for evaluating these strategies. The resulting performance metrics are presented in
Table 9.
As shown in
Table 9, the severity of the imbalance limited the effectiveness of both focal loss and weighted sampling strategies. Although the models achieved high accuracy—primarily due to the dominance of normal samples—their performance in terms of recall and precision remained weak, indicating poor sensitivity to collision events.
5.2. Ablation Study
Up to this point, our discussion has focused on simulation results. In this section, we turn to the real-world NGSIM dataset, where we introduce realistic conditions such as added noise, communication latency, and occlusion, and then re-evaluate the model under these scenarios. NGSIM was collected between 2005 and 2006 across locations in Los Angeles, Emeryville, and Atlanta. It provides detailed vehicle trajectory information, including records of cars, motorcycles, and other vulnerable road users. Using this real-world dataset, we evaluated our proposed model, and the corresponding results are presented in
Table 10.
As shown, the CNN–LSTM model optimized with grid search performed better than comparable studies. The proposed model was evaluated by introducing Gaussian noise, shifting frames, and applying an occlusion length of 10, with a maximum frame shift of two frames at a time. Noise was added to the velocity, acceleration, longitude, and latitude features. To assess performance in a real-world scenario, we further evaluated the model on a new dataset, and the results are presented in
Table 11.
As presented in
Table 11, introducing noise and modifying the frame index in the dataset leads to a decline in model performance, particularly in terms of recall, while also increasing the false-positive rate. Among the tested approaches, the CNN–LSTM architecture demonstrates the strongest resilience to these perturbations and frame shifts. Consequently, the CNN–LSTM optimized through grid search emerges as the most effective model for real-time collision detection, even under sensor noise conditions.
To further examine the effectiveness of the proposed model, we extended the evaluation to a real-time dataset. For this purpose, the NGSIM dataset was employed. The proposed approach was benchmarked against more recently introduced models, including the Graph Neural Network (GNN) [
50] and the Temporal Fusion Transformer (TFT) [
51]. We implemented a Graph Neural Network (GNN) comprising four graph convolutional layers, with a hidden dimension of 64 and a final output dimension of 2 to distinguish between normal and collision detection. To mitigate overfitting, we incorporated dropout layers and employed ReLU as the activation function following each graph convolutional layer.
This study employed a GNN consisting of four graph convolutional layers, configured with a hidden dimension of 64 and a final output dimension of 2 to differentiate between normal and collision detection. To reduce the risk of overfitting, dropout layers were incorporated, and the ReLU activation function was applied following each graph convolutional layer. In addition, a TFT was implemented with a hidden size of 64 and four attention heads. The architecture comprised four Gated Residual Networks and two LSTM layers within the TFT, supplemented by two gated residual layers and a gated normalization layer. The total number of trainable parameters for the TFT was 207,000. The performance of both models, trained on Scenario B1, Scenario B2, jaywalking, and NGSIM datasets, is presented in
Table 12.
Table 12 indicates that the GNN outperforms the TFT. The relatively weaker performance of the TFT can be attributed to challenges similar to those faced by multi-head attention and gated recurrent networks, namely their limited ability to capture the diversity among input features. In contrast, the GNN generates embedding vectors from the input features and leverages vehicle IDs as edges to construct the network, enabling it to more effectively represent variations in acceleration and velocity compared to the TFT. Nevertheless, both evaluated models performed less favorably than the CNN LSTM optimized via random search, which achieved superior results for collision detection across all scenarios.
5.3. Early Warning System
The proposed model has demonstrated strong performance in real-time collision detection. However, developing an effective early warning system requires predicting collisions in advance. To evaluate the models’ capability for future collision prediction, the next five and ten time steps were used as training targets. The corresponding results are presented in
Table 13.
Table 13 shows that as the prediction horizon increases, the accuracy of the forecasts gradually declines. Among the evaluated models, CNN-LSTM and the Transformer achieve the best performance for future prediction. In contrast, the bidirectional LSTM performs the worst, as its memory units are less effective for long-term forecasting compared to CNN LSTM and the Transformer.
5.4. Real-World Application
The proposed model in this research is part of the collision detection and avoidance system. The model utilizes the features of geo locations, acceleration, and velocity. The number of parameters for the CNN-Bidirectional LSTM is 236,945. The total number of parameters for the bidirectional LSTM model is 34,945. The response time for the CNN-LSTM in 1000 instances is 2 ms and 1 ms for the LSTM. Thus, the proposed methods can be mounted on the local device to receive the features and predict the collisions. All the instances between the vehicles and VRUs are recorded and sent to the cloud server for updating the models. The proposed system is shown in
Figure 10. The proposed system helps the model in both collecting the proper dataset and avoiding possible collisions. Also, the proposed model can be further improved using the proposed model. The implementation of the proposed model in real-time scenarios is a gradual process that requires establishing infrastructure across edge, fog, and cloud computing environments. To ensure responsiveness and robustness, the model must be continuously trained with updated datasets. Financial constraints represent an additional challenge that must be acknowledged in the development process. To mitigate costs, the model should initially be deployed within edge and fog systems, with cloud integration as needed. Upon demonstrating success in predicting collisions under diverse scenarios, the approach can be scaled for deployment across all vehicles in the traffic network.
Beyond the technical evaluation, the system has clear pathways to real-world application. In autonomous driving, collision avoidance is a critical safety requirement. Integrating the proposed model into advanced driver assistance systems can enhance responsiveness in urban environments where VRUs such as pedestrians and cyclists are at high risk [
52]. Similarly, in fleet management, logistics companies can deploy the model across connected vehicles to reduce accident rates, lower insurance costs, and improve operational safety. In addition, within the context of smart cities, the proposed system can be embedded into intelligent transportation infrastructures that combine edge, fog, and cloud computing. By leveraging real-time data from traffic sensors, connected vehicles, and pedestrian monitoring systems, the model can support proactive traffic management strategies such as dynamic rerouting, adaptive traffic signals, and early collision warnings. This integration aligns with broader smart city initiatives aimed at reducing accidents, improving mobility efficiency, and enhancing public safety [
53].
5.5. Limitations and Future Work
One limitation of the proposed model is that it currently focuses only on detecting collisions between vehicles, motorcyclists, and pedestrians, excluding other types of VRUs, such as motorcyclists and scooters. Expanding the model to cover all VRUs would make it more comprehensive and enhance safety. Another limitation is response time, which could be improved by making the model smaller and more lightweight without compromising accuracy. For future work, the model could be evaluated on new datasets to ensure robust performance under varied conditions and validated in real-world scenarios. The platform can be deployed on a Raspberry Pi 5 (16 GB), enabling the trained model to communicate with other devices via BSM messages and detect collisions in real time, demonstrating its potential for practical on-road applications. The proposed model in this study was developed using simulated data. Incorporating datasets that closely resemble real-world samples can further enhance the model’s capability to detect collisions under actual conditions. In particular, collision detection can be evaluated across both light and heavy traffic scenarios by varying traffic light durations.
6. Conclusions
Road safety and traffic efficiency can be substantially enhanced through the application of deep learning (DL) techniques. In this study, a comprehensive review of prior research facilitated the identification of key findings and existing gaps. The evaluated models include Bidirectional Long Short-Term Memory (Bi-LSTM), CNN-LSTM, and Transformer architectures. Three scenarios representing potential collision situations involving pedestrians, vehicles, and motorcyclists were examined. Random search and grid search methods were applied to determine the optimal hyperparameters for all models. Experimental results show that, for pedestrian–vehicle collisions, Scenario A using CNN-LSTM achieved 99.76% accuracy, 99.77% precision, and 99.76% recall. For vehicle–motorcyclist collisions, Scenario B1 using Bi-LSTM attained 99.73% accuracy, 99.15% precision, and 99.15% recall, while Scenario B2 achieved 99.73% accuracy, 97.15% precision, and 97.15% recall using Bi-LSTM. The principal contributions of this research can be summarized as follows:
Framework Development: We proposed a framework that integrates sampling techniques with deep learning model tuning to identify optimal architectures for collision avoidance.
Automatic Model Development: The model architecture is selected automatically, emphasizing a lightweight design with rapid response time to reduce both false-positive and false-negative outcomes in collision detection.
Performance Improvement: The proposed framework outperformed comparable deep learning models, enhancing accuracy as well as true-positive and true-negative rate predictions for collision avoidance involving vehicles, jaywalking pedestrians, and motorcyclists.
Practical Application: Delivered a lightweight model suitable for real-time collision avoidance systems, achieving over 99% accuracy, along with a reliable model for early warning applications.
In conclusion, the comparative analysis of the three models indicates that CNN-LSTM achieved superior performance in Scenario A (jaywalking) and Scenario B1, while Bidirectional LSTM yielded better results in Scenario B2. Future research will aim to enhance the proposed models by exploring moderate architectural designs and employing a genetic algorithm optimizer, with performance compared against random and grid search hyperparameter tuning strategies.