Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Intelligence Collision Detection Using a Combination of Tuning Base Methods and Convolutional Long Short Term Memory Models

Smart Cities 2026, 9(4), 61; https://doi.org/10.3390/smartcities9040061

by Mohammed Hilfi^*

and Lubna Alazzawi^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Kristián Čulík

Reviewer 5:

Uneb Gazder

Smart Cities 2026, 9(4), 61; https://doi.org/10.3390/smartcities9040061

Submission received: 1 November 2025 / Revised: 8 February 2026 / Accepted: 23 March 2026 / Published: 31 March 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a deep learning-based collision detection system for V2X environments, focusing on vehicle–pedestrian and vehicle–cyclist interactions. The authors employ and tune three models—Bidirectional LSTM, CNN–LSTM, and Transformer—using simulated data from SUMO/VEINS. Through extensive hyperparameter optimization and class-imbalance handling via T-SMOTE, the models achieve high accuracy (e.g., 99.76% for CNN–LSTM in jaywalking scenarios) and reduced false positives, demonstrating potential for real-time edge deployment. Some concerns and suggestions are given as follows.

The study relies entirely on simulated data (SUMO/VEINS), with no validation on real-world V2X or trajectory datasets. This limits the credibility of the claimed performance in real autonomous systems. Could the authors incorporate public real-world datasets (e.g., nuScenes, Waymo Open Dataset) or conduct a sim-to-real analysis to evaluate model generalization? Ablation studies on sensor noise, communication latency, and occlusion would strengthen practical relevance.
The paper uses standard Bidirectional LSTM, CNN–LSTM, and a "simplest form" of Transformer without justifying their suitability relative to more recent architectures (e.g., Temporal Fusion Transformers, Graph Neural Networks for V2X).
The paper only evaluates a 2-time-step prediction horizon without exploring how performance degrades with longer horizons—critical for practical early-warning systems.
While T-SMOTE is used, no comparison with other imbalance-handling techniques (e.g., Focal Loss, weighted sampling, or generative models) is provided. The impact of synthetic samples on temporal integrity is not discussed.
The Transformer model performs poorly, but no in-depth analysis is given (e.g., attention map visualization, positional encoding impact, or data-hungry nature). Why？

Overall, a round of major revision is recommended.

Comments on the Quality of English Language

Good.

Author Response

Reviewer #1:

Comment 1: The study relies entirely on simulated data (SUMO/VEINS), with no validation on real-world V2X or trajectory datasets. This limits the credibility of the claimed performance in real autonomous systems. Could the authors incorporate public real-world datasets (e.g., nuScenes, Waymo Open Dataset) or conduct a sim-to-real analysis to evaluate model generalization? Ablation studies on sensor noise, communication latency, and occlusion would strengthen practical relevance.

Author’s Response: We appreciate your positive feedback on our work. To ensure reliable performance on real-world data, we utilized the Next Generation Simulation (NGSIM) dataset. This dataset was collected between 2005 and 2006 across locations in Los Angeles, Emeryville, and Atlanta. It provides detailed vehicle trajectory information, including records of cars, motorcycles, and other vulnerable road users. Using this real-world dataset, we evaluated our proposed model, and the corresponding results are presented in Table 1.

Table 1. Evaluating the proposed model on the new NGSIM datasets.

Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
NGSIM	CNN-LSTM optimized by Grid search.	99.39	98.57	98.57	99.19
NGSIM	Bidirectional LSTM optimized by Grid search.	98.44	76.05	69.82	72.81
NGSIM	Transformer optimized by Grid search.	98.53	98.35	98.35	98.35

As shown, the CNN–LSTM model optimized with grid search performed better than comparable studies. The proposed model was evaluated by introducing Gaussian noise, shifting frames, and applying an occlusion length of 10, with a maximum frame shift of 2 frames at a time. Noise was added to the velocity, acceleration, longitude, and latitude features. To assess performance in a real-world scenario, we further evaluated the model on a new dataset, and the results are presented in Table 2.

Table 2. Evaluating the proposed model on the new NGSIM datasets with added noise and situation close to real time scenario.

Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
NGSIM + Noise	CNN-LSTM optimized by Grid search.	97.39	89.10	82.56	86.25
NGSIM + Noise	Bidirectional LSTM optimized by Grid search.	96.65	75.49	63.80	69.15
NGSIM + Noise	Transformer optimized by Grid search.	94.23	71.23	58.56	65.89

As presented in Table 2, introducing noise and modifying the frame index in the dataset leads to a decline in model performance, particularly in terms of recall, while also increasing the false positive rate. Among the tested approaches, the CNN–LSTM architecture demonstrates the strongest resilience to these perturbations and frame shifts. Consequently, the CNN–LSTM optimized through grid search emerges as the most effective model for real-time collision detection, even under sensor noise conditions. An ablation study detailing these findings has been included in the discussion section (Section 5.2) to inform readers of these additional insights.

Changes to the manuscript: Page 21, Paragraph3, has been added and highlighted. Page 21, Table 10, has been added and highlighted. Page 22, Paragraphs 2, 3 and 4, have been added and highlighted. Page 22, Table 12, has been added and highlighted

Comment 2: The paper uses standard Bidirectional LSTM, CNN–LSTM, and a "simplest form" of Transformer without justifying their suitability relative to more recent architectures (e.g., Temporal Fusion Transformers, Graph Neural Networks for V2X).

Author’s Response: Thank you for your constructive feedback. In response, we have evaluated our approach against more recent models as requested. Specifically, we implemented a Graph Neural Network (GNN) comprising four graph convolutional layers, with a hidden dimension of 64 and a final output dimension of 2 to distinguish between normal and collision detection. To mitigate overfitting, we incorporated dropout layers, and employed ReLU as the activation function following each graph convolutional layer.

Additionally, we utilized the Temporal Fusion Transformer (TFT) with a hidden size of 64 and four attention heads. The architecture includes four Gated Residual Networks and two LSTM layers within the TFT, along with two gated residual layers and a gated normalization layer. The total number of trainable parameters for the TFT amounts to 207,000. The performance of these models, trained on Scenario B1, Scenario B2, jaywalking, and NGSIM datasets, is summarized in Table 3.

Table 3. The results of evaluating the new models on the old and new datasets.

Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Jaywalking	Graph Neural Networks	99.41	96.83	96.41	96.12
Baseline Scenario B1	Graph Neural Networks	90.74	82.34	90.74	86.33
Baseline Scenario B2	Graph Neural Networks	90.71	82.14	90.71	85.14
NGSIM	Graph Neural Networks	97.78	95.61	97.78	96.68
Jaywalking	Temporal Fusion Transformers	33.45	33.46	33.45	33.45
Baseline Scenario B1	Temporal Fusion Transformers	49.20	47.10	49.20	47.66
Baseline Scenario B2	Temporal Fusion Transformers	49.87	48.35	49.87	47.52
NGSIM	Temporal Fusion Transformers	50.09	50.04	50.09	49.72

Table 3 indicates that the GNN outperforms the TFT. The relatively weaker performance of the TFT can be attributed to challenges similar to those faced by multi‑head attention and gated recurrent networks, namely their limited ability to capture the diversity among input features. In contrast, the GNN generates embedding vectors from the input features and leverages vehicle IDs as edges to construct the network, enabling it to more effectively represent variations in acceleration and velocity compared to the TFT. Nevertheless, both evaluated models performed less favorably than the CNN‑LSTM optimized via random search, which achieved superior results for collision detection across all scenarios. The evaluation results of the new model have been included in the ablation study section.

Changes to the manuscript: Page 22, Paragraphs 1 and 2, have been added and highlighted. Page 22, Table 11, has been added and highlighted. Page 23, Paragraph 1 Line 672, has been revised and highlighted.

Comment 3: The paper only evaluates a 2-time-step prediction horizon without exploring how performance degrades with longer horizons—critical for practical early-warning systems.

Author’s Response: Thank you for your constructive feedback. We retrained and re‑evaluated the model to predict the next five- and ten-time steps, and the results are reported as follows. Each time step in the process is presented as 0.1 seconds

Table 4. The evaluated results on evaluating the proposed model for next 5- and 10-time steps ahead of the current time frame.

Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Time step = 5
Jaywalking	CNN + LSTM	91.93	91.90	91.90	91.90
Jaywalking	Bidirectional LSTM	90.57	90.59	90.50	91.43
Jaywalking	Transformer	85.58	85.58	74.86	79.86
Baseline Scenario B1	CNN + LSTM	94.20	94.25	92.15	92.02
Baseline Scenario B1	Bidirectional LSTM	92.75	92.78	92.78	92.78
Baseline Scenario B1	Transformer	89.64	89.62	89.62	89.62
Baseline Scenario B2	CNN + LSTM	95.10	96.10	94.28	94.12
Baseline Scenario B2	Bidirectional LSTM	95.12	95.15	95.18	95.05
Baseline Scenario B2	Transformer	89.69	89.70	89.89	89.40
Time step = 10
Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Jaywalking	CNN + LSTM	80.20	79.10	79.02	78.01
Jaywalking	Bidirectional LSTM	70.12	70.56	70.12	70.12
Jaywalking	Transformer	76.28	76.35	76.36	76.38
Baseline Scenario B1	CNN + LSTM	83.42	84.29	84.85	84.32
Baseline Scenario B1	Bidirectional LSTM	82.93	82.96	82.96	82.96
Baseline Scenario B1	Transformer	83.91	83.91	83.30	83.60
Baseline Scenario B2	CNN + LSTM	84.01	84.07	84.02	84.06
Baseline Scenario B2	Bidirectional LSTM	81.64	81.55	81.02	80.34
Baseline Scenario B2	Transformer	83.25	83.11	83.12	83.10

Table 4 shows that as the prediction horizon increases, the accuracy of the forecasts gradually declines. Among the evaluated models, CNN‑LSTM and the Transformer achieve the best performance for future prediction. In contrast, the bidirectional LSTM performs the worst, as its memory units are less effective for long‑term forecasting compared to CNN‑LSTM and the Transformer. To report the achieved results, we included a section in the discussion chapter that elaborates on the impact of future collision prediction, particularly as the number of time steps increases. In this section, we highlight that models such as CNN‑LSTM can be effectively applied in early warning systems, especially for collision detection up to 0.5 seconds in advance.

Changes to the manuscript: Page 23, Table 13, has been added and highlighted. Page 23, Paragraphs 2 and 3, have been added and highlighted.

Comment 4: While T-SMOTE is used, no comparison with other imbalance-handling techniques (e.g., Focal Loss, weighted sampling, or generative models) is provided. The impact of synthetic samples on temporal integrity is not discussed.

Author’s response: Thank you for your constructive feedback. We have used the focal loss and the weighted sampling on training. The results of evaluating using focal loss and weighted sampling strategy for the best candidate model (CNN-LSTM) is shown in Table 5.

Table 5. The outcomes of assessing alternative techniques for addressing the class imbalance issue are presented.

Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	Imbalance Handling Techniques
Jaywalking	CNN -LSTM	98.21	14.25	13.29	Focal Loss
Baseline Scenario B1	CNN -LSTM	98.78	23.48	10.69	Focal Loss
Baseline Scenario B2	CNN -LSTM	98.89	22.49	9.32	Focal Loss
Jaywalking	CNN -LSTM	97.23	27.26	26.45	Weighted Sampling
Baseline Scenario B1	CNN -LSTM	97.24	27.32	27.32	Weighted sampling
Baseline Scenario B2	CNN -LSTM	97.48	25.42	25.42	weighted sampling

Focal loss was applied with the following configurations: Focal Loss (gamma=2, alpha=0.0057, task type='binary') for the jaywalking scenario, and Focal Loss (gamma=2, alpha=0.0010, task type='binary') for scenarios B1 and B2. These alpha values align with the respective sampling ratios, reflecting the pronounced imbalance between collision and normal instances in each case. As shown in Table 2, the severity of the imbalance limited the effectiveness of both focal loss and weighted sampling strategies. Although the models achieved high accuracy—primarily due to the dominance of normal samples—their performance in terms of recall and precision remained weak, indicating poor sensitivity to collision events.

We did not consider models such as Generative Adversarial Networks (GANs), as the synthetic samples they produce could compromise the temporal integrity of the dataset. Instead, we adopted T‑SMOTE in place of conventional SMOTE or random under‑/over‑sampling methods to preserve process integrity, since the introduction of randomness may lead to biased and unreliable outcomes across all evaluated models. T‑SMOTE generates collision samples within normal sequences by exploiting vehicle movements and pattern similarities to collision events. This approach preserves collision‑related patterns while altering long‑term patterns among normal samples. To ensure unbiased evaluation, the validation and test sets remained unchanged. To clarify these methodological choices, subsection 3.4 (Preprocessing) was revised to describe how T‑SMOTE synthesizes new samples without significantly affecting temporal integrity. Furthermore, subsection 4.1 (Training Settings) was updated to include the impact of the sampling method on the final distribution of the training data, resulting in 71% normal and 29% collision samples for Scenario A, and 83% normal and 17% collision samples for both Scenario B1 and Scenario B2.

Finally, in subsection 5.1 (Comparison), we examined alternative preprocessing methods to highlight the extent to which focal loss and weighted sampling contribute to addressing the class imbalance problem.

Changes to the manuscript: Page 21, Table 9, has been added and highlighted. Page 9, Paragraph 4, has been added and highlighted. Page 21, Paragraph 2, has been added and highlighted.

Comment 5: The Transformer model performs poorly, but no in-depth analysis is given (e.g., attention map visualization, positional encoding impact, or data-hungry nature). Why？

Author’s response: Thank you for your valuable comment. We have extracted the attention map for head visualization and the position encoding mapping for jaywalking with vehicle collision detection as well. The results are shown in Figures 1 and 2:

(a) (b)

Figure 1. The attention layer produced feature maps corresponding to (a) the first head, (b) the fourth head, (c) the sixth head, and (d) the ninth head.

Figure 1 demonstrates that Head 8 provides the most concentrated attention, characterized by sharper and more localized activation patterns. In comparison, Head 0 exhibits a broadly distributed focus, capturing general contextual information across the dataset. A similar diffuse distribution is observed in Heads 3 and 5, indicating that the model does not strongly emphasize critical temporal or spatial features, but instead allocates attention uniformly. By contrast, CNN-LSTM and bidirectional LSTM architectures are able to capture localized spatial features, such as variations in acceleration and velocity, which are essential for effective collision detection.

Figure 2. Positional embedding map derived from the 128-dimensional embedding vectors.

Figure 2 illustrates how positional information is embedded across the 128-dimensional space to enable the transformer model to infer sequence order. The visualization reveals that dimensions above 70 exhibit lower-frequency patterns, which support the modeling of long-range dependencies. In contrast, dimensions below 60 display high-frequency oscillations, aiding the distinction of closely positioned elements. This indicates that the model relies on positional encoding to interpret feature order; however, it lacks the capacity to capture temporal causality, such as abrupt deceleration or velocity shifts preceding a collision. Additionally, the Transformer model suffers from limited feature diversity. With only 11 input features, the use of a 128-dimensional embedding space may be excessive, potentially resulting in underfitting or over-smoothing.

To present the findings, we revised the experimental results and, following their reporting, included the attention feature maps to highlight the transformer model’s poor performance in comparison with other models.

Changes to the manuscript: Page 19, Figure 9, has been added and highlighted. Page 19, Paragraphs 4 and 5 have been revised and highlighted.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1. The abstract mentions two types of conflicts, but when describing them specifically, it refers to evaluations while seemingly presenting the accuracy of classification models.
2. The use of subjects and verbs needs optimization; for example, "this research uses" is inappropriate.
3. There is confusion in verb usage in multiple places. For instance, in lines 45-46, incorrect verb usage makes the meaning unclear—what does "method receives data" mean? In lines 48-50, what does "section two reviews" or "section three describes" mean? There are many similar issues.
4. Punctuation is misused in multiple instances.
5. Excessive grammatical errors and fragmented sentences make the text difficult to read. For example, line 67 first mentions the author's name, followed by "In their research." Such issues are frequent.
6. The literature review is written as a list of who did what, lacking summary and critical commentary.
7. Based on the summary, the author emphasizes that many have used LSTM and CNN-LSTM for conflict prediction, but there is a lack of scenarios simultaneously addressing both pedestrian-vehicle and vehicle-vehicle situations. Why is this the case? It seems illogical. The author merely uses the same methods as others for conflict identification, so what is the innovation or core contribution?
8. There are instances of incomplete sentences.
9. Dataset descriptions need to include example demonstrations.
10. The text in the correlation analysis plot is unclear. Are these indicators suitable for such analysis? The purpose of this analysis is not clear.
11. The conclusions need to be broken down, listing the paper's contributions as bullet points.
12. The title of section 5.3 also appears to be incorrect.

Author Response

Reviewer #2:

Comment 1: The abstract mentions two types of conflicts, but when describing them specifically, it refers to evaluations while seemingly presenting the accuracy of classification models.

Author’s response: Thank you for your constructive feedback. We have changed the manuscript to solve the sudden shift from the collision scenarios (vehicle–pedestrian, vehicle–cyclist) into model evaluation metrics. Thus, we revised the manuscript and changed “evaluations of pedestrian–vehicle collision” to “For the pedestrian–vehicle scenario” (keeps focus on the scenario, not the evaluation process).

Changes to the manuscript: Page 1, Paragraph 1 Line 7, has been revised and highlighted.

Comment 2: The use of subjects and verbs needs optimization; for example, "this research uses" is inappropriate.

Author’s response: Thank you for your insightful comment. We have thoroughly revised the manuscript to eliminate any inappropriate use of the phrase “this research.” In addition, we refined the use of verbs and subjects to enhance clarity throughout the text.

Changes to the manuscript: Page 2, Paragraph 1 Line 35, has been revised and highlighted.

Page 13, Paragraph 5 Line 410, has been revised and highlighted. Page 20, Paragraph 2 Line 583, has been revised and highlighted.

Comment 3: There is confusion in verb usage in multiple places. For instance, in lines 45-46, incorrect verb usage makes the meaning unclear—what does "method receives data" mean? In lines 48-50, what does "section two reviews" or "section three describes" mean? There are many similar issues.

Author’s response: Thank you for your valuable comment. Our aim was to use both the section titles and their accompanying descriptions to provide further clarification. We have revised the manuscript to ensure that the remaining content is clearly presented by explicitly outlining the focus of each section. In addition, the introduction has been refined to remove phrasing such as “method receives data” and similar expressions.

Changes to the manuscript: Page 2, Paragraph 1 Line 43, has been revised and highlighted. Page 2, Paragraph 2 Line 53, has been revised and highlighted.

Comment 4: Punctuation is misused in multiple instances.

Author’s response: Thank you for your valuable feedback. We have revised the manuscript, particularly the introduction and related work sections, which contained the most errors. All scenarios with missing punctuation have been corrected, and in several cases, the sentences were entirely rewritten to align with your recommendations. Furthermore, the paraphrasing process has been extended to the entire manuscript to address all identified issues.

Changes to the manuscript: Page 2, Paragraph 1 Line 33, has been revised and highlighted.

Page 2, Paragraph 1 Line 35, has been revised and highlighted. Page 2, Paragraph 1 Line 43, has been revised and highlighted.

Page 3, Paragraph 1 Line 83, has been revised and highlighted. Page 3, Paragraph 2 Line 88, has been revised and highlighted. Page 3, Paragraph 3 Line 107, has been revised and highlighted. Page 3, Paragraph 4 Line 116, has been revised and highlighted. Page 4, Paragraph 3 Line 149, has been revised and highlighted.

Page 4, Paragraph 3 Line 151, has been revised and highlighted. Page 4, Paragraph 4 Line 166, has been revised and highlighted. Page 5, Paragraph 2 Line 188, has been revised and highlighted. Page 5, Paragraph 3 Line 200, has been revised and highlighted. Page 5, Paragraph 5 Line 223, has been revised and highlighted.

Page 9, Paragraph 2 Line 329, has been revised and highlighted.

Page 13, Paragraph 5 Line 410, has been revised and highlighted. Page 13, Paragraph 6 Line 414, has been revised and highlighted.

Page 14, Paragraph 1 Line 417, has been revised and highlighted. Page 14, Paragraph 1 Line 424, has been revised and highlighted.

Page 15, Paragraph 1 Line 445, has been revised and highlighted. Page 16, Paragraph 1 Line 455, has been revised and highlighted. Page 16, Paragraph 4 Line 481, has been revised and highlighted.

Comment 5: Excessive grammatical errors and fragmented sentences make the text difficult to read. For example, line 67 first mentions the author's name, followed by "In their research." Such issues are frequent.

Author’s response: Thank you for your valuable comment. We have revised the entire related work section and corrected instances where author names were followed by the phrase “In their research.” These have been updated to use only the author names along with a direct description of their contributions.

Changes to the manuscript: Page 2, Paragraph 1 Line 33, has been revised and highlighted.

Page 2, Paragraph 1 Line 35, has been revised and highlighted. Page 2, Paragraph 1 Line 43, has been revised and highlighted.

Comment 6: The literature review is written as a list of who did what, lacking summary and critical commentary.

Author’s response: Thank you for your precise comment. We have revised the manuscript so that, at the end of each reviewed article, the main issue is explicitly identified, ensuring that critical commentary is incorporated throughout. Following each review, we also provide a summary of the evaluated issues along with an explanation of how we intend to address them. In this way, the entire “Related Work” section has been refined to align with your feedback.

Changes to the manuscript: : Page 2, Paragraph 1 Line 33, has been revised and highlighted.

Page 2, Paragraph 1 Line 35, has been revised and highlighted. Page 2, Paragraph 1 Line 43, has been revised and highlighted.

Page 9, Paragraph 2 Line 329, has been revised and highlighted.

Comment 7: Based on the summary, the author emphasizes that many have used LSTM and CNN-LSTM for conflict prediction, but there is a lack of scenarios simultaneously addressing both pedestrian-vehicle and vehicle-vehicle situations. Why is this the case? It seems illogical. The author merely uses the same methods as others for conflict identification, so what is the innovation or core contribution?

Author’s response: Thank you for your valuable comment. We reviewed prior models and summarized existing approaches, noting that earlier work often struggled with rare scenarios such as jaywalking and failed to adopt a suitable framework for addressing the imbalance between normal and collision samples. Moreover, previous studies did not adequately tune their models to improve true positive and true negative outcomes while reducing false positives. In this research, we provide a framework to evaluate established models such as CNN-LSTM and LSTM, while also assessing a relatively newer architecture, the Transformer. To address the imbalance issue, we introduced a novel preprocessing pipeline designed to balance normal and collision samples. Additionally, we incorporated a hyperparameter optimization strategy to select the most effective architecture CNN-LSTM, bidirectional LSTM, or Transformer based on performance across accuracy, precision, and recall. During fine-tuning, we deliberately minimized the number of layers to produce a lightweight model capable of real-time application with rapid response. The manuscript has been revised to highlight the novelty of the proposed framework, with contributions emphasized in the conclusion and expanded discussion of its performance in long-term collision detection. Our approach prioritizes automation in model development, focusing on maximizing true positives and true negatives while reducing false positives and false negatives. Minimizing false positives reduces unnecessary warnings, while minimizing false negatives is critical to preventing potential accidents.

Changes to the manuscript: Page 5, Paragraph 3, has been revised and highlighted. Page 16, Paragraph 1, has been revised and highlighted. Page 20, Paragraph 5, has been revised and highlighted. Page 25, Paragraph 3, has been revised and highlighted.

Comment 8: There are instances of incomplete sentences.

Author’s response: Thank you for your valuable feedback. We have thoroughly reviewed the entire manuscript and revised sentences with unclear meaning or inconsistencies to correct mistakes and incomplete expressions. In addition, we have carefully checked and corrected all instances of missing punctuation.

Changes to the manuscript: Page 2, Paragraph 1 Line 33, has been revised and highlighted.

Page 2, Paragraph 1 Line 35, has been revised and highlighted. Page 2, Paragraph 1 Line 43, has been revised and highlighted.

Page 3, Paragraph 1 Line 83, has been revised and highlighted. Page 3, Paragraph 2 Line 88, has been revised and highlighted. Page 3, Paragraph 3 Line 104, has been revised and highlighted. Page 3, Paragraph 4 Line 113, has been revised and highlighted. Page 4, Paragraph 3 Line 149, has been revised and highlighted.

Page 4, Paragraph 3 Line 139, has been revised and highlighted. Page 4, Paragraph 4 Line 164, has been revised and highlighted. Page 4, Paragraph 5 Line 176, has been revised and highlighted. Page 5, Paragraph 1 Line 182, has been revised and highlighted. Page 5, Paragraph 2 Line 198, has been revised and highlighted. Page 5, Paragraph 5 Line 224, has been revised and highlighted.

Page 8, Paragraph 2 Line 339, has been revised and highlighted.

Page 12, Paragraph 2 Line 397, has been revised and highlighted. Page 12, Paragraph 3 Line 401, has been revised and highlighted. Page 12, Paragraph 4 Line 404, has been revised and highlighted. Page 12, Paragraph 4 Line 411, has been revised and highlighted.

Page 13, Paragraph 2 Line 432, has been revised and highlighted. Page 13, Paragraph 2 Line 439, has been revised and highlighted.

Page 17, Paragraph 3 Line 543, has been revised and highlighted.

Comment 9: Dataset descriptions need to include example demonstrations.

Author’s response: Thanks for your valuable comments. We appreciate your valuable comment. We have added the figures related to the behavior of the objects based on their velocity, acceleration, longitude, and latitude. The added figures are shown in Figures 3 and 4.

(a) (b)

(c)

Figure 3 presents illustrative examples for Scenario A: (a) Speed versus Longitude, (b) Acceleration versus Longitude, and (c) Acceleration versus Speed.

(a) (b)

Figure 4 presents illustrative examples for Scenario B: (a) Latitude versus Longitude in Scenario B1, (b) Acceleration versus Speed in Scenario B1, (c) Latitude versus Longitude in Scenario B2, and (d) Acceleration versus Speed in Scenario B2.

Figure 3 illustrates the range of velocity and acceleration, highlighting their influence on collision occurrence. The results show that most collisions arise under conditions of negative acceleration or near-zero speed. Figure 4 presents similar patterns for the second scenario, where collisions are concentrated around low velocity and negative acceleration. Additionally, Figure 4 indicates that the majority of accidents occur in the middle of the road, corresponding to junction locations. We have revised the manuscript and added the aforementioned figures to the description of dataset section.

Changes to the manuscript: Page 7, paragraph 3, has been revised and highlighted. Page 8, Figure 2, has been revised and highlighted. Page 10, Figure 4, has been revised and highlighted. Page 8, paragraph 2, has been revised and highlighted.

Comment 10: The text in the correlation analysis plot is unclear. Are these indicators suitable for such analysis? The purpose of this analysis is not clear.

Author’s response: Thank you for your valuable feedback. Figure 3 has been revised by enlarging the fonts of the numerical values and the descriptions of each row and column. The figure illustrates both linear and non‑linear relationships among the dataset features, demonstrating that there is no strong correlation between the features and the target collision variable; therefore, feature elimination based on high correlation with the target is unnecessary. In addition, Figure 3 provides insight into the characteristics of each feature, showing how they correlate with one another and how changes in one feature may influence the others, thereby offering readers a clearer understanding of the dataset. Moreover, examining linear correlations among features can support future studies in reducing highly correlated variables, enabling researchers to focus on the most essential features.

Changes to the manuscript: Page 11, Figure 5, has been revised and highlighted.

Comment 11: The conclusions need to be broken down, listing the paper's contributions as bullet points.

Author’s response: We appreciate your valuable feedback. The conclusion has been revised to clearly highlight the primary contributions of this research, which are as follows:

Framework Development: A framework was introduced that integrates sampling strategies with deep learning model tuning to identify optimal architectures for collision avoidance.
Performance Enhancement: The proposed framework demonstrated superior performance compared to existing deep learning models, improving accuracy as well as true positive and true negative rate predictions for collision avoidance involving vehicles, jaywalking pedestrians, and motorcyclists.
Practical Application: A lightweight model was delivered that is suitable for real-time collision avoidance systems, achieving over 99\% accuracy, together with a reliable model for early warning applications.

Changes to the manuscript: Page 25, paragraphs 3 and 4, have been revised and highlighted.

Comment 12: The title of section 5.3 also appears to be incorrect.

Author’s response: Thanks for noticing this issue. In response to the reviewers’ comments, the numerical reference has been updated to Section 5.5, and the title of Section 5.5 has been revised to “Limitations and Future Work.”

Changes to the manuscript: Page 24, Section 5.5 title, has been revised and highlighted.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

In the introduction, please better highlight the liaison between this study and the concept of a smart city, to better pinpoint the suitability of this paper for this journal. Please elaborate further.

What were the pillars for identifying the architecture components in the elevated transformer model? Please elaborate further.

Accuracy and precision of nearby 100% (i.e., 99.7%, in table 3) might appear as “pseudo”, in that an absolutely ideal model might be of limited predictability and usability. Please comment.

Are there any limitations in the scenarios assumed that could hinder the applicability and the generalizability of the models? Please elaborate.

Section 5.2. Please expand the discussion provided to support cases from real world applications so far.

Please fix the style of the references list.

Comments on the Quality of English Language

moderate changes are needed

Author Response

Reviewer #3:

Comments and Suggestions for Authors

Comment 1: In the introduction, please better highlight the liaison between this study and the concept of a smart city, to better pinpoint the suitability of this paper for this journal. Please elaborate further.

Author’s response: Thank you for your constructive feedback. We have updated the introduction section by adding a discussion that emphasizes the application of the proposed approach to smart city frameworks.

Changes to the manuscript: Page 2, Paragraph 4 Line 74, has been revised and highlighted.

Comment 2: What were the pillars for identifying the architecture components in the elevated transformer model? Please elaborate further.

Author’s response: We appreciate your valuable comment. We used the Transformer because it can handle sequential data. It preserves temporal dependencies that are important for collision prediction. Self-attention also gives direct access to all tokens, without compressing history into a hidden state.

We applied a lightweight Transformer with 128 embedding dimensions and only one layer. To explain its poor performance, we revised the manuscript and included feature maps extracted from the multi-head attention layer. We attempted to use the diversity of multi-head attention to capture different feature maps. However, the model could not capture temporal causality, such as sudden deceleration or velocity changes before a collision.

Another limitation is the small number of input features. With only 11 features, using a 128-dimensional embedding space may be excessive. This can lead to underfitting or over-smoothing. We revised the manuscript to emphasize the findings, explaining both the rationale for using this architecture and the reasons for its limited performance.

Changes to the manuscript: Page 14, Paragraph 1, has been added and highlighted.

Page 18, Paragraph 5 has been revised and highlighted.

Page 19, Paragraph 1 has been revised and highlighted. Page 19, Figure 9, has been added and highlighted.

Comment 3: Accuracy and precision of nearby 100% (i.e., 99.7%, in table 3) might appear as “pseudo”, in that an absolutely ideal model might be of limited predictability and usability. Please comment.

Author’s response: Thanks for your precise comment. The observed high accuracy is largely a result of the overwhelming number of normal instances in the validation and test sets. In certain scenarios, the proportion of positive cases is extremely small, with ratios of 0.0057 in Scenario A and 0.0010 in Scenarios B1 and B2. Consequently, even if the model predicts all samples as normal, the accuracy remains above 99%. This underscores the need to evaluate performance using precision and recall to ensure that both positive and negative cases are correctly identified. We have revised the manuscript to explain the source of this inflated accuracy. Therefore, the model cannot be considered an absolute solution, as its predictability and long-term usability are limited.

Changes to the manuscript: Page 19, Paragraph 1, has been added and highlighted. Page 20, Paragraph 5, has been added and highlighted

Comment 4: Are there any limitations in the scenarios assumed that could hinder the applicability and the generalizability of the models? Please elaborate.

Author’s response: We appreciate your detailed comments. In the manuscript, we have clarified that the evaluated data were generated through simulation. It is essential that such simulations closely approximate real‑world scenarios and incorporate variations in traffic flow to ensure the development of a robust model capable of addressing diverse urban traffic conditions. Also, the issue with development of the system in the real time is declared in section 5.5 Real World Application. To illustrate the implications of using a real-world dataset, we introduced noise into the original data and conducted an ablation study presented in Section 5.2. This analysis informs readers of the extent to which model accuracy declines when evaluated under real-world simulation conditions.

Page 24, Paragraph 2 Line 727, has been revised and highlighted. Page 25, Paragraph 1 Line 729, has been revised and highlighted.

Comment 5: Section 5.2. Please expand the discussion provided to support cases from real world applications so far.

Author’s response: Thank you for your valuable comment. We have revised the manuscript and included sample applications of the proposed model, particularly highlighting its role in reducing road accidents and supporting automation in logistics as part of smart city development. In addition, the monitoring system has been examined in real‑world applications, with specific emphasis on controlling traffic flow within urban environments. We have also added references related to the applications of traffic avoidance systems in smart cities to strengthen the discussion.

Changes to the manuscript: Page 23, Paragraph 2 Line 656, has been revised and highlighted. Page 28, References 49 and 50 have been added and highlighted.

Comment 6: Please fix the style of the references list.

Author’s response: Thank you for your valuable comment. We have updated the reference style to align with the MDPI system by configuring BibLaTeX with the numeric style, using biber as the backend, disabling sorting to ensure references appear in citation order, enabling citation sorting for multiple references, and activating hyperlink support while disabling back references. This setup ensures that all references are reported in numerical order and include hyperlinks, thereby providing readers with a clear, consistent, and accessible reference system that matches MDPI requirements.

Changes to the manuscript: Page 26, 27, 28, and 29 have been revised and highlighted.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Dear Authors,

I read your article on training artificial intelligence to detect potential collisions between pedestrians, cyclists, and motor vehicles with great interest.

Your paper is of a high standard, and I particularly appreciate the well-prepared literature review, which is very clear, concise, and detailed.

From a formal perspective:

- It would be advisable to reconsider Figure 3, as it may be difficult to read in its current form.
- For your own benefit, I recommend carefully checking all formulas.

From a content perspective:

- Please provide a more detailed description of the dataset used, including example images if this is possible from a copyright perspective.
- From a transport engineering perspective, I also appreciate the proposed practical application of the system; however, in Section 5.2, I recommend discussing potential issues that such a proposal may entail (e.g., economic burden, system reliability, gradual implementation, etc.).
- Table 8 clearly shows that your results are the most accurate; however, does your model require significantly more computational power than other algorithms? It is also unclear what datasets (input data) were used in the other referenced studies.

Thank you.

Author Response

Reviewer #4:

I read your article on training artificial intelligence to detect potential collisions between pedestrians, cyclists, and motor vehicles with great interest.

Your paper is of a high standard, and I particularly appreciate the well-prepared literature review, which is very clear, concise, and detailed.

Comment 1: From a formal perspective: It would be advisable to reconsider Figure 3, as it may be difficult to read in its current form.

Author’s response: We sincerely appreciate your thoughtful feedback and the positive perspective you brought to our article. By answering to other reviewers, the Figure 3 is now Figure 5. We revised Figure 5 by adjusting its scale to enhance the readability of both the descriptions and the correlation matrix values.

Changes to the manuscript: Page 11, Figure 5, has been revised and highlighted.

Comment 2: For your own benefit, I recommend carefully checking all formulas.

Author’s response: Thanks for your valuable comment. We have reviewed the formulas and revised Equations (1) and (2) by changing the input notation from ? to ? to ensure consistency with the manuscript’s definition of the input. In addition, Equation (4) has been updated to align with the notation used in other equations for inputs, weights, and biases. The remaining equations were carefully re‑examined to confirm their accuracy and consistency throughout the manuscript.

Changes to the manuscript: Page 12, equation 1 and 2, have been revised and highlighted.

Comment 3: Please provide a more detailed description of the dataset used, including example images if this is possible from a copyright perspective.

(a) (b)

(c)

Figure 3 presents illustrative examples for Scenario A: (a) Speed versus Longitude, (b) Acceleration versus Longitude, and (c) Acceleration versus Speed.

(a) (b)

Comment 4: From a transport engineering perspective, I also appreciate the proposed practical application of the system; however, in Section 5.2, I recommend discussing potential issues that such a proposal may entail (e.g., economic burden, system reliability, gradual implementation, etc.).

Author’s response: Thanks for your valuable comment. The manuscript has been revised to incorporate the noted issues concerning economic burden, implementation, and system reliability, which have been added to the conclusion of Section 5.4 Real World Application.

Changes to the manuscript: Page 23, Paragraph 4 Line 695, has been revised and highlighted.

Comment 5: Table 8 clearly shows that your results are the most accurate; however, does your model require significantly more computational power than other algorithms? It is also unclear what datasets (input data) were used in the other referenced studies.

Author’s response: We appreciate your valuable comment. During the development process, we prioritized lightweight models to minimize the computational resources required for training and testing. The proposed model consists of 34,945 trainable parameters and employs one convolutional layer, three LSTM layers, and one dense layer. As such, it remains computationally efficient and does not demand additional parameters compared to other approaches. The study has been benchmarked against related research utilizing trajectory data, simulated datasets generated by SUMO, VEINS, or both, as well as urban traffic datasets. These datasets share common characteristics, including traffic flow, geolocation information, and crash likelihood among road users. To clarify these aspects, the discussion section has been revised to highlight the dataset types and the parameter requirements for training the model.

Changes to the manuscript: Page 20 Table 8 is revised and the Dataset Description is added and highlighted. Page 20, Paragraph 5 Line 592, has been revised and highlighted.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

Line 12: Which one is the "proposed method"?

Line 31: As per my knowledge, DL is an advanced form of ML? If I am wrong, please add more explanation to differentiate between them, as a lot of readers may have the same perception.

Identify the research gap and justify the need for this study at the end of the introduction. This section should also have some background related to traffic crashes, especially those with vulnerable road users, which are also considered in this research (pedestrians and bicyclists).

What "new hyperparameter tuning" strategy is used in this research? It is mentioned at the end of the literature, but the relevant section (3.6) fails to elaborate on this claim. Then the authors mention hyperparameter optimization as their future work. So it is really confusing.

Table 3: Check the results for the transformer model. How come it has 0 recall and precision with 99% accuracy?

How was the data selected for training, testing, and validation? Did the authors ensure that each set was also balanced with respect to the outcome, or was it done randomly?

Section 5.3: Rephrase the title.

The instances are taken from a simulation. So an important limitation is adjusting the detection mechanism to real-world conditions with changing light patterns and other site features. This should be mentioned.

Author Response

Reviewer #5:

Comment 1: Line 12: Which one is the "proposed method"?

Author’s response: Thanks for your valuable comment. We have revised the manuscript and informed the reader that the proposed model is the combination of CNN with LSTM which is tunned furthermore with random search.

Changes to the manuscript: Page 1, Paragraph 1 Line 11, has been revised and highlighted.

Comment 2: Line 31: As per my knowledge, DL is an advanced form of ML? If I am wrong, please add more explanation to differentiate between them, as a lot of readers may have the same perception.

Author’s response: Thanks for your valuable comment. DL uses layered neural networks to learn directly from raw data, while ML often relies on manual feature extraction and simpler models. We have added this sentence to indicate the difference between them more clearly.

Changes to the manuscript: Page 2, Paragraph 1 Line 35, has been revised and highlighted.

Comment 3: Identify the research gap and justify the need for this study at the end of the introduction. This section should also have some background related to traffic crashes, especially those with vulnerable road users, which are also considered in this research (pedestrians and bicyclists).

Author’s response: We appreciate your insightful feedback. The manuscript has been revised to highlight the gaps identified in prior research across two domains: data and model architectures. In particular, we emphasize that earlier studies did not address the issue of class imbalance between collision and non‑collision samples, nor did they propose an effective strategy for identifying optimal parameters for collision detection. We have incorporated background information on traffic crashes, their associated fatalities, and the proportion attributable to the VRUs examined in this study.

Changes to the manuscript: Page 1, Paragraph 1 Line 25, has been revised and highlighted. Page 2, Paragraph 2 Line 52, has been revised and highlighted.

Comment 4: What "new hyperparameter tuning" strategy is used in this research? It is mentioned at the end of the literature, but the relevant section (3.6) fails to elaborate on this claim. Then the authors mention hyperparameter optimization as their future work. So, it is really confusing.

Author’s response: We appreciate your detailed comment. In our study, we employed random and grid search techniques for hyperparameter tuning, which are established methods rather than novel approaches. Our contribution lies in introducing a new objective function that simultaneously considers accuracy, recall, and precision. Accordingly, we have removed any indication that a new hyperparameter tuning method was proposed and have clarified the key parameters used in the tuning process. For future work, alternative optimization strategies such as Particle Swarm Optimization, Gray Wolf Optimizer, and Genetic Algorithms could be explored. To address the issue raised, we noted that Genetic Algorithms were employed as a global optimization approach, enabling comparison with the results obtained from the existing random and grid search hyperparameter tuning methods.

Changes to the manuscript: Page 16, Paragraph 2 Line 462, has been revised and highlighted. Page 26, Paragraph 1 Line 729, has been revised and highlighted.

Comment 5: How was the data selected for training, testing, and validation? Did the authors ensure that each set was also balanced with respect to the outcome, or was it done randomly?

Author’s response: Thanks for your valuable comment. The original dataset was partitioned into training, validation, and test sets with ratios of 80%, 10%, and 10%, respectively. Following data balancing, the generated samples were distributed across the entire training set. The proportions of collision and non‑collision samples in the validation and test sets remained consistent with the original dataset, ensuring that the proposed model was evaluated under representative conditions rather than overfitted distributions. For clarity, the ratios between collision and normal samples are explicitly reported in Section 4.1 (Training Settings): 71% normal and 29% collision samples for Scenario A, and 83% normal to 17% collision samples for both Scenario B1 and Scenario B2.

Changes to the manuscript: Page 16, Paragraph 4 Line 481, has been revised and highlighted

Comment 6: Section 5.3: Rephrase the title.

Changes to the manuscript: Page 24, Section 5.5 title, has been revised and highlighted.

Comment 7: The instances are taken from a simulation. So, an important limitation is adjusting the detection mechanism to real-world conditions with changing light patterns and other site features. This should be mentioned.

Author’s response: Thanks for your precise comments. We have added this comments as the limitation of this work to be solved in the future of this research. Thus, we have revised the 5.3. Limitations and Future of the Works. To overcome the difference between the simulation and the real-time scenario, we have evaluated the model using a real-time dataset called NGSIM. Also, we have added the noise to the dataset and evaluated the model again. The results of evaluating the model using the NGSIM is shown in Table 6.

Table 6. Evaluating the proposed model on the new NGSIM datasets.

Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
NGSIM	CNN-LSTM optimized by Grid search.	99.39	98.57	98.57	99.19
NGSIM	Bidirectional LSTM optimized by Grid search.	98.44	76.05	69.82	72.81
NGSIM	Transformer optimized by Grid search.	98.53	98.35	98.35	98.35

Table 7. Evaluating the proposed model on the new NGSIM datasets with added noise and situation close to real time scenario.

Scenario	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
NGSIM + Noise	CNN-LSTM optimized by Grid search.	97.39	89.10	82.56	86.25
NGSIM + Noise	Bidirectional LSTM optimized by Grid search.	96.65	75.49	63.80	69.15
NGSIM + Noise	Transformer optimized by Grid search.	94.23	71.23	58.56	65.89

As presented in Table 7, introducing noise and modifying the frame index in the dataset leads to a decline in model performance, particularly in terms of recall, while also increasing the false positive rate. Among the tested approaches, the CNN–LSTM architecture demonstrates the strongest resilience to these perturbations and frame shifts. Consequently, the CNN–LSTM optimized through grid search emerges as the most effective model for real-time collision detection, even under sensor noise conditions. An ablation study detailing these findings has been included in the discussion section (Section 5.2) to inform readers of these additional insights

Page 25, Paragraph 2 Line 727, has been revised and highlighted.

Page 26, Paragraph 1 Line 729, has been revised and highlighted.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank for your revision. I don't have further comments.

Author Response

Thanks for reviewing our work.

Reviewer 2 Report

Comments and Suggestions for Authors I think the author has revised the suggestions, and the revisions are basically acceptable. Comments on the Quality of English Language

Author Response

Thanks for reviewing our work.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript was improved. Please fix the caption of figure 4, which is like a sentence. Please improve the fonts of figures 7 and 8.

Comments on the Quality of English Language

moderate changes are needed

Author Response

Reviewer 3:

Comment 1: The manuscript was improved. Please fix the caption of figure 4, which is like a sentence.

Author’s response:

Thank you for your valuable feedback. The caption of Figure 4 has been revised to enhance the clarity of the figure descriptions and improve readability for the readers.

Changes to the manuscript:

Page 10 the caption of Figure 4 has been revised and highlighted.

Comment 2: Please improve the fonts of figures 7 and 8.

Author’s response:

Thank you for your valuable feedback. I have improved the fonts and descriptions components inside the figures 7 and 8.

Changes to the manuscript:

Page 14 figure 7 has been revised and highlighted. Page 15 figure 8 has been revised and highlighted.

Article Menu

Intelligence Collision Detection Using a Combination of Tuning Base Methods and Convolutional Long Short Term Memory Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI