Review Reports
- Bing He1,
- Wei He1,* and
- Qing Chang2
- et al.
Reviewer 1: Anonymous Reviewer 2: Artur Budzyński Reviewer 3: Ali Younes Reviewer 4: Gen Li
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsComments for the Authors
Keywords, Title, and Abstract
- Clear and succinct, appropriately capturing the contributions of the study.
- The keywords have been carefully selected and are pertinent.
Introduction
- a well-defined problem statement and strong motivation.
- Excellent defense of the use of CoT and MLLMs in traffic analysis.
- Think about including a succinct sentence outlining the novelty in comparison to other MLLM applications in the transportation industry.
Related Works
- Thorough discussion of LLMs and multimodal fusion in transportation.
- Could more effectively draw attention to the research gap that this paper addresses, such as the dearth of effective fine-tuning techniques for MLLMs used in remote sensing.
Materials and Methods
- The dataset is meticulously constructed and extensively documented.
- The multi-stage training approach is eloquently described and supported.
- Think about elaborating on the reasons behind the selection of just three MLLMs as well as their unique architectural distinctions.
- Indicate the time frame for which the Google and OSM data were used in lines 189–195 on page 5.
Experimental Setup
- Raining details are observable and repeatable.
- The metrics are carefully selected and suitable for every task.
- For improved reproducibility, mention the GPU model and training time in lines 379–385 on page 10.
Results
- With adequate quantitative and qualitative analysis, the results are presented in an understandable manner.
- It is convincingly demonstrated that strong improvements occur after fine-tuning.
- Include standard deviations for accuracy/F1 to illustrate variability in Table 1 on page 12.
- Add a succinct explanation of confusion matrices to the caption on page 15, figure 7.
Discussion
- Insightful comparison of parameter scale and model architecture.
- The importance of CoT in interpretability is well-emphasized.
- Discuss the reasons behind Deepseek-VL*(1.3B)'s decline in METEOR on page 22, lines 692–701.
- Explain why predicting duration is more difficult than predicting severity in lines 802–808 of page 24.
Limitations and Future Work
- Limitations are properly recognized.
- Prospects for the future are encouraging and pertinent.
- Think about imposing a restriction on the applicability to non-US regions or to sensor data that differs.
Kind regards,
Reviewer
Author Response
We sincerely thank you for taking the time and effort to review our manuscript. We believe that your feedback is very constructive and will help improve the quality and clarity of our work. We have carefully considered each of your suggestions and made corresponding revisions to the manuscript. We have attached the specific modification details in the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript presents an innovative and timely approach that combines satellite imagery, CTADS crash data and domain-adapted multimodal language–vision models. The idea of integrating remote sensing with MLLM-based reasoning is genuinely original and has strong potential for advancing data-driven traffic safety analysis. The paper addresses a relevant and growing research direction, and with further methodological refinement it could become a valuable contribution to the field.
- Major
- From the manuscript I understand that both the scene descriptions and the CoT reasoning chains are generated synthetically by language models, and that the CoT is constructed based on the known severity and duration labels. This raises the question of how closely such a procedure reflects real crash narratives and the actual reasoning processes used in traffic accident analysis.
- In the experimental section, the authors only compare different variants of multimodal vision–language models (before and after fine-tuning), but there is no reference to simpler, classical methods that could serve as an important baseline. This raises the question of what the actual gain from using large multimodal models is, and whether it is really worth the added complexity.
- I could not find a detailed description of how the data were split into training, validation and test sets. For spatial data, where satellite images and crash locations are likely to be strongly spatially correlated, the splitting strategy is crucial for a realistic assessment of generalization. This raises the question of whether the split was purely random or spatial/temporal, and how the authors ensured that very similar scenes from the same road segment or intersection do not appear simultaneously in the training and test sets. Without this information, it is difficult to judge to what extent the reported very high accuracy and F1 scores reflect true generalization ability, and to what extent they may be influenced by unintended information leakage between the splits.
- The authors repeatedly claim that the proposed method improves interpretability and can support decision-making in traffic accident analysis, but, in my view, the manuscript does not present concrete empirical evidence to substantiate these claims.
- The manuscript mentions rule-based validation and manual inspection of approximately 1% of the synthetic descriptions and CoT samples, but does not provide details on sampling strategy, annotation criteria or error typology. A brief description of this process would help readers assess the reliability of the synthetic supervision signals.
- Minor
- The current 3D stacked bar design in Figure 14 makes the results difficult to interpret. I would suggest replacing it with a simpler 2D bar chart, using separate (non-stacked) bars for each variable and adding numerical values above the bars. This would substantially improve the clarity and readability of the figure.
- The description of the LoRA configuration could benefit from a brief clarification regarding which layers were adapted and approximately how many parameters were trainable in each model variation.
Comments for author File:
Comments.pdf
Author Response
We sincerely thank you for taking the time and effort to review our manuscript. We believe that your feedback is very constructive and will help improve the quality and clarity of our work. We have carefully considered each of your suggestions and made corresponding revisions to the manuscript. We have attached the specific modification details in the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis research addresses an important and timely topic, as traditional traffic accident analysis struggles to incorporate high-dimensional and heterogeneous data such as remote sensing imagery. The study’s proposed multimodal framework represents a valuable contribution toward enhancing the interpretability and intelligence of traffic accident analysis. I would like to acknowledge and thank the authors for their significant effort in developing and evaluating this comprehensive approach.
However, the manuscript still requires further improvements to strengthen its overall quality and scientific rigor.
- The abstract lacks detailed methodology. When I read it, I am not able to fully understand your methodological process. Adding one or two clarifying sentences would greatly improve the abstract’s clarity.
- The introduction is well written; however, it lacks real-world examples that illustrate the nature of the damage and the consequences of traffic accidents in terms of loss of life, property damage, and economic impact.
- Although the objective of the study is clearly stated, the research gap is not well focused. A more explicit and structured paragraph outlining the research gap is needed.
- In the Methodology section, the writing is unclear in relation to Figure 1, titled 'Schematic diagram of the proposed framework'.
- The authors should describe the data used by including it clearly in the methodology section.
- The discussion section is weak and needs to compare your results with previous studies, provide clear interpretations of the findings, and explain why the chosen methodology was used.
- Relevant references should be added, such as 'Integrating InSAR coherence and air pollution detection satellites to study the impact of war on air quality'.
This research addresses an important and timely topic, as traditional traffic accident analysis struggles to incorporate high-dimensional and heterogeneous data such as remote sensing imagery. The study’s proposed multimodal framework represents a valuable contribution toward enhancing the interpretability and intelligence of traffic accident analysis. I would like to acknowledge and thank the authors for their significant effort in developing and evaluating this comprehensive approach.
However, the manuscript still requires further improvements to strengthen its overall quality and scientific rigor.
- The abstract lacks detailed methodology. When I read it, I am not able to fully understand your methodological process. Adding one or two clarifying sentences would greatly improve the abstract’s clarity.
- The introduction is well written; however, it lacks real-world examples that illustrate the nature of the damage and the consequences of traffic accidents in terms of loss of life, property damage, and economic impact.
- Although the objective of the study is clearly stated, the research gap is not well focused. A more explicit and structured paragraph outlining the research gap is needed.
- In the Methodology section, the writing is unclear in relation to Figure 1, titled 'Schematic diagram of the proposed framework'.
- The authors should describe the data used by including it clearly in the methodology section.
- The discussion section is weak and needs to compare your results with previous studies, provide clear interpretations of the findings, and explain why the chosen methodology was used.
- Relevant references should be added, such as 'Integrating InSAR coherence and air pollution detection satellites to study the impact of war on air quality'.
Author Response
We sincerely thank you for taking the time and effort to review our manuscript. We believe that your feedback is very constructive and will help improve the quality and clarity of our work. We have carefully considered each of your suggestions and made corresponding revisions to the manuscript. We have attached the specific modification details in the attachment.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThis is an excellent and timely paper that addresses a significant gap in road traffic accident analysis. The suggested revisions below are intended to further enhance its clarity, impact, and rigor.
- The creation of the multimodal dataset, especially the CoT part via “reverse causal reasoning,” is a standout contribution. In the Introduction, the authors could more explicitly frame this as a key contribution. Furthermore, providing more detail in Section 3.1 or an Appendix on the data construction process would be beneficial.
- The authors are encouraged to add a short, dedicated subsection to the Discussion. Here, they could explicitly outline how a traffic safety engineer or emergency response manager would use the model’s output. For details ,refer to 10.1016/j.aap.2025.108282
- In Section 3.1.2 (lines 229-232), it is mentioned that the severity levels 1-4 emphasize the “effect of accidents on traffic operations”. This is an interesting and non-standard definition. A brief clarification on how the CTADS dataset defines this would be helpful for context.
- The context for Table 5 is slightly unclear upon first reading. Please clarify this in the table’s caption and the preceding paragraph to make the trade-off between pure classification and joint reasoning-classification more explicit.
Author Response
We sincerely thank you for taking the time and effort to review our manuscript. We believe that your feedback is very constructive and will help improve the quality and clarity of our work. We have carefully considered each of your suggestions and made corresponding revisions to the manuscript. We have attached the specific modification details in the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI would like to thank the authors for carefully addressing all of my comments and suggestions. The revisions substantially improve the methodological clarity and overall quality of the manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have thoroughly addressed all the comments and revised the manuscript accordingly. I appreciate their efforts and recommend the paper for publication.
Comments on the Quality of English LanguageThis research addresses an important and timely topic, as traditional traffic accident analysis struggles to incorporate high-dimensional and heterogeneous data such as remote sensing imagery. The study’s proposed multimodal framework represents a valuable contribution toward enhancing the interpretability and intelligence of traffic accident analysis. I would like to acknowledge and thank the authors for their significant effort in developing and evaluating this comprehensive approach.
However, the manuscript still requires further improvements to strengthen its overall quality and scientific rigor.
- The abstract lacks detailed methodology. When I read it, I am not able to fully understand your methodological process. Adding one or two clarifying sentences would greatly improve the abstract’s clarity.
- The introduction is well written; however, it lacks real-world examples that illustrate the nature of the damage and the consequences of traffic accidents in terms of loss of life, property damage, and economic impact.
- Although the objective of the study is clearly stated, the research gap is not well focused. A more explicit and structured paragraph outlining the research gap is needed.
- In the Methodology section, the writing is unclear in relation to Figure 1, titled 'Schematic diagram of the proposed framework'.
- The authors should describe the data used by including it clearly in the methodology section.
- The discussion section is weak and needs to compare your results with previous studies, provide clear interpretations of the findings, and explain why the chosen methodology was used.
- Relevant references should be added, such as 'Integrating InSAR coherence and air pollution detection satellites to study the impact of war on air quality'.
Reviewer 4 Report
Comments and Suggestions for AuthorsAll my comments have been addressed.