Structured Prompting and Collaborative Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference
Abstract
1. Introduction
2. Data Description
3. Methodology
3.1. Overview
3.2. CoT Prompt Design for Role-Aware Video Analysis
3.2.1. Road Scene Understanding
- Time of Day: Distinguishing between daytime and nighttime based on ambient light and sky conditions.
- Road Weather Conditions: Classifying the environment as clear, foggy, rainy, or snowy.
- Pavement Wetness Condition: Assessing the road surface as dry, wet (shiny or moist), flooded (pooled water), or snowy (slush/coverage).
- Vehicle Behavior: Identifying maneuvers such as lane changes, braking, acceleration, turns, or unusual/hazardous actions.
- Traffic Flow and Speed: Estimating traffic smoothness and the general speed level (high, medium, or low).
- Congestion Level: Categorizing congestion as light, moderate, or heavy.
3.2.2. Risk Reasoning
- Environmental Risk Factors: This component focuses on analyzing the interplay between visibility, weather, and pavement surface conditions—three critical elements in assessing traffic safety. Time of day plays a decisive role in determining visibility, with nighttime or low-light conditions significantly impairing a driver’s ability to detect road hazards, pedestrians, or surface anomalies such as standing water or debris. Weather factors such as rain, fog, or snow further compound these risks by introducing glare, reduced contrast, or obscured features. Pavement wetness, in particular, poses substantial safety concerns by affecting vehicle traction, braking distance, and hydroplaning likelihood. For example, a reflective road surface under overcast skies may suggest recent precipitation, while visible pooling indicates potential flooding. Distinguishing between partially wet and deeply saturated pavement is therefore crucial for anticipating vehicle instability and enabling downstream risk prediction. The CoT prompt guides the model to reason across these dimensions collectively, enabling a comprehensive assessment of how environmental conditions influence roadway safety.
- Vehicle Behavior Risk: This dimension evaluates whether the observed driving patterns suggest cautious or erratic behavior, offering insight into potential latent hazards. Behavioral responses such as sudden braking, abrupt acceleration, or frequent lane changes often reflect driver reactions to perceived risks—such as reduced visibility, surface irregularities, or obstructions not directly captured in the visual field. Importantly, clusters of such evasive maneuvers across multiple vehicles may signal the presence of high-risk zones ahead, including flooded pavement, debris fields, or stalled traffic. By analyzing both the frequency and distribution of these maneuvers, the model can infer emergent risk factors and support preemptive safety reasoning that extends beyond the immediate visual context.
- Traffic Flow Risk: This component assesses the stability and efficiency of traffic flow to identify dynamic risk patterns. Consistent, smooth flow typically indicates low interaction risk, whereas abrupt fluctuations in vehicle speed or spacing may signal unexpected environmental disturbances—such as roadway obstructions, water accumulation, or sudden visibility drops. Such disruptions not only degrade flow efficiency but also elevate the probability of rear-end collisions, particularly under conditions of limited traction or poor visual perception. The CoT prompt enables the model to reason temporally, detecting irregularities in flow continuity and interpreting them as early indicators of potential hazards ahead. This temporal perspective is critical for proactive traffic risk evaluation.
- Overall Safety Risk Level: Providing a low/moderate/high-risk classification with justification.
- Alerts: Warnings or advisories relevant to current road conditions.
- Suggested Safety Speed: A recommended driving speed that reflects current visibility, pavement, and flow characteristics.
Prompt Structure for General-Purpose Fine-Tuning
3.3. Collaborative Role-Aware Knowledge Distillation via Caption and Risk Label Generation
3.3.1. Distillation Setup
3.3.2. Unified Label Construction
3.3.3. Distillation Objective
3.4. Training and Evaluation Pipeline
- (1)
- Role-Aware Knowledge-Enriched Supervision.The first path generates rich, semantically grounded annotations from a pair of expert LVLMs: GPT-4o and o3-mini. Specifically, we sample frames from short video clips and pass them along with two specialized prompting branches: (i) GPT-4o for fine-grained scene analysis and captioning, and (ii) o3-mini for contextual risk reasoning. Their outputs are merged and organized into structured, knowledge-enriched video annotations which are used as pseudo-labels to supervise the fine-tuning of a 3B student model (Qwen2.5-VL-3B-Instruct), leading to our VISTA model, suitable for deployment in real-world highway safety monitoring.
- (2)
- Template-Based Evaluation.To rigorously assess the semantic fidelity of VISTA against its LVLM teachers, we construct another parallel supervision path. The outputs are reformatted into a unified evaluation template (refer to Figure 10) that reflects the exact linguistic structure expected during inference. We then further fine-tune the same 3B model under this template supervision using identical hyperparameters. This guarantees structural alignment between model outputs and GPT-4o reference generations, enabling precise metric-based comparison.
4. Experiments and Results
4.1. Evaluation Metrics
- BLEU-4 [22] evaluates n-gram precision, focusing on exact phrase-level overlap between the predicted caption and the reference. Specifically, BLEU-4 computes the modified precision of 4-g aswhere denotes the modified precision for n-grams (up to 4), , and is the brevity penalty to discourage overly short predictions. In our case, we compute BLEU-4 between each VISTA-generated caption and the corresponding GPT-4o reference over all 200 samples and report the average.
- METEOR [23] balances unigram precision and recall, incorporating stemming, synonymy, and word order through alignment. It is defined aswhere is the harmonic mean of unigram precision and recall, and P is a penalty based on the fragmentation of matched chunks. This metric tends to correlate better with human judgment, especially for short descriptive text such as video captions.
- ROUGE-L [24] measures sentence-level structural similarity based on the longest common subsequence (LCS) between the predicted and reference token sequences. Let denote the prediction and the reference. Define
- CIDEr [25] evaluates consensus across multiple references by computing TF-IDF-weighted n-gram similarity. For each caption, it is defined aswhere is the weight for n-gram n (typically uniform), and is the cosine similarity between TF-IDF vectors of the n-grams. Since we use single-reference evaluation (GPT-4o output), CIDEr provides insight into descriptive richness and term relevance.
- To facilitate unified comparison, we compute a composite score from all four metrics, defined aswhere CIDEr is scaled by 0.1 to normalize its value range. This aggregate formulation captures both structural accuracy and semantic fidelity while accounting for redundancy and term frequency. A similar multi-metric strategy has been adopted in the recent literature such as CityLLaVA [26], where BLEU, METEOR, and ROUGE are combined to evaluate semantic grounding and coherence in urban scene descriptions. In our case, the composite score enables a holistic assessment of model performance, integrating both exact n-gram matches (BLEU), semantic fluency and recall (METEOR), sentence-level structure (ROUGE-L), and consensus with human-like outputs (CIDEr).
4.2. Implementation Details
- Visual Encoder: The CLIP-style backbone is updated to specialize in low-resolution traffic footage under diverse environmental conditions.
- Language Decoder: Tuning the LLM decoder enables the model to emulate expert-style reasoning and reporting formats derived from large teacher models (GPT-4o and o3-mini).
- Cross-Modal MLP Fusion: This module aligns vision and language representations. Fine-tuning it improves grounding of visual semantics into structured CoT outputs.
4.3. Results
| Model | BLEU-4 | METEOR | ROUGE-L | CIDEr | Score |
|---|---|---|---|---|---|
| 3B original | 0.2517 | 0.5396 | 0.3902 | 0.2984 | 30.28 |
| 3B mlp | 0.2581 | 0.5287 | 0.4040 | 0.3363 | 30.61 |
| 3B mlp + vision | 0.2722 | 0.5281 | 0.4346 | 0.2413 | 31.48 |
| 3B mlp + llm | 0.3269 | 0.5691 | 0.4862 | 0.6712 | 36.23 |
| 3B mlp + llm + vision (VISTA) | 0.3289 | 0.5634 | 0.4895 | 0.7014 | 36.30 |
5. Practical Implications
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| ATIS | Advanced Traveler Information System |
| ATMS | Advanced Traffic Management System |
| BLEU | Bilingual Evaluation Understudy |
| CIDEr | Consensus-based Image Description Evaluation |
| CoT | Chain-of-Thought |
| FPS | Frames Per Second |
| ITS | Intelligent Transportation Systems |
| LCS | Longest Common Subsequence |
| LLM | Large Language Model |
| MLP | Multilayer Perceptron |
| RWIS | Road Weather Information System |
| SFT | Supervised Fine-Tuning |
| VISTA | Vision-Informed Safety and Transportation Assessment |
| VLM | Vision–Language Model |
| VQA | Visual Question Answering |
References
- Rivera, J.; Lin, K.; Adeli, E. Scenario Understanding of Traffic Scenes Through Large Visual Language Models. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025. [Google Scholar]
- Zhang, Y.; Liu, L.; Zhang, H.; Wang, X.; Li, M. Semantic Understanding of Traffic Scenes with Large Vision-Language Models. arXiv 2024, arXiv:2406.20092. [Google Scholar]
- Zheng, O.; Abdel-Aty, M.; Wang, D.; Wang, Z.; Ding, S. ChatGPT Is on the Horizon: Could a Large Language Model Be All We Need for Intelligent Transportation? arXiv 2023, arXiv:2303.05382. [Google Scholar]
- Theofilatos, A.; Yannis, G. A review of the effect of traffic and weather characteristics on road safety. Accid. Anal. Prev. 2014, 72, 244–256. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-Language Models for Vision Tasks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5645. [Google Scholar] [CrossRef]
- Li, Z.; Wu, X.; Du, H.; Nghiem, H.; Shi, G. Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey. arXiv 2025, arXiv:2501.02189. [Google Scholar] [CrossRef]
- Xu, H.; Jin, L.; Wang, X.; Wang, L.; Liu, C. A Survey on Multi-Agent Foundation Models: Progress and Challenges. arXiv 2024, arXiv:2404.20061. [Google Scholar]
- Yang, C.; Zhu, Y.; Lu, W.; Wang, Y.; Chen, Q.; Gao, C.; Yan, B.; Chen, Y. Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application. ACM Trans. Intell. Syst. Technol. 2024. [Google Scholar] [CrossRef]
- Chen, L.; Wei, X.; Li, J.; Dong, X.; Zhang, P.; Zang, Y.; Chen, Z.; Duan, H.; Tang, Z.; Yuan, L.; et al. ShareGPT4Video: Improving Video Understanding and Generation with Better Captions. Adv. Neural Inf. Process. Syst. 2024, 37, 19472–19495. [Google Scholar]
- Cao, X.; Zhou, T.; Ma, Y.; Ye, W.; Cui, C.; Tang, K.; Cao, Z.; Liang, K.; Wang, Z.; Rehg, J.M.; et al. MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
- Lohner, A.; Compagno, F.; Francis, J.; Oltramari, A. Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding. In Proceedings of the 2024 IEEE International Automated Vehicle Validation Conference (IAVVC), Pittsburgh, PA, USA, 22–23 October 2024. [Google Scholar]
- Ashqar, H.I.; Alhadidi, T.I.; Elhenawy, M.; Khanfar, N.O. Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems. Automation 2024, 5, 508–526. [Google Scholar] [CrossRef]
- Shriram, S.; Perisetla, S.; Keskar, A.; Krishnaswamy, H.; Westerhof Bossen, T.E.; Møgelmose, A.; Greer, R. Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety. In Proceedings of the 2025 IEEE 21st International Conference on Automation Science and Engineering (CASE), Los Angeles, CA, USA, 17–21 August 2025. [Google Scholar]
- Kugo, N.; Li, X.; Li, Z.; Gupta, A.; Khatua, A.; Jain, N.; Patel, C.; Kyuragi, Y.; Ishii, Y.; Tanabiki, M.; et al. VideoMultiAgents: A Multi-Agent Framework for Video Question Answering. arXiv 2025, arXiv:2504.20091. [Google Scholar]
- Jiang, B.; Zhuang, Z.; Shivakumar, S.S.; Roth, D.; Taylor, C.J. Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering. arXiv 2024, arXiv:2403.1478. [Google Scholar]
- Bhooshan, R.S.; Suresh, K. A Multimodal Framework for Video Caption Generation. IEEE Access 2022, 10, 92166–92176. [Google Scholar] [CrossRef]
- Yang, Y. Vision-Informed Safety and Transportation Assessment (VISTA). 2025. Available online: https://github.com/winstonyang117/Vision-informed-Safety-and-Transportation-Assessment (accessed on 8 August 2025).
- OpenAI. GPT-4 Technical Report; Technical Report; OpenAI: San Francisco, CA, USA, 2023. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- OpenAI. ChatGPT (o3-mini). 31 January 2025. Available online: https://openai.com/index/openai-o3-mini/ (accessed on 4 November 2025).
- Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics, Ann Arbor, MI, USA, 23 June 2005; pp. 65–72. [Google Scholar]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS), Barcelona, Spain, 25–26 July 2004. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Deng, C.; Li, Y.; Jiang, H.; Li, W.; Zhang, Y.; Zhao, H.; Zhou, P. CityLLaVA: Efficient Fine-Tuning for Vision-Language Models in City Scenario. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024. [Google Scholar]
- Wang, L.; Chen, S.; Jiang, L.; Pan, S.; Cai, R.; Yang, S.; Yang, F. Parameter-efficient fine-tuning in large language models: A survey of methodologies. Artif. Intell. Rev. 2025, 58, 227. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual Conference, 25–29 April 2022. [Google Scholar]










Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, Y.; Xu, N.; Yang, J.J. Structured Prompting and Collaborative Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference. Computers 2025, 14, 490. https://doi.org/10.3390/computers14110490
Yang Y, Xu N, Yang JJ. Structured Prompting and Collaborative Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference. Computers. 2025; 14(11):490. https://doi.org/10.3390/computers14110490
Chicago/Turabian StyleYang, Yunxiang, Ningning Xu, and Jidong J. Yang. 2025. "Structured Prompting and Collaborative Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference" Computers 14, no. 11: 490. https://doi.org/10.3390/computers14110490
APA StyleYang, Y., Xu, N., & Yang, J. J. (2025). Structured Prompting and Collaborative Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference. Computers, 14(11), 490. https://doi.org/10.3390/computers14110490

