EAR-CCPM-Net: A Cross-Modal Collaborative Perception Network for Early Accident Risk Prediction
Abstract
1. Introduction
2. Literature Review
3. Research Methodology
3.1. Data Collection and Preprocessing
3.2. Architecture Design and Rationale of the Proposed CCPM
3.2.1. Fusion Module
3.2.2. Single Enhanced Module
3.2.3. Enhanced Cross-Hypothesis Interaction Module
3.3. Experiment Setup
3.3.1. Input Modality and Preprocessing
3.3.2. Training and Testing Settings
3.3.3. Performance Measurement
4. Results and Discussion
4.1. Training Results
4.2. Baseline Performance Comparison
4.3. Ablation Variants of EAR-CCPM-Net
4.4. EAR-CCPM-Net Performance Comparison
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- World Health Organization. Save Lives: A Road Safety Technical Package. 2017. Available online: https://apps.who.int/iris/bitstream/handle/10665/255199/9789241511704-eng.pdf (accessed on 6 May 2018).
- World Health Organization. Global Status Report on Road Safety 2018; World Health Organization: Geneva, Switzerland, 2019. [Google Scholar]
- Heydari, S.; Hickford, A.; McIlroy, R.; Turner, J.; Bachani, A.M. Road Safety in Low-Income Countries: State of Knowledge and Future Directions. Sustainability 2019, 11, 6249. [Google Scholar] [CrossRef]
- Anik, B.T.H.; Islam, Z.; Abdel-Aty, M. A Time-Embedded Attention-based Transformer for Crash Likelihood Prediction at Intersections Using Connected Vehicle Data. Transp. Res. Part C Emerg. Technol. 2024, 169, 104831. [Google Scholar] [CrossRef]
- Siddique, I. Advanced Analytics for Predicting Traffic Collision Severity Assessment. World J. Adv. Res. Rev. 2024, 21, 30574. [Google Scholar] [CrossRef]
- Chen, J.; Tao, W.; Jing, Z.; Wang, P.; Jin, Y. Traffic Accident Duration Prediction Using Multi-mode Data and Ensemble Deep Learning. Heliyon 2024, 10, e25957. [Google Scholar] [CrossRef] [PubMed]
- Elsayed, H.A.G.; Syed, L. An Automatic Early Risk Classification of Hard Coronary Heart Diseases Using Framingham Scoring Model. In Proceedings of the Second International Conference on Internet of Things, Data and Cloud Computing, Cambridge, UK, 22–23 March 2017; pp. 1–8. [Google Scholar]
- Ma, J.; Jia, C.; Yang, X.; Cheng, X.; Li, W.; Zhang, C. A Data-driven Approach for Collision Risk Early Warning in Vessel Encounter Situations Using Attention-BiLSTM. IEEE Access 2020, 8, 188771–188783. [Google Scholar] [CrossRef]
- Fang, J.; Li, L.L.; Yang, K.; Zheng, Z.; Xue, J.; Chua, T.S. Cognitive Accident Prediction in Driving Scenes: A Multimodality Benchmark. arXiv 2022, arXiv:2212.09381. [Google Scholar]
- Fang, S.; Liu, J.; Ding, M.; Cui, Y.; Lv, C.; Hang, P.; Sun, J. Towards Interactive and Learnable Cooperative Driving Automation: A Large Language Model-Driven Decision-Making Framework. arXiv 2024, arXiv:2409.12812. [Google Scholar] [CrossRef]
- Jamshidi, H.; Jazani, R.K.; Khani Jeihooni, A.; Alibabaei, A.; Alamdari, S.; Kalyani, M.N. Facilitators and Barriers to Collaboration Between Pre-hospital Emergency and Emergency Department in Traffic Accidents: A qualitative study. BMC Emerg. Med. 2023, 23, 58. [Google Scholar] [CrossRef]
- Adewopo, V.A.; Elsayed, N.; ElSayed, Z.; Ozer, M.; Abdelgawad, A.; Bayoumi, M. A Review on Action Recognition for Accident Detection in Smart City Transportation Systems. J. Electr. Syst. Inf. Technol. 2023, 10, 57. [Google Scholar] [CrossRef]
- Liu, C.; Xiao, Z.; Long, W.; Li, T.; Jiang, H.; Li, K. Vehicle trajectory data processing, analytics, and applications: A survey. ACM Comput. Surv. 2025, 57, 1–36. [Google Scholar] [CrossRef]
- Wang, L.; Xiao, M.; Lv, J.; Liu, J. Analysis of Influencing Factors of Traffic Accidents on Urban Ring Road based on the SVM Model Optimized by Bayesian Method. PLoS ONE 2024, 19, e0310044. [Google Scholar] [CrossRef]
- Khan, S.W.; Hafeez, Q.; Khalid, M.I.; Alroobaea, R.; Hussain, S.; Iqbal, J.; Almotiri, J.; Ullah, S.S. Anomaly Detection in Traffic Surveillance Videos Using Deep Learning. Sensors 2022, 22, 6563. [Google Scholar] [CrossRef] [PubMed]
- Hasan, F.; Huang, H. MALS-Net: A Multi-head Attention-based LSTM Sequence-to-sequence Network for Socio-temporal Interaction Modelling and Trajectory Prediction. Sensors 2023, 23, 530. [Google Scholar] [CrossRef] [PubMed]
- Leon, F.; Gavrilescu, M. A Review of Tracking and Trajectory Prediction Methods for Autonomous Driving. Mathematics 2021, 9, 660. [Google Scholar] [CrossRef]
- Yang, D.; Huang, S.; Xu, Z.; Li, Z.; Wang, S.; Li, M.; Wang, Y.; Liu, Y.; Yang, K.; Chen, Z.; et al. Aide: A Vision-driven Multi-view, Multimodal, Multi-tasking Dataset for Assistive Driving Perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 20459–20470. [Google Scholar]
- Tang, Q.; Liang, J.; Zhu, F. A Comparative Review on Multimodal Sensors Fusion Based on Deep Learning. Signal Process. 2023, 213, 109165. [Google Scholar] [CrossRef]
- Zheng, Y.; Xu, Z.; Wang, X. The fusion of deep learning and fuzzy systems: A state-of-the-art survey. IEEE Trans. Fuzzy Syst. 2021, 30, 2783–2799. [Google Scholar] [CrossRef]
- Marchetti, F.; Mordan, T.; Becattini, F.; Seidenari, L.; Bimbo, A.D.; Alahi, A. CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting. IEEE Trans. Intell. Veh. 2024; 1–10, early access. [Google Scholar] [CrossRef]
- He, R.; Zhang, C.; Xiao, Y.; Lu, X.; Zhang, S.; Liu, Y. Deep Spatio-temporal 3D Dilated Dense Neural Network for Traffic Flow Prediction. Expert Syst. Appl. 2024, 237, 121394. [Google Scholar] [CrossRef]
- Liu, Z.; Yang, N.; Wang, Y.; Li, Y.; Zhao, X.; Wang, F.Y. Enhancing Traffic Object Detection in Variable Illumination with RGB-Event Fusion. arXiv 2023, arXiv:2311.00436. [Google Scholar] [CrossRef]
- Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. Transfuser: Imitation with Transformer-based Sensor Fusion for Autonomous Driving. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12878–12895. [Google Scholar] [CrossRef]
- Tan, H.; Bansal, M. Lxmert: Learning Cross-modality Encoder Representations from Transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv 2019, arXiv:1908.02265. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal Image-text Representation Learning. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
- Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers. pp. 2236–2246. [Google Scholar]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Florence, Italy, 28 July–2 August 2019; Volume 2019, p. 6558. [Google Scholar]
- Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Online, 5–10 July 2020; Volume 2020, p. 2359. [Google Scholar]
- Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. Mhformer: Multi-hypothesis Transformer for 3d Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13147–13156. [Google Scholar]
- Dhanasekaran, S.; Gopal, D.; Logeshwaran, J.; Ramya, N.; Salau, A.O. Multi-model Traffic Prediction in Smart Cities Using Graph Neural Networks and Transformer-Based Multi-Source Visual Fusion for Intelligent Transportation Management. Int. J. Intell. Transp. Syst. Res. 2024, 22, 518–541. [Google Scholar]
- Li, L.; Dou, Y.; Zhou, J. Traffic Accident Detection based on Multimodal Knowledge Graphs. In Proceedings of the 2023 5th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI), Hangzhou, China, 1–3 December 2023; pp. 644–647. [Google Scholar]
- Zhang, Y.; Tiwari, P.; Zheng, Q.; El Saddik, A.; Hossain, M.S. A Multimodal Coupled Graph Attention Network for Joint Traffic Event Detection and Sentiment Classification. IEEE Trans. Intell. Transp. Syst. 2022, 24, 8542–8554. [Google Scholar] [CrossRef]
- Ektefaie, Y.; Dasoulas, G.; Noori, A.; Farhat, M.; Zitnik, M. Multimodal Learning with Graphs. Nat. Mach. Intell. 2023, 5, 340–350. [Google Scholar] [CrossRef]
- Kong, X.; Xing, W.; Wei, X.; Bao, P.; Zhang, J.; Lu, W. STGAT: Spatial-temporal Graph Attention Networks for Traffic Flow Prediction. IEEE Access 2020, 8, 134363–134372. [Google Scholar] [CrossRef]
- Zhang, Y.; Dong, X.; Shang, L.; Zhang, D.; Wang, D. A Multimodal Graph Neural Network Approach to Traffic Risk Prediction in Smart Urban Sensing. In Proceedings of the 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Como, Italy, 22–25 June 2020; pp. 1–9. [Google Scholar]
- Ulutan, O.; Iftekhar, A.S.M.; Manjunath, B.S. Vsgnet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13617–13626. [Google Scholar]
- Liu, A.A.; Tian, H.; Xu, N.; Nie, W.; Zhang, Y.; Kankanhalli, M. Toward Region-aware Attention Learning for Scene Graph Generation. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 7655–7666. [Google Scholar] [CrossRef] [PubMed]
- Sultana, S.; Ahmed, B. Robust Nighttime Road Lane Line Detection Using Bilateral Filter and SAGC Under Challenging Conditions. In Proceedings of the 2021 IEEE 13th International Conference on Computer Research and Development (ICCRD), Beijing, China, 5–7 January 2021; pp. 137–143. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Yao, H.; Wang, L.; Cai, C.; Wang, W.; Zhang, Z.; Shang, X. Language Conditioned Multi-Scale Visual Attention Networks for Visual Grounding. Image Vis. Comput. 2024, 150, 105242. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Pearson, K. VII. Note on Regression and Inheritance in the Case of Two Parents. Proc. R. Soc. Lond. 1895, 58, 240–242. [Google Scholar]
- Judd, T.; Ehinger, K.; Durand, F.; Torralba, A. Learning to Predict Where Humans Look. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 2106–2113. [Google Scholar]
- Borji, A.; Sihite, D.N.; Itti, L. Quantitative Analysis of Human-model Agreement in Visual Saliency Modeling: A Comparative Study. IEEE Trans. Image Process. 2012, 22, 55–69. [Google Scholar] [CrossRef]
- Chan, F.H.; Chen, Y.T.; Xiang, Y.; Sun, M. Anticipating Accidents in Dashcam Videos. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2017. Revised Selected Papers, Part IV 13. pp. 136–153. [Google Scholar]
- Suzuki, T.; Kataoka, H.; Aoki, Y.; Satoh, Y. Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident db. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3521–3529. [Google Scholar]
- Bao, W.; Yu, Q.; Kong, Y. Uncertainty-based Traffic Accident Anticipation with Spatio-temporal Relational Learning. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2682–2690. [Google Scholar]
- Bao, W.; Yu, Q.; Kong, Y. Drive: Deep Reinforced Accident Anticipation with Visual Explanation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7619–7628. [Google Scholar]
- Xia, Y.; Zhang, D.; Kim, J.; Nakayama, K.; Zipser, K.; Whitney, D. Predicting Driver Attention in Critical Situations. In Proceedings of the Computer vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer International Publishing: Berlin/Heidelberg, Germany, 2019. Revised Selected Papers, Part v 14. pp. 658–674. [Google Scholar]
- Palazzi, A.; Abati, D.; Solera, F.; Cucchiara, R. Predicting the Driver’s Focus of Attention: The DR (eye) VE Project. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1720–1733. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Shen, J.; Xie, J.; Cheng, M.M.; Ling, H.; Borji, A. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 220–237. [Google Scholar] [CrossRef] [PubMed]
- Fang, J.; Yan, D.; Qiao, J.; Xue, J.; Yu, H. DADA: Driver Attention Prediction in Driving Accident Scenarios. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4959–4971. [Google Scholar] [CrossRef]
Eval | MINI-Train-Test | |||
---|---|---|---|---|
Raw Videos | Positive | Negative | ||
Train | 512 | sampling | 1778 | 1534 |
Test | 168 | sampling | 608 | 207 |
Baselines | AUC ↑ | AP ↑ | TTA0.5 (s) ↑ | mTTA (s) ↑ |
---|---|---|---|---|
DSA-RNN [47] | 0.47 | - | 3.095 | - |
AdaLEA [48] | 0.55 | - | 3.890 | - |
UncertaintyTA [49] | 0.60 | - | 3.849 | - |
DRIVE [50] | 0.69 | 0.72 | 3.657 | 4.295 |
CAP [9] | 0.81 | 0.73 | 3.927 | 4.209 |
EAR-CCPM-Net (Proposed) | 0.85 | 0.75 | 4.225 | 4.672 |
Model Name | Fusion Module | T2I-SFLayers | Single Enhanced Module | Enhanced CHIM | AUC ↑ | AP ↑ | TTA0.5 (s) ↑ | mTTA (s) ↑ |
---|---|---|---|---|---|---|---|---|
EAR-CCPM-Net w/o Fusion | ✕ | √ | √ | √ | 0.813 | 0.734 | 3.927 | 4.209 |
EAR-CCPM-Net w/o T2I-SFLayer | √ | ✕ | √ | √ | 0.826 | 0.740 | 3.997 | 4.371 |
EAR-CCPM-Net w/o Single Enhanced | √ | √ | ✕ | √ | 0.837 | 0.749 | 4.089 | 4.482 |
EAR-CCPM-Net w/o Enhanced CHIM | √ | √ | √ | ✕ | 0.842 | 0.752 | 4.121 | 4.536 |
EAR-CCPM-Net (Full) | √ | √ | √ | √ | 0.853 | 0.758 | 4.225 | 4.672 |
Variant Name | Number of T2I-SFLayer | AUC ↑ | AP ↑ | TTA0.5 (s) ↑ | mTTA (s) ↑ |
---|---|---|---|---|---|
EAR-CCPM-Net w/o T2I-SFLayer | 0 | 0.826 | 0.740 | 3.997 | 4.371 |
EAR-CCPM-Net (1 T2I-SFLayer) | 1 | 0.833 | 0.743 | 4.032 | 4.412 |
EAR-CCPM-Net (2 T2I-SFLayer) | 2 | 0.842 | 0.748 | 4.107 | 4.545 |
EAR-CCPM-Net (3 T2I-SFLayer) | 3 | 0.853 | 0.758 | 4.225 | 4.672 |
EAR-CCPM-Net (4 T2I-SFLayer) | 4 | 0.839 | 0.745 | 4.087 | 4.491 |
Variant Name | Number of T2I-SFLayer | AUC ↑ | AP ↑ | TTA0.5 (s) ↑ | mTTA (s) ↑ |
---|---|---|---|---|---|
EAR-CCPM-Net w/o Single Enhanced | 0 | 0.837 | 0.749 | 4.089 | 4.482 |
EAR-CCPM-Net (1 Single Enhanced) | 1 | 0.846 | 0.752 | 4.050 | 4.500 |
EAR-CCPM-Net (2 Single Enhanced) | 2 | 0.848 | 0.754 | 4.090 | 4.540 |
EAR-CCPM-Net (3 Single Enhanced) | 3 | 0.853 | 0.758 | 4.225 | 4.672 |
EAR-CCPM-Net (4 Single Enhanced) | 4 | 0.851 | 0.756 | 4.122 | 4.635 |
Baselines | ||||
---|---|---|---|---|
BDDA [51] | 3.32 | 0.33 | 0.25 | 0.63 |
DR (eye)VE [52] | 2.27 | 0.45 | 0.32 | 0.64 |
ACLNet [53] | 2.51 | 0.35 | 0.35 | 0.64 |
DADA [54] | 2.19 | 0.50 | 0.37 | 0.66 |
DRIVE [50] | 2.65 | 0.33 | 0.19 | 0.66 |
CAP [9] | 1.89 | 0.38 | 0.28 | 0.85 |
(EAR-CCPM-Net) Proposed | 1.72 | 0.40 | 0.29 | 0.87 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sun, W.; Abdullah, L.N.; Khalid, F.B.; Suhaiza Binti Sulaiman, P. EAR-CCPM-Net: A Cross-Modal Collaborative Perception Network for Early Accident Risk Prediction. Appl. Sci. 2025, 15, 9299. https://doi.org/10.3390/app15179299
Sun W, Abdullah LN, Khalid FB, Suhaiza Binti Sulaiman P. EAR-CCPM-Net: A Cross-Modal Collaborative Perception Network for Early Accident Risk Prediction. Applied Sciences. 2025; 15(17):9299. https://doi.org/10.3390/app15179299
Chicago/Turabian StyleSun, Wei, Lili Nurliyana Abdullah, Fatimah Binti Khalid, and Puteri Suhaiza Binti Sulaiman. 2025. "EAR-CCPM-Net: A Cross-Modal Collaborative Perception Network for Early Accident Risk Prediction" Applied Sciences 15, no. 17: 9299. https://doi.org/10.3390/app15179299
APA StyleSun, W., Abdullah, L. N., Khalid, F. B., & Suhaiza Binti Sulaiman, P. (2025). EAR-CCPM-Net: A Cross-Modal Collaborative Perception Network for Early Accident Risk Prediction. Applied Sciences, 15(17), 9299. https://doi.org/10.3390/app15179299