Next Article in Journal
From Illusion to Insight: A Taxonomic Survey of Hallucination Mitigation Techniques in LLMs
Previous Article in Journal
Artificial Intelligence-Guided Supervised Learning Models for Photocatalysis in Wastewater Treatment
Previous Article in Special Issue
GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Gated Fusion Networks for Multi-Modal Violence Detection

1
Digital Image Processing Laboratory, Department of Computer Science, Islamia College Peshawar, Peshawar 25120, Pakistan
2
College of Information Technology, United Arab Emirates University (UAEU), Al Ain 15551, United Arab Emirates
*
Author to whom correspondence should be addressed.
AI 2025, 6(10), 259; https://doi.org/10.3390/ai6100259
Submission received: 28 August 2025 / Revised: 19 September 2025 / Accepted: 1 October 2025 / Published: 3 October 2025

Abstract

Public safety and security require an effective monitoring system to detect violence through visual, audio, and motion data. However, current methods often fail to utilize the complementary benefits of visual and auditory modalities, thereby reducing their overall effectiveness. To enhance violence detection, we present a novel multimodal method in this paper that detects motion, audio, and visual information from the input to recognize violence. We designed a framework comprising two specialized components: a gated fusion module and a multi-scale transformer, which enables the efficient detection of violence in multimodal data. To ensure a seamless and effective integration of features, a gated fusion module dynamically adjusts the contribution of each modality. At the same time, a multi-modal transformer utilizes multiple instance learning (MIL) to identify violent behaviors more accurately from input data by capturing complex temporal correlations. Our model fully integrates multi-modal information using these techniques, improving the accuracy of violence detection. In this study, we found that our approach outperformed state-of-the-art methods with an accuracy of 86.85% using the XD-Violence dataset, thereby demonstrating the potential of multi-modal fusion in detecting violence.
Keywords: violence detection; Multi-modality; multi-modal fusion; weakly supervised learning violence detection; Multi-modality; multi-modal fusion; weakly supervised learning

Share and Cite

MDPI and ACS Style

Ahmad, B.; Khan, M.; Sajjad, M. Gated Fusion Networks for Multi-Modal Violence Detection. AI 2025, 6, 259. https://doi.org/10.3390/ai6100259

AMA Style

Ahmad B, Khan M, Sajjad M. Gated Fusion Networks for Multi-Modal Violence Detection. AI. 2025; 6(10):259. https://doi.org/10.3390/ai6100259

Chicago/Turabian Style

Ahmad, Bilal, Mustaqeem Khan, and Muhammad Sajjad. 2025. "Gated Fusion Networks for Multi-Modal Violence Detection" AI 6, no. 10: 259. https://doi.org/10.3390/ai6100259

APA Style

Ahmad, B., Khan, M., & Sajjad, M. (2025). Gated Fusion Networks for Multi-Modal Violence Detection. AI, 6(10), 259. https://doi.org/10.3390/ai6100259

Article Metrics

Back to TopTop