Next Article in Journal
Using Flipped Classroom to Improve the Learning Effectiveness of Digital Logic Courses
Previous Article in Journal
Physical Layer Latency Management Mechanisms: A Study for Millimeter-Wave Wi-Fi
 
 
Article

ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence

1
Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, 41012 Sevilla, Spain
2
VISILAB, E.T.S.I.I, University of Castilla-La Mancha, 13071 Ciudad Real, Spain
*
Author to whom correspondence should be addressed.
Academic Editor: Juan M. Corchado
Electronics 2021, 10(13), 1601; https://doi.org/10.3390/electronics10131601
Received: 24 May 2021 / Revised: 29 June 2021 / Accepted: 30 June 2021 / Published: 3 July 2021
(This article belongs to the Special Issue Computer Vision for Surveillance)
Introducing efficient automatic violence detection in video surveillance or audiovisual content monitoring systems would greatly facilitate the work of closed-circuit television (CCTV) operators, rating agencies or those in charge of monitoring social network content. In this paper we present a new deep learning architecture, using an adapted version of DenseNet for three dimensions, a multi-head self-attention layer and a bidirectional convolutional long short-term memory (LSTM) module, that allows encoding relevant spatio-temporal features, to determine whether a video is violent or not. Furthermore, an ablation study of the input frames, comparing dense optical flow and adjacent frames subtraction and the influence of the attention layer is carried out, showing that the combination of optical flow and the attention mechanism improves results up to 4.4%. The conducted experiments using four of the most widely used datasets for this problem, matching or exceeding in some cases the results of the state of the art, reducing the number of network parameters needed (4.5 millions), and increasing its efficiency in test accuracy (from 95.6% on the most complex dataset to 100% on the simplest one) and inference time (less than 0.3 s for the longest clips). Finally, to check if the generated model is able to generalize violence, a cross-dataset analysis is performed, which shows the complexity of this approach: using three datasets to train and testing on the remaining one the accuracy drops in the worst case to 70.08% and in the best case to 81.51%, which points to future work oriented towards anomaly detection in new datasets. View Full-Text
Keywords: violence detection; fight detection; deep learning; dense net; bidirectional ConvLSTM violence detection; fight detection; deep learning; dense net; bidirectional ConvLSTM
Show Figures

Figure 1

MDPI and ACS Style

Rendón-Segador, F.J.; Álvarez-García, J.A.; Enríquez, F.; Deniz, O. ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence. Electronics 2021, 10, 1601. https://doi.org/10.3390/electronics10131601

AMA Style

Rendón-Segador FJ, Álvarez-García JA, Enríquez F, Deniz O. ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence. Electronics. 2021; 10(13):1601. https://doi.org/10.3390/electronics10131601

Chicago/Turabian Style

Rendón-Segador, Fernando J., Juan A. Álvarez-García, Fernando Enríquez, and Oscar Deniz. 2021. "ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence" Electronics 10, no. 13: 1601. https://doi.org/10.3390/electronics10131601

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop