You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

10 October 2025

IDG-ViolenceNet: A Video Violence Detection Model Integrating Identity-Aware Graphs and 3D-CNN

and
School of Computer Science and Engineering, Sichuan University of Science and Engineering, Yibin 644000, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Sensing and Imaging

Highlights

What are the main findings?
  • The proposed method efficiently detects violent behaviors in videos with significantly improved accuracy.
  • The model outperforms existing methods in both detection speed and precision.
What is the implication of the main finding?
  • It is applicable to public safety surveillance for rapid response to violent incidents.
  • It provides a new reference and methodology for optimizing violent behavior detection algorithms.

Abstract

Video violence detection plays a crucial role in intelligent surveillance and public safety, yet existing methods still face challenges in modeling complex multi-person interactions. To address this, we propose IDG-ViolenceNet, a dual-stream video violence detection model that integrates identity-aware spatiotemporal graphs with three-dimensional convolutional neural networks (3D-CNN). Specifically, the model utilizes YOLOv11 for high-precision person detection and cross-frame identity tracking, constructing a dynamic spatiotemporal graph that encodes spatial proximity, temporal continuity, and individual identity information. On this basis, a GINEConv branch extracts structured interaction features, while an R3D-18 branch models local spatiotemporal patterns. The two representations are fused in a dedicated module for cross-modal feature integration. Experimental results show that IDG-ViolenceNet achieves accuracies of 97.5%, 99.5%, and 89.4% on the Hockey Fight, Movies Fight, and RWF-2000 datasets, respectively, significantly outperforming state-of-the-art methods. Additionally, ablation studies validate the contributions of key components in improving detection accuracy and robustness.

1. Introduction

With the increasing frequency of public safety incidents, violence detection has gradually become one of the core tasks in intelligent video surveillance systems. In crowded public spaces such as shopping malls, railway stations, and campuses, achieving real-time recognition and early warning of violent events is of great practical significance for maintaining social order and preventing potential risks. Violence, by its nature, is a type of sudden behavior that heavily depends on the interactions between individuals. Existing studies have shown that relying solely on global appearance features often fails to effectively distinguish violent from non-violent scenarios, while neglecting spatiotemporal interaction patterns among people can substantially weaken a model’s representational capacity [1]. In recent years, Graph Neural Networks (GNNs) [2] have demonstrated outstanding performance in modeling entity interactions and have been widely applied to tasks such as action recognition and group behavior modeling [3]. Among these approaches, human-centered graph modeling methods [4] have attracted considerable attention from the research community. In such methods, detected individuals in videos are treated as graph nodes, and edges are constructed based on spatial proximity or interactive relationships, thereby effectively enhancing the modeling of local interaction patterns [5].
Although traditional video action recognition methods—primarily relying on handcrafted feature extraction or two-dimensional convolutional neural networks (2D-CNNs)—achieved certain progress in early studies, they inherently struggle to model dynamic interactions and long-range temporal dependencies. When confronted with real-world challenges such as complex backgrounds, target occlusion, and motion blur, their robustness and recognition accuracy are often severely constrained. Existing approaches generally suffer from two major limitations: (1) most interaction graphs are constructed on a single-frame basis, making it difficult to capture the behavioral continuity of the same individual across temporal frames [6], and (2) commonly adopted graph structures often exist in the form of “static snapshots,” lacking effective modeling of motion trajectories and identity information, which restricts the depth and completeness of temporal semantic representation [7].
To address the aforementioned challenges, this paper proposes a dual-stream violence detection framework that integrates a three-dimensional convolutional neural network (3D-CNN) [8] with an identity-aware graph neural network, termed IDG-ViolenceNet (Identity-aware Graph and 3D-CNN Fusion for Violence Detection). The core ideas of this method include
(1)
Identity-Aware Graph Construction: Human instances are first detected in video frames using a YOLO-based model. A cross-frame matching algorithm then assigns unique identity labels to each instance. On this basis, a dynamic graph structure is constructed that encodes not only identity and spatial proximity but also temporal continuity and motion trajectory consistency.
(2)
Dual-Branch Feature Extraction with 3D-CNN and GNN: The 3D-CNN branch captures spatial features and short-term temporal dependencies from frame sequences, while the GNN branch aggregates interaction patterns among nodes within the graph. This complementary modeling approach integrates visual features with structural interaction features.
(3)
Cross-Modal Fusion Mechanism: Features from the two modalities are effectively fused and passed through a classifier for final prediction, enabling accurate recognition of violent behaviors. This design is particularly robust and adaptive in complex video scenarios involving multiple participants and intense interactions.
In practical application scenarios, IDG-ViolenceNet exhibits strong scalability and is oriented toward deployable use. It can be integrated as a front-end early-warning module into existing video surveillance platforms to help security systems proactively identify violent incidents, thereby reducing the costs and risks of post-event intervention. To validate the effectiveness of the method, we conducted experiments on three public datasets—HockeyFight, MoviesFight, and RWF-2000—achieving accuracies of 97.5% and 99.5% on HockeyFight and MoviesFight, respectively, and 89.4% on the more challenging RWF-2000. In addition, ablation studies show that the identity-aware graph construction mechanism and the cross-modal fusion strategy play a key role in performance gains, supporting the soundness and effectiveness of the proposed design.

3. Methodology

To comprehensively model the dynamic interactions among multiple individuals in videos, we propose IDG-Violence Net, a violence detection framework that integrates identity-aware graph structures with three-dimensional convolutional neural networks (3D-CNNs). The overall architecture is illustrated in Figure 1. Specifically, the input video is first divided into continuous frame sequences, and each detected person is assigned a consistent cross-frame identity ID using an object detection module combined with a lightweight tracking algorithm. Based on spatial proximity among individuals and the motion trajectories of the same identity across consecutive frames, we construct an identity-aware spatiotemporal graph to explicitly capture inter-person interactions. Subsequently, the model employs a graph neural network (GINEConv) to perform feature aggregation over the graph structure, while a 3D-CNN branch extracts spatiotemporal representations directly from the raw frame sequences. Finally, a cross-modal fusion mechanism integrates the two types of features to achieve high-precision recognition of violent behaviors. This approach preserves local action semantics while effectively strengthening the modeling of spatiotemporal interaction relationships, making it particularly suitable for real-world surveillance scenarios involving dense crowds and complex activities.
Figure 1. Framework of the IDG-ViolenceNet Model.

3.1. Data Preprocessing and Identity-Aware Graph Construction

To enable spatiotemporal relationship modeling of human interactions in videos, we design a processing pipeline based on the raw video data, consisting of data cleaning, frame extraction, object detection, cross-frame identity tracking, and spatiotemporal graph construction. The overall workflow is illustrated in Figure 2.
Figure 2. Data Processing Pipeline for Spatiotemporal Relationship Modeling of Human Interactions in Videos.
The entire data preprocessing pipeline begins with Raw Video Clips, followed by Frame Extraction to obtain continuous frame sequences. Next, the YOLOv11x model is employed for object detection, retaining only bounding boxes corresponding to the person category. On this basis, a lightweight tracker with IoU-based matching is applied for ID Assignment & Trajectory Smoothing, ensuring identity consistency of the same individual across frames. Additionally, the tracker performs real-time updates, and a frame-level NMS threshold of 0.3 is applied to filter low-confidence detections.
Following this, person instances with stable IDs are mapped to graph nodes, and a Spatio-Temporal Graph is constructed based on spatial proximity and temporal continuity. The graph incorporates edges that represent person-person interactions within the same frame and between consecutive frames, capturing the dynamic relations. This graph is then formatted into a PyTorch Geometric (PyG) Data object (PyG v2.6.1), providing structured representations for subsequent graph neural network modeling. This structure enables the efficient modeling of interactions, leveraging both spatial and temporal features for robust identity tracking.

3.1.1. Video Preprocessing and High-Precision Person Detection

To ensure the validity of the input data and the stability of subsequent modeling, the raw video data are first preprocessed by removing clips that are irrelevant to the research task or severely degraded in quality. The remaining videos are then decomposed into continuous frame sequences at a fixed frame rate.
In the detection stage, we adopt YOLO11x.pt as the object detection model to perform inference on each video frame, retaining only the detections classified as person. YOLO11x represents the largest variant within the YOLOv11 [25] family, containing 56.9M parameters and requiring 194.9B FLOPs, with an mAP50–95 of 54.7 on the COCO dataset—achieving higher detection accuracy than other YOLOv11 versions. Although computationally more expensive, its deeper and wider network architecture combined with multi-scale feature fusion enables superior robustness in small-object detection, occlusion handling, and crowded scenes. The detection results are output in the form of pixel coordinates (x1, y1, x2, y2), which are further converted into normalized center coordinates (cx, cy) and width–height (w, h) within the [0, 1] range to ensure consistent input scales. Leveraging the high-precision detection capability of YOLO11x.pt together with a stable target assignment mechanism, the model can assign consistent ID labels to the same individuals across frames during inference, thereby achieving preliminary cross-frame identity association while effectively reducing both missed detections and false positives.
However, in crowded scenes or cases where individuals are in close proximity, the pretrained YOLO model may mistakenly merge multiple adjacent persons into a single bounding box, leading to missed detections or identity confusion. To mitigate such issues, this study further incorporates the following optimization strategies during the preprocessing stage:
(1) Scale Filtering: Remove detection boxes that are excessively small or have abnormal aspect ratios to reduce noise interference. (2) Multi-Scale/High-Resolution Inference: Enhance detection accuracy in crowded scenarios and mitigate the merging of adjacent targets. (3) NMS Threshold Adjustment: Lower the non-maximum suppression threshold appropriately to preserve independent detections of closely positioned individuals.
As shown in Figure 3, the three scenes are (a) outdoor street, (b) indoor surveillance, and (c) low-light industrial environment. Frames in each row are arranged in temporal order from left to right. The YOLO11x.pt model accurately detects persons and maintains consistent track IDs across frames, providing a reliable basis for subsequent identity-aware graph construction.
Figure 3. YOLO-based person detection and ID tracking across three scenes (a) outdoor street; (b) indoor surveillance; (c) low-light industrial setting.

3.1.2. Identity Association and Graph Construction

After completing frame-level person detection and preliminary ID assignment during the data preprocessing stage, this study further incorporates a lightweight tracking strategy based on Intersection over Union (IoU) matching to enhance the continuity and robustness of cross-frame identity association. Let the detected bounding box of a person in the current frame be denoted as Ba, and the most recent position of an existing trajectory be denoted as Bb. The IoU is defined as
IoU ( B a , B b ) = | B a B b | | B a B b |
In the figure, | B a B b | denotes the area of intersection between the two bounding boxes, while | B a B b | represents the area of their union. If the matching score exceeds the predefined threshold τ i o u (set to 0.5 in our experiments), the detection result is assigned to the corresponding trajectory, and its position and temporal information are updated. Unmatched detections will generate new trajectories with new IDs, whereas unmatched historical trajectories are removed once their consecutive missing frames exceed the upper limit max_lost (set to 10 frames in our experiments). This strategy, with low computational overhead, effectively maintains the stability of identity labels in short sequences and tolerates transient detection failures caused by occlusion or missed detections, thereby reducing identity loss.
When constructing the Identity-aware Spatiotemporal Graph, each person instance with a stable ID is modeled as a node, and its node feature vector is defined as
x i = [ c x ( i ) , c y ( i ) , w ( i ) , h ( i ) , t i T max ]
Here, ( c x ( i ) , c y ( i ) ) denote the center coordinates of node i , while w ( i ) and h ( i ) represent its width and height (all normalized to the range [0, 1]). t i is the frame index, and T max is the maximum frame index of the video, used for temporal normalization. This design not only preserves the individual’s spatial positional information but also explicitly encodes temporal information into the node features, enabling downstream graph neural networks to jointly leverage spatial and temporal dimensions.
The rules for edge construction are divided into two categories:
  • Temporal edges
e i j t i m e = 1 , i f I D i = I D j 0 < | t i t j | T ω 0 , o t h e r w i s e
Here, T ω denotes the temporal window size (set to 2 frames in our experiments), which is used to reflect the continuity of the motion trajectory of the same individual within a short time span.
2
Spatial edges
e i j s p a c e 1 , i f t i = t j | | P i P j | | 2 < d t h 0 , o t h e r w i s e
Here, P i = ( c x ( i ) , c y ( i ) ) denotes the center coordinates of individual i , and d t h represents the spatial distance threshold (set to 0.3 in our experiments after normalization), which is used to describe local interaction relationships among individuals within the same frame.
Each edge is associated with an edge feature vector, defined as
a i j = [ | t i t j | α t , | | P i P j | | 2 α d ]
Here, α t = 5.0 and α d = 300.0 are normalization constants for temporal intervals and spatial distances, respectively. This design not only preserves the original temporal–spatial information but also maps the feature values into a numerical range suitable for graph neural network inputs. If no nodes in the video meet the specified conditions, a single-node graph with zero-valued features is generated to avoid structural deficiencies in downstream modeling.
To provide an intuitive illustration of the constructed spatiotemporal graph structure, we use NetworkX for visualizing nodes and edges. Node colors are assigned based on individual IDs, with brightness gradually deepening as time progresses, reflecting the passage of time in a static graph. Node labels are formatted as “ID@Frame,” capturing the corresponding trajectory patterns. In Figure 4, nodes of different colors represent unique individual identities, and the edges signify their spatial or temporal associations. Edges are constructed based on the spatial and temporal proximity between nodes. Specifically, if two nodes belong to the same individual (i.e., they share the same ID) and their time difference is within a specified temporal window, an edge is formed. Additionally, spatial edges are drawn when the Euclidean distance between two nodes is smaller than a spatial threshold. The color of the edges varies depending on the distance between nodes. Edges representing closer relationships (either spatial or temporal) are darker, while edges representing more distant relationships are lighter, creating a visual representation of the strength of the connections. Trajectories formed by same-colored nodes indicate the movement paths of individuals, while the temporal progression is conveyed through the darkening of node brightness. Three sets of visualizations showcase identity-aware graph structures in varying scenarios, from sparse distributions to dense interactions. These visualizations maintain the stability of individual IDs and ensure the rationality of edge connections, demonstrating the method’s adaptability across different environments.
Figure 4. Identity-Aware Graphs from Different Videos.
Finally, the identity-aware graph is stored in the form of a PyTorch Geometric Data object, which includes the node feature matrix, edge indices, edge features, ID list, and frame indices, serving as the direct input for downstream spatiotemporal graph neural networks.

3.2. Architecture Choice

The architecture of IDG-ViolenceNet combines 3D-CNN with GINEConv to effectively capture both spatial and temporal features in video data.
3D-CNN: 3D-CNN are widely used for processing video data because they can capture both spatial and temporal features by applying convolutional filters across both the spatial dimensions (height and width of the frame) and the temporal dimension (time). In our model, the 3D-CNN layers are responsible for extracting local spatiotemporal features from video frames. This allows the model to detect motion, objects, and other important temporal dynamics within the video. 3D-CNN are ideal for tasks like action recognition or violence detection, where understanding both the space and time in the video is crucial.
GINEConv: In addition to spatial and temporal feature extraction, understanding the interactions between individuals in the video is essential for detecting violent behavior. For this purpose, we use GINEConv, a type of Graph Neural Network. GINEConv is used to model the relationships between individuals in the video by representing each person as a node in a graph, with edges representing interactions between them. The GINEConv layer aggregates information from neighboring nodes (individuals), allowing the model to capture the dynamic interactions between people over time, which is essential for detecting violence in scenes with multiple people.
By combining 3D-CNN and GINEConv, our model benefits from the strengths of both architectures: 3D-CNN for extracting local spatiotemporal features and GINEConv for modeling complex interactions between individuals. This hybrid architecture is particularly effective for video violence detection, as it can learn both individual actions and the relationships between individuals in the video, which is crucial for detecting violent behavior in dynamic and crowded scenes.

3.3. Hyperparameter Selection

In our experiments, we focused on tuning two critical hyperparameters: the learning rate and the batch size, as these significantly affect model performance and training stability.
Learning Rate: We set the learning rate to 0.001, which is a commonly used value for the Adam optimizer. A learning rate of 0.001 was tested and found to provide stable training without causing large fluctuations in the loss function. During preliminary experiments, we tested learning rates ranging from 0.0001 to 0.1. A learning rate of 0.001 consistently provided the best validation performance, ensuring the model converged without overshooting the optimal solution. Higher learning rates (e.g., 0.01 or 0.1) led to instability in training, where the model’s accuracy fluctuated significantly, and convergence was not achieved. Lower learning rates (e.g., 0.0001) resulted in slower convergence, leading to extended training times without a noticeable improvement in validation accuracy. Based on these empirical results, we selected 0.001 as the optimal learning rate to balance training speed and model performance.
Batch Size: Due to hardware configuration limitations, we experimented with three different batch sizes: 8, 16, and 32. A batch size of 16 was found to provide the best performance, striking the right balance between training time and model accuracy. Smaller batch sizes, such as 8, led to noisier gradient updates and slower convergence, while batch size 16 resulted in faster convergence and the best validation accuracy. We tested various batch sizes during preliminary experiments, and batch size 16 consistently performed better on the validation set, providing the best trade-off between computational efficiency and model generalization.
These hyperparameters were chosen based on a combination of empirical results and prior research, with the goal of providing stable training while achieving optimal model performance. The validation results confirmed that these settings were optimal for our task, leading to faster convergence, better generalization, and improved overall performance.

3.4. Multi-Branch Spatiotemporal Modeling Network

To simultaneously capture local spatiotemporal variation patterns and global interaction structures in videos, we propose a multi-branch video modeling framework that integrates 3D-CNNs and GNNs, as illustrated in Figure 5. The framework consists of two parallel branches:
Figure 5. Multi-Branch Video Modeling Framework.
In the spatiotemporal feature extraction branch, we employ the R3D-18-based 3D Convolutional network [26], which is designed to model local motion patterns and capture spatiotemporal dependencies within short-term frame sequences. Specifically, the input sequence length is set to 8 frames with a fixed resolution. All convolutional layers adopt a kernel size of (3, 3, 3), with the channel dimension starting from 64 and progressively increasing within the residual blocks. ReLU is used as the activation function, global average pooling is applied for feature aggregation, and a Dropout layer is inserted before the fully connected layer to alleviate overfitting.
In the identity-aware graph feature modeling branch, we adopt a GINEConv-based Graph Neural Network [27], which is designed to model the spatial relationships among multiple individuals and capture cross-frame interaction dependencies within a video. Each node is represented by a 5-dimensional feature vector (center coordinates, width, height, and normalized timestamp), while each edge is described by a 2-dimensional feature vector (normalized temporal interval and spatial distance). Node and edge features are projected into a 128-dimensional hidden space through fully connected layers. One-hop neighborhood aggregation is then performed, followed by Global Attention Pooling for feature aggregation. To enhance generalization, batch normalization and Dropout are applied after each convolutional layer.
During training, the batch size is set to 16. The learning rate is selected from {0.0005, 0.001, 0.0015}, with 0.001 yielding the best performance. The Adam optimizer is used in conjunction with a ReduceLROnPlateau scheduler, and the loss function is Cross-Entropy Loss. Early stopping is employed, halting training if validation performance shows no improvement for more than 5 consecutive epochs.
Figure 5 illustrates the overall structure of the proposed multi-branch spatiotemporal modeling network: the inputs on the left consist of video frames and the identity-aware graph, which are fed into the R3D-18 branch and the GNN branch, respectively, for feature extraction; the fusion module integrates the features from both modalities; and finally, the classifier outputs the recognition results.

3.5. 3D-CNN Branch: Local Spatiotemporal Feature Extraction

To capture local motion patterns within video clips, we adopt R3D-18 as the backbone for convolutional feature extraction. Structurally, R3D-18 inherits the four residual stages of ResNet-18 [28] (Conv1, Conv2_x, Conv3_x, Conv4_x) while extending the convolutional kernels from two dimensions ( k h × k w ) to three dimensions ( k t × k h × k w ) enabling the model to perform feature modeling simultaneously along the temporal, vertical, and horizontal dimensions.
At the input stage, to ensure stability in temporal modeling, the video frame sequence is uniformly sampled to a fixed length of T = 8 frames, yielding local cropped images of detected persons. Each frame is cropped using the detection bounding box and resized to 112 × 112 pixels, forming the input tensor:
X R B × 3 × T × 112 × 112
Here, B denotes the batch size. The 3D convolution operation in R3D-18 is formulated as
Y t , h , w = t = 0 k t 1 h = 0 k s 1 w = 0 k s 1 W t , h , w X t + t , h + h , w + w
In our experiments, both the temporal kernel size k t and the spatial kernel size k s are set to 3. The stride is set to 1 along the temporal dimension and 2 along the spatial dimensions (for spatial downsampling). The network weights are initialized from a model pretrained on the Kinetics-400 dataset, leveraging its rich and generalizable motion features.
The global feature produced by R3D-18 has the shape B , 512 , 1 , 1 , 1 and is projected by a fully connected layer (cnn_fc) to a d c n n = 256 -dimensional vector, yielding the local spatiotemporal representation f 3 D . This representation is highly sensitive to short-term action variations (e.g., shoving, punching, kicking), making it particularly well-suited for scenarios such as violence detection that require capturing rapid movements.

3.6. GNN Branch: Modeling Spatial Interaction Structures

To explicitly model the spatiotemporal interactions among multiple individuals in videos, we employ GINEConv [29] (Graph Isomorphism Network with Edge features) to perform graph convolution on the identity-aware graph. GINEConv is an effective Graph Neural Network (GNN) architecture that can handle graph data with edge features, making it especially suitable for graph structures enriched with edge features, such as relationships or interactions between individuals. In this model, each node represents a person instance, and each edge represents the spatiotemporal interactions between them. The GINEConv layer allows for the capture of spatiotemporal dependencies and dynamic interactions between individuals, providing more efficient modeling for multi-person tracking and recognition in videos. In the specific implementation, GINEConv uses two fully connected layers (MLP) to construct the convolution operation, updating both node features and edge features through this network. By applying nonlinear activation (ReLU) to the output of each layer, and aggregating the graph nodes through Global Attention pooling, the model is able to effectively capture complex spatiotemporal interaction patterns between individuals in the video.
Input Graph Structure:
Node features X v R d x : composed of the center coordinates c x , c y , width–height w , h , and the normalized timestamp t n o r m .
Edge features e u v R d e : composed of the normalized temporal interval and normalized Euclidean distance, which, respectively, represent the temporal relationships and spatial proximity between individuals.
First, two independent linear transformations are applied to project both node and edge features into a unified hidden dimension of d h = 128 :
h v ( 0 ) = W x x v
z u v = W e e u v
Subsequently, during the message-passing phase, GINEConv updates node representations by jointly leveraging both node features and edge features:
h v ( l + 1 ) = M L P ( l ) ( ( 1 + ε ( l ) ) h v ( l ) + u Ν ( v ) σ ( h u ( l ) + z u v ) )
Here, Ν ( v ) denotes the set of neighbors of node v , σ is the ReLU activation function, and ε ( l ) is a learnable constant. After multiple layers of graph convolutions, the local information of nodes is progressively aggregated into a global graph-level representation.
Finally, Global Attention Pooling is applied to aggregate node features with attention weights into a video-level representation, producing a global interaction feature vector f G N N of dimension d g n n = 128 . This vector explicitly characterizes the interaction patterns and spatiotemporal structural relationships among multiple individuals, providing strong feature support for distinguishing different types of group behaviors (e.g., confrontation, cooperation).

3.7. Multimodal Feature Fusion Module

Before the classification stage, we introduce a multimodal feature fusion module to combine the complementary strengths of two modalities: the local spatiotemporal visual modality (extracted by R3D-18, representing appearance and short-term motion patterns in frame sequences) and the structured interaction modality (modeled by the GNN on the identity-aware graph, capturing spatial relationships and temporal dependencies among individuals). Since these modalities differ fundamentally in data representation and feature space, they can be regarded as heterogeneous modalities. By employing concatenation or attention-based weighted fusion, the model can adaptively integrate fine-grained motion cues with global interaction structures, thereby significantly enhancing the robustness and generalization ability of violence recognition. The fusion strategies are divided into two categories:
  • Concatenation Fusion
When attention-based fusion is not applied, the feature vector f 3 D R d c n n from the CNN branch and the feature vector f G N N R d g n n from the GNN branch are directly concatenated along the channel dimension:
f f u s e d = [ f 3 D | | f G N N ] R d c n n + d g n n
This approach preserves the complete information from both modalities and is suitable for scenarios where the feature dimensions are relatively small and computational resources are sufficient.
2.
Attention Fusion
When attention-based fusion is enabled, the features from both modalities are first linearly projected to a common dimension d c = min ( d c n n , d g n n ) :
f ^ 3 D = W c n n f 3 D
f ^ G N N = W g n n f G N N
Then, the aligned features are concatenated and fed into the attention weighting network to generate modality weights w = [ w c n n , w g n n ] :
w = S o f t m a x ( W 2 σ ( W 1 [ f ^ 3 D | | f ^ G N N ] ) )
The final fused feature is obtained by the weighted summation of the two modalities:
f f u s e d = w c n n × f ^ 3 D + w g n n × f ^ G N N
Attention-based fusion can adaptively adjust the contribution ratio of the two modalities according to the specific content of the input video. For example, in interaction-intensive scenarios, the weight of the GNN features may be higher, whereas in segments with pronounced rapid movements, the weight of the CNN features may dominate.
Regardless of the fusion strategy adopted, the final fused vector is fed into a fully connected classifier
F C Re L U D r o p o u t ( p = 0.5 ) S o f t m a x
to output the class probability distribution of the video (violence/non-violence). This fusion design ensures that the local motion modeling capability of the CNN and the global structural modeling capability of the GNN complement each other, thereby enhancing the overall performance of video behavior recognition.

4. Experiments

4.1. Experimental Setup and Datasets

The experiments were conducted on a Windows 10 operating system with an Intel(R) Xeon(R) W-2275 @ 3.30GHz CPU, and accelerated using an NVIDIA RTX A4000 GPU. Python 3.8 was used as the programming language, while model construction, training, and optimization were implemented with the PyTorch deep learning framework and the PyTorch Geometric (PyG) library for graph neural networks.
To ensure the reliability and validity of our research results, we evaluate the proposed model on three widely used benchmark datasets: Hockey Fight, Movies Fight, and RWF-2000. The Hockey Fight dataset, constructed by Nievas et al. [30], consists of 500 violent and 500 non-violent short video clips, with each video containing an average of 41 frames. The Movies Fight dataset, also created by Nievas et al. [30], contains 201 videos—100 violent and 101 non-violent—each lasting approximately 1.6–2 s with around 50 frames on average. The RWF-2000 dataset, introduced by Cheng et al. [31], is a large-scale violence detection dataset built from real-world surveillance footage. It includes 1000 violent and 1000 non-violent clips, each about 5 s in length and averaging 150 frames. Compared with the other two datasets, RWF-2000 is more challenging, as it incorporates complex factors such as low illumination, blur, and occlusion, making it highly representative of real-world scenarios.
For all datasets, we adopt a 6:2:2 split into training, validation, and test sets using a video-level, class-stratified protocol to prevent data leakage. Specifically, all clips or frames originating from the same source video or camera scene are assigned exclusively to a single partition. After splitting, identity-aware graphs are constructed for each video, ensuring that tracking identities or near-duplicate frames do not cross between partitions.

4.2. Evaluation Metrics

To comprehensively evaluate the classification performance of the proposed method on video violence detection, we adopt several commonly used metrics, including Accuracy, Precision, Recall, F1-Score, and the Area Under the Receiver Operating Characteristic Curve (AUC). In the evaluation process, violent videos (fight) are defined as the positive class, while non-violent videos (nonfight) are defined as the negative class.
Here, TP (True Positive) denotes the number of positive samples correctly classified as positive; TN (True Negative) denotes the number of negative samples correctly classified as negative; FP (False Positive) denotes the number of negative samples incorrectly classified as positive; and FN (False Negative) denotes the number of positive samples incorrectly classified as negative. The calculation formulas for each metric are as follows:
A ccuracy = T P + T N T P + T N + F P + F N
P r e c i s i o n = TP FP + TP
F 1 = 2 × Precision × Recall Precision + Recall
f f u s e d = [ f 3 D | | f G N N ] R d c n n + d g n n
Accuracy reflects the overall correctness of the model’s predictions; Precision measures the proportion of samples predicted as positive that are truly positive; Recall measures the proportion of positive samples that are correctly identified; and the F1-Score, as the harmonic mean of Precision and Recall, is more suitable for evaluating models under imbalanced class distributions.
In addition, to evaluate the model’s discriminative ability under different classification thresholds, this paper calculates the AUC. Specifically, in implementation, the value of the positive-class channel output from the Softmax function p y = 1 x is taken as the discriminant score. By traversing all possible thresholds, the ROC curve is plotted, and the area under the curve is computed as the AUC value. AUC values closer to 1 indicate stronger separability of positive and negative classes by the model.

4.3. Experimental Results and Analysis

To comprehensively evaluate the effectiveness and generalization capability of the proposed method, we conducted systematic experiments on three public violence recognition datasets: Hockey Fight, Movies Fight, and RWF-2000. The compared baselines include traditional optical flow-based methods (Vif), temporal modeling approaches (ECO), graph convolutional networks (DGCNN), lightweight spatiotemporal modeling (MobileNet-TSM), and Transformer-based methods (MoEViT). All methods were tested under the same hardware setup and dataset partitioning strategy to ensure fair and comparable results.
Table 1 presents the classification accuracy comparison of different methods on the three datasets. It can be observed that the proposed IDG-ViolenceNet achieves the highest accuracies of 97.5% and 99.5% on the Hockey Fight and Movies Fight datasets, respectively, and reaches 89.4% on the RWF-2000 dataset. While maintaining state-of-the-art accuracy, the model also demonstrates strong adaptability to challenging surveillance scenarios such as low resolution, viewpoint variation, and crowded scenes. These results indicate that the proposed method exhibits robust performance and strong generalization capability across diverse violence recognition tasks.
Table 1. Comparison of Accuracy (%) of Different Methods on Three Datasets.
We conducted a systematic evaluation of IDG-ViolenceNet on three benchmarks—HockeyFight, MoviesFight, and RWF-2000—using Accuracy, Precision, Recall, F1-Score, and AUC. All results are averaged over five random seeds and reported as mean ± std. Accuracy measures overall correctness; Precision and Recall characterize exactness and coverage for the positive class; F1 balances Precision and Recall; and AUC assesses discriminative ability across thresholds. As shown in Table 2, the model performs best on Movie Fight. Overall, training time increases with dataset complexity, while inference time remains largely stable. Taken together, the five-seed mean ± std results demonstrate high accuracy on constrained datasets and strong robustness and adaptability in more realistic surveillance scenarios.
Table 2. Performance Comparison of the Proposed Method on Different Datasets.
In addition to the quantitative metrics, we further examined the model’s performance through confusion matrices on the three benchmark datasets, as illustrated in Figure 6. These matrices provide a more fine-grained view of the classification results. For the Hockey Fight dataset, the model achieved an accuracy of 97.5%, with only a few samples misclassified. On the Movies Fight dataset, the model achieved 100% accuracy on the shown test split (21/21 and 20/20 correctly classified), while the average performance across multiple random runs or cross-validation folds was 99.50 ± 0.50%. In contrast, for the more challenging RWF-2000 dataset, the model achieved an accuracy of 89.4%, where most errors occurred in mislabeling “nonfight” actions as “fight.” This analysis highlights not only the robustness of the proposed approach on simpler datasets but also its limitations and generalization ability in more complex, real-world scenarios.
Figure 6. Confusion Matrices for Hockey Fight, Movies Fight, and RWF-2000 Datasets.
To systematically evaluate stability and generalization under small test folds (≈40–41), we use five random seeds (42, 1337, 2020, 2021, 3407) and perform five-fold cross-validation for each seed, yielding 25 independent test folds in total; the evaluation metrics include accuracy (ACC) and AUC. The results show that across all 25 test folds, MoviesFight achieves a mean ACC of 99.5%, with a minimum/maximum of 97.50%/100.00%; among these, 18/25 folds reach 100%, indicating high consistency and stability across seeds and folds. In terms of AUC, the five seed-specific five-fold curves remain close to 1.0 overall, with only slight dips in a few folds (see Figure 7), suggesting that local difficulty variations do not alter the overall high and robust discriminative performance.
Figure 7. Per-seed five-fold test AUCs on the MoviesFight dataset.

4.4. Ablation Study

To comprehensively evaluate the contribution of each component to the model’s performance and its applicability across different scenarios, we conducted ablation experiments on all three datasets; the results are summarized in Table 3, Table 4 and Table 5. Under identical training/validation splits, optimizer settings, and hyperparameters, we varied only the feature extraction and fusion strategies, considering the following four variants:
Table 3. Ablation Study Results on the Hockey Fight Dataset.
Table 4. Ablation Study Results on the Movies Fight Dataset.
Table 5. Ablation Study Results on the RWF-2000 Dataset.
(1) CNN_Only: prediction using only CNN features; (2) GNN_Only: prediction using only GNN structural information; (3) CNN+GNN (Concat): concatenation of CNN and GNN features followed by the classification head; and (4) CNN+GNN (Attn): cross-modal/cross-domain fusion using an attention mechanism (our full method).
All experiments are compared in terms of Accuracy, F1-Score, and AUC, in order to validate the stability and generalization ability of each component under different data distributions and task scenarios.
From the ablation results above, it can be observed that the overall performance of IDG-ViolenceNet relies on the complementary characteristics of the CNN and GNN branches. Using only the CNN branch (CNN_Only) achieves relatively high Accuracy and F1-Score across all three datasets, indicating that the 3D-CNN is highly effective in capturing short-term local motion features. However, the performance of GNN_Only is generally lower than that of CNN_Only, with particularly significant gaps on the Hockey Fight and Movies Fight datasets. This reflects the limited discriminative power of relying solely on structured interaction features when fine-grained visual information is absent. The concatenation of features from both branches (CNN+GNN Concat) yields performance improvements on certain datasets, validating the complementarity between local visual features and global interaction features. With the further introduction of attention-based weighted fusion (CNN+GNN Attn), the model achieves the best results in terms of Accuracy, F1-Score, and AUC. Notably, it attains 100% Accuracy and F1-Score on the Movies Fight dataset, demonstrating that the attention mechanism can adaptively assign importance between the two modalities depending on the scene, thereby maximizing fusion effectiveness. Overall, the ablation study results strongly validate the critical roles of identity-aware graph construction, dual-branch feature extraction, and the attention-based fusion module in enhancing both the accuracy and robustness of violence detection under complex scenarios.

4.5. Limitations and Future Work

The limitations of this study are as follows: Our current pipeline does not yet support reliable edge-level attribution, and the available datasets lack edge-level ground truth; therefore, the present visualizations are intended to illustrate graph construction and edge-weight distributions rather than single-edge causality. In addition, we have not yet built our own multi-scene violence video dataset, which constrains extrapolation to complex real-world settings. In future work, we will make these limitations explicit and systematically incorporate edge-level explanation methods (masking-based perturbation analyses, GNNExplainer/PGExplainer, and structural ablations on spatial vs. temporal edges), accompanied by randomized and stability checks and metrics such as deletion/insertion curves; construct small-scale edge-labeled or controllable synthetic benchmarks for calibration; and curate and release a diversified violence video dataset (covering viewpoint, illumination, occlusion, group size, and action intensity) with fixed splits and baseline code. All releases will adhere to compliance and privacy requirements, with the aim of improving the reliability of explanations, the robustness of results, and the reproducibility of the research.

5. Conclusions

The proposed IDG-ViolenceNet model introduces an identity-aware graph construction mechanism combined with a dual-branch 3D-CNN–GNN feature extraction framework, enabling collaborative modeling of multi-person spatial interactions and local spatiotemporal motion patterns in videos. Experimental results demonstrate that the method consistently outperforms mainstream approaches across multiple public datasets, particularly showing strong generalization and robustness in surveillance scenarios characterized by dense crowds, severe occlusions, and complex actions. Further ablation studies validate the critical role of cross-frame identity-preserving graph modeling and cross-modal feature fusion in improving detection accuracy. Future work will focus on two main directions: (1) further optimizing person detection and tracking algorithms to enhance identity association stability under ultra-dense crowds and extreme occlusion conditions and (2) exploring the integration of adaptive spatiotemporal attention mechanisms and multimodal information (e.g., audio signals and scene semantics) to improve adaptability to more complex forms of violent behavior and cross-domain datasets. Overall, this study not only provides a feasible paradigm for deep fusion of structured and visual features but also offers strong technical support for implementing real-time early-warning modules in intelligent security systems.

Author Contributions

Conceptualization, Q.J.; Data curation, H.H.; Formal analysis, Q.J.; Funding acquisition, H.H.; Investigation, H.H.; Methodology, H.H.; Software, Q.J.; Supervision, H.H.; Validation, Q.J.; Visualization, Q.J.; Writing—original draft, Q.J.; Writing—review & editing, Q.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Laboratory Project of Enterprise Informatization and IoT Measurement and Control Technology for Universities in Sichuan Province (NO: 2024WYJ06), Central Guidance for Local Science and Technology Development Fund Projects (NO: 2024ZYD0266), Tibet Science and Technology Program (NO: XZ202401YD0023).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. The RWF-2000 dataset can be accessed at https://paperswithcode.com/dataset/rwf-2000 (accessed on 2 March 2024.), the Hockey Fight dataset at https://academictorrents.com/details/38d9ed996a5a75a039b84cf8a137be794e7cee89 (accessed on 28 May 2024), and the Movies Fight dataset at https://www.kaggle.com/datasets/naveenk903/movies-fight-detection-dataset (accessed on 28 May 2024). All datasets were used in accordance with their respective licenses.

Acknowledgments

The authors would like to thank the A6-505 Laboratory at Sichuan University of Light Chemical Industry for their technical support provided during this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pandey, B.; Sinha, U.; Nagwanshi, K.K. A multi-stream framework using spatial–temporal collaboration learning networks for violence and non-violence classification in complex video environments. Int. J. Mach. Learn. Cybern. 2025, 16, 4737–4766. [Google Scholar] [CrossRef]
  2. Kim, H.; Lee, B.S.; Shin, W.-Y.; Lim, S. Graph anomaly detection with graph neural networks: Current status and challenges. IEEE Access 2022, 10, 111820–111829. [Google Scholar] [CrossRef]
  3. Ahmad, T.; Jin, L.; Zhang, X.; Lai, S.; Tang, G.; Lin, L. Graph convolutional neural network for human action recognition: A comprehensive survey. IEEE Trans. Artif. Intell. 2021, 2, 128–145. [Google Scholar] [CrossRef]
  4. Patel, D.; Sarlati, S.; Martin-Tuite, P.; Feler, J.; Chehab, L.; Texada, M.; Marquez, R.; Orellana, F.J.; Henderson, T.L.; Nwabuo, A.; et al. Designing an information and communications technology tool with and for victims of violence and their case managers in San Francisco: Human-centered design study. JMIR mHealth uHealth 2020, 8, e15866. [Google Scholar] [CrossRef]
  5. Wang, N.; Zhu, G.; Zhang, L.; Shen, P.; Li, H.; Hua, C. Spatio-temporal interaction graph parsing networks for human-object interaction recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 4985–4993. [Google Scholar]
  6. Huang, H.; Zhou, L.; Zhang, W.; Corso, J.J.; Xu, C. Dynamic graph modules for modeling object-object interactions in activity recognition. arXiv 2018, arXiv:1812.05637. [Google Scholar]
  7. Yun, H.; Ahn, J.; Kim, M.; Kim, E.-S. Compositional video understanding with spatiotemporal structure-based transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18751–18760. [Google Scholar]
  8. Ullah, F.U.M.; Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors 2019, 19, 2472. [Google Scholar] [CrossRef]
  9. Abdali, A.-M.R.; Al-Tuma, R.F. Robust real-time violence detection in video using cnn and lstm. In Proceedings of the 2019 2nd Scientific Conference of Computer Sciences (SCCS), Baghdad, Iraq, 27–28 March 2019; IEEE: New York, NY, USA, 2019; pp. 104–108. [Google Scholar]
  10. Patel, M. Real-time violence detection using CNN-LSTM. arXiv 2021, arXiv:2107.07578. [Google Scholar] [CrossRef]
  11. Khan, M.; Gueaieb, W.; Elsaddik, A.; De Masi, G.; Karray, F. Graph-based knowledge driven approach for violence detection. IEEE Consum. Electron. Mag. 2024, 14, 77–85. [Google Scholar] [CrossRef]
  12. Lai, Z.; Liang, G.; Zhou, J.; Kong, H.; Lu, Y. A joint learning framework for optimal feature extraction and multi-class SVM. Inf. Sci. 2024, 671, 120656. [Google Scholar] [CrossRef]
  13. Salman, H.A.; Kalakech, A.; Steiti, A. Random forest algorithm overview. Babylon. J. Mach. Learn. 2024, 2024, 69–79. [Google Scholar] [CrossRef]
  14. Schapire, R.E. Explaining adaboost. In Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52. [Google Scholar]
  15. Hassner, T.; Itcher, Y.; Kliper-Gross, O. Violent flows: Real-time detection of violent crowd behavior. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 1–6. [Google Scholar]
  16. Deniz, O.; Serrano, I.; Bueno, G.; Kim, T.-K. Fast violence detection in video. In Proceedings of the 2014 International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, 5–8 January 2014; IEEE: New York, NY, USA, 2014; Volume 2, pp. 478–485. [Google Scholar]
  17. Li, J.; Jiang, X.; Sun, T.; Xu, K. Efficient violence detection using 3d convolutional neural networks. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Madrid, Spain, 29 November 2019; IEEE: New York, NY, USA, 2022; pp. 1–8. [Google Scholar]
  18. Ghosh, D.K.; Chakrabarty, A. Two-stream multi-dimensional convolutional network for real-time violence detection. arXiv 2022, arXiv:2211.04255. [Google Scholar]
  19. Negre, P.; Alonso, R.S.; González-Briones, A.; Prieto, J.; Rodríguez-González, S. Literature Review of Deep-Learning-based detection of violence in video. Sensors 2024, 24, 4016. [Google Scholar] [CrossRef]
  20. Kavathia, A.; Sayer, S. Optimizing Violence Detection in Video Classification Accuracy through 3D Convolutional Neural Networks. arXiv 2024, arXiv:2411.01348. [Google Scholar] [CrossRef]
  21. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
  22. Yang, H.; Ren, Z.; Yuan, H.; Wei, W.; Zhang, Q.; Zhang, Z. Multi-scale and attention enhanced graph convolution network for skeleton-based violence action recognition. Front. Neurorobotics 2022, 16, 1091361. [Google Scholar] [CrossRef]
  23. Tian, J.; Li, D. Video Violence Detection Method Based on Multi-Feature and Graph Convolutional Network. In The International Conference on 3D Imaging Technologies; Springer: Singapore, 2023; pp. 167–177. [Google Scholar]
  24. Lu, X.; Chen, Y.; Chen, Y.; Gao, X.; Yang, T.; Chen, G. STIG-Net: A spatial–temporal interactive graph framework for recognizing violent behaviors in videos. Vis. Comput. 2025, 41, 7447–7458. [Google Scholar] [CrossRef]
  25. He, L.-H.; Zhou, Y.-Z.; Liu, L.; Cao, W.; Ma, J.-H. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef]
  26. Byeon, Y.-H.; Kim, D.; Lee, J.; Kwak, K.-C. Body and hand–object ROI-based behavior recognition using deep learning. Sensors 2021, 21, 1838. [Google Scholar] [CrossRef]
  27. Yang, Y.; Zou, D.; He, X. Graph neural network-based node deployment for throughput enhancement. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 14810–14824. [Google Scholar] [CrossRef]
  28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  29. Giuliari, F.; Skenderi, G.; Cristani, M.; Del Bue, A. Spatial commonsense graph for object localisation in partial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 19518–19527. [Google Scholar]
  30. Nievas, E.B.; Suarez, O.D.; Garcia, G.B.; Sukthankar, R. Hockey fight detection dataset. In Computer Analysis of Images and Patterns; Springer: Berlin/Heidelberg, Germany, 2011; pp. 332–339. [Google Scholar]
  31. Cheng, M.; Cai, K.; Li, M. RWF-2000: An open large scale video database for violence detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 4183–4190. [Google Scholar]
  32. Zolfaghari, M.; Singh, K.; Brox, T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 695–712. [Google Scholar]
  33. Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 2019, 38, 146. [Google Scholar] [CrossRef]
  34. Zhang, Y.; Li, Y.; Guo, S. Lightweight mobile network for real-time violence recognition. PLoS ONE 2022, 17, e0276939. [Google Scholar] [CrossRef] [PubMed]
  35. Mohammadi, H.; Nazerfard, E.; Firoozi, T. Reinforcement Learning-based Mixture of Vision Transformers for Video Violence Recognition. arXiv 2023, arXiv:2310.03108. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.