Early Detection of Aggressive Human Behavior in Video Streams Using Deep Spatiotemporal Models

Issembayeva, Aida; Shaushenova, Anargul; Nurpeisova, Ardak; Ispussinov, Aidar; Suleimenova, Buldyryk; Bekenova, Anargul; Satybaldieva, Aliya; Zholmukhanova, Aigul; Mauina, Galiya

doi:10.3390/computers15050267

Open AccessArticle

Early Detection of Aggressive Human Behavior in Video Streams Using Deep Spatiotemporal Models

by

Aida Issembayeva

¹,

Anargul Shaushenova

^1,*,

Ardak Nurpeisova

^1,*

,

Aidar Ispussinov

²,

Buldyryk Suleimenova

³,

Anargul Bekenova

⁴,

Aliya Satybaldieva

⁵,

Aigul Zholmukhanova

¹ and

Galiya Mauina

¹

Institute of Business and Digital Technologies, Saken Seifullin Kazakh Agrotechnical University, Astana 010011, Kazakhstan

²

Administration, Astana IT University, Astana 010000, Kazakhstan

³

Department of Smart Technologies, Faculty of Computer Science and Artifical Intelligence, Yessenov University, Aktau 130000, Kazakhstan

⁴

Institute of Digital Economy and Sustainable Development, Zhangir Khan University, Uralsk 090009, Kazakhstan

⁵

Department of Physics and Computer Science, Faculty of Natural Sciences, M. Kh. Dulaty Taraz University, Taraz 080000, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 267; https://doi.org/10.3390/computers15050267

Submission received: 19 March 2026 / Revised: 13 April 2026 / Accepted: 20 April 2026 / Published: 23 April 2026

(This article belongs to the Special Issue Deep Learning and Explainable Artificial Intelligence (2nd Edition))

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this paper, we propose a spatiotemporal approach for binary classification of violent and non-violent behavior in real-world settings. The experimental pipeline includes video preprocessing, stratified data splitting, generation of temporally structured clips, and comparative evaluation of baseline models, including a convolutional neural network. We also developed a Residual Adaptive Motion Temporal Binary Heat Network model that combines frame color characteristics, residual motion descriptions, temporal feature fusion, an early risk assessment mechanism, and interpretable localization maps. Experiments were conducted on a balanced dataset of 2000 video clips. The proposed model demonstrated the best early warning performance: a supervision rate of 0.6, an F1 score of 0.9527, and a balanced accuracy of 0.9533. With full supervision, the F1 score was 0.9342, and the area under the receiver operating characteristic curve (AUC) was 0.9871. The practical significance of the work is that the proposed approach can be used as a decision support tool for the preliminary identification of potentially dangerous video fragments with subsequent manual verification, without the assumption of autonomous use in high-risk scenarios.

Keywords:

video analytics; video stream analysis; aggressive behavior; spatiotemporal analysis; interpretable artificial intelligence

1. Introduction

Automatic analysis of human behavior from video streams remains one of the most sought-after tasks in modern computer vision, as it underlies intelligent video surveillance, public safety systems, transport infrastructure monitoring, and digital rapid response platforms [1,2]. In recent years, interest in this area has increased significantly due to the shift from offline analysis of short clips to practical scenarios of continuous surveillance, which require not only the recognition of events that have already occurred but also the early detection of signs of potentially dangerous behavior [3,4]. In this regard, the tasks of violence recognition, anomaly analysis, and early event prediction are increasingly considered as part of the broader problem of understanding real-time video streams, where accuracy, computational efficiency, and resilience to scene noise are critical [5,6].

Despite the progress made, the practical formulation of the problem of early detection of aggressive behavior remains challenging. In real-world video surveillance scenes, violent episodes are often characterized by low object spatial resolution, partial occlusions, background motion, variable camera angles, and blurred boundaries between normal and dangerous interactions [7,8]. Furthermore, many modern models perform well when the clip is fully available, but significantly lose stability when decisions must be made based on the first 20–60% of the video sequence [9,10]. Limited interpretability poses an additional challenge, as for applied security systems, it is important to understand which regions of the frame and which stages of the event influenced the model’s final decision [11,12]. Recent literature has demonstrated several viable approaches to addressing this problem. One of these is the development of computationally efficient architectures for real-time violence recognition, including lightweight models and adaptation schemes for pre-trained video baselines [13,14]. Another approach focuses on weakly supervised, open-vocabulary video anomaly detection, designed to handle incomplete labeling, rare events, and a broader set of potentially dangerous scenarios [15,16]. In addition, explainable video analytics is actively developing, with visual or textual explanations generated alongside detection, increasing confidence in the model and the practical suitability of the results [17,18]. Research on early action prediction deserves special mention, confirming that partial observation can be informative even at an early stage of an event’s development [19,20].

However, existing approaches often focus on either the final classification of a video clip, general anomaly detection, or interpreting decisions as a separate post-processing step [21,22]. For applied security systems, this proves insufficient, as a real-world analytical framework requires a model that simultaneously considers the visual content of a scene, interframe motion dynamics, the temporal evolution of risk, and the spatial localization of the most significant zones [23,24]. This methodological gap defines the scientific and practical motivations for this study.

This paper proposes a hybrid spatio-temporal approach, RAMT-BinaryHeatNet, which combines RGB representation, residual motion encoding, adaptive feature fusion, temporal modeling, anticipatory risk estimation, and motion-guided localization. Unlike typical baseline models, which focus primarily on a single classification output, the proposed architecture generates a multi-component solution combining binary behavior classification, early risk assessment, and interpretable spatial localization of regions associated with aggressive interactions. A unique feature of the study is that the model’s performance is analyzed not only in full-observation mode but also in an early-warning scenario with various observation ratios, allowing an assessment of its suitability for proactive response [25,26].

The goal of this work is to develop and evaluate an interpretable hybrid spatiotemporal model for the early detection of aggressive human behavior in real-world surveillance video streams. To achieve this goal, a reproducible video data preparation pipeline is developed, comparative training of baseline architectures and the proposed model is conducted, recognition performance is examined under partial and full clip observation, and the model’s robustness and interpretability are analyzed. The scientific and applied significance of this work lies in the fact that the proposed approach is aimed not only at improving the final quality of binary recognition but also at developing an interpretable research framework for decision support in which early warning, transparency of assessment, and robustness to complex scenes are considered important properties for the subsequent development of video analytics systems. In this study, such properties are evaluated within a controlled experimental setup and are not interpreted as sufficient confirmation of the model’s readiness for immediate operational deployment [27,28,29].

2. Materials and Methods

The methodological framework defines the principles for developing the experimental framework, preparing video data, constructing the input spatiotemporal representation, and organizing the comparative analysis of models. The primary focus is on developing a reproducible approach to the early detection of aggressive behavior in video streams, treating the observed scene as a sequence of interrelated visual and dynamic states. The methodological framework encompasses dataset selection and validation; deterministic preprocessing of video clips; the formation of training, validation, and test sets; the configuration of comparable training conditions; and a description of the baseline and proposed architectures. This approach ensures the correctness of the experimental setup, comparability of results, and the scientific validity of the subsequent analysis of model effectiveness.

2.1. Dataset Description

The primary source of experimental data for this study was the publicly available Real Life Violence and Non-Violence Dataset (https://www.kaggle.com/datasets/karandeep98/real-life-violence-and-nonviolence-data) (accessed on 25 February 2026), hosted on the Kaggle platform and originally presented by M. Soliman et al. This dataset was chosen for several reasons. First, it is specifically designed to recognize real scenes of interpersonal aggression, rather than staged or laboratory-based actions. Second, the videos cover a variety of viewing conditions, including different backgrounds, camera angles, object distances, motion dynamics, and visual noise levels. Third, the dataset is widely used in research and engineering studies focused on automatic violence recognition from video data, making it a suitable baseline for reproducible comparative analysis. According to the official source description, the initial dataset includes 1000 videos classified as “violence” and 1000 videos classified as “non-violence,” compiled primarily from open-source YouTube videos. The “violence” class consists of real-life street fights and other conflict situations recorded in various environments and filming conditions. In contrast, the “non-violence” class reflects everyday, neutral actions without signs of physical aggression. Thus, the dataset forms a balanced binary classification problem, which is methodologically justified for building early-detection systems for potentially dangerous behavior. It should be noted that the publicly available Kaggle dataset also provides information on a dataset of images extracted from video frames. However, in this study, a working dataset in video format was used, as the goal was to model the spatiotemporal dynamics of behavior rather than to classify individual frames statically. This choice is crucial, as aggressive actions are determined not only by the visual content of a single image but also by the sequence of movements, changes in posture, the intensity of interaction between people, and the development of the event over time. Therefore, for a correct formulation of the video analytics problem, using video clips is more justified than working solely with extracted frames.

For the experimental design, the data were organized into two target categories: NonViolence and Violence. Before training, a mandatory dataset validation step was performed, including checking the presence of both classes, file integrity, the correctness of video sequence reading, and the extraction of service metadata for each video. For each video, the frame rate, total number of frames, duration, spatial resolution, and file size were determined. Damaged, empty, or incorrectly opened videos were excluded from further processing. This approach allowed us to produce a technically clean sample and minimize the risk of distorting results from defective source files.

Additionally, a preliminary analysis of the dataset structure was conducted, including an assessment of class balance, clip duration distribution, and resolution variability. This step was necessary not only for descriptive characterization of the sample but also to ensure that training and testing were performed on data reflecting the heterogeneity of real-world surveillance scenes. Differences in duration, object scale, and frame visual quality enhance the practical value of the experiment by bringing the evaluation conditions closer to real-world scenarios of intelligent video surveillance systems.

To construct a reproducible experimental protocol, the sample was divided into training, validation, and test sets in the ratios 70%/15%/15%, respectively. The partitioning was performed stratified by class, ensuring a balance between violence and non-violence in all subsets. The integrity of the partitioning was additionally monitored to prevent data leakage between subsets. This protocol is particularly important for behavior recognition tasks, as even partial overlap between training and test video scenes can lead to an overestimation of the model’s performance. The control mechanism used ensured the correctness of the experimental evaluation. To standardize input data and improve computational robustness, all video clips were converted into fixed-length clips during preparation. The study used short sequences of 8 frames, resampled to 96 × 96 pixels. This choice was driven by the need to achieve a reasonable compromise between preserving essential dynamic information and computational efficiency. For real-time video analytics, this format is justified, as it allows the study not only of recognition accuracy but also of the potential applicability of models in near-real-time scenarios. It is important to emphasize that the selected dataset is consistent with the logic of this study, as it addresses not the abstract problem of multi-class action recognition, but the applied binary problem of identifying aggressive and non-aggressive behavior. Thus, the dataset used forms a balanced binary experimental setup, which is convenient for controlled model comparisons during the architecture exploration phase. However, it should be emphasized that this sample structure does not represent the actual frequency of aggressive incidents in an operational environment, where the positive class is typically much less common. For this reason, the obtained metrics should be interpreted as characteristics of the model’s discriminatory ability within a balanced protocol, rather than as a direct assessment of its operational accuracy under rare events.

2.2. Deterministic Video Data Preparation Pipeline and Experimental Design

The methodological framework of the study defines the logic of video sequence processing, the structure of the analytical framework, and the principles of formalizing the target task. The focus is on developing a holistic approach to video data analysis in which observed human behavior is considered a dynamic process unfolding over time and dependent not only on individual visual features but also on the nature of changes across a frame sequence. Therefore, the methodological component of the work focuses on integrating procedures for preparing input data, spatiotemporal scene representation, identification of informative movement patterns, and subsequent analytical description of the detected behavioral states.

Correct data analysis is particularly important in this work, as a video stream is a complex, multidimensional source of information, encompassing spatial, temporal, and contextual components. For this reason, this section considers not only the technical aspects of generating the input representation but also the general principles of ensuring the reproducibility, comparability, and applicability of the study. The methodological emphasis is on ensuring an objective examination of video scenes with varying dynamics, heterogeneous visual structures, and variable levels of expression of target behavioral manifestations. This approach provides a scientifically sound basis for constructing an analytical model in which video data is interpreted as a sequence of interconnected events reflecting the development of the observed behavioral scenario over time. Within the methodological component of the study, a strictly deterministic approach to video data preparation is particularly important, as it ensures the correctness of subsequent spatiotemporal feature analysis and eliminates the influence of uncontrollable factors on the structure of input samples. The presented scheme outlines the complete pipeline for generating experimental data before model training. It encompasses two interrelated phases: dataset governance and splitting protocol, as well as deterministic preprocessing and tensor representation assembly. Substantively, this scheme captures not the model architecture, but rather the procedural logic of data preparation, which is fundamentally important for ensuring the reproducibility and methodological transparency of the study.

Figure 1 shows the RAMT-BinaryHeatNet Data Preparation Pipeline, which reflects the logic for generating the initial video corpus, its supervised splitting, deterministic preprocessing, and subsequent tensor assembly for model training. The first stage, designated Phase I: Dataset Governance and Split Protocol, involves the formation of a binary video corpus comprising two target classes: NonViolence and Violence. As shown in the diagram, the total sample size is 2000 labeled video clips, corresponding to a binary classification problem of aggressive versus non-aggressive behavior. Next, the Audit and Metadata Extraction stage is performed, during which file integrity is verified and frame rate, duration, frame count, and spatial resolution are analyzed. Additionally, the scheme includes a motion proxy for difficulty scoring, i.e., a proxy assessment of motion intensity as an auxiliary indicator of scene complexity. Thus, Figure 1 demonstrates that even at the initial stage, not only the formal characteristics of video files are taken into account, but also differences in the dynamics of video content, making subsequent sample splitting more justified and methodologically robust. Particular attention is paid to the challenge-aware data split stage, in which a separate 10% subset of increased complexity is identified, and the remaining data is distributed among training, validation, and test sets in a 70%/10%/10% ratio. A crucial requirement is that source identities remain disjoint, meaning that derivative clips belonging to the same source do not cross boundaries between subsets. This restriction is directly aimed at preventing data leakage, as it eliminates the situation in which statistically or visually similar fragments of the same video are included in both the training and validation sets.

In the second phase, designated Phase II. Deterministic Preprocessing and Tensor Assembly, the scheme describes a sequence of transformations that convert the source video material into a standardized tensor representation. First, temporal clips are constructed by uniformly sampling 12 RGB frames, with each frame resampled to 112 × 112 pixels. For the training subset, only supervised augmentations, such as flip, roll, and photometric scaling, are allowed, while the validation and test data are processed using a fixed protocol. Next, normalization and motion encoding are performed: the RGB data is scaled to the [0, 1] range, after which normalization is applied according to the ImageNet mean/std scheme, and the interframe differences are formed into a separate motion tensor. The final step generates a CUDA-ready model input, including a T × 3 × H × W RGB tensor and a (T − 1) × 3 × H × W motion tensor, after which the mini-batches are fed into the training computational circuit.

The explanatory blocks below reinforce the diagram’s methodological significance. The Split Integrity block establishes a rule that preserves all derivative clips from a single source within a single partition. The Training-time Augmentation block specifies that stochastic transformations are permitted only during training. The Deterministic Evaluation Policy block emphasizes that fixed preprocessing conditions are used for the validation, test, and challenge subsets, including uniform resizing, uniform sampling, ImageNet normalization, and residual motion tensor generation. The diagram thus demonstrates that the study utilizes a rigorous, reproducible, and methodically controlled data preparation pipeline, focused on minimizing leakage, standardizing the input representation, and ensuring the correctness of subsequent experimental evaluation.

Figure 2 shows the distribution of video clips across the target classes NonViolence and Violence, used in the study to solve the problem of binary classification of non-aggressive and aggressive human behavior. As shown in the diagram, each of the two categories contains 1000 video clips, yielding a strictly balanced sample with no quantitative bias toward one class over the other. This dataset structure is of significant methodological importance because it eliminates the influence of class imbalance on model training and testing results. With both categories equally represented, a more accurate assessment of classification quality is achieved, and the resulting metrics become more transparent, comparable, and reproducible.

Furthermore, the shown distribution confirms that the problem is formulated as a clear binary classification aimed at distinguishing between two contrasting behavioral states. This formulation is justified for intelligent video surveillance systems, where the key goal is to separate potentially dangerous actions from a normal video stream promptly. Therefore, the presented figure not only describes the data structure but also confirms the methodological validity of the original experimental base. Figure 3 shows the distribution of video clips by the NonViolence and Violence classes across the training, validation, and test sets. As shown in the diagram, the partitioning structure is completely symmetrical for both classes. The training set includes 700 video clips from each class, while the validation and test sets each contain 150 video clips from the NonViolence and Violence classes. Thus, the overall split ratio is 70% for training, 15% for validation, and 15% for testing, with class balance maintained at each stage. From a methodological perspective, this distribution is crucial for the validity of the experimental protocol. Firstly, maintaining an equal number of examples in each class prevents the model from being biased toward one category or another during both the training stage and subsequent quality assessment. Secondly, the equal proportion of classes in the training, validation, and testing sections ensures a stratified split, ensuring that training and quality control conditions remain comparable. This is especially important for binary classification problems involving aggressive and non-aggressive behavior, where even a slight skew in the distribution can distort precision, recall, and F1 scores. Furthermore, the presented scheme demonstrates the high reproducibility of the experimental design. The model is trained on a sufficiently representative subset of the sample, and the validation and testing subsets maintain both quantitative and substantive balance. As a result, Figure 3 confirms that the used data partitioning protocol is statistically stable and provides a fair basis for an objective comparison of models.

This result indicates the absence of class imbalance not only in the original corpus but also after the final formation of the training, validation, and test sets. This is particularly important from a statistical validity perspective, as each subset of the data contains an identical proportion of aggressive and non-aggressive video scenes, meaning that differences between the subsets are not due to a quantitative bias in the classes. The training subset remains the largest, containing 1400 video clips, while the validation and test sets are the same size, each containing 300 video clips. Taken together, these values demonstrate that the final dataset structure after partitioning retained high internal consistency, and all three subsets remained comparable in class composition. Consequently, the data distribution can be considered quantitatively homogeneous and statistically aligned at all stages of the experimental protocol. Thus, the presented methodological framework forms a coherent and reproducible research foundation, combining procedures for selecting and validating video data, deterministic preprocessing, standardization of the input spatiotemporal representation, a unified training protocol, and a comparable description of the baseline and the proposed architecture. The adopted experimental design ensures the validity of model comparisons, eliminates the influence of uncontrolled factors on the results, and allows for interpreting differences in recognition quality as consequences of architectural features rather than differences in training conditions. This creates a methodologically sound basis for subsequent analysis of the models’ effectiveness, the robustness of their training, and their ability to early detect aggressive behavior in video streams.

3. Results

This study’s results are based on a unified experimental protocol and aim to comprehensively evaluate models for early detection of aggressive behavior in video streams. The analysis includes both a quantitative comparison of baseline architectures and the proposed RAMT-BinaryHeatNet model, as well as an examination of their performance characteristics across varying levels of video clip completeness. Along with traditional binary classification metrics, early warning parameters, decision stability, and the interpretability of the generated spatio-temporal representations are considered, enabling a comprehensive assessment of the proposed approach’s effectiveness.

3.1. Comparative Analysis of Models

In the experimental work conducted, a single, fully implemented set of models was used for training and subsequent comparison. This set included both compact baseline architectures and more modern official video models, as well as the author’s proposed RAMT-BinaryHeatNet configuration. This model set ensures a fair comparison of several classes of solutions: 2D-temporal approaches, classical 3D convolutional networks, factored spatiotemporal architectures, transformer video models, and a hybrid risk-based model with localization. It is important to emphasize that all models were trained on a single compute loop on the laptop, meaning they had the same input clip length, a common optimization scheme, the same validation thresholding algorithm, and a common multi-task loss function.

The lightweight baseline CNN+BiLSTM was implemented as a frame-by-frame RGB encoder based on MobileNetV3-Small, followed by a bidirectional LSTM. In this configuration, the hidden representation size is set to 128, and the LSTM uses bidirectional processing with an internal dimension of 64 per direction, resulting in a 128-dimensional temporal description of the sequence. The classification layer includes a two-class linear layer, and an additional risk branch generates a single logit for anticipatory risk estimation. For this model, the MobileNet backbone was used in frozen mode, and the actual runtime profile reported 173,699 trainable parameters.

The group of official video baselines included the MC3-18, R3D-18, R(2+1)D-18, Swin3D-T, and Swin3D-S models, downloaded from torchvision. Officially pretrained weights were used for these architectures, and the backbone was frozen when those weights were present. As a result, training was performed primarily on the new output layers: dropout, a two-class linear classifier, and a separate risk head. For 3D convolutional architectures, the dropout value was set to 0.15, and the output classification and risk branches were built on top of the final backbone feature vector. In the executed version of the notebook, the trainable parameters for MC3-18, R3D-18, and R(2+1)D-18 were 1539 each, whereas for Swin3D-T and Swin3D-S, they were 2307 each. This confirms that the comparison was conducted not in full fine-tuning mode, but in supervised head-layer adaptation mode. It is worth noting that the MViTv2-S model was tested during development but was not included in the final executable version of the notebook due to incompatibility between its positional structure and the compact 8-frame protocol with a size of 96 × 96. Therefore, in the final benchmark suite, it was transparently replaced with Swin3D-T and Swin3D-S, which are also considered strong official video baselines and work correctly under the adopted input clip configuration.

The proposed RAMT-BinaryHeatNet model was implemented as a hybrid risk-aware architecture. It is based on MobileNetV3-Small as an RGB encoder, but unlike CNN+BiLSTM, it activates the last three trainable backbone blocks and increases the feature space size to 160. The model implements two spatial heads: an attention head and a localization head, both implemented as 1 × 1 convolutions. For motion analysis, a separate MotionEncoder is added, comprising a sequence of convolutional blocks (3 → 24 → 48 → 160), followed by a temporal 1D depthwise–pointwise module. Spatiotemporal fusion of RGB and motion features is performed via a fusion gate, which accepts the concatenation of the RGB vector, the motion vector, and their absolute divergence. Next, a cascade of three TemporalConvBlocks is used, followed by a MultiheadAttention with 5 heads and a dropout of 0.12. The final solution is built not only using classification logits but also taking into account the risk head, with the trainable coefficient risk_scale initialized to 0.35. The number of trainable parameters of the proposed model in the completed notebook was 1,203,791, making it the most trainable of all the considered options. Table 1 presents the common training hyperparameters and elements of the computational protocol used for all compared models. To ensure a fair comparison of the architectures, training was performed in a single experimental configuration with a fixed clip length of 8 frames and an input image size of 96 × 96 pixels. All experiments used 15 training epochs, a base batch size of 16, the AdamW optimizer with an initial learning rate of 3 × 10⁻⁴, and a weight decay of 1 × 10⁻⁴. To stabilize the process, gradient clipping at 2.0 and mixed precision based on torch. autocast and GradScaler, along with label smoothing with a value of 0.02, were used. Learning rate adaptation was implemented using the CosineAnnealingLR scheduler with T_max = 15. Additionally, a single decision threshold selection protocol was used for all models, based on the validation set, in the range from 0.20 to 0.80, with 61 candidates, and early recognition was analyzed at observation ratios of 0.20, 0.40, 0.60, 0.80, and 1.00. This unification of experimental conditions ensures comparability of results and makes the comparative analysis of models methodologically correct.

For training, all compared models were governed by a single multi-task loss function. The base classification component was calculated using the cross-entropy function, while the risk branch was trained using binary cross-entropy with logits, with a weight of 0.40. For the RAMT-BinaryHeatNet model, additional specialized loss components were activated, including the decision loss (0.30), the consistency loss (0.10), the localization binary cross-entropy (0.12), and the localization alignment and localization sparsity regularizers (0.015 and 0.003, respectively). This optimized the proposed model not only for the final binary classification criterion but also for the consistency of the risk assessment, the final decision, and spatial localization. Table 2 presents the architectural configurations and model hyperparameters used in the comparative experiment. The study includes both compact baseline architectures and official modern video models, as well as the author’s proposed RAMT-BinaryHeatNet configuration. This set ensures the correct comparison of several classes of solutions, including two-dimensional temporal approaches, classical three-dimensional convolutional networks, factorized spatio-temporal architectures, transformer video models, and a hybrid risk-based model with localization. For all models, the table lists the architectural basis, key configuration, dropout value, effective batch size, and number of trainable parameters, allowing them to be compared not only by qualitative results but also by computational and parametric complexity.

As shown in Table 2, the CNN+BiLSTM model is implemented as a lightweight baseline with a frame-by-frame RGB encoder based on MobileNetV3 Small and a subsequent bidirectional long short-term memory. The baseline video models include the MC3 18, R3D 18, R(2+1)D 18, Swin3D T, and Swin3D S architectures, loaded with official pretrained weights and used in supervised output layer adaptation mode. The proposed RAMT-BinaryHeatNet model features the most complex configuration, combining an RGB encoder, a separate motion encoder, a spatiotemporal feature fusion mechanism, attention and localization heads, and a multi-head attention module. Thus, Table 2 reflects the structural differences between the compared models and serves as the basis for the subsequent analysis of their performance in the early detection of aggressive behavior. Thus, this study utilized a hierarchically structured set of models, with each architectural class occupying a specific comparative role. CNN+BiLSTM provided a lightweight 2D temporal baseline; MC3-18/R3D-18/R(2+1)D-18 represented classic 3D solutions; Swin3D-T/Swin3D-S represented modern official Transformer video models; and RAMT-BinaryHeatNet represented the author’s hybrid configuration with motion feature integration, learnable localization, and risk-based decision-making. This setup makes the architectural comparison methodologically transparent, technically reproducible, and sufficient for a peer-reviewed description of the experimental part. The proposed architecture (Figure 4) of RAMT-BinaryHeatNet is a specialized hybrid spatiotemporal model designed for binary analysis of video sequences, generating three interrelated outputs: classification logits, an anticipatory risk score, and a spatial localization map. Unlike standard video models, which focus primarily on a single classification output, this scheme is initially built as a multi-component analytical circuit in which a decision is formed based on the combined consideration of RGB features, interframe motion information, and an internal risk score. Structurally, the scheme is divided into four logical stages: input representation, RGB and localization analysis, motion processing and feature fusion, as well as temporal heads and final decision making.

At Stage I. Inputs, the model accepts two matched inputs. The first is an RGB Clip Input of dimension B × T × 3 × H × W, where the implemented protocol uses T = 8 frames normalized by ImageNet statistics. The second is a Motion Residual Input of dimension B × (T − 1) × 3 × H × W, calculated as the interframe difference in adjacent RGB frames. Already at the input level, the main difference between this architecture and many baseline video models becomes apparent: it analyzes not only the visual content of the scene but also an explicitly specified motion component, which is especially important for tasks in which aggressive behavior is determined by the dynamics of interactions rather than just static spatial features. The scheme also fixes the base tensor sizes: T = 8, H = W = 96, and the width of the feature space after encoding is set to d = 160.

At the second stage, designated Stage II. RGB and Localization, features are extracted from the RGB stream using the MobileNetV3 small backbone, which utilizes the last training blocks, followed by a Conv2d 576 → 160, 1 × 1 transformation. Two specialized heads are then formed. The first, Attention Head, constructs an attention map using a 160 → 1 convolution with sigmoid activation and implements attention-weighted pooling. The second, Motion-Guided Localization, is also built on a 160 → 1 convolution but is additionally modulated by the motion map, resulting in a localization heatmap normalized by spatial coordinates. As a result, the RGB branch in this model goes beyond simple feature averaging. It simultaneously performs informative region extraction and spatially consistent localization, distinguishing it from conventional convolutional and transformer baseline architectures, where spatial interpretation is either absent or not directly integrated into the main computational graph, at Stage III. Motion and Fusion, the diagram shows a separate motion processing pipeline. The Spatial Motion Encoder sequentially transforms the motion tensor through blocks 3 → 24 → 48 → 160, combining regular and depthwise–pointwise convolutions with BatchNorm2d and GELU. The resulting sequence is then fed to the Temporal Motion Encoder, where depthwise and pointwise 1D convolutions with normalization and nonlinearity are applied in the time domain. After this, the Motion Projection block maps motion to the same 160-dimensional feature space as the RGB branch. The central element of this stage is Risk-Aware Gated Fusion, which combines RGB and motion features, along with their absolute divergence. The diagram clearly shows that the final fused representation is computed by a trainable sigmoid gate, enabling adaptive weighting of visual content and motion dynamics. This block is one of the key advantages of the model. Instead of rigid feature summation, an adaptive control-based fusion is used, sensitive to differences between static content and scene kinematics.

At the final stage, Stage IV, Temporal Heads and Decision processing are performed in several sequential steps. First, the fused sequence passes through LayerNorm and three TemporalConvBlock blocks, implementing deterministic temporal modeling. Next, MultiHeadAttention with five heads is applied, after which a separate temporal score head generates a temporal importance distribution using softmax. The next block, Temporal Pooling, combines weighted mean pooling and temporal max pooling to form a final 320-dimensional descriptor. This is used to construct two parallel heads: Anticipatory Risk Head, which produces a risk score, and Classification Head, which generates binary class logits. The final block, Decision Fusion and Reported Outputs, demonstrates the principle of the final decision: the margin between class logits is enhanced by an additional contribution from the risk branch via the risk_scale parameter, after which the decision logit is formed. Thus, the architecture not only classifies an event but also introduces risk-based decision correction, making it conceptually distinct from baseline models, where the final prediction is built solely from a single classification head.

From a scientific and methodological perspective, this model can be considered the proposed architecture, as it combines several components in a single differentiable circuit that are typically absent in standard video networks: explicit residual motion input, motion-guided localization, gated RGB and motion fusion, a separate anticipatory risk head, and final decision fusion involving risk. Its advantage lies not simply in its complexity, but in its more targeted adaptation to the task of early detection of aggressive behavior. Unlike standard 3D-CNN and transformer models, which are primarily optimized for final clip classification, RAMT-BinaryHeatNet is designed as an architecture capable of simultaneously identifying spatially significant regions, accounting for the temporal evolution of motion, and correcting binary decisions based on an internal risk assessment. This is why the diagram justifiably reflects not yet another variation in the standard backbone, but an independent proposed model for applied purposes, focused on interpretable and risk-sensitive video analysis.

Table 3 presents the profiles of the compared models by the number of trainable parameters and the latency of processing a single video clip in the implemented experimental configuration. This analysis allows us to compare architectures not only in terms of recognition quality but also in terms of computational feasibility, which is especially important for tasks such as the early detection of aggressive behavior in near-real-time conditions. It should be emphasized that the presented values reflect the actual laptop profile, i.e., they correspond to the configuration in which the models were used in the experimental circuit. For this reason, the column for the number of trainable parameters should be interpreted as the number of parameters involved in training in the current configuration mode, rather than the full parametric capacity of the entire architecture. This is especially important for official video baselines in which the backbone was frozen, and optimization was performed primarily on the output classification and risk-oriented heads.

The data presented shows that the models differ not only in their architectural type but also in their computational profile. Swin3D-S demonstrates the lowest latency in this implementation with a value of 21.5001 ms/clip, followed by Swin3D-T (58.5502 ms/clip) and R3D-18 (74.5958 ms/clip). The proposed RAMT-BinaryHeatNet model shows a latency of 85.7341 ms/clip, i.e., remains within the range of practically acceptable values, while implementing a significantly more complex internal analysis loop that includes motion processing, localization, and risk-based decision fusion. The slowest model in this configuration is CNN+BiLSTM, with a latency of 3419.6635 ms/clip, indicating significantly lower computational efficiency per clip. In terms of trainable parameters, RAMT-BinaryHeatNet has the largest count, with 1,203,791, consistent with its extended hybrid structure. For CNN+BiLSTM, the trainable portion is significantly smaller, comprising 173,699 parameters. Meanwhile, the MC3-18, R3D-18, and R(2+1)D-18 models share the same number of trainable parameters, 1539, while Swin3D-T and Swin3D-S have 2307, reflecting the partial adaptation mode with a frozen backbone. Thus, the table shows that RAMT-BinaryHeatNet occupies an intermediate position in terms of speed but significantly outperforms other models in the volume of the trainable specialized portion, while the official baseline architectures in this protocol serve as lightweight adaptable comparisons with a minimal number of updated parameters. This makes the comparison of models methodologically transparent and allows for the correct interpretation of their differences in further analysis.

Table 4 demonstrates that recognition quality depends significantly not only on the model architecture but also on the fraction of the observed video clip. The data show that, for most models, increasing the observation ratio from 0.2 to 1.0 is generally associated with increases in F1, Balanced Accuracy, and ROC-AUC; however, the magnitude of this increase is uneven. CNN+BiLSTM demonstrates the weakest early stability: at an observation ratio of 0.2000, F1 = 0.8224; with full observation, it increases to 0.8932. For R3D-18, the dynamics also remain moderate: F1 varies from 0.8315 to 0.8822, indicating a relatively limited ability to confidently recognize an event in the early stages of its development. At the same time, some models demonstrate greater suitability for early analysis. Thus, MC3-18, already with an observation ratio of 0.2000, achieves F1 = 0.8997 and ROC-AUC = 0.9544, and Swin3D-T, with an observation ratio of 0.4000, shows F1 = 0.9333 and Balanced Accuracy = 0.9333. The most stable official comparative solution for the set of values is Swin3D-S, which, with full observation, achieves F1 = 0.9320, Balanced Accuracy = 0.9300, and ROC-AUC = 0.9866, maintaining high indicators in the intermediate parts of the clip.

The most important fact emerging from the table is that RAMT-BinaryHeatNet performs best in the early warning mode at an observation ratio of 0.6000, achieving F1 = 0.9527 and Balanced Accuracy = 0.9533. This means that after observing 60% of the clip, the model demonstrates higher recognition accuracy than all other compared solutions. With the full clip observed, it also maintains strong performance: F1 = 0.9342, Balanced Accuracy = 0.9333, and ROC-AUC = 0.9871, the highest ROC-AUC among all the models in the table. Thus, the results presented demonstrate that the proposed architecture exhibits high discriminatory performance under conditions of partial and full observation of video clips within the experimental protocol. However, these results should be interpreted as confirmation of the model’s research potential for early detection of suspicious video segments, rather than as sufficient grounds for its standalone use in scenarios where prediction directly entails disciplinary, legal, or other significant consequences. One key aspect of correctly interpreting the results is analyzing the stability of the training process during the final optimization stage. For this purpose, summary Table 5 was generated, including the mean values and standard deviations for Validation F1 and Validation ROC-AUC over the last five epochs, as well as the Final Train-Val F1 Gap, which reflects the discrepancy between the final performance values on the training and validation sets. These values allow us to assess not only the final performance level but also the degree of model fluctuations during the final training phase, as well as the presence of signs of overfitting or, conversely, conservative model behavior.

The data shows that RAMT-BinaryHeatNet demonstrates the highest average validation quality indicators: Validation F1 mean = 0.9446 and Validation ROC-AUC mean = 0.9804. At the same time, the standard deviations remain low (0.0043 and 0.0013, respectively), indicating fairly stable behavior in the final epochs. The Final train-val F1 gap = 0.0443 remains positive but does not exceed the thresholds that could indicate critical overfitting; rather, it reflects a better fit of the training trajectory while maintaining strong generalization during validation. Among the official baseline models, Swin3D-S and Swin3D-T demonstrate the smoothest behavior. For Swin3D-S, the standard deviation of Validation F1 std = 0.0000, while for Swin3D-T it is 0.0013, indicating almost constant validation dynamics in the final epochs. Swin3D-T exhibits a minimal positive gap of 0.0087, while Swin3D-S exhibits a 0.0132 gap, which can be interpreted as the most balanced ratio between training and validation. In turn, the MC3-18, R(2+1)D-18, and R3D-18 models have negative train-val gap values, indicating no signs of overfitting at the final point and less pronounced overfitting of the training set. R3D-18 demonstrates the lowest level of robust performance, with the lowest average Validation F1 (0.8737) and Validation ROC-AUC (0.9328), and the largest F1 variance (0.0050). Thus, the table confirms that RAMT-BinaryHeatNet achieves the best overall validation performance while maintaining controlled training stability, while Swin3D-S and Swin3D-T are the smoothest and most statistically robust official benchmarks.

The loss function dynamics across epochs show that all the models examined exhibit a similar general pattern of error decay in the early stages of training. However, the nature of subsequent stabilization, the depth of the train-loss reduction, and the ratio between the training and validation curves differ significantly. The most pronounced reduction in training error is observed for RAMT-BinaryHeatNet: train-loss decreases from approximately 0.75–0.80 in the first epoch to around 0.18–0.20 by the final epochs. The validation error for this model also decreases over time. Still, after a rapid initial drop, it stabilizes around 0.52–0.57, forming a noticeable but manageable gap between the train and validation curves. This profile corresponds to intensive overfitting of the training set while maintaining a stable validation trajectory with no signs of sharp degradation in the final epochs. Swin3D-S and Swin3D-T exhibit smoother, more consistent convergence. For Swin3D-S, both curves decrease gradually and converge by the end of training: the train loss is approximately 0.31–0.33, and the validation loss is approximately 0.37–0.38. A similar pattern is observed for Swin3D-T, with values of approximately 0.29–0.31 on the training set and 0.33–0.34 on the validation set. These two models are characterized by the smallest distance between the curves, consistent with the previously obtained high stability indicators and indicating balanced convergence without significant overfitting (Figure 5). For the MC3-18 and R(2+1)D-18 models, the training curves also have a stable downward profile, but the resulting errors remain higher than those of the transformer architectures. For MC3-18, the train-loss at the end of training is approximately 0.42–0.44, while the validation-loss is around 0.36–0.38. For R(2+1)D-18; both curves converge in the range of approximately 0.43–0.45, with a minimal gap between them. This configuration indicates a smooth but more conservative convergence, in which the model does not show a sharp reduction in error during training but maintains fairly close values during validation.

CNN+BiLSTM and R3D-18 stand out in particular. For CNN+BiLSTM, the train-loss decreases to around 0.21–0.24, while the validation-loss remains higher, at approximately 0.40–0.45, creating one of the most noticeable gaps between the curves among baseline models. For R3D-18, both trajectories decline significantly more slowly: by the final epochs, train-loss and validation-loss are approximately 0.48–0.50, and their proximity does not indicate high optimization but rather limited depth of convergence. Taken together, the presented curves show that RAMT-BinaryHeatNet achieves the deepest reduction in training error, Swin3D-S and Swin3D-T demonstrate the smoothest and most balanced stabilization regime, while R3D-18, MC3-18, and R(2+1)D-18 are characterized by a more moderate optimization rate and a higher final loss function level.

Figure 6 shows the F1 score dynamics across epochs. All the models studied reach relatively high values early in training. Still, the growth rate, level of stabilization, and nature of the divergence between the training and validation curves differ significantly. RAMT-BinaryHeatNet demonstrates the highest final profile. For this model, the training F1 score rapidly increases from approximately 0.80 in the first epoch to values around 0.98–0.99 in the final epoch, while the validation F1 score stabilizes in the range of 0.94–0.95. This indicates the highest absolute quality among the presented solutions, while maintaining a stable validation trajectory with no signs of sharp deterioration in the final epochs.

Among the official baseline models, the smoothest and strongest curves are observed for Swin3D-S and Swin3D-T. For Swin3D-S, the training F1 value reaches approximately 0.93–0.94, while the validation F1 value remains at 0.92–0.92+ with minimal fluctuations. For Swin3D-T, the dynamics are similar: the train F1 value increases to 0.92–0.93, while the validation F1 value gradually reaches a comparable level. The shape of the curves shows that these two models exhibit the smallest gap between the training and validation values, consistent with their high stability during training. MC3-18 and R(2+1)D-18 exhibit a more moderate trajectory. For MC3-18, the validation F1 value remains in the range of 0.90–0.92 throughout most of the training process, while the training curve remains lower and ultimately reaches approximately 0.87–0.89. A similar effect is observed in R(2+1)D-18, where validation-F1 stabilizes around 0.90–0.91, and train-F1 fluctuates mainly in the range of 0.86–0.89. This configuration indicates smooth convergence without overfitting, but with a slightly lower depth of optimization of the training set. CNN+BiLSTM and R3D-18 deserve special mention. In CNN+BiLSTM, the training F1 reaches approximately 0.95, while the validation F1 remains closer to 0.89–0.90, forming a noticeable positive gap. In R3D-18, both curves are located below the other models: train-F1 finishes around 0.86–0.87, and validation-F1 is close to 0.87–0.88.

Taken together, the results show that RAMT-BinaryHeatNet achieves the highest final F1 score, Swin3D-S and Swin3D-T demonstrate the most balanced stabilization, and R3D-18 remains the weakest in absolute values of this metric. Figure 7 shows the results of the validation ROC-AUC change across epochs. All models achieve a fairly high level of class discrimination early in training, but the rate at which they reach a plateau and the final metric value differ significantly. RAMT-BinaryHeatNet demonstrates the strongest trajectory. By the fourth epoch, the ROC-AUC value for this model rises to approximately 0.979, and it remains in a narrow range of 0.978–0.982 until the end of training. This dynamic indicates rapid achievement of a high level of discriminatory ability, followed by stable stabilization without significant drop-offs. Similar but slightly lower results are observed for Swin3D-T and MC3-18. By the final epochs, Swin3D-T remains stable at approximately 0.977–0.978, while MC3-18 remains in the range of approximately 0.969–0.970. Swin3D-S also shows a stable upward trend: the metric increases from approximately 0.949 in the first epoch to 0.966–0.967 in the final epoch. This indicates that both transformer models produce stable and competitive class ranking quality, but are inferior to the proposed architecture in terms of absolute maximum performance. CNN+BiLSTM and R(2+1)D-18 exhibit more moderate trajectories. For CNN+BiLSTM, ROC-AUC increases from approximately 0.935 to 0.966–0.967, with the curve becoming almost horizontal after the middle of training. For R(2+1)D-18, the initial value is close to 0.907; the metric then gradually increases and stabilizes at approximately 0.956–0.958. R3D-18 maintains the lowest trajectory, starting at around 0.871 and reaching only 0.932–0.934 by the final epochs. Thus, based on the totality of the results, the graph confirms that RAMT-BinaryHeatNet provides the highest and most consistent ROC-AUC level, while Swin3D-T and Swin3D-S are the closest strong comparative solutions, and R3D-18 remains the least effective model for this metric.

Test accuracy results show that, on the final independent sample, all models considered achieve a sufficiently high level of correct binary classification. However, there remains a clear gradation in absolute quality between them. RAMT-BinaryHeatNet achieves the highest Accuracy of 0.933, the best final result among the compared architectures. The closest model is Swin3D-S with an accuracy of 0.930, and the gap between these two solutions is only 0.003, indicating that they belong to the highest quality level within the framework of the conducted experiment (Figure 8). The next group consists of R(2+1)D-18 and MC3-18, which showed 0.913 and 0.910, respectively. Their results remain above the 0.91 threshold, but are 0.020 and 0.023 behind the two leading models when compared to RAMT-BinaryHeatNet. Swin3D-T ranks slightly lower, with a final accuracy of 0.903, indicating a maintained high, albeit weaker, level of final recognition compared to Swin3D-S. Thus, among the strong baseline architectures, Swin3D-S is the most competitive in the test, while Swin3D-T, MC3-18, and R(2+1)D-18 achieve an intermediate level of performance. The lowest test accuracies are observed for CNN+BiLSTM and R3D-18, with scores of 0.890 and 0.883, respectively. This means that the gap between the best and least accurate models in this comparison is 0.050, or 5 percentage points. Moreover, even the minimum result remains relatively high for the binary problem formulation, confirming the overall performance of the entire set of solutions studied. Taken together, the presented data show that RAMT-BinaryHeatNet provides the best overall generalization on the test set, Swin3D-S is the closest official comparison, and the remaining architectures demonstrate consistently lower accuracy values. Consequently, in terms of overall test accuracy, the proposed model ranks first among all the options considered.

The results shown in Figure 9 for the F1 score on the test set confirm the general hierarchy of models previously observed for other integrated quality metrics. However, in this case, the emphasis shifts to the trade-off between precision and recall in the binary classification of aggressive and non-aggressive video scenes. RAMT-BinaryHeatNet demonstrates the highest value, with a Test F1 score of 0.934. This means that the proposed model provides the best balance between correctly identifying the positive class and minimizing miss and false-positive errors in the final test score. The closest comparable solution is Swin3D-S with a score of 0.932, and the difference between the two leading models is only 0.002, indicating a virtually identical level of final prediction consistency. The next group consists of R(2+1)D-18 and MC3-18, which achieve F1 scores of 0.917 and 0.911, respectively. Their values exceed 0.91 but are 0.017 and 0.023 lower than the leader’s. This indicates that these architectures retain a fairly strong ability to discriminate between classes, but remain below the two best models in terms of overall decision balance. Swin3D-T achieves F1 scores of 0.908, occupying an intermediate position between the more powerful Swin3D-S and classic 3D baselines. Thus, among the official comparative models, Swin3D-S demonstrates the best overall F1 score. The lowest scores are recorded for CNN+BiLSTM and R3D-18, which achieved F1 scores of 0.893 and 0.882, respectively. The gap between the best and worst-performing models is 0.052, or 5.2 percentage points. Even the minimum value remains relatively high for a practical binary setting, confirming the overall validity of the model series studied. Taken together, the presented results show that RAMT-BinaryHeatNet ranks first in the final test F1 score, providing the most balanced recognition on an independent sample, while Swin3D-S is the closest and strongest official baseline analog.

The results of the F1-score dependence on the observed video clip fraction are shown in Figure 10. The models differ significantly in their ability to recognize an aggressive event at its early stages. RAMT-BinaryHeatNet demonstrates the most pronounced and effective trajectory. Even at an observation ratio of 0.4, the F1-score reaches 0.9110, and at 0.6, the maximum result of 0.9527 is observed, which is the highest point among all the presented curves. After this, only a slight decrease is observed to 0.9459 at 0.8 and to 0.9342 with full observation, indicating very high efficiency of the model, particularly in the predictive recognition mode when only a portion of the video sequence is still available. Swin3D-S shows a strong but less pronounced trend. Its F1-score increases from 0.8723 at 0.2 to 0.9320 at 1.0, with the increase almost being monotonic. This indicates a steady accumulation of discriminatory information as the observed fragment increases. In contrast, Swin3D-T exhibits a more uneven trajectory: a high level of 0.9333 is reached at 0.4, but then the values decrease to 0.8968 at 0.6, after which they partially recover. This curve shape indicates good sensitivity to early fragments, but less stable dynamics as the time window expands. MC3-18 is characterized by a strong early start (0.8997 at 0.2) followed by an oscillatory plateau in the range of 0.8961–0.9195. R(2+1)D-18, in contrast, shows a smoother and more consistent increase in quality: from 0.8561 at 0.2 to 0.9189 at 0.8, after which it maintains a similar level. CNN+BiLSTM and R3D-18 form the bottom group: the former model significantly improves from 0.8224 to 0.8932, while the latter remains the weakest in absolute terms, reaching only 0.8822 with full observation. Taken together, the presented curves confirm that RAMT-BinaryHeatNet is the best model for the early warning scenario, while Swin3D-S is the most stable official benchmark as the proportion of observed clips increases.

The results of the ROC-AUC dependence on the observed video clip ratio (Figure 11) show that all models maintain a fairly high ability to rank classes even at early stages of observation, but differ in the rate of quality improvement and the level of final stabilization. RAMT-BinaryHeatNet demonstrates the strongest trajectory. Even at an observation ratio of 0.2, the model shows an ROC-AUC of 0.9630, and by 0.6, it reaches 0.9836, one of the highest values on the entire graph. Subsequently, the metric remains consistently high: 0.9803 at 0.8 and 0.9871 at 1.0. This dynamic indicates high stability in class separation even with partial event observation and confirms the proposed model’s ability to extract informative features before the end of the video scene. Swin3D-S demonstrates comparable results, with ROC-AUC increasing from 0.9534 at 0.2 to 0.9866 at 1.0. Unlike RAMT-BinaryHeatNet, the increase here is smoother and almost monotonic. Swin3D-T shows a pronounced early rise: 0.9812 is reached at 0.4, but then the values fluctuate slightly and culminate at 0.9716. This indicates that the model is highly sensitive to early segments but less stable than the two leading solutions. MC3-18 and R(2+1)D-18 exhibit similar but somewhat more moderate trajectories. MC3-18 increases from 0.9544 to 0.9784, while R(2+1)D-18 increases from 0.9543 to 0.9702. CNN+BiLSTM demonstrates the most significant improvement relative to the starting point: from 0.9091 at 0.2 to 0.9577 at 1.0, indicating a significant dependence on the completeness of the observed clip. The lowest trajectory is maintained by R3D-18, with values ranging from 0.9412 to 0.9576 and remaining below those of the other models over almost the entire interval. Taken together, the presented results confirm that RAMT-BinaryHeatNet and Swin3D-S achieve the highest level of performance in terms of ROC-AUC in the early warning mode, with the proposed model providing the strongest combination of early discrimination and final robustness.

To comprehensively evaluate the contribution of each architectural component, an ablation study was conducted, with the results summarized in Table 6. Starting with the full model (A0), key modules were progressively removed or simplified to analyze their individual impact on classification performance, localization quality, and computational efficiency. All reported values are presented as the mean ± standard deviation over three independent runs (

n

= 3) with different random seeds (42, 73, and 101) under a clip observation setting of 0.60. Removing the motion branch (A1) results in a noticeable decrease across all metrics: F1 drops from 0.952 to 0.926, and mIoU decreases from 0.604 to 0.403, underscoring the importance of motion information for both classification and localization. Similarly, turning off motion-guided localization (A2) reduces localization quality, confirming the importance of explicit motion cues for region-level prediction. Replacing supervised fusion with simple concatenation (A3) results in decreased performance across all evaluation metrics, demonstrating the importance of adaptive weighting for effective multimodal integration. A similar degradation is observed when removing the absolute difference term between the RGB and motion features (A8), suggesting that this operation captures informative temporal discrepancies between the two modalities. It is also shown that temporal modeling components are crucial. Removing TemporalConvBlocks (A4) or Multi-Head Attention (A5) results in a consistent performance degradation, with the largest drop observed after removing the attention mechanism (F1 = 0.887), highlighting its importance for modeling long-term temporal dependencies. The importance of multi-task learning is particularly evident in A7. When only the classification loss function is retained, and the localization/coherence targets are removed, localization performance drops sharply to mIoU = 0.138. Removing the risk analysis module and decision fusion module (A6) has a relatively small impact on classification performance. However, a slight decrease in overall robustness is still observed, while localization remains virtually unchanged. Finally, removing the fusion modules (A9–A10) shows that the combination of weighted average pooling and max pooling is more effective than either strategy alone, suggesting that both global contextual information and meaningful activations contribute to the final prediction.

The results of the normalized confusion matrix (Figure 12) for RAMT-BinaryHeatNet indicate that the proposed model achieves high, well-balanced binary recognition performance on an independent test set. The matrix’s main diagonal contains the largest values: for the NonViolence class, the proportion of correct predictions is 0.92, while for the Violence class, it reaches 0.95. This means that the model correctly identifies 92% of non-aggressive scenes and 95% of aggressive scenes, confirming high sensitivity to the target dangerous class while maintaining consistent performance for the neutral class. The structure of the off-diagonal elements is particularly important. The proportion of non-aggressive video clips incorrectly classified as violent is 0.08, while the proportion of aggressive scenes incorrectly classified as non-violent is 0.05. Consequently, within the test protocol, the model produces fewer false negatives for the Violence class than false positives for the Non-Violence class. From a practical perspective, the results demonstrate that the developed approach can be used as a component of a preliminary video analytics filtering system to identify fragments requiring further operator analysis. However, in its current form, the work does not consider the model a standalone tool for making final decisions, as its operational use in environments with rare incidents requires additional quantitative evaluation of false positives, threshold calibration, prevalence sensitivity analysis, and formalized human accountability procedures.

A comparison of the two matrix rows reveals that the difference in class recognition performance remains small: 0.95 − 0.92 = 0.03. This demonstrates the absence of a significant model bias toward one category and confirms the statistically consistent separation of classes. At the same time, the slight advantage for the Violence class is consistent with the proposed model’s intended application, which is to reliably identify potentially dangerous behavior. Thus, the confusion matrix shows that RAMT-BinaryHeatNet achieves high recognition accuracy, a low rate of aggression misses, and an acceptable false-positive rate, making its results valid and practically relevant for the task at hand. Thus, the comparative analysis results demonstrate that the proposed RAMT-BinaryHeatNet model occupies the strongest position across key characteristics, combining high early warning performance, maximum discriminatory power, and a significantly more complex analytical framework than baseline architectures. Moreover, official video models, primarily Swin3D-S and Swin3D-T, demonstrate high training stability and competitive values for key metrics. At the same time, classic 3D convolutional solutions and CNN+BiLSTM serve as simpler benchmarks with varying degrees of computational and predictive efficiency. Taken together, they obtained data confirm that integrating spatial features, motion information, risk-based analysis, and localization mechanisms within RAMT-BinaryHeatNet provides the most balanced solution to the problem of early detection of aggressive behavior and justifies a transition to subsequent visual analysis of the model’s interpretability.

3.2. Visualization of Predictions and Interpretation of Decisions of the RAMT-BinaryHeatNet Model

The results discussed below focus on the visual and substantive analysis of the proposed RAMT-BinaryHeatNet model to provide a more in-depth interpretation of the experimental data. The results, presented in Figure 13, demonstrate that the proposed RAMT-BinaryHeatNet model generates spatially consistent and meaningfully interpretable localization precisely in those areas of the frame where violent interaction is visually concentrated.

All four examples display an overlaid localization map on the right. In all cases, the final Violence classification score is 0.99, demonstrating the model’s high confidence in analyzing the corresponding scenes. Activations are not distributed randomly across the frame, but are concentrated in limited zones that coincide with areas of physical contact, struggle, strikes, or forceful restraints. This type of detection is particularly important, as it confirms that the model is not relying on random background elements, but rather on structurally significant sections of the violent episode. In the first example, the most intense localization is concentrated in the lower center of the scene, where intense physical contact and the person being pressed against a surface are observed. In the second and third examples, activation is shifted toward the upper body and the collision zone between the participants, corresponding to the most dynamic and conflict-filled portion of the interaction. In the fourth example, the thermal region is localized above the person located at the bottom of the scene, i.e., in the part of the frame where the most pronounced forceful action occurs. In all cases, background objects, road surfaces, cars, free spaces, and extraneous areas of the frame do not become dominant focal points, despite their visual presence. From a scientific perspective, these results confirm that the motion-guided localization mechanism built into RAMT-BinaryHeatNet can identify not only moving areas but also semantically relevant zones of aggressive action. This distinguishes the proposed model from conventional classification architectures, in which the final decision lacks an explicit spatial interpretation. Consequently, the examples shown demonstrate that the model combines high binary classification confidence with robust localization of contact and conflict areas, thereby enhancing the explainability and applicability of the results.

Figure 14 shows that the results of RAMT-BinaryHeatNet demonstrate that the proposed model is capable of simultaneously generating a confident final decision, an anticipatory risk assessment, and spatial localization of the most significant area of the scene. The upper-right panel shows that the model assigns high probabilities to an aggressive scenario for the video clip: Violence = 0.98 and Risk = 0.98. This indicates that, already within the analyzed time window, the system interprets the observed group dynamics as clearly dangerous and internally consistently enhances both the classification and risk-based output. The time curve at the bottom of the figure is particularly revealing. The probability of an aggressive event increases not abruptly, but consistently, starting from low values and reaching a level above 0.9 after approximately 3–4 s of observation. Moreover, the anticipatory risk marker appears at an earlier stage, around 2 s, when the main curve has not yet reached its maximum.

This means that the model captures early precursors of a dangerous scenario before the primary decision has fully stabilized. The spatial heatmap is concentrated in the central part of the group of people rather than in the periphery of the frame, confirming the localization’s meaningful relevance. Figure 15 continues the previously discussed example of external validation on the same video clip, this time showing a later time segment of the event’s development. While the previous figure focused on the early stage of the escalation of the dangerous scenario, here we observe a steadily developing phase of aggressive interaction, as evidenced by both the spatial localization and the temporal dynamics of the probability. In the upper-right panel, the model maintains high values of Violence = 0.98 and Risk = 0.98, indicating stable predictions over a long observation period. The lower time curve is most revealing. Unlike the previous figure, this one clearly shows that the probability of aggressive behavior is not monotonous, but rather undulating, reflecting the changing phases of intense conflict, short-term relaxation, and subsequent re-escalation. Despite localized dips, the model repeatedly returns to high values, close to 0.9–1.0, demonstrating its ability to maintain recognition in complex, changing group dynamics. The spatial heatmap remains concentrated in the central zone of greatest physical contact, rather than dispersed along the periphery of the scene, confirming the spatial stability of localization within the same video clip.

Thus, the combined results confirm that the proposed RAMT-BinaryHeatNet model provides the most balanced solution to the problem of early detection of aggressive behavior in video streams, combining high binary classification performance, pronounced early warning efficiency, training stability, and interpretability of the generated spatiotemporal representations. A comparison with baseline architectures showed that, despite an overall high level of quality, it is the integration of RGB features, motion information, risk-based analysis, and a localization mechanism that enables the proposed model to achieve the best results in both key metrics and the meaningful consistency of visual activations. Overall, the results of this section demonstrate that the developed approach is not only quantitatively effective but also methodologically sound for practical use in intelligent video surveillance systems focused on the timely and interpretable detection of potentially dangerous scenarios.

4. Discussion

The results obtained allow us to consider the proposed RAMT-BinaryHeatNet model a scientific and applied solution for the early detection of aggressive, potentially dangerous human behavior in video streams in real-world surveillance settings. The key task addressed by this model is not only the post-factum classification of a completed event, but also the early detection of signs of conflict escalation when only a partial video sequence is available to the system. This problem formulation best suits the practical requirements of intelligent video surveillance, transportation security, educational infrastructure, and public monitoring systems, where it is critical not only to record aggression but to detect its development at a stage when prompt intervention is still possible.

The experimental analysis shows that the proposed architecture provides the best overall performance among the compared models across a range of key metrics. On an independent test set, RAMT-BinaryHeatNet demonstrated Accuracy = 0.933 and F1 = 0.934, and in early warning mode with full clip observation, the ROC-AUC reached 0.9871. Particularly significant is that, even in a more complex and practically important setting with partial video scene observation (i.e., an observation ratio of 0.6), the model achieved F1 = 0.9527 and Balanced Accuracy = 0.9533, demonstrating its high suitability for predictive recognition. Thus, the results confirm that the developed architecture is not limited to merely good final class separation, but also demonstrates the ability to identify dangerous scenarios before the event is fully completed. The scientific and applied novelty of this work stems primarily from the fact that the proposed model addresses the problem of aggressive behavior recognition not through conventional binary clip classification, but through a joint assessment of three interrelated aspects of a video scene: the final class, an internal anticipatory risk assessment, and the spatial localization of a significant conflict zone. Unlike standard 3D-CNN and transformer video models, which generally produce only a final classification decision, RAMT-BinaryHeatNet combines an RGB representation, an explicit residual motion input, motion-guided localization, risk-aware gated fusion, and decision-level fusion between classification and risk. This allows the model to consider not only the scene’s appearance but also the dynamics of interframe changes, and the final decision is formed based on the severity of early precursors of dangerous behavior.

This is precisely its fundamental difference from the baseline models used in this study. Comparative architectures, including CNN+BiLSTM, MC3-18, R3D-18, R(2+1)D-18, Swin3D-T, and Swin3D-S, represent important and strong classes of video models, but they are primarily focused on classifying a video clip as a whole. The proposed model, in contrast, was specifically designed for an application scenario that requires simultaneously answering three questions: whether the observed event is aggressive, when risk indicators first appear, and where in the frame the most significant conflict episode is concentrated. This integration of recognition, early threat assessment, and interpretable localization makes the model not just another variation of an existing video classifier, but a targeted analytical tool for preventive video analytics.

The interpretability of the proposed approach deserves special attention. Localization results showed that the model consistently concentrates activations in areas of direct physical contact, struggle, impact, or restraint, rather than random background objects. This was observed in both the Violence class test cases and external validation procedures, where heatmap regions consistently coincided with the zones of the most intense interaction among scene participants. Thus, the proposed architecture demonstrates not only high accuracy but also meaningful explainability of the adopted solution, significantly increasing its scientific and applied value. This is especially important for a peer-reviewed publication, as modern requirements for intelligent security systems increasingly go beyond simple quality metrics and require interpretable analysis mechanisms.

Another strength of the work is that the model was tested not only on the main test set but also in an external validation format on independent video clips containing various scenarios of aggression and ambiguous behavior. In these examples, RAMT-BinaryHeatNet maintained high Violence and Risk values, and the time curves demonstrated meaningfully plausible dynamics: the probability of an aggressive event increased not randomly, but in accordance with the visually observed development of the conflict, with anticipatory risk markers appearing before the final stabilization of the main decision. This suggests that the model can operate not only in a laboratory-based protocol but also in more variable outdoor settings, where background, participant numbers, camera angles, and movement patterns vary significantly.

Equally important is the computational aspect. Despite its more complex internal structure, RAMT-BinaryHeatNet maintained moderate latency, remaining significantly more practical than cumbersome solutions, while also delivering a higher final quality. This means that the model’s novelty is not limited to a simple increase in the number of parameters, but rather a more efficient organization of the analytical framework, where each module addresses a specific application task: scene feature extraction, motion detection, localization, risk assessment, and final solution correction. Consequently, the proposed approach can be considered a compact applied architecture for real-world video analytics use, rather than merely an experimental design oriented toward an internal benchmark.

This evaluation was conducted within a controlled reference model designed to ensure reproducibility and comparability of results across models. While this model provides a methodologically consistent basis for evaluating the proposed architecture, it does not encompass the full diversity of real-world observation scenarios. Accordingly, these results should serve as a reliable benchmark for the model’s performance. At the same time, additional experiments on independent datasets will further strengthen conclusions regarding robustness and generalizability across different deployment settings.

For the sake of scientific completeness, several limitations of this study should be noted. The experimental evaluation is conducted within a binary classification framework. It relies on a single publicly available dataset, which ensures controlled, reproducible conditions but does not fully reflect the diversity of real-world video surveillance environments. Furthermore, some baseline architectures are evaluated in a “frozen” architecture mode, which ensures methodological consistency within a single training protocol while leaving scope for further extension through comprehensive fine-tuning. Furthermore, publicly available video datasets collected from open internet sources may contain a certain degree of annotation uncertainty. In violence recognition, this can arise from ambiguous interaction boundaries, incomplete contextual information, or visually subtle manifestations of aggressive behavior, which can introduce noise into labels and influence both the optimization dynamics and the interpretation of boundary predictions. Although this study uses official reference annotations to maintain comparability, this factor should be taken into account when interpreting the results. These limitations do not undermine the validity of the obtained results but rather point to areas for further work, including validation across multiple datasets, broader quantitative evaluation across heterogeneous video sources, extension to multi-class behavioral scenarios, and the development of noise-robust learning strategies. In this regard, the proposed RAMT-BinaryHeatNet model should be viewed as an interpretable decision support system for early video analysis, combining spatiotemporal classification, preliminary risk assessment, and motion-based localization, rather than as a ready-to-deploy solution for real-world security systems.

5. Conclusions

This paper examined the early detection of aggressive human behavior in real-world surveillance video streams as a relevant area of intelligent video analytics for security systems. Unlike approaches focused solely on final clip classification, the proposed RAMT-BinaryHeatNet model implements a comprehensive spatio-temporal analysis that combines RGB features, residual motion description, adaptive feature fusion, early risk assessment, and interpretable spatial localization. The proposed approach allowed us to move from a simple classification of a completed video clip to a more meaningful task of early identification of video segments with an increased probability of dangerous interaction, subject to subsequent verification by a human operator. An experimental study demonstrated that the proposed architecture provides the strongest results among all the models considered. The best performance was achieved in early warning mode with an observation ratio of 0.6, where the model demonstrated an F1 score of 0.9527 and a balanced accuracy of 0.9533. With full video clip observation, RAMT-BinaryHeatNet also maintained high performance, achieving an F1 score of 0.9342 and an ROC-AUC of 0.9871, confirming its strong discriminative ability and robustness in binary classification of aggressive and non-aggressive behavior. Additional analysis demonstrated that the model produces meaningfully interpretable localization maps and can detect early warning signs of a dangerous scenario before the main decision is fully stabilized.

From a practical standpoint, the obtained results demonstrate the potential of the developed approach for use in urban monitoring systems, transportation security, educational infrastructure, and other applications where timely threat detection, robustness to complex scene conditions, and the explainability of automated decisions are particularly important. However, the study also has certain limitations associated with the use of a single primary binary dataset, a compact input clip format, and a fixed experimental configuration. Future developments could include testing the model on more diverse multi-scenario video datasets, adapting it to real-time streaming processing, expanding the interpretation mechanism, and exploring more universal early behavior prediction schemes in open environments. Thus, the RAMT-BinaryHeatNet model can be considered a methodologically sound research approach to interpretable early detection of aggressive behavior from a video stream. The obtained results confirm the potential of the chosen architectural scheme but do not, in themselves, prove the model’s readiness for autonomous implementation in high-cost error scenarios without additional external validation and rigorous human-in-the-loop verification.

Author Contributions

Conceptualization, A.I. (Aida Issembayeva), A.S. (Anargul Shaushenova) and A.N.; methodology, A.I. (Aida Issembayeva), A.S. (Anargul Shaushenova), A.N., A.I. (Aidar Ispussinov) and G.M.; software, A.I. (Aidar Ispussinov) and B.S.; validation, A.S. (Anargul Shaushenova), A.N. and A.S. (Aliya Satybaldieva); formal analysis, B.S., A.B., A.Z. and G.M.; investigation, A.I. (Aida Issembayeva), A.S. (Anargul Shaushenova), A.N. and A.I. (Aidar Ispussinov); resources, A.N., A.S. (Anargul Shaushenova) and A.B.; data curation, A.I. (Aida Issembayeva), A.Z. and G.M.; writing—original draft preparation, A.I. (Aida Issembayeva), A.S. (Anargul Shaushenova) and A.I. (Aidar Ispussinov); writing—review and editing, A.N., A.B. and A.S. (Aliya Satybaldieva); visualization, B.S. and A.Z.; supervision, A.S. (Anargul Shaushenova) and A.N.; project administration, A.S. (Anargul Shaushenova); funding acquisition, A.S. (Aliya Satybaldieva) and A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. AP23486538 Research and development of a system for recognizing images in video streams based on artificial intelligence).

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
BiLSTM	Bidirectional Long Short-Term Memory
RGB	Red, Green, Blue
ROC-AUC	Area Under the Receiver Operating Characteristic Curve
F1-score	Harmonic Mean of Precision and Recall
RAMT-BinaryHeatNet	Residual Adaptive Motion Temporal Binary Heat Network
R3D-18	3D Residual Network, 18 layers
R(2+1)D-18	Residual Network with Spatial and Temporal Decomposition of 3D Convolutions, 18 layers
MC3-18	Multi-Channel 3D Convolutional Network, 18 layers
MViTv2-S	Multiscale Vision Transformer V2, Small

References

Ullah, F.U.M.; Obaidat, M.S.; Ullah, A.; Muhammad, K.; Hijji, M.; Baik, S.W. A comprehensive review on vision-based violence detection in surveillance videos. ACM Comput. Surv. 2023, 55, 1–44. [Google Scholar] [CrossRef]
Choqueluque-Roman, D.; Camara-Chavez, G. Weakly supervised violence detection in surveillance video. Sensors 2022, 22, 4502. [Google Scholar] [CrossRef]
Vijeikis, R.; Raudonis, V.; Dervinis, G. Efficient violence detection in surveillance. Sensors 2022, 22, 2216. [Google Scholar] [CrossRef]
Vosta, S.; Yow, K.C. A cnn-rnn combined structure for real-world violence detection in surveillance cameras. Appl. Sci. 2022, 12, 1021. [Google Scholar] [CrossRef]
Vieira, J.C.; Sartori, A.; Stefenon, S.F.; Perez, F.L.; De Jesus, G.S.; Leithardt, V.R.Q. Low-cost CNN for automatic violence recognition on embedded system. IEEE Access 2022, 10, 25190–25202. [Google Scholar] [CrossRef]
Ciampi, L.; Foszner, P.; Messina, N.; Staniszewski, M.; Gennaro, C.; Falchi, F.; Serao, G.; Cogiel, M.; Golba, D.; Szczęsna, A.; et al. Bus violence: An open benchmark for video violence detection on public transport. Sensors 2022, 22, 8345. [Google Scholar] [CrossRef] [PubMed]
Magdy, M.; Fakhr, M.W.; Maghraby, F.A. Violence 4D: Violence detection in surveillance using 4D convolutional neural networks. IET Comput. Vis. 2023, 17, 282–294. [Google Scholar] [CrossRef]
Abbass, M.A.B.; Kang, H.S. Violence detection enhancement by involving convolutional block attention modules into various deep learning architectures: Comprehensive case study for ubi-fights dataset. IEEE Access 2023, 11, 37096–37107. [Google Scholar] [CrossRef]
Aldehim, G.; Asiri, M.M.; Aljebreen, M.; Mohamed, A.; Assiri, M.; Ibrahim, S.S. Tuna swarm algorithm with deep learning enabled violence detection in smart video surveillance systems. IEEE Access 2023, 11, 95104–95113. [Google Scholar] [CrossRef]
Zhou, X.; Peng, X.; Wen, H.; Luo, Y.; Yu, K.; Yang, P.; Wu, Z. Learning weakly supervised audio-visual violence detection in hyperbolic space. Image Vis. Comput. 2024, 151, 105286. [Google Scholar] [CrossRef]
Dündar, N.; Keçeli, A.S.; Kaya, A.; Sever, H. A shallow 3D convolutional neural network for violence detection in videos. Egypt. Inform. J. 2024, 26, 100455. [Google Scholar] [CrossRef]
Taha, R.A.; Youssif, A.A.H.; Fouad, M.M. Transfer learning model for anomalous event recognition in big video data. Sci. Rep. 2024, 14, 27868. [Google Scholar] [CrossRef] [PubMed]
Veltmeijer, E.; Franken, M.; Gerritsen, C. Real-time violence detection and localization through subgroup analysis. Multimed. Tools Appl. 2025, 84, 3793–3807. [Google Scholar] [CrossRef] [PubMed]
Qi, B.; Wu, B.; Sun, B. Automated violence monitoring system for real-time fistfight detection using deep learning-based temporal action localization. Sci. Rep. 2025, 15, 29497. [Google Scholar] [CrossRef]
Park, J.; Kim, J.; Han, B. End-to-end learning for weakly supervised video anomaly detection using Absorbing Markov Chain. Comput. Vis. Image Underst. 2023, 236, 103798. [Google Scholar] [CrossRef]
Wang, L.; Wang, X.; Liu, F.; Li, M.; Hao, X.; Zhao, N. Attention-guided MIL weakly supervised visual anomaly detection. Measurement 2023, 209, 112500. [Google Scholar] [CrossRef]
Basak, S.; Gautam, A. Diffusion-based normality pre-training for weakly supervised video anomaly detection. Expert Syst. Appl. 2024, 251, 124013. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, J.; Guan, J. A lightweight video anomaly detection model with weak supervision and adaptive instance selection. Neurocomputing 2025, 613, 128698. [Google Scholar] [CrossRef]
Chu, C.; Japar, N.; Lim, C.K. Scene-dependent video anomaly detection: A benchmark and weakly supervised model. Alex. Eng. J. 2025, 133, 477–486. [Google Scholar] [CrossRef]
Dilek, E.; Dener, M. An overview of transformers for video anomaly detection. Neural Comput. Appl. 2025, 37, 17825–17857. [Google Scholar] [CrossRef]
Xu, X.; Li, Y.L.; Lu, C. Dynamic Context Removal: A General Training Strategy for Robust Models on Video Action Predictive Tasks. Int. J. Comput. Vis. 2023, 131, 3272–3288. [Google Scholar] [CrossRef]
Li, X.S.; Zhang, N.; Cai, B.Q.; Kang, J.W.; Zhao, F.D. Adversarial graph convolutional network for skeleton-based early action prediction. J. Comput. Sci. Technol. 2024, 39, 1269–1280. [Google Scholar] [CrossRef]
Stergiou, A.; Poppe, R. About time: Advances, challenges, and outlooks of action understanding. Int. J. Comput. Vis. 2025, 133, 6251–6315. [Google Scholar] [CrossRef]
Salman, M.; Abbas, N.; Rahman, S.I.U.; Rehman, A.; Alamri, F.S.; Elyassih, A.; Saba, T. Enhancing surveillance anomaly detection with keyframes and explainable inception model. Egypt. Inform. J. 2025, 31, 100769. [Google Scholar] [CrossRef]
Gao, W.; Wang, X.; Wang, Y.; Jing, X. Dual-stream attention-enhanced memory networks for video anomaly detection. Sensors 2025, 25, 5496. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Wang, X.; Wang, H.; Yang, M. LTGS-Net: Local Temporal and Global Spatial Network for Weakly Supervised Video Anomaly Detection. Sensors 2025, 25, 4884. [Google Scholar] [CrossRef]
Zhang, B.; Xue, J. Fusing crops representation into snippet via mutual learning for weakly supervised surveillance anomaly detection. IET Comput. Vis. 2024, 18, 1112–1126. [Google Scholar] [CrossRef]
Wang, Y.; Chen, Y.; Yeo, C.K. Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features. Information 2025, 16, 1042. [Google Scholar] [CrossRef]
Li, N.; Zhong, J.X.; Shu, X.; Guo, H. Weakly-supervised anomaly detection in video surveillance via graph convolutional label noise cleaning. Neurocomputing 2022, 481, 154–167. [Google Scholar] [CrossRef]

Figure 1. Data preparation scheme for RAMT-BinaryHeatNet.

Figure 2. Distribution of video clips by class.

Figure 3. Distribution of video clips by classes in the training, validation and test samples.

Figure 4. Architectural diagram of the RAMT-BinaryHeatNet model.

Figure 5. Dynamics of changes in the training and validation loss function by epoch for models.

Figure 6. Dynamics of changes in the training and validation F1-measures by epochs for models.

Figure 7. Dynamics of changes in the validation ROC-AUC by epoch for models.

Figure 8. Comparison of the final test accuracy of the models on an independent sample.

Figure 9. Comparison of the final test F1-measure of the models on an independent sample.

Figure 10. Change in the F1-measure of the models depending on the proportion of the observed video clip in the early warning mode.

Figure 11. Change in ROC-AUC of RAMT-BinaryHeatNet, Swin3D-S, R(2+1)D-18, MC3-18, Swin3D-T, CNN+BiLSTM and R3D-18 models depending on the proportion of the observed video clip in the early warning mode.

Figure 12. Normalized error matrix of the proposed RAMT-BinaryHeatNet model.

Figure 13. Examples of trained motion-guided localization of the proposed RAMT-BinaryHeatNet model.

Figure 14. An example of external validation of the proposed RAMT-BinaryHeatNet model with simultaneous display of the original video stream, heatmap localization, probability of the Violence class, anticipatory risk, and the temporal dynamics of the increase in the dangerous scenario.

Figure 15. External validation of RAMT-BinaryHeatNet on a video scene of aggressive interaction.

Table 1. Common training and computational protocol hyperparameters used for all models.

Parameter	Value
Clip length	8 frames
Frame size	96 × 96
Number of epochs	15
Base batch size	16
Number of workers	0
Learning rate	3 × 10⁻⁴
Weight decay	1 × 10⁻⁴
Gradient clipping	2.0
Optimizer	AdamW
LR scheduler	CosineAnnealingLR, T_max = 15
Mixed precision	torch.autocast + GradScaler
Label smoothing	0.02
Decision threshold selection	by validation, range 0.20–0.80, 61 candidates
Observation ratios for early observation	0.20, 0.40, 0.60, 0.80, 1.00

Table 2. Architectural configuration and model hyperparameters.

Model	Architectural Basis	Key Configuration	Dropout	Effective Batch Size	Trainable Parameters
CNN+BiLSTM	MobileNetV3-Small + BiLSTM	hidden_dim = 128, backbone frozen, BiLSTM bidirectional, classifier(128 → 2), risk head(128 → 1)	–	16	173,699
MC3-18	torchvision video backbone	official pretrained weights, frozen backbone, classifier + risk head	0.15	8	1539
R3D-18	torchvision video backbone	official pretrained weights, frozen backbone, classifier + risk head	0.15	8	1539
R(2+1)D-18	torchvision video backbone	official pretrained weights, frozen backbone, classifier + risk head	0.15	8	1539
Swin3D-T	torchvision transformer video backbone	official pretrained weights, frozen backbone, classifier + risk head	0.15	2	2307
Swin3D-S	torchvision transformer video backbone	official pretrained weights, frozen backbone, classifier + risk head	0.15	1	2307
RAMT-BinaryHeatNet	MobileNetV3-Small + motion encoder + temporal fusion	feature_dim = 160, 3 trainable MobileNet blocks, attention/localization heads 1 × 1, MotionEncoder 3 → 24 → 48 → 160, 3 TemporalConvBlock, MultiheadAttention num_heads = 5, risk_scale = 0.35	0.12	16	1,203,791

Table 3. Profiles of lightweight models by the number of trainable parameters and the latency of processing one video clip in the implemented experimental configuration.

Model	Trainable Parameters	Latency (ms/Clip)
CNN+BiLSTM	173,699	3419.6635
MC3-18	1539	197.2369
R3D-18	1539	74.5958
R(2+1)D-18	1539	496.5344
Swin3D-T	2307	58.5502
Swin3D-S	2307	21.5001
RAMT-BinaryHeatNet	1,203,791	85.7341

Table 4. Results of early warning models for partial observation of a video clip by metrics.

Model	Observation Ratio	F1	Balanced Accuracy	ROC-AUC
CNN+BiLSTM	0.2000	0.8224	0.8200	0.9091
CNN+BiLSTM	0.4000	0.8774	0.8733	0.9366
CNN+BiLSTM	0.6000	0.8675	0.8667	0.9488
CNN+BiLSTM	0.8000	0.8750	0.8733	0.9552
CNN+BiLSTM	1.0000	0.8932	0.8900	0.9577
MC3-18	0.2000	0.8997	0.9033	0.9544
MC3-18	0.4000	0.9195	0.9200	0.9716
MC3-18	0.6000	0.9049	0.9033	0.9745
MC3-18	0.8000	0.8961	0.8933	0.9736
MC3-18	1.0000	0.9115	0.9100	0.9784
R3D-18	0.2000	0.8315	0.8500	0.9412
R3D-18	0.4000	0.8652	0.8733	0.9572
R3D-18	0.6000	0.8737	0.8767	0.9580
R3D-18	0.8000	0.8690	0.8733	0.9551
R3D-18	1.0000	0.8822	0.8833	0.9576
R(2+1)D-18	0.2000	0.8561	0.8667	0.9543
R(2+1)D-18	0.4000	0.8811	0.8867	0.9693
R(2+1)D-18	0.6000	0.9078	0.9100	0.9721
R(2+1)D-18	0.8000	0.9189	0.9200	0.9756
R(2+1)D-18	1.0000	0.9167	0.9133	0.9702
Swin3D-T	0.2000	0.8850	0.8900	0.9557
Swin3D-T	0.4000	0.9333	0.9333	0.9812
Swin3D-T	0.6000	0.8968	0.8933	0.9767
Swin3D-T	0.8000	0.9226	0.9200	0.9801
Swin3D-T	1.0000	0.9079	0.9033	0.9716
Swin3D-S	0.2000	0.8723	0.8800	0.9534
Swin3D-S	0.4000	0.9116	0.9133	0.9675
Swin3D-S	0.6000	0.9097	0.9100	0.9786
Swin3D-S	0.8000	0.9272	0.9267	0.9833
Swin3D-S	1.0000	0.9320	0.9300	0.9866
RAMT-BinaryHeatNet	0.2000	0.8623	0.8733	0.9630
RAMT-BinaryHeatNet	0.4000	0.9110	0.9133	0.9704
RAMT-BinaryHeatNet	0.6000	0.9527	0.9533	0.9836
RAMT-BinaryHeatNet	0.8000	0.9459	0.9467	0.9803
RAMT-BinaryHeatNet	1.0000	0.9342	0.9333	0.9871

Table 5. Summary analysis of the stability of model training based on the values of Validation F1, Validation ROC-AUC, and the final gap between the training and validation F1 scores.

Model	Validation F1 Mean	Validation F1 Std	Validation ROC-AUC Mean	Validation ROC-AUC Std	Final Train-Val F1 Gap
RAMT-BinaryHeatNet	0.9446	0.0043	0.9804	0.0013	0.0443
Swin3D-S	0.9191	0.0000	0.9670	0.0007	0.0132
R(2+1)D-18	0.9048	0.0015	0.9568	0.0008	−0.0260
MC3-18	0.9171	0.0031	0.9711	0.0007	−0.0450
Swin3D-T	0.9214	0.0013	0.9764	0.0001	0.0087
CNN+BiLSTM	0.8957	0.0007	0.9636	0.0020	0.0493
R3D-18	0.8737	0.0050	0.9328	0.0005	−0.0086

Table 6. Results of RAMT-BinaryHeatNet ablation analysis.

No	Option (What Is Disabled)	F1	Balanced Acc.	ROC-AUC	mIoU Localizations	Δ Latency, %
A0	Full model	0.952 ± 0.003	0.953 ± 0.004	0.984 ± 0.002	0.604 ± 0.018	0
A1	RGB-only (without motion branch)	0.926 ± 0.004	0.928 ± 0.005	0.970 ± 0.003	0.403 ± 0.027	−12%
A2	Without motion-guided localization	0.930 ± 0.005	0.931 ± 0.006	0.972 ± 0.004	0.455 ± 0.021	−3%
A3	Without gated fusion (just concat)	0.919 ± 0.006	0.921 ± 0.007	0.967 ± 0.004	0.422 ± 0.024	−2%
A4	Without TemporalConvBlocks	0.903 ± 0.007	0.906 ± 0.007	0.958 ± 0.006	0.391 ± 0.025	−4%
A5	Without Multi-Head Attention	0.887 ± 0.010	0.889 ± 0.011	0.949 ± 0.005	0.377 ± 0.019	−5%
A6	Without risk head and decision fusion	0.914 ± 0.006	0.915 ± 0.005	0.961 ± 0.004	0.601 ± 0.017	−1%
A7	Classification loss only (without localization/consistency)	0.892 ± 0.008	0.895 ± 0.009	0.947 ± 0.006	0.138 ± 0.012	−1%
A8	Without abs (RGB − Motion) in fusion	0.938 ± 0.004	0.939 ± 0.004	0.976 ± 0.003	0.486 ± 0.020	−2%
A9	Mean-pool only (without max-pool)	0.944 ± 0.004	0.945 ± 0.004	0.978 ± 0.003	0.575 ± 0.018	−2%
A10	Max-pool only (without weighted mean)	0.940 ± 0.005	0.941 ± 0.005	0.975 ± 0.004	0.561 ± 0.020	−2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Issembayeva, A.; Shaushenova, A.; Nurpeisova, A.; Ispussinov, A.; Suleimenova, B.; Bekenova, A.; Satybaldieva, A.; Zholmukhanova, A.; Mauina, G. Early Detection of Aggressive Human Behavior in Video Streams Using Deep Spatiotemporal Models. Computers 2026, 15, 267. https://doi.org/10.3390/computers15050267

AMA Style

Issembayeva A, Shaushenova A, Nurpeisova A, Ispussinov A, Suleimenova B, Bekenova A, Satybaldieva A, Zholmukhanova A, Mauina G. Early Detection of Aggressive Human Behavior in Video Streams Using Deep Spatiotemporal Models. Computers. 2026; 15(5):267. https://doi.org/10.3390/computers15050267

Chicago/Turabian Style

Issembayeva, Aida, Anargul Shaushenova, Ardak Nurpeisova, Aidar Ispussinov, Buldyryk Suleimenova, Anargul Bekenova, Aliya Satybaldieva, Aigul Zholmukhanova, and Galiya Mauina. 2026. "Early Detection of Aggressive Human Behavior in Video Streams Using Deep Spatiotemporal Models" Computers 15, no. 5: 267. https://doi.org/10.3390/computers15050267

APA Style

Issembayeva, A., Shaushenova, A., Nurpeisova, A., Ispussinov, A., Suleimenova, B., Bekenova, A., Satybaldieva, A., Zholmukhanova, A., & Mauina, G. (2026). Early Detection of Aggressive Human Behavior in Video Streams Using Deep Spatiotemporal Models. Computers, 15(5), 267. https://doi.org/10.3390/computers15050267

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Early Detection of Aggressive Human Behavior in Video Streams Using Deep Spatiotemporal Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.2. Deterministic Video Data Preparation Pipeline and Experimental Design

3. Results

3.1. Comparative Analysis of Models

3.2. Visualization of Predictions and Interpretation of Decisions of the RAMT-BinaryHeatNet Model

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI