This study’s results are based on a unified experimental protocol and aim to comprehensively evaluate models for early detection of aggressive behavior in video streams. The analysis includes both a quantitative comparison of baseline architectures and the proposed RAMT-BinaryHeatNet model, as well as an examination of their performance characteristics across varying levels of video clip completeness. Along with traditional binary classification metrics, early warning parameters, decision stability, and the interpretability of the generated spatio-temporal representations are considered, enabling a comprehensive assessment of the proposed approach’s effectiveness.
3.1. Comparative Analysis of Models
In the experimental work conducted, a single, fully implemented set of models was used for training and subsequent comparison. This set included both compact baseline architectures and more modern official video models, as well as the author’s proposed RAMT-BinaryHeatNet configuration. This model set ensures a fair comparison of several classes of solutions: 2D-temporal approaches, classical 3D convolutional networks, factored spatiotemporal architectures, transformer video models, and a hybrid risk-based model with localization. It is important to emphasize that all models were trained on a single compute loop on the laptop, meaning they had the same input clip length, a common optimization scheme, the same validation thresholding algorithm, and a common multi-task loss function.
The lightweight baseline CNN+BiLSTM was implemented as a frame-by-frame RGB encoder based on MobileNetV3-Small, followed by a bidirectional LSTM. In this configuration, the hidden representation size is set to 128, and the LSTM uses bidirectional processing with an internal dimension of 64 per direction, resulting in a 128-dimensional temporal description of the sequence. The classification layer includes a two-class linear layer, and an additional risk branch generates a single logit for anticipatory risk estimation. For this model, the MobileNet backbone was used in frozen mode, and the actual runtime profile reported 173,699 trainable parameters.
The group of official video baselines included the MC3-18, R3D-18, R(2+1)D-18, Swin3D-T, and Swin3D-S models, downloaded from torchvision. Officially pretrained weights were used for these architectures, and the backbone was frozen when those weights were present. As a result, training was performed primarily on the new output layers: dropout, a two-class linear classifier, and a separate risk head. For 3D convolutional architectures, the dropout value was set to 0.15, and the output classification and risk branches were built on top of the final backbone feature vector. In the executed version of the notebook, the trainable parameters for MC3-18, R3D-18, and R(2+1)D-18 were 1539 each, whereas for Swin3D-T and Swin3D-S, they were 2307 each. This confirms that the comparison was conducted not in full fine-tuning mode, but in supervised head-layer adaptation mode. It is worth noting that the MViTv2-S model was tested during development but was not included in the final executable version of the notebook due to incompatibility between its positional structure and the compact 8-frame protocol with a size of 96 × 96. Therefore, in the final benchmark suite, it was transparently replaced with Swin3D-T and Swin3D-S, which are also considered strong official video baselines and work correctly under the adopted input clip configuration.
The proposed RAMT-BinaryHeatNet model was implemented as a hybrid risk-aware architecture. It is based on MobileNetV3-Small as an RGB encoder, but unlike CNN+BiLSTM, it activates the last three trainable backbone blocks and increases the feature space size to 160. The model implements two spatial heads: an attention head and a localization head, both implemented as 1 × 1 convolutions. For motion analysis, a separate MotionEncoder is added, comprising a sequence of convolutional blocks (3 → 24 → 48 → 160), followed by a temporal 1D depthwise–pointwise module. Spatiotemporal fusion of RGB and motion features is performed via a fusion gate, which accepts the concatenation of the RGB vector, the motion vector, and their absolute divergence. Next, a cascade of three TemporalConvBlocks is used, followed by a MultiheadAttention with 5 heads and a dropout of 0.12. The final solution is built not only using classification logits but also taking into account the risk head, with the trainable coefficient risk_scale initialized to 0.35. The number of trainable parameters of the proposed model in the completed notebook was 1,203,791, making it the most trainable of all the considered options.
Table 1 presents the common training hyperparameters and elements of the computational protocol used for all compared models. To ensure a fair comparison of the architectures, training was performed in a single experimental configuration with a fixed clip length of 8 frames and an input image size of 96 × 96 pixels. All experiments used 15 training epochs, a base batch size of 16, the AdamW optimizer with an initial learning rate of 3 × 10
−4, and a weight decay of 1 × 10
−4. To stabilize the process, gradient clipping at 2.0 and mixed precision based on torch. autocast and GradScaler, along with label smoothing with a value of 0.02, were used. Learning rate adaptation was implemented using the CosineAnnealingLR scheduler with T_max = 15. Additionally, a single decision threshold selection protocol was used for all models, based on the validation set, in the range from 0.20 to 0.80, with 61 candidates, and early recognition was analyzed at observation ratios of 0.20, 0.40, 0.60, 0.80, and 1.00. This unification of experimental conditions ensures comparability of results and makes the comparative analysis of models methodologically correct.
For training, all compared models were governed by a single multi-task loss function. The base classification component was calculated using the cross-entropy function, while the risk branch was trained using binary cross-entropy with logits, with a weight of 0.40. For the RAMT-BinaryHeatNet model, additional specialized loss components were activated, including the decision loss (0.30), the consistency loss (0.10), the localization binary cross-entropy (0.12), and the localization alignment and localization sparsity regularizers (0.015 and 0.003, respectively). This optimized the proposed model not only for the final binary classification criterion but also for the consistency of the risk assessment, the final decision, and spatial localization.
Table 2 presents the architectural configurations and model hyperparameters used in the comparative experiment. The study includes both compact baseline architectures and official modern video models, as well as the author’s proposed RAMT-BinaryHeatNet configuration. This set ensures the correct comparison of several classes of solutions, including two-dimensional temporal approaches, classical three-dimensional convolutional networks, factorized spatio-temporal architectures, transformer video models, and a hybrid risk-based model with localization. For all models, the table lists the architectural basis, key configuration, dropout value, effective batch size, and number of trainable parameters, allowing them to be compared not only by qualitative results but also by computational and parametric complexity.
As shown in
Table 2, the CNN+BiLSTM model is implemented as a lightweight baseline with a frame-by-frame RGB encoder based on MobileNetV3 Small and a subsequent bidirectional long short-term memory. The baseline video models include the MC3 18, R3D 18, R(2+1)D 18, Swin3D T, and Swin3D S architectures, loaded with official pretrained weights and used in supervised output layer adaptation mode. The proposed RAMT-BinaryHeatNet model features the most complex configuration, combining an RGB encoder, a separate motion encoder, a spatiotemporal feature fusion mechanism, attention and localization heads, and a multi-head attention module. Thus,
Table 2 reflects the structural differences between the compared models and serves as the basis for the subsequent analysis of their performance in the early detection of aggressive behavior. Thus, this study utilized a hierarchically structured set of models, with each architectural class occupying a specific comparative role. CNN+BiLSTM provided a lightweight 2D temporal baseline; MC3-18/R3D-18/R(2+1)D-18 represented classic 3D solutions; Swin3D-T/Swin3D-S represented modern official Transformer video models; and RAMT-BinaryHeatNet represented the author’s hybrid configuration with motion feature integration, learnable localization, and risk-based decision-making. This setup makes the architectural comparison methodologically transparent, technically reproducible, and sufficient for a peer-reviewed description of the experimental part. The proposed architecture (
Figure 4) of RAMT-BinaryHeatNet is a specialized hybrid spatiotemporal model designed for binary analysis of video sequences, generating three interrelated outputs: classification logits, an anticipatory risk score, and a spatial localization map. Unlike standard video models, which focus primarily on a single classification output, this scheme is initially built as a multi-component analytical circuit in which a decision is formed based on the combined consideration of RGB features, interframe motion information, and an internal risk score. Structurally, the scheme is divided into four logical stages: input representation, RGB and localization analysis, motion processing and feature fusion, as well as temporal heads and final decision making.
At Stage I. Inputs, the model accepts two matched inputs. The first is an RGB Clip Input of dimension B × T × 3 × H × W, where the implemented protocol uses T = 8 frames normalized by ImageNet statistics. The second is a Motion Residual Input of dimension B × (T − 1) × 3 × H × W, calculated as the interframe difference in adjacent RGB frames. Already at the input level, the main difference between this architecture and many baseline video models becomes apparent: it analyzes not only the visual content of the scene but also an explicitly specified motion component, which is especially important for tasks in which aggressive behavior is determined by the dynamics of interactions rather than just static spatial features. The scheme also fixes the base tensor sizes: T = 8, H = W = 96, and the width of the feature space after encoding is set to d = 160.
At the second stage, designated Stage II. RGB and Localization, features are extracted from the RGB stream using the MobileNetV3 small backbone, which utilizes the last training blocks, followed by a Conv2d 576 → 160, 1 × 1 transformation. Two specialized heads are then formed. The first, Attention Head, constructs an attention map using a 160 → 1 convolution with sigmoid activation and implements attention-weighted pooling. The second, Motion-Guided Localization, is also built on a 160 → 1 convolution but is additionally modulated by the motion map, resulting in a localization heatmap normalized by spatial coordinates. As a result, the RGB branch in this model goes beyond simple feature averaging. It simultaneously performs informative region extraction and spatially consistent localization, distinguishing it from conventional convolutional and transformer baseline architectures, where spatial interpretation is either absent or not directly integrated into the main computational graph, at Stage III. Motion and Fusion, the diagram shows a separate motion processing pipeline. The Spatial Motion Encoder sequentially transforms the motion tensor through blocks 3 → 24 → 48 → 160, combining regular and depthwise–pointwise convolutions with BatchNorm2d and GELU. The resulting sequence is then fed to the Temporal Motion Encoder, where depthwise and pointwise 1D convolutions with normalization and nonlinearity are applied in the time domain. After this, the Motion Projection block maps motion to the same 160-dimensional feature space as the RGB branch. The central element of this stage is Risk-Aware Gated Fusion, which combines RGB and motion features, along with their absolute divergence. The diagram clearly shows that the final fused representation is computed by a trainable sigmoid gate, enabling adaptive weighting of visual content and motion dynamics. This block is one of the key advantages of the model. Instead of rigid feature summation, an adaptive control-based fusion is used, sensitive to differences between static content and scene kinematics.
At the final stage, Stage IV, Temporal Heads and Decision processing are performed in several sequential steps. First, the fused sequence passes through LayerNorm and three TemporalConvBlock blocks, implementing deterministic temporal modeling. Next, MultiHeadAttention with five heads is applied, after which a separate temporal score head generates a temporal importance distribution using softmax. The next block, Temporal Pooling, combines weighted mean pooling and temporal max pooling to form a final 320-dimensional descriptor. This is used to construct two parallel heads: Anticipatory Risk Head, which produces a risk score, and Classification Head, which generates binary class logits. The final block, Decision Fusion and Reported Outputs, demonstrates the principle of the final decision: the margin between class logits is enhanced by an additional contribution from the risk branch via the risk_scale parameter, after which the decision logit is formed. Thus, the architecture not only classifies an event but also introduces risk-based decision correction, making it conceptually distinct from baseline models, where the final prediction is built solely from a single classification head.
From a scientific and methodological perspective, this model can be considered the proposed architecture, as it combines several components in a single differentiable circuit that are typically absent in standard video networks: explicit residual motion input, motion-guided localization, gated RGB and motion fusion, a separate anticipatory risk head, and final decision fusion involving risk. Its advantage lies not simply in its complexity, but in its more targeted adaptation to the task of early detection of aggressive behavior. Unlike standard 3D-CNN and transformer models, which are primarily optimized for final clip classification, RAMT-BinaryHeatNet is designed as an architecture capable of simultaneously identifying spatially significant regions, accounting for the temporal evolution of motion, and correcting binary decisions based on an internal risk assessment. This is why the diagram justifiably reflects not yet another variation in the standard backbone, but an independent proposed model for applied purposes, focused on interpretable and risk-sensitive video analysis.
Table 3 presents the profiles of the compared models by the number of trainable parameters and the latency of processing a single video clip in the implemented experimental configuration. This analysis allows us to compare architectures not only in terms of recognition quality but also in terms of computational feasibility, which is especially important for tasks such as the early detection of aggressive behavior in near-real-time conditions. It should be emphasized that the presented values reflect the actual laptop profile, i.e., they correspond to the configuration in which the models were used in the experimental circuit. For this reason, the column for the number of trainable parameters should be interpreted as the number of parameters involved in training in the current configuration mode, rather than the full parametric capacity of the entire architecture. This is especially important for official video baselines in which the backbone was frozen, and optimization was performed primarily on the output classification and risk-oriented heads.
The data presented shows that the models differ not only in their architectural type but also in their computational profile. Swin3D-S demonstrates the lowest latency in this implementation with a value of 21.5001 ms/clip, followed by Swin3D-T (58.5502 ms/clip) and R3D-18 (74.5958 ms/clip). The proposed RAMT-BinaryHeatNet model shows a latency of 85.7341 ms/clip, i.e., remains within the range of practically acceptable values, while implementing a significantly more complex internal analysis loop that includes motion processing, localization, and risk-based decision fusion. The slowest model in this configuration is CNN+BiLSTM, with a latency of 3419.6635 ms/clip, indicating significantly lower computational efficiency per clip. In terms of trainable parameters, RAMT-BinaryHeatNet has the largest count, with 1,203,791, consistent with its extended hybrid structure. For CNN+BiLSTM, the trainable portion is significantly smaller, comprising 173,699 parameters. Meanwhile, the MC3-18, R3D-18, and R(2+1)D-18 models share the same number of trainable parameters, 1539, while Swin3D-T and Swin3D-S have 2307, reflecting the partial adaptation mode with a frozen backbone. Thus, the table shows that RAMT-BinaryHeatNet occupies an intermediate position in terms of speed but significantly outperforms other models in the volume of the trainable specialized portion, while the official baseline architectures in this protocol serve as lightweight adaptable comparisons with a minimal number of updated parameters. This makes the comparison of models methodologically transparent and allows for the correct interpretation of their differences in further analysis.
Table 4 demonstrates that recognition quality depends significantly not only on the model architecture but also on the fraction of the observed video clip. The data show that, for most models, increasing the observation ratio from 0.2 to 1.0 is generally associated with increases in F1, Balanced Accuracy, and ROC-AUC; however, the magnitude of this increase is uneven. CNN+BiLSTM demonstrates the weakest early stability: at an observation ratio of 0.2000, F1 = 0.8224; with full observation, it increases to 0.8932. For R3D-18, the dynamics also remain moderate: F1 varies from 0.8315 to 0.8822, indicating a relatively limited ability to confidently recognize an event in the early stages of its development. At the same time, some models demonstrate greater suitability for early analysis. Thus, MC3-18, already with an observation ratio of 0.2000, achieves F1 = 0.8997 and ROC-AUC = 0.9544, and Swin3D-T, with an observation ratio of 0.4000, shows F1 = 0.9333 and Balanced Accuracy = 0.9333. The most stable official comparative solution for the set of values is Swin3D-S, which, with full observation, achieves F1 = 0.9320, Balanced Accuracy = 0.9300, and ROC-AUC = 0.9866, maintaining high indicators in the intermediate parts of the clip.
The most important fact emerging from the table is that RAMT-BinaryHeatNet performs best in the early warning mode at an observation ratio of 0.6000, achieving F1 = 0.9527 and Balanced Accuracy = 0.9533. This means that after observing 60% of the clip, the model demonstrates higher recognition accuracy than all other compared solutions. With the full clip observed, it also maintains strong performance: F1 = 0.9342, Balanced Accuracy = 0.9333, and ROC-AUC = 0.9871, the highest ROC-AUC among all the models in the table. Thus, the results presented demonstrate that the proposed architecture exhibits high discriminatory performance under conditions of partial and full observation of video clips within the experimental protocol. However, these results should be interpreted as confirmation of the model’s research potential for early detection of suspicious video segments, rather than as sufficient grounds for its standalone use in scenarios where prediction directly entails disciplinary, legal, or other significant consequences. One key aspect of correctly interpreting the results is analyzing the stability of the training process during the final optimization stage. For this purpose, summary
Table 5 was generated, including the mean values and standard deviations for Validation F1 and Validation ROC-AUC over the last five epochs, as well as the Final Train-Val F1 Gap, which reflects the discrepancy between the final performance values on the training and validation sets. These values allow us to assess not only the final performance level but also the degree of model fluctuations during the final training phase, as well as the presence of signs of overfitting or, conversely, conservative model behavior.
The data shows that RAMT-BinaryHeatNet demonstrates the highest average validation quality indicators: Validation F1 mean = 0.9446 and Validation ROC-AUC mean = 0.9804. At the same time, the standard deviations remain low (0.0043 and 0.0013, respectively), indicating fairly stable behavior in the final epochs. The Final train-val F1 gap = 0.0443 remains positive but does not exceed the thresholds that could indicate critical overfitting; rather, it reflects a better fit of the training trajectory while maintaining strong generalization during validation. Among the official baseline models, Swin3D-S and Swin3D-T demonstrate the smoothest behavior. For Swin3D-S, the standard deviation of Validation F1 std = 0.0000, while for Swin3D-T it is 0.0013, indicating almost constant validation dynamics in the final epochs. Swin3D-T exhibits a minimal positive gap of 0.0087, while Swin3D-S exhibits a 0.0132 gap, which can be interpreted as the most balanced ratio between training and validation. In turn, the MC3-18, R(2+1)D-18, and R3D-18 models have negative train-val gap values, indicating no signs of overfitting at the final point and less pronounced overfitting of the training set. R3D-18 demonstrates the lowest level of robust performance, with the lowest average Validation F1 (0.8737) and Validation ROC-AUC (0.9328), and the largest F1 variance (0.0050). Thus, the table confirms that RAMT-BinaryHeatNet achieves the best overall validation performance while maintaining controlled training stability, while Swin3D-S and Swin3D-T are the smoothest and most statistically robust official benchmarks.
The loss function dynamics across epochs show that all the models examined exhibit a similar general pattern of error decay in the early stages of training. However, the nature of subsequent stabilization, the depth of the train-loss reduction, and the ratio between the training and validation curves differ significantly. The most pronounced reduction in training error is observed for RAMT-BinaryHeatNet: train-loss decreases from approximately 0.75–0.80 in the first epoch to around 0.18–0.20 by the final epochs. The validation error for this model also decreases over time. Still, after a rapid initial drop, it stabilizes around 0.52–0.57, forming a noticeable but manageable gap between the train and validation curves. This profile corresponds to intensive overfitting of the training set while maintaining a stable validation trajectory with no signs of sharp degradation in the final epochs. Swin3D-S and Swin3D-T exhibit smoother, more consistent convergence. For Swin3D-S, both curves decrease gradually and converge by the end of training: the train loss is approximately 0.31–0.33, and the validation loss is approximately 0.37–0.38. A similar pattern is observed for Swin3D-T, with values of approximately 0.29–0.31 on the training set and 0.33–0.34 on the validation set. These two models are characterized by the smallest distance between the curves, consistent with the previously obtained high stability indicators and indicating balanced convergence without significant overfitting (
Figure 5). For the MC3-18 and R(2+1)D-18 models, the training curves also have a stable downward profile, but the resulting errors remain higher than those of the transformer architectures. For MC3-18, the train-loss at the end of training is approximately 0.42–0.44, while the validation-loss is around 0.36–0.38. For R(2+1)D-18; both curves converge in the range of approximately 0.43–0.45, with a minimal gap between them. This configuration indicates a smooth but more conservative convergence, in which the model does not show a sharp reduction in error during training but maintains fairly close values during validation.
CNN+BiLSTM and R3D-18 stand out in particular. For CNN+BiLSTM, the train-loss decreases to around 0.21–0.24, while the validation-loss remains higher, at approximately 0.40–0.45, creating one of the most noticeable gaps between the curves among baseline models. For R3D-18, both trajectories decline significantly more slowly: by the final epochs, train-loss and validation-loss are approximately 0.48–0.50, and their proximity does not indicate high optimization but rather limited depth of convergence. Taken together, the presented curves show that RAMT-BinaryHeatNet achieves the deepest reduction in training error, Swin3D-S and Swin3D-T demonstrate the smoothest and most balanced stabilization regime, while R3D-18, MC3-18, and R(2+1)D-18 are characterized by a more moderate optimization rate and a higher final loss function level.
Figure 6 shows the F1 score dynamics across epochs. All the models studied reach relatively high values early in training. Still, the growth rate, level of stabilization, and nature of the divergence between the training and validation curves differ significantly. RAMT-BinaryHeatNet demonstrates the highest final profile. For this model, the training F1 score rapidly increases from approximately 0.80 in the first epoch to values around 0.98–0.99 in the final epoch, while the validation F1 score stabilizes in the range of 0.94–0.95. This indicates the highest absolute quality among the presented solutions, while maintaining a stable validation trajectory with no signs of sharp deterioration in the final epochs.
Among the official baseline models, the smoothest and strongest curves are observed for Swin3D-S and Swin3D-T. For Swin3D-S, the training F1 value reaches approximately 0.93–0.94, while the validation F1 value remains at 0.92–0.92+ with minimal fluctuations. For Swin3D-T, the dynamics are similar: the train F1 value increases to 0.92–0.93, while the validation F1 value gradually reaches a comparable level. The shape of the curves shows that these two models exhibit the smallest gap between the training and validation values, consistent with their high stability during training. MC3-18 and R(2+1)D-18 exhibit a more moderate trajectory. For MC3-18, the validation F1 value remains in the range of 0.90–0.92 throughout most of the training process, while the training curve remains lower and ultimately reaches approximately 0.87–0.89. A similar effect is observed in R(2+1)D-18, where validation-F1 stabilizes around 0.90–0.91, and train-F1 fluctuates mainly in the range of 0.86–0.89. This configuration indicates smooth convergence without overfitting, but with a slightly lower depth of optimization of the training set. CNN+BiLSTM and R3D-18 deserve special mention. In CNN+BiLSTM, the training F1 reaches approximately 0.95, while the validation F1 remains closer to 0.89–0.90, forming a noticeable positive gap. In R3D-18, both curves are located below the other models: train-F1 finishes around 0.86–0.87, and validation-F1 is close to 0.87–0.88.
Taken together, the results show that RAMT-BinaryHeatNet achieves the highest final F1 score, Swin3D-S and Swin3D-T demonstrate the most balanced stabilization, and R3D-18 remains the weakest in absolute values of this metric.
Figure 7 shows the results of the validation ROC-AUC change across epochs. All models achieve a fairly high level of class discrimination early in training, but the rate at which they reach a plateau and the final metric value differ significantly. RAMT-BinaryHeatNet demonstrates the strongest trajectory. By the fourth epoch, the ROC-AUC value for this model rises to approximately 0.979, and it remains in a narrow range of 0.978–0.982 until the end of training. This dynamic indicates rapid achievement of a high level of discriminatory ability, followed by stable stabilization without significant drop-offs. Similar but slightly lower results are observed for Swin3D-T and MC3-18. By the final epochs, Swin3D-T remains stable at approximately 0.977–0.978, while MC3-18 remains in the range of approximately 0.969–0.970. Swin3D-S also shows a stable upward trend: the metric increases from approximately 0.949 in the first epoch to 0.966–0.967 in the final epoch. This indicates that both transformer models produce stable and competitive class ranking quality, but are inferior to the proposed architecture in terms of absolute maximum performance. CNN+BiLSTM and R(2+1)D-18 exhibit more moderate trajectories. For CNN+BiLSTM, ROC-AUC increases from approximately 0.935 to 0.966–0.967, with the curve becoming almost horizontal after the middle of training. For R(2+1)D-18, the initial value is close to 0.907; the metric then gradually increases and stabilizes at approximately 0.956–0.958. R3D-18 maintains the lowest trajectory, starting at around 0.871 and reaching only 0.932–0.934 by the final epochs. Thus, based on the totality of the results, the graph confirms that RAMT-BinaryHeatNet provides the highest and most consistent ROC-AUC level, while Swin3D-T and Swin3D-S are the closest strong comparative solutions, and R3D-18 remains the least effective model for this metric.
Test accuracy results show that, on the final independent sample, all models considered achieve a sufficiently high level of correct binary classification. However, there remains a clear gradation in absolute quality between them. RAMT-BinaryHeatNet achieves the highest Accuracy of 0.933, the best final result among the compared architectures. The closest model is Swin3D-S with an accuracy of 0.930, and the gap between these two solutions is only 0.003, indicating that they belong to the highest quality level within the framework of the conducted experiment (
Figure 8). The next group consists of R(2+1)D-18 and MC3-18, which showed 0.913 and 0.910, respectively. Their results remain above the 0.91 threshold, but are 0.020 and 0.023 behind the two leading models when compared to RAMT-BinaryHeatNet. Swin3D-T ranks slightly lower, with a final accuracy of 0.903, indicating a maintained high, albeit weaker, level of final recognition compared to Swin3D-S. Thus, among the strong baseline architectures, Swin3D-S is the most competitive in the test, while Swin3D-T, MC3-18, and R(2+1)D-18 achieve an intermediate level of performance. The lowest test accuracies are observed for CNN+BiLSTM and R3D-18, with scores of 0.890 and 0.883, respectively. This means that the gap between the best and least accurate models in this comparison is 0.050, or 5 percentage points. Moreover, even the minimum result remains relatively high for the binary problem formulation, confirming the overall performance of the entire set of solutions studied. Taken together, the presented data show that RAMT-BinaryHeatNet provides the best overall generalization on the test set, Swin3D-S is the closest official comparison, and the remaining architectures demonstrate consistently lower accuracy values. Consequently, in terms of overall test accuracy, the proposed model ranks first among all the options considered.
The results shown in
Figure 9 for the F1 score on the test set confirm the general hierarchy of models previously observed for other integrated quality metrics. However, in this case, the emphasis shifts to the trade-off between precision and recall in the binary classification of aggressive and non-aggressive video scenes. RAMT-BinaryHeatNet demonstrates the highest value, with a Test F1 score of 0.934. This means that the proposed model provides the best balance between correctly identifying the positive class and minimizing miss and false-positive errors in the final test score. The closest comparable solution is Swin3D-S with a score of 0.932, and the difference between the two leading models is only 0.002, indicating a virtually identical level of final prediction consistency. The next group consists of R(2+1)D-18 and MC3-18, which achieve F1 scores of 0.917 and 0.911, respectively. Their values exceed 0.91 but are 0.017 and 0.023 lower than the leader’s. This indicates that these architectures retain a fairly strong ability to discriminate between classes, but remain below the two best models in terms of overall decision balance. Swin3D-T achieves F1 scores of 0.908, occupying an intermediate position between the more powerful Swin3D-S and classic 3D baselines. Thus, among the official comparative models, Swin3D-S demonstrates the best overall F1 score. The lowest scores are recorded for CNN+BiLSTM and R3D-18, which achieved F1 scores of 0.893 and 0.882, respectively. The gap between the best and worst-performing models is 0.052, or 5.2 percentage points. Even the minimum value remains relatively high for a practical binary setting, confirming the overall validity of the model series studied. Taken together, the presented results show that RAMT-BinaryHeatNet ranks first in the final test F1 score, providing the most balanced recognition on an independent sample, while Swin3D-S is the closest and strongest official baseline analog.
The results of the F1-score dependence on the observed video clip fraction are shown in
Figure 10. The models differ significantly in their ability to recognize an aggressive event at its early stages. RAMT-BinaryHeatNet demonstrates the most pronounced and effective trajectory. Even at an observation ratio of 0.4, the F1-score reaches 0.9110, and at 0.6, the maximum result of 0.9527 is observed, which is the highest point among all the presented curves. After this, only a slight decrease is observed to 0.9459 at 0.8 and to 0.9342 with full observation, indicating very high efficiency of the model, particularly in the predictive recognition mode when only a portion of the video sequence is still available. Swin3D-S shows a strong but less pronounced trend. Its F1-score increases from 0.8723 at 0.2 to 0.9320 at 1.0, with the increase almost being monotonic. This indicates a steady accumulation of discriminatory information as the observed fragment increases. In contrast, Swin3D-T exhibits a more uneven trajectory: a high level of 0.9333 is reached at 0.4, but then the values decrease to 0.8968 at 0.6, after which they partially recover. This curve shape indicates good sensitivity to early fragments, but less stable dynamics as the time window expands. MC3-18 is characterized by a strong early start (0.8997 at 0.2) followed by an oscillatory plateau in the range of 0.8961–0.9195. R(2+1)D-18, in contrast, shows a smoother and more consistent increase in quality: from 0.8561 at 0.2 to 0.9189 at 0.8, after which it maintains a similar level. CNN+BiLSTM and R3D-18 form the bottom group: the former model significantly improves from 0.8224 to 0.8932, while the latter remains the weakest in absolute terms, reaching only 0.8822 with full observation. Taken together, the presented curves confirm that RAMT-BinaryHeatNet is the best model for the early warning scenario, while Swin3D-S is the most stable official benchmark as the proportion of observed clips increases.
The results of the ROC-AUC dependence on the observed video clip ratio (
Figure 11) show that all models maintain a fairly high ability to rank classes even at early stages of observation, but differ in the rate of quality improvement and the level of final stabilization. RAMT-BinaryHeatNet demonstrates the strongest trajectory. Even at an observation ratio of 0.2, the model shows an ROC-AUC of 0.9630, and by 0.6, it reaches 0.9836, one of the highest values on the entire graph. Subsequently, the metric remains consistently high: 0.9803 at 0.8 and 0.9871 at 1.0. This dynamic indicates high stability in class separation even with partial event observation and confirms the proposed model’s ability to extract informative features before the end of the video scene. Swin3D-S demonstrates comparable results, with ROC-AUC increasing from 0.9534 at 0.2 to 0.9866 at 1.0. Unlike RAMT-BinaryHeatNet, the increase here is smoother and almost monotonic. Swin3D-T shows a pronounced early rise: 0.9812 is reached at 0.4, but then the values fluctuate slightly and culminate at 0.9716. This indicates that the model is highly sensitive to early segments but less stable than the two leading solutions. MC3-18 and R(2+1)D-18 exhibit similar but somewhat more moderate trajectories. MC3-18 increases from 0.9544 to 0.9784, while R(2+1)D-18 increases from 0.9543 to 0.9702. CNN+BiLSTM demonstrates the most significant improvement relative to the starting point: from 0.9091 at 0.2 to 0.9577 at 1.0, indicating a significant dependence on the completeness of the observed clip. The lowest trajectory is maintained by R3D-18, with values ranging from 0.9412 to 0.9576 and remaining below those of the other models over almost the entire interval. Taken together, the presented results confirm that RAMT-BinaryHeatNet and Swin3D-S achieve the highest level of performance in terms of ROC-AUC in the early warning mode, with the proposed model providing the strongest combination of early discrimination and final robustness.
To comprehensively evaluate the contribution of each architectural component, an ablation study was conducted, with the results summarized in
Table 6. Starting with the full model (A0), key modules were progressively removed or simplified to analyze their individual impact on classification performance, localization quality, and computational efficiency. All reported values are presented as the mean ± standard deviation over three independent runs (
= 3) with different random seeds (42, 73, and 101) under a clip observation setting of 0.60. Removing the motion branch (A1) results in a noticeable decrease across all metrics: F1 drops from 0.952 to 0.926, and mIoU decreases from 0.604 to 0.403, underscoring the importance of motion information for both classification and localization. Similarly, turning off motion-guided localization (A2) reduces localization quality, confirming the importance of explicit motion cues for region-level prediction. Replacing supervised fusion with simple concatenation (A3) results in decreased performance across all evaluation metrics, demonstrating the importance of adaptive weighting for effective multimodal integration. A similar degradation is observed when removing the absolute difference term between the RGB and motion features (A8), suggesting that this operation captures informative temporal discrepancies between the two modalities. It is also shown that temporal modeling components are crucial. Removing TemporalConvBlocks (A4) or Multi-Head Attention (A5) results in a consistent performance degradation, with the largest drop observed after removing the attention mechanism (F1 = 0.887), highlighting its importance for modeling long-term temporal dependencies. The importance of multi-task learning is particularly evident in A7. When only the classification loss function is retained, and the localization/coherence targets are removed, localization performance drops sharply to mIoU = 0.138. Removing the risk analysis module and decision fusion module (A6) has a relatively small impact on classification performance. However, a slight decrease in overall robustness is still observed, while localization remains virtually unchanged. Finally, removing the fusion modules (A9–A10) shows that the combination of weighted average pooling and max pooling is more effective than either strategy alone, suggesting that both global contextual information and meaningful activations contribute to the final prediction.
The results of the normalized confusion matrix (
Figure 12) for RAMT-BinaryHeatNet indicate that the proposed model achieves high, well-balanced binary recognition performance on an independent test set. The matrix’s main diagonal contains the largest values: for the NonViolence class, the proportion of correct predictions is 0.92, while for the Violence class, it reaches 0.95. This means that the model correctly identifies 92% of non-aggressive scenes and 95% of aggressive scenes, confirming high sensitivity to the target dangerous class while maintaining consistent performance for the neutral class. The structure of the off-diagonal elements is particularly important. The proportion of non-aggressive video clips incorrectly classified as violent is 0.08, while the proportion of aggressive scenes incorrectly classified as non-violent is 0.05. Consequently, within the test protocol, the model produces fewer false negatives for the Violence class than false positives for the Non-Violence class. From a practical perspective, the results demonstrate that the developed approach can be used as a component of a preliminary video analytics filtering system to identify fragments requiring further operator analysis. However, in its current form, the work does not consider the model a standalone tool for making final decisions, as its operational use in environments with rare incidents requires additional quantitative evaluation of false positives, threshold calibration, prevalence sensitivity analysis, and formalized human accountability procedures.
A comparison of the two matrix rows reveals that the difference in class recognition performance remains small: 0.95 − 0.92 = 0.03. This demonstrates the absence of a significant model bias toward one category and confirms the statistically consistent separation of classes. At the same time, the slight advantage for the Violence class is consistent with the proposed model’s intended application, which is to reliably identify potentially dangerous behavior. Thus, the confusion matrix shows that RAMT-BinaryHeatNet achieves high recognition accuracy, a low rate of aggression misses, and an acceptable false-positive rate, making its results valid and practically relevant for the task at hand. Thus, the comparative analysis results demonstrate that the proposed RAMT-BinaryHeatNet model occupies the strongest position across key characteristics, combining high early warning performance, maximum discriminatory power, and a significantly more complex analytical framework than baseline architectures. Moreover, official video models, primarily Swin3D-S and Swin3D-T, demonstrate high training stability and competitive values for key metrics. At the same time, classic 3D convolutional solutions and CNN+BiLSTM serve as simpler benchmarks with varying degrees of computational and predictive efficiency. Taken together, they obtained data confirm that integrating spatial features, motion information, risk-based analysis, and localization mechanisms within RAMT-BinaryHeatNet provides the most balanced solution to the problem of early detection of aggressive behavior and justifies a transition to subsequent visual analysis of the model’s interpretability.