This section intends to assess our suggested method via a series of systematic experiments. We will evaluate the performance against contemporary state-of-the-art methodologies using the public datasets UCF-Crime [
3] and XD-Violence [
27]. Our experimental design will focus intently on the fundamental challenges addressed in the preceding section to systematically evaluate our proposed technical solutions. Firstly, we will assess the foundational timing modelling capabilities by juxtaposing its performance with that of conventional models, and evaluate whether the HMTE module can proficiently capture both long- and short-term interdependence of intricate events. We will determine the background suppression capability of the model via directional tests and performance comparisons in highly disturbed environments, and evaluate the anomalous prototype fitting ability of the bi-directional memory bank based on its performance with the label-ambiguous XD-Violence dataset. Ultimately, we will comprehensively validate the final feature discrimination capability of the model through specialised fine-grained anomaly classification tests to assess its efficacy in differentiating comparable, ambiguous abnormalities. This section seeks to illustrate that each design in our problem-driven validation approach provides a distinct technological solution, hence highlighting its specific advantages over current methodologies.
4.1. Dataset and Evaluation Criteria
To assess the efficacy of the proposed strategy, we performed comprehensive evaluations on two prominent anomaly detection datasets, UCF-Crime [
3] and XD-Violence [
27].
The UCF-Crime dataset comprises 13 categories of genuine anomalous incidents, using films extracted from unprocessed surveillance recordings. The training dataset includes 800 normal videos and 810 anomalous videos, whilst the test dataset consists of 150 normal videos and 140 anomalous videos. The primary problem with this dataset is that the abnormal signals are feeble. Simultaneously, the backdrop dynamics are pronounced: surveillance film frequently includes substantial dynamic background elements (e.g., pedestrians, traffic) that are irrelevant to the event, and numerous anomalous behaviours are visually indistinct from the periphery of normal behaviour. This presents a definitive technical validation case for our dual-stream attention method. In low signal-to-noise situations, the model must concentrate on critical spatial and temporal areas while efficiently attenuating background noise. Our attention module is the ideal solution for this, capable of identifying and extracting the most informative and suspicious portions from intricate situations through autonomous learning, hence validating its background suppression capability.
XD-Violence is an extensive and heterogeneous dataset of 4754 unedited videos sourced from web content, live sports broadcasts, and surveillance footage. The primary challenge with this dataset is the morphological diversity of the anomalies and the ambiguity of video-level labelling; specifically, we only ascertain that a lengthy film has anomalies without being able to determine their precise timing. This attribute underscores the necessity and benefit of our memory module. The model must learn a stable and generalisable collection of prototype features despite inaccurate labels and highly varied input patterns. Our learnable bi-directional memory module is intended for this objective: by storing and updating typical normal and abnormal patterns, it facilitates stable and reliable feature matching in extremely diverse data, therefore directly proving its capacity to accommodate atypical prototypes.
For the experimental evaluation, consistent with previous studies, we measure the performance of WSVAD using the area under the receiver operating characteristic curve (AUC) at the frame level on the UCF-Crime dataset. Meanwhile, we use the area under the precision–recall curve (AP) at the frame level for the XD-Violence dataset as the evaluation metric. Higher AUC and AP values indicate better network performance.
4.2. Experimental Details
We extract snippet features from the I3D [
28] model pretrained on the Kinetics-400 dataset. The configuration for both datasets remains consistent. During training, a multi-crop aggregation method is employed to obtain the final anomaly scores, setting the number of crops for UCF-Crime to 10 and for XD-Violence to 5. To ensure robust performance when processing videos with extreme durations (tens of minutes to hours), our framework adopts a hierarchical temporal sampling strategy to manage computational complexity while maintaining anomaly localization accuracy effectively. Specifically, for videos exceeding 10 min, we first apply uniform frame down-sampling to reduce temporal redundancy while preserving essential motion and appearance cues. Next, we divide the entire video into overlapping temporal segments of fixed length, which are processed independently by the feature extractor and memory module. This segmentation enables our model to capture local spatio-temporal dynamics without being affected by the excessive length of the video. Finally, we aggregate the segment-level anomaly score by aligning the predicted scores to the original timeline. This hierarchical design ensures that the model remains computationally efficient and minimises the risk of anomaly localization loss in long-duration videos.
The optimal values for key hyperparameters were determined via a systematic grid search on a held-out validation set. The model is trained with a learning rate of 0.0001 and a batch size of 64 over 3000 iterations, with
and
for both UCF-Crime and XD-Violence. The hyperparameters (
,
,
) are set to (0.1, 0.1, 0.001, 0.0001) to ensure a balanced total loss after 3000 training iterations for each dataset. The loss value of the model tends to stabilise, as illustrated in
Figure 5, and the optimizer used is Adam. Our experiments were conducted on a machine equipped with an Intel (R) Xeon (R) Gold 6430 CPU (Intel Corporation, Santa Clara, CA, USA) and NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), using CUDA 12.1, Python 3.9.21, and Pytorch 2.1.0.
4.5. Ablation Study
To systematically assess the efficacy of each core component in the model and to distinctly identify performance improvements attributable to a specific mechanism rather than mere structural complexity, we executed a series of ablation experiments adhering closely to the modular design framework. Commencing with a baseline model (Baseline) that comprises solely the foundational backbone network, we sequentially incorporate HMTE, PGRN, DFRL, and DS-AEMN modules, monitoring the performance variations to establish a definitive causal relationship. The experimental findings are presented in
Table 8.
Analysis of individual values for baseline and singular modules: The baseline model attained an AUC of 86.26% and an AP of 82.80% on UCF-Crime and XD-Violence, respectively. Consequently, we assessed the distinct worth of each module. Upon the sole introduction of the HMTE module, performance enhances to 86.37% and 84.00%, illustrating the autonomous and prompt advantage of capturing local timing dependencies through temporal convolutional networks. The introduction of the DFRL module as a standalone support structure also led to enhanced performance, validating the efficacy of residual connection in stabilising training and facilitating the acquisition of robust features. The incorporation of the core DS-AEMN module yields the most substantial single-module performance enhancements of 86.60% and 84.35%, thereby affirming that our fundamental principle of “notice first, remember later” is an exceptionally effective strategy for differentiating ambiguous anomalous events, exhibiting robust independent discriminative capability. The basic principle of “attention first, then memory” is an effective method for identifying ambiguous abnormal events, demonstrating significant independent discriminatory capability.
Examination of the synergistic impact of multi-module integration: Upon confirming the independent validity of each module, we further investigated their synergistic effect. The integration of HMTE and DFRL elevates the AUC on UCF-Crime to 86.99%, demonstrating that the superior feature backbone and improved temporal modelling capacity collectively establish a more robust feature foundation for subsequent processing. The fundamental DS-AEMN module provides substantial performance improvements when integrated with other modules. The integration of Base + HMTE + DS-AEMN attained an 85.30% average precision on XD-Violence. This illustrates the synergistic relationship between the modules: with access to a practical temporal context (provided by HMTE), our DS-AEMN module can execute attentional focusing and memory matching with greater precision, demonstrating that its efficacy is enhanced not only in isolation but also through the collaborative framework.
The comprehensive model, incorporating all three modules, attained optimal performance with an AUC of 87.43% and an AP of 85.51%. To specifically isolate the contribution of the PGRN, we conducted an additional experiment where only the PGRN module was removed from the full model. As shown in
Table 8, this led to a notable performance drop to 87.18% AUC on UCF-Crime and 85.12% AP on XD-Violence. This result quantitatively confirms the significant impact of PGRN in capturing long-range, content-driven global dependencies, which complements the local features from HMTE. This outcome illustrates the synergistic interaction among the modules: the HMTE captures the temporal context, the PGRN establishes global relations, the DFRL ensures the depth and stability of feature extraction, and the DS-AEMN effectively identifies anomalies through an accurate prototype matching task within this framework. Each component addresses the difficulty from a distinct perspective, and their combination becomes a synergistic partnership rather than a mere performance overlay, culminating in enhanced model performance.
To further verify the contribution of each component within our dual-stream attention mechanism, we performed a fine-grained ablation study, as shown in
Table 9. The results indicate that incorporating either the spatial or the channel attention branch individually yields a performance gain over the baseline, confirming the effectiveness of each component. Critically, the synergistic use of both branches achieves the best performance on both datasets. This strongly demonstrates that our dual-stream design is more than a simple stacking of components; it effectively captures complementary ‘where’ (spatial) and ‘what’ (channel) information to achieve efficient feature purification, thereby validating the rationale and necessity of our approach.
The efficacy of the composite loss function (Equation (8)) is somewhat contingent upon the hyperparameters associated with each loss weight. A systematic sensitivity analysis is conducted on both the UCF-Crime and XD-Violence datasets to demonstrate that the selected hyperparameters (, , , ) are not overfitting to a single dataset but represent a generalised choice that performs robustly across varying data characteristics. The tests utilised the control variable method, assessing one parameter while maintaining all other parameters at their final selected values.
Impact of memory loss weight
:
Figure 7 illustrates that the effect of
exhibits a very similar pattern across both datasets. The performance on both UCF-Crime and XD-Violence reaches its zenith at
. When the weights are very low (0.01), the supervised limitations on the memory module are inadequate, resulting in suboptimal utilisation of its capacity. Simultaneously, when the weights exceed 0.5, the high auxiliary loss disrupts the primary classification job, resulting in a concurrent decline in performance across both datasets. The uniformity across datasets compellingly illustrates that
is a robust and optimal parameter, irrespective of data distribution.
Effect of the Triplet loss weight : Likewise, the examination of demonstrates a comparable trend. Eliminating this factor () results in a notable decline in performance across both datasets, thus affirming the critical necessity of explicitly enforcing the structure of the eigenspace metric to enhance model generalisation. Optimal performance is reliably achieved at , but increased weights detrimentally affect performance due to overregularization.
Combined impact of KL divergence and distance loss (): and function as more refined regularisation terms, which we jointly ablate. The table indicates that both datasets exhibit a minor decline in performance from the ideal position when both losses are simultaneously eliminated. Despite the lesser magnitude of the effect compared to and , it suggests that these two regularisation terms, employed to stabilise the latent space and explicitly separate features, are generally advantageous for subsequent fine-tuning and enhancing model performance.
This series of ablation experiments over two distinct characterisation datasets establishes a robust empirical foundation for our ultimate hyperparameter design. The consistently reliable performance patterns in the results suggest that the selected weights are not mere “fortunate values” that overfit a particular data distribution, but relatively resilient selections that equilibrate the loss terms and enhance the generalisation capacity of the model.
4.6. Qualitative Results
To enhance the visualisation of the efficacy of our method and to substantiate the preceding quantitative analysis, we present the findings of our qualitative investigation on the UCF-Crime and XD-Violence datasets.
The anomaly score curves depicted in
Figure 8 provide a visual confirmation of the proficiency of our model in time-series modelling. Specifically, the figure provides qualitative evidence of the ability of the model to handle anomalies of varying durations. For instance, the model successfully localises both a long-duration event with a gradually rising score (left plot) and a series of brief, explosive anomalies with sharp, spiky scores (right plot). This demonstrates that our HMTE module, designed to capture both long- and short-term dependencies, is effective in practice. The overall alignment of the scores of the model with the actual abnormal periods (shown by the pink shaded areas) shows that, even in a poorly guided context without precise frame-level annotations, our approach can successfully acquire dependable spatio-temporal patterns and attain accurate anomaly localisation. A detailed quantitative analysis on this topic is a valuable direction for future work.
Figure 9 illustrates the distinct performance of this technique on Fighting028_x264 and Vandalism004_x264 within the UCF-Crime dataset, which offers the most explicit visual demonstration of the background suppression capacity of our model. The graphic illustrates that, following the implementation of the dual-stream attention mechanism (from c to d), the focus of the network progressively shifts towards the primary topics (e.g., the combatants) in the image, away from the dispersed background areas. This incremental purification approach successfully mitigates the influence of extraneous background information, hence augmenting the accuracy of the model in learning video features and ultimately enhancing overall recognition performance.