Wavelet-Based Time–Frequency Feature Fusion for Violence Detection

Zhang, Fan; Peng, Jing; Wang, Jinxiao; Liu, Xuan; Cao, Lin; Du, Kangning; Guo, Yanan

doi:10.3390/electronics14214320

Open AccessArticle

Wavelet-Based Time–Frequency Feature Fusion for Violence Detection

by

Fan Zhang

^1,2

,

Jing Peng

^1,2,

Jinxiao Wang

^3,*,

Xuan Liu

³,

Lin Cao

^1,2

,

Kangning Du

^1,2 and

Yanan Guo

^1,2

¹

Center for Target Cognition Information Processing Science and Technology, Beijing Information Science and Technology University, Beijing 100096, China

²

Key Laboratory of the Ministry of Education for Optoelectronic Measurement Technology and Instrument, Beijing Information Science and Technology University, Beijing 100096, China

³

Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, CAS, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(21), 4320; https://doi.org/10.3390/electronics14214320

Submission received: 18 September 2025 / Revised: 20 October 2025 / Accepted: 31 October 2025 / Published: 4 November 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

In the past few years, feature fusion-based violence detection has made remarkable progress. However, existing detection methods primarily focus on temporal feature analysis, which may result in an insufficient representation of the subtle variations inherent in violent behaviors, ultimately compromising detection performance. To overcome this limitation, this study introduces a Wavelet-Based Time–Frequency Feature Fusion (WTFF) method. Specifically, the Wavelet-Dilated Separable Convolution Module (WDCM) and the Time–Frequency Feature Fusion (TFFF) Network are designed. Firstly, the input video data is utilized by the WDCM to extract and process frequency-domain features, enabling the model to capture fine-grained behavioral details often overlooked in temporal analysis. Secondly, the TFFF fuses the temporal and frequency-domain features, thereby improving the model’s ability to discriminate violent events. Ultimately, the effectiveness and superiority of the proposed approach are demonstrated by experimental results on UCF-Crime, XD-Violence, and ShanghaiTech datasets, achieving 85.87% (AUC), 84.77% (AP), and 97.91% (AUC), respectively.

Keywords:

frequency-domain feature; violence detection; feature fusion

1. Introduction

In recent years, the development of deep learning, particularly convolutional neural networks (CNNs), has led to significant breakthroughs in video-based violence detection (VVD). Violent behavior presents a direct threat to public safety, jeopardizing lives and property. Timely detection of such incidents is crucial for mitigating harm and preventing further escalation. As a result, the development of efficient VVD methods has become an urgent priority to address emerging security threats.

Early research primarily adopted unsupervised approaches [1,2]. These methods constructed models of normal behavior using non-violent video data, subsequently detecting violence by identifying deviations in feature space from the learned normal patterns. However, the lack of violence video data during training often resulted in high false-positive rates, particularly in complex or dynamic scenes [3].

To balance annotation costs and detection performance, weakly supervised VVD (WSVVD) was introduced, requiring only video-level labels (i.e., indicating the presence or absence of violence). For example, Zhu et al. proposed the inter-clip feature similarity-based video violence detection (IFS-VVD) method [4], which leveraged a multi-scale temporal multi-layer perceptron (MLP) to integrate both global and local temporal relations to improve detection performance. Wu et al. proposed the STPrompt method [5], which learned temporal prompt embeddings for violence detection and localization by using pre-trained vision-language models. However, existing WSVVD detection methods, particularly those based on the Multiple-Instance Learning (MIL) framework, predominantly rely on features extracted from the temporal domain (e.g., I3D or C3D features). These features are effective at capturing global, low-frequency motion patterns, but they often suffer from information smoothing or loss when dealing with the subtle and abrupt high-frequency variations that characterize the instantaneous start and end of a violent event (e.g., a sudden strike or a rapid fall). This over-reliance on low-frequency temporal information often compromises the model’s capacity for accurate frame-level localization. For instance, methods relying solely on temporal averaging often struggle to distinguish between a fast, non-violent movement and a sudden violent strike, leading to an inevitable temporal offset in the predicted anomaly boundaries. This limitation manifests as a significant gap between video-level classification accuracy and frame-level localization precision (e.g., AUC/AP scores), which hinders real-world application. Our goal is to fundamentally reduce this localization error by introducing complementary frequency cues.

To address this critical limitation, this paper introduces Wavelet-Based Time–Frequency Feature Fusion (WTFF) method. WTFF adopts a cascaded dual-module architecture, consisting of the Wavelet-Dilated Separable Convolution Module (WDCM) for frequency-domain feature extraction and the Time–Frequency Feature Fusion Network (TFFF) for cross-domain feature integration. The primary advantage of WTFF lies in its novel utilization of frequency-domain analysis to extract features that are complementary to the standard temporal features. This is achieved by introducing the wavelet transform, which allows the model to decompose the video signal and capture the detailed, high-frequency motion characteristics essential for precise event boundary detection. Through a systematic feature fusion process, WTFF constructs a more robust representation that integrates both low-frequency temporal context and high-frequency motion details, significantly boosting the accuracy of weakly supervised violence localization. Subsequently, the TFFF fuses the extracted temporal and spectral features, leveraging their complementary nature to generate more discriminative representations for violence detection. The design of maintaining separate feature extraction branches prior to fusion not only allows for simultaneous capture of temporal and frequency-domain features but also preserves the distinct characteristics of each domain. By integrating information from both domains, this architecture facilitates more discriminative detection of violent events. Experimental results on the UCF-Crime, XD-Violence, and ShanghaiTech datasets demonstrate the effectiveness of the proposed method. It should be noted that this study focuses on enhancing the visual feature representation by exploring the time–frequency domain, setting aside multimodal fusion for future investigation.

The main contributions of this manuscript can be summarized as follows:

We propose a novel framework named Wavelet-Based Time–Frequency Feature Fusion (WTFF) for weakly supervised video violence detection, which addresses the limitations of purely temporal feature analysis by incorporating frequency-domain information.
We design the Wavelet-Dilated Separable Convolution Module (WDCM), which innovatively employs the wavelet transform to decompose the video features and effectively isolate and enhance the high-frequency components related to subtle and abrupt violent motions.
We introduce the Time–Frequency Feature Fusion (TFFF) Network to achieve a complementary feature fusion, ensuring that the final anomaly score benefits from the synergistic integration of both low-frequency temporal context and high-frequency motion details.
Extensive experiments on three challenging benchmarks (UCF-Crime, XD-Violence, and ShanghaiTech) demonstrate that our proposed WTFF method achieves superior frame-level anomaly localization performance compared to state-of-the-art methods.

The structure of this paper is arranged as follows. Section 2 mainly introduces the related work of violence detection methods. In Section 3, the design and implementation details of the proposed WDCM and TFFF are elaborated. Section 4 presents the experimental results and analysis. Finally, Section 5 concludes the study and discusses potential directions for future research.

2. Related Works

2.1. Unsupervised Video-Based Violent Behavior Detection

In early research, unsupervised video violence detection gained significant attention due to the difficulty in collecting and annotating large-scale video data of violent events. Unsupervised Violence Behavior Detection (UVBD) aimed to identify data points that deviated significantly from normal behavior patterns by learning the distribution or characteristics of non-violent data. This field held significant importance in video surveillance applications [6]. The UVBD methods could be broadly categorized into traditional methods and deep learning-based methods.

Specifically, traditional UVBD methods were typically based on data’s statistical properties, inter-sample proximity, or data distribution. For example, Breunig et al. [7] proposed the Local Outlier Factor (LOF) method, which identified deviations from normal patterns by measuring the local density deviation of a data point. Ester et al. [8] proposed the DBSCAN clustering method that regarded points not belonging to any known cluster or forming very small clusters as anomalies. Schölkopf et al. [9] proposed the One-Class Support Vector Machine method, which built a model for normal data and labeled points with low probability as anomalies (in UVBD, these might indicate violent behavior). However, these traditional methods often faced challenges in computational efficiency and performance when dealing with high-dimensional, complex-structured, or large-scale data.

In recent years, deep learning’s powerful capabilities have spurred the development of unsupervised violence detection, especially for complex data such as images and videos. Leveraging these capabilities and suitability for complex data, researchers have developed various deep learning-based methods for UVBD. For example, Sakurada et al. [10] proposed the Autoencoder method, where deep models were trained to precisely learn normal data’s encoding-decoding processes. The models minimized reconstruction errors for normal samples while leveraging significantly higher reconstruction errors generated by violent events for violence detection. Schlegl et al. [11] proposed the AnoGAN and f-AnoGAN methods. These approaches learned the manifold of normal data, identified the closest reconstruction for violence samples on this manifold, and detected violent events using reconstruction errors or latent space discrepancies.

Although unsupervised violence detection methods had made a series of advancements, as they primarily relied on learning the distribution or characteristics of normal data, they still faced challenges in identifying subtle violent events that did not deviate significantly from normal patterns and in generalizing to novel violent patterns that had not appeared in the training data.

2.2. Weakly Supervised Video Violence Detection

While unsupervised violence detection methods had made progress by learning normal data distributions, their inherent absence of violence video data during training often led to high false-positive rates. Moreover, they also faced challenges in detecting subtle deviations from normal patterns and generalizing to novel violent patterns, which together prompted the development of weakly supervised video-based violence detection (WSVVD).

To balance the trade-off between annotation costs and detection performance, WSVVD emerged as a mainstream solution. It not only utilizes video-level binary labels (indicating violence presence/absence) to guide model learning without requiring costly frame-level annotations but also provides implicit cues for identifying violent segments through these video-level labels—this addresses the key limitations of unsupervised methods. Currently, mainstream WSVVD methods primarily focus on two core aspects: one is learning to locate abnormal segments based on video-level labels (often termed temporal localization), and the other is effectively modeling the temporal dynamics of violence patterns.

Early work by Sultani et al. [12] pioneered the application of Multiple-Instance Learning (MIL) for this task, treating videos as ‘bags’ of segments and employing attention mechanisms for violence detection. Subsequent research advanced MIL, such as Shin et al. [13] integrated MIL with temporal attention and self-supervised learning for refined localization and feature extraction. Besides, alternative WSVVD methods focused on advanced temporal modeling. For example, Ren et al. [14] proposed temporal convolutional networks (TCNs) within their WSVVD framework to capture both local and global temporal dependencies for improved violence localization; Zhai et al. [15] proposed a WSVVD method utilizing transformers to enable long-range temporal reasoning of violent patterns under weak supervision; Gao et al. [16] introduced temporal graph neural networks (T-GNNs) to model spatiotemporal relationships between weakly labeled video segments; and Zhang et al. [17] designed a multi-scale temporal fusion network (MSTFN) that integrates short-term motion patterns with long-term semantic representations for violence detection. Tan et al. [18] proposed a deep differentiation segmentation neural network for video-based foreign object detection in urban rail transit, which enhances detection accuracy through attention mechanisms and morphological post-processing.

2.3. Feature Fusion-Based Weakly Supervised Video Violence Detection

To comprehensively characterize violent events, feature fusion has emerged as an essential research direction for WSVVD. By integrating different types of feature representations, this method enabled simultaneous capture of complementary manifestations of violent events and enhanced detection performance in complex scenarios. Based on this, researchers have proposed various methods based on feature fusion. For example, Yang et al. [19] introduced TADST, which integrated temporal convolutional networks with graph attention influence networks to effectively capture spatiotemporal dependencies, and further employed an encoder–decoder structure with adaptive violence detection to enhance temporal sequence modeling. Liang et al. [20] proposed the STFFE framework, which improved violence feature discrimination through temporal-constrained fusion of the top k most salient video segments. Cai et al. Jeon et al. [21] developed FEFM, combining a multi-dimensional feature extractor with a temporal feature extractor to jointly capture spatial and temporal dynamics, followed by fusion layers for robust violence detection. Cai et al. [22] proposed a multi-scale predictive framework, which extracted features with different receptive fields to capture spatial information and incorporated temporal constraint through ConvGRU modules for spatio-temporal feature fusion in video violence detection.

In summary, although existing methods based on feature fusion have advanced violence detection, they primarily rely on temporal analysis, often overlooking critical frequency-domain features. However, most violent events can be characterized by distinct frequency-domain patterns, such as high-frequency texture variations, which are challenging for single-domain analysis to fully capture. To address this limitation, this paper proposes a Wavelet-Based Time–Frequency Feature Fusion method for violence detection, utilizing a novel cascaded dual-module architecture.

2.4. Frequency-Domain Analysis and Feature Learning in Video Violence Detection

Frequency-domain analysis, traditionally powerful in signal processing, has recently gained traction in advanced video feature learning. Over the past three years in particular, several state-of-the-art studies have begun exploring this domain to address the limitations of purely temporal models. For instance, Li et al. [23] proposed the Frequency-Enhanced and Decomposed Transformer for Violent Behavior Detection (FDTAD), which integrates time-domain and frequency-domain decomposition within a transformer architecture to enhance model generalization and reduce false positives in unstable multivariate time series data. Chen et al. [24] proposed the LTFAD model, a lightweight All-MLP time–frequency violent behavior detection framework that achieves high efficiency and accuracy in IIoT time series analysis through dual-branch reconstruction and time–frequency joint learning. Xu et al. [25] proposed the FCDATA model, which enhances time series violent behavior detection by integrating frequency-domain rectification with a comprehensive dependency-aware attention mechanism to capture both synchronous and asynchronous inter-variable relationships. Zhang et al. [26] proposed the FreCT model, which enhances time series violent behavior detection by integrating frequency-domain analysis with a convolutional transformer to jointly capture long-term dependencies and local topology information.

3. Methodology

3.1. The Overall Framework

The overall structure of the proposed Wavelet-Based Time–Frequency Feature Fusion (WTFF) framework is illustrated in Figure 1. Unlike previous weakly supervised approaches that primarily focus on temporal cues, WTFF introduces a frequency-domain analysis branch to capture subtle variations such as abrupt motion, texture discontinuities, and illumination changes, which are common in violent events. The motivation is that temporal-only modeling (e.g., with RNN or TCN) tends to overlook these fine-grained spatial fluctuations that appear as high-frequency patterns in the spectral domain.

As shown in Figure 1, the proposed WTFF framework comprises two core components: the Wavelet-Dilated Separable Convolution Module (WDCM) and the Time–Frequency Feature Fusion (TFFF) network. These two modules are jointly optimized in an end-to-end manner to integrate temporal and frequency-domain representations for more discriminative violence detection. The detailed architectures of WDCM and TFFF will be introduced in Section 3.2 and Section 3.3, respectively.

The WTFF process performs as follows: untrimmed videos are first divided into non-overlapping 16-frame segments using a fixed step size of 16 frames. Each segment’s features x are extracted through a pre-trained I3D network [27] and then processed in parallel by the Temporal Context Aggregation (TCA) [27] and WDCM modules to obtain temporal features

x^{c}

and frequency-domain features

x^{f}

, respectively. Next, the TFFF module fuses

x^{c}

and

x^{f}

into

x^{f u s e d}

. The fused representation is then passed through a two-layer MLP for dimensionality reduction, followed by a classifier that produces the segment-level violence scores S.

The role of WTFF is dual. First, through the WDCM, the input temporal features are decomposed using the wavelet transform to isolate and enhance the high-frequency components of the video signal. These components correspond to rapid, short-duration motion and texture changes (e.g., sudden impact, quick body movements) that are highly indicative of violence but often suppressed by conventional low-pass operations. Second, the TFFF network systematically fuses the high-frequency features (from WDCM) with the baseline low-frequency temporal features (e.g., from the backbone network), forming a comprehensive spatio-temporal-frequency representation that captures both global context and transient motion details.

In essence, the advantage of WTFF over existing methods lies in its transition from a purely time-domain approach to a time–frequency hybrid modeling paradigm, which is fundamentally more effective for localizing and recognizing anomalies characterized by rapid transients in violent events. This design enables WTFF to simultaneously exploit complementary temporal and frequency cues, thereby improving the robustness and discriminative capability of violence detection in complex real-world surveillance environments.

3.2. Wavelet-Dilated Separable Convolution Module

The WDCM is the core module for extracting frequency-domain representations. It is specifically designed to combine the advantages of wavelet transform and dilated separable convolution. The rationale is twofold: (1) Wavelet transformation decomposes input features into multiple frequency sub-bands, providing localized time–frequency analysis; (2) dilated convolution enlarges the receptive field without increasing parameters, facilitating multi-scale feature aggregation. This combination allows WDCM to model both global context and fine-grained local variations in motion intensity, which are critical for distinguishing violent from non-violent activities.

The WDCM is designed to capture frequency-domain information from video data, primarily consisting of one depth-wise separable convolution layer and multiple dilated convolutional layers.

The process of the WDCM is illustrated in Figure 2. Initially, a depth-wise separable convolution (DSConv) is applied to the input, which can either be images or feature tensors. Following this, a first-level wavelet decomposition is performed, dividing the feature map into four frequency components: one low-frequency component (LL) and three high-frequency components (LH, HL, HH). Subsequently, separate

3 \times 3

dilated convolutional operations (DConv) are applied to each of these components. Finally, an inverse wavelet transform is used to reconstruct the processed frequency components. By combining wavelet decomposition with dilated convolutions, this architecture effectively captures frequency-domain features while simultaneously expanding the receptive fields.

3.2.1. Wavelet Transformation

For wavelet transformation in WDCM, this paper uses the db1 basis due to its computational efficiency and simplicity [28]. While alternative wavelet bases (e.g., haar, db2, db3) could be used, the alternative wavelet bases would incur higher computational overhead without proportional performance gains.

The wavelet transform is chosen instead of the Fourier transform because it preserves both spatial and frequency localization. In our implementation, the Daubechies-1 (db1) basis [28] is employed for its simplicity and low computational cost. Alternative bases (e.g., db2, db3) were evaluated but found to incur higher complexity without noticeable accuracy improvement.

For the db1 wavelet transform implementation, a one-dimensional operation is performed on the input image F along either the width or height dimension. This single-level decomposition is achieved through depth-wise convolutions with kernels

\frac{[1, 1]}{\sqrt{2}}

and

\frac{[1, - 1]}{\sqrt{2}}

. To extend two-dimensional db1 wavelet decomposition, we compose the 1D transforms along both spatial dimensions. This is implemented using the following four distinct filter groups applied through depth-wise convolutions with a stride of 2 [28]:

\begin{matrix} f r e q_{L L} & = \frac{1}{2} (\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}), \\ f r e q_{L H} & = \frac{1}{2} (\begin{matrix} 1 & - 1 \\ 1 & - 1 \end{matrix}), \\ f r e q_{H L} & = \frac{1}{2} (\begin{matrix} 1 & 1 \\ - 1 & - 1 \end{matrix}), \\ f r e q_{H H} & = \frac{1}{2} (\begin{matrix} 1 & - 1 \\ - 1 & 1 \end{matrix}) \end{matrix} .

(1)

Here,

f r e q_{L L}

captures the low-frequency (smooth) content, while

f r e q_{L H}

,

f r e q_{H L}

, and

f r e q_{H H}

capture horizontal, vertical, and diagonal high-frequency details, respectively. Applying these filters through DSConv produces four sub-band feature maps, separating motion-related high-frequency regions from background-dominant low-frequency areas, providing more interpretable features for violence analysis.

3.2.2. Wavelet Decomposition

Notably, since the kernels in Equation (1) form an orthogonal basis, the inverse wavelet transform (IWT) can be implemented through transposed convolution,

F = Conv_transposed ({f r e q_{L L}, f r e q_{L H}, f r e q_{H L}, f r e q_{H H}}, {F_{L L}, F_{L H}, F_{H L}, F_{H H}}) .

(2)

In WDCM, the decomposition is recursively applied to the low-frequency component to form a hierarchical representation,

F_{L L}^{(i)}, F_{L H}^{(i)}, F_{H L}^{(i)}, F_{H H}^{(i)} = WT (F_{L L}^{(i - 1)}) .

(3)

Such multi-level decomposition progressively enhances the model’s ability to separate fine-grained spectral components while preserving essential global context. Empirically, a two-level decomposition achieves the best trade-off between efficiency and discriminative power.

3.2.3. Inverse Wavelet Transformation

After capturing the frequency-domain information of the input data, the WDCM processes different frequency components through dilated convolutional operations, as follows:

\begin{matrix} F_{L L}^{(i)}, F_{H}^{(i)} = WT (F_{L L}^{(i - 1)}), \\ M_{L L}^{(i)}, M_{H}^{(i)} = DConv (W^{(i)}, (F_{L L}^{(i)}, F_{H}^{(i)})) \end{matrix} .

(4)

To further model spatial dependencies within each sub-band, dilated convolutions (DConv) are applied, where small dilation rates (e.g., 2–4) are used to capture localized motion variations without excessive background interference.

The WDCM reconstructs the final output by combining the convolved frequency components through IWT,

V^{(i)} = I W T (M_{L L}^{(i)} + V^{(i + 1)}, M_{H}^{(i)})

(5)

This yields a feature map that preserves both low-frequency stability and high-frequency sensitivity. Compared to conventional CNN layers, WDCM provides multi-resolution analysis, improving robustness to illumination changes, occlusion, and noise.

3.3. Time–Frequency Feature Fusion Network

After frequency features

x^{f}

and temporal features

x^{c}

are extracted, the TFFF network fuses them into a unified representation that captures complementary information. The key insight is that violent events often exhibit temporal discontinuities (e.g., sudden acceleration) accompanied by spectral disturbances (e.g., texture deformation). Hence, fusing these modalities enhances discriminative representation.

Figure 3 illustrates the detailed architecture of the TFFF, where B denotes the batch size, D represents the feature dimension, and

S e q

indicates the sequence length.

After extracting temporal features

x^{c}

and frequency-domain features

x^{f}

through the TCA [27] and WDCM modules, respectively, the following feature fusion steps are performed. First, the temporal features

x^{c}

undergo dimension adjustment via a 2D convolution. Next, the frequency-domain features

x^{f}

are normalized using layer normalization. Finally, the processed temporal and frequency-domain features are concatenated along the channel dimension and fused through a 2D convolutional layer. The fused feature is formulated as

x^{fused} = Conv 2 D (Concat (x^{c}, x^{f})),

(6)

where

C o n c a t

denotes the channel-wise concatenation operation of tensors.

To further enhance the semantic consistency, the Prompt-Enhanced Learning (PEL) module [27] from the baseline is retained after fusion. It provides adaptive prompts that guide the model to focus on violence-related patterns. Compared with traditional fusion strategies such as element-wise addition or attention-based fusion, TFFF offers a balanced trade-off between efficiency and representation richness, as verified by ablation studies.

3.4. Loss Function

Following [27], this paper employs a binary cross-entropy loss and a cross-modal alignment loss [27] as the loss functions. Specifically, the video-level prediction

p_{i}

is determined by averaging the top-k violence scores. Given a mini-batch of B samples with video-level labels

y_{i}

, the binary cross-entropy loss is formulated as

L_{ce} = \sum_{n = 1}^{B} - y_{i} log (p_{i}) .

(7)

Following [27], WTFF employs a combination of binary cross-entropy (BCE) loss and cross-modal alignment loss. The BCE loss supervises video-level classification, while the alignment term encourages consistency between temporal and frequency modalities.

During the training phase, the overall objective loss function of our model is formulated as

L = L_{ce} + λ L_{kd} .

(8)

Here,

L_{k d}

denotes the Kullback–Leibler divergence–based alignment loss, and

λ

controls the balance between classification and alignment. This joint optimization encourages WTFF to learn complementary and coherent representations across both time and frequency domains.

4. Experiments

4.1. Datasets and Evaluation Metric

This paper conducts experiments on three violence event datasets: UCF-Crime [12], XD-Violence [29], and ShanghaiTech [30].

The UCF-Crime dataset is a widely used benchmark for video violence behavior detection. It comprises 1900 surveillance videos with a total duration of 128 h, covering 13 categories of real-world violent events, including abuse, robbery, explosion, and road accidents. For evaluation in the weakly supervised setting, the dataset is typically partitioned into 1610 training videos and 290 test videos. Consistent with the weakly supervised paradigm, training videos are provided with only video-level labels, while frame-level annotations are available for evaluation on the test set. Figure 4 presents example frames from the UCF-Crime dataset.

The XD-Violence dataset is recognized as one of the latest and largest multi-modal datasets for violence detection. It comprises 4754 untrimmed videos spanning a total duration of 217 h, collected from diverse sources including surveillance, movies, car cameras, and games. The dataset is partitioned into 3954 training videos and 800 test videos. It covers six types of violent events: abuse, car accidents, explosions, fighting, riots, and shooting. A notable characteristic of this dataset is the prevalence of artistic expressions, such as complex camera movements and frequent scene switching, in many videos. These characteristics pose significant challenges for accurate video violence detection due to their inherent variability. Figure 5 presents example frames from the XD-Violence dataset.

The ShanghaiTech dataset contains 437 videos recorded across 13 different campus scenes. Although originally used for semi-supervised violence detection with a training set comprising only normal videos, Zhong et al. [14] reorganized this dataset specifically for evaluating weakly supervised methods. In this commonly used split, the dataset is partitioned into 238 videos for training and 199 videos for testing. Figure 6 presents example frames from the dataset.

For evaluation, we employ the frame-level receiver operating characteristic Area Under the Curve (AUC) on the UCF-Crime and ShanghaiTech datasets. For the XD-Violence dataset, the standard evaluation metric is the area under the precision-recall curve, also known as Average Precision (AP). Note that the False Alarm Rate (FAR) values for many existing methods are marked as ‘–’ because their original publications did not report this specific metric for comparison. In the Weakly Supervised Video Anomaly Detection (WS-VAD) community, the primary metrics for SOTA comparison are AUC and AP. However, we acknowledge the importance of FAR for real-world application.

4.2. Implementation Details

4.2.1. Data Pre-Processing

Consistent with prior work [31], features are extracted using the I3D network [32] pre-trained on the Kinetics-400 dataset. For a fair comparison, a 10-crop augmentation strategy, consisting of the center, four corners, and their mirrored counterparts, is adopted for the UCF-Crime and ShanghaiTech datasets. Conversely, a 5-crop augmentation strategy, comprising the center and four corners, is employed for the XD-Violence dataset, following the approach in [29].

4.2.2. Hyperparameter Settings

The hyperparameters for the model are configured as follows. The two Conv1D layers within the MLP have 512 and 300 nodes, respectively, both with a dropout rate of 0.1. Dataset-specific parameters vary across the evaluated benchmarks. The local window sizes for UCF-Crime, XD-Violence, and ShanghaiTech are set to 9, 9, and 5, respectively. The coefficient

λ

is set to 1 for the UCF-Crime and XD-Violence datasets and to 9 for the ShanghaiTech dataset.

During the training phase, our model is trained using the ADAM optimizer [33] with a batch size of 128 for a total of 50 epochs. The initial learning rate for all three datasets is set to

5 \times 10^{- 4}

and decays using a cosine schedule. To ensure the reliability and statistical significance of our results, all experiments, particularly the comparison against the baseline and SOTA methods, were executed five times using different random seeds. The final performance metrics (AUC and AP) for the proposed method are reported as the Mean ± Standard Deviation over these five runs. This procedure formally confirms the statistical stability of the observed performance gains.

4.3. Comparison with State-of-the-Art Methods

Table 1, Table 2 and Table 3 present the comparative results of the proposed WTFF method against state-of-the-art approaches across three evaluated datasets. We conduct a detailed comparison with the baseline method PEL [27] and STFFE [20].

On the UCF-Crime dataset, WTFF outperforms PEL [27] and STFFE [20] by 0.58% and 1.35%, respectively. On the XD-Violence dataset, WTFF achieves 0.68% and 4.44% higher AP than PEL [27] and STFFE [20], respectively. Moreover, in terms of false alarm rate, WTFF achieves a FAR reduction compared to PEL [27] on UCF-Crime (0.01% lower) and XD-Violence (0.33% lower). On the ShanghaiTech dataset, WTFF not only maintains the same 0.00% FAR as PEL [27], but also achieves 0.28% improvement in AUC performance. The reason why the performance of PEL [27] and STFFE [20] is inferior is their inability to effectively extract frequency-domain features and leverage the complementary advantages of time–frequency characteristics.

Notably, the results on the XD-Violence dataset reveal a crucial finding: our Visual-Only WTFF method achieves a superior AP of 84.77% and the lowest FAR of 0.34% among all comparable entries, significantly outperforming dedicated multi-modal models (e.g., UR-DMU [39] at 81.66% and Zhang et al. [43] at 81.43%). This demonstrates that the quality of the visual feature representation is fundamentally more important for precise localization than simply adding another modality.

The reason for this superiority is rooted in the different foci of the methods. Multi-modal and hybrid methods (e.g., UR-DMU [39], Zhang et al. [43]) often dedicate their primary innovation to the training objective or classification robustness (e.g., utilizing memory banks, refining pseudo-labels, or employing specialized ranking/margin losses). While these techniques effectively improve the resilience of the classification process, they still rely on the same standard, temporally-smoothed I3D features as their primary input. Our WTFF method, conversely, solves the root cause of localization error by introducing the Wavelet-Based Time–Frequency feature

F_{H F}

. This feature is designed orthogonally to existing methods, as it specifically captures the high-frequency, abrupt visual motion transients that are essential for pinpointing the exact frame where violence starts and ends. On the XD-Violence dataset, which heavily penalizes temporal offsets, the audio modality provides strong classification context (e.g., sudden noise indicating a violent event) but is often too coarse in time to aid in precise frame-level localization of the visual strike or impact. WTFF’s enhanced visual representation thus offers a more fundamental gain in localization precision than the supplementary, yet coarse, information from the audio channel.

In summary, the Wavelet-Based Time–Frequency Feature Fusion mechanism in WTFF overcomes the limitations of conventional temporal approaches, thereby enhancing the accuracy of violent event detection.

4.4. Analysis of the AUC/AP and False Alarm Rate Trade-Off

Beyond the standard metrics of AUC and AP, the False Alarm Rate (FAR) is a critical indicator for real-world surveillance system deployment. A common challenge in Weakly Supervised Video Anomaly Detection (WS-VAD) is the trade-off between high detection performance (AUC/AP) and acceptable FAR. Methods that aggressively optimize for maximum AUC/AP often do so by increasing the model’s sensitivity, which consequently leads to a higher FAR. For instance, some compared methods, while achieving peak AUC, tend to produce more sporadic high-score segments within normal video periods. Our proposed WTFF method effectively navigates this trade-off. We observe that while the WDCM provides the necessary sensitivity to capture subtle, high-frequency violent cues, the TFFF Network plays a vital role in regularizing the anomaly scores. By ensuring the high-frequency features are consistently integrated with the stable low-frequency temporal context, TFFF mitigates the noise amplification inherent in pure frequency analysis. As demonstrated in Table 1, Table 2 and Table 3, WTFF not only achieves superior AUC/AP scores but also maintains a competitive or lower FAR compared to other high-performing techniques (where data is available). This indicates that our Time–Frequency Feature Fusion approach yields a more robust and contextually aware anomaly representation, allowing for precise localization without unduly compromising the system’s tolerance for false alarms in long-duration surveillance videos.

4.5. Ablation Experiments

In this section, we conduct extensive ablation experiments to validate the contribution of each component in WTFF.

4.5.1. Ablation Experiments on WDCM and TFFF

To assess the contribution of each component within WTFF, a series of ablation studies are conducted on the UCF-Crime, XD-Violence, and ShanghaiTech datasets: (1) experiment 1—the baseline [27]; (2) experiment 2—relative to experiment 1, the WDCM module is integrated, where temporal and frequency-domain features are merged through element-wise summation; (3) experiment 3—based on experiment 2, the element-wise summation is substituted by the TFFF module to enable a more adaptive feature fusion. The corresponding results are presented in Table 4. A comparison between experiments 1 and 2 indicates that the baseline [27] is enhanced by the WDCM module, which yields improvements of 0.58%, 0.68%, and 0.28% on the UCF-Crime, XD-Violence, and ShanghaiTech datasets, respectively. These improvements highlight the importance of frequency-domain information in detecting violent events. From experiments 2 and 3, the TFFF further enhances performance across all datasets, with a notable 2.13% increase in AP on XD-Violence. From experiments 1, 2 and 3, while the WDCM module alone demonstrates consistent improvements, its synergy with the TFFF network is critical for achieving optimal performance, and the combined use of both WDCM and TFFF yields state-of-the-art results, validating that effective violent event detection requires concurrent analysis in both temporal and frequency domains.

The ablation results, particularly those presented in Table 4, clearly validate the effectiveness of both the WDCM and TFFF components. The incorporation of the WDCM leads to a significant performance improvement. This gain stems from the WDCM’s capability to isolate and enhance high-frequency feature components via the wavelet transform. In the context of violence detection, these high-frequency components precisely correspond to the abrupt, non-linear motion transients (e.g., sudden striking, rapid fall) that mark the precise onset and cessation of an anomaly. Traditional temporal features often smooth out these critical high-frequency details, leading to blurred event boundaries; the WDCM successfully captures them, thereby greatly improving frame-level localization accuracy.

When TFFF is integrated on top of WDCM, the performance achieves its peak, confirming that TFFF is not merely concatenating features but performing a necessary complementary integration. While the high-frequency features derived from WDCM are crucial for pinpointing event boundaries, they can also be susceptible to noise. TFFF ensures that these high-frequency details are balanced with the robust low-frequency context (temporal features from the backbone). This synergistic fusion mitigates the noise sensitivity of high-frequency cues while maintaining localization precision, resulting in a model that is both sensitive to subtle violence and resistant to false alarms in complex scenes.

In summary, the superior performance of the complete WTFF framework confirms the necessity of adopting a Time–Frequency Feature Fusion approach for enhanced weakly supervised anomaly detection.

4.5.2. Comparison of Alternative Feature Fusion Mechanisms

To definitively justify the selection of the Time–Frequency Feature Fusion Network, we conducted an ablation study comparing its performance against several well-established feature integration mechanisms. As detailed in Table 5, we evaluated three alternative strategies: Self-Attention Weights, Gating Mechanism, and Cross-Attention Weights. The results demonstrate that TFFF achieves the optimal balance between high detection performance and a robust False Alarm Rate (FAR) across all three challenging datasets.

Superior Overall Performance: On the UCF-Crime dataset, TFFF achieves the highest AUC of 85.87, a significant margin over the Gating Mechanism (84.93) and Self-Attention (84.01). Similarly, TFFF leads on the ShanghaiTech dataset with an AUC of 97.91. On XD-Violence, TFFF’s AP of 84.77 is substantially higher than the best alternative, the Gating Mechanism (82.43).

FAR-Performance Trade-off: The Cross-Attention mechanism, despite its complexity, yields inferior AUC/AP and suffers from a significantly higher FAR (e.g., 0.88% on UCF-Crime), indicating it introduces instability when fusing the noisy high-frequency features. While the Gating Mechanism achieves a low FAR on ShanghaiTech (0.00%), its overall AUC/AP is lower than TFFF’s. TFFF stands out by maximizing AUC/AP while maintaining an exceptionally low FAR (e.g., 0.18% on UCF-Crime), confirming its effectiveness as a stable, noise-mitigating fusion gate.

Efficiency Rationale: The newly introduced Avg. Params (M) column in Table 5 definitively validates TFFF’s architectural superiority in terms of efficiency. Complex strategies like Self-Attention and Cross-Attention introduce a massive overhead, with 10.9255 M and 8.6728 M parameters, respectively. By contrast, TFFF achieves the best overall performance (highest AUC/AP) with the minimal parameter count (2.2864 M). Although the Gating Mechanism is similarly efficient (2.3233 M), its performance is substantially lower than TFFF’s across all datasets. This quantitative evidence confirms that TFFF’s streamlined design is the most appropriate and efficient choice for integrating complementary time–frequency features, achieving SOTA-level fusion performance with minimal complexity.

4.5.3. Ablation Experiments on Dilation Rates of Dilated Convolutions

From the experimental results in Table 6, smaller dilation rates consistently yield better performance. The reason may lie in the fact that violent events (such as beating or shoving) in surveillance scenarios exhibit abrupt motions within a short spatio-temporal range. A smaller dilation rate (e.g., 2 or 4) corresponds to a limited receptive field, which is effective in capturing such localized features accurately. In contrast, a larger dilation rate (≥6) introduces irrelevant background interference, thereby degrading detection performance.

4.5.4. Ablation Experiments on a Convolutional Kernel Size of TFFF

From the experimental results in Table 7, smaller kernel sizes generally lead to better model performance. On UCF-Crime and ShanghaiTech datasets, smaller kernels help preserve fine-grained features, leading to better detection performance. For the XD-Violence dataset, although a larger kernel size (

7 \times 7

) yields the highest AP, it also leads to an increased FAR. The AP improvement over smaller kernels (

1 \times 1

or

3 \times 3

) is marginal, while the FAR is higher for the larger kernel. Therefore, considering both AP and FAR, smaller kernel sizes (

1 \times 1

or

3 \times 3

) provide a more balanced and effective solution.

4.5.5. Ablation Experiments on Wavelet Decomposition Levels

As shown in Table 8, lower wavelet decomposition levels generally result in better model performance. This may be attributed to the fact that lower decomposition levels retain more critical features of the original input, avoiding information loss caused by excessive decomposition. Specifically, a 2-level decomposition achieves the best detection performance on both UCF-Crime and XD-Violence. For the XD-Violence dataset, although the 4-level decomposition yields a slightly lower FAR, its AP is 0.6% lower than that of the 2-level decomposition. In contrast, the ShanghaiTech dataset demonstrates inherent robustness to varying decomposition levels, maintaining a high AUC even with a 2-level decomposition.

4.5.6. Ablation Experiments on Wavelet Types

From the experimental results in Table 9, both db1 and haar wavelet bases achieve near-optimal performance. However, from the perspective of computational complexity, the db1 wavelet basis requires fewer parameters. Considering this superior parameter efficiency while maintaining competitive detection performance, the db1 wavelet basis presents an effective and practical choice for real-world implementation.

5. Conclusions

In this paper, we proposed the Wavelet-Based Time–Frequency Feature Fusion (WTFF) method, a novel approach designed to enhance the accuracy of frame-level anomaly localization in weakly supervised video violence detection. The core innovation of WTFF lies in moving beyond traditional temporal analysis by systematically incorporating frequency-domain features to capture the subtle, high-frequency motion transients inherent in violent events. Specifically, the Wavelet-Dilated Separable Convolution Module (WDCM) was introduced to decompose the video features, effectively isolating the high-frequency components that are critical for pinpointing the exact onset and cessation of anomalous behaviors. Furthermore, the Time–Frequency Feature Fusion (TFFF) Network ensured a complementary integration between these detailed high-frequency cues and the overall low-frequency temporal context. The superiority of our proposed method was comprehensively validated through extensive experiments on three large-scale public benchmarks. On the UCF-Crime dataset, our WTFF method achieved a state-of-the-art AUC of approximately 85.87%, demonstrating robust frame-level anomaly localization. On the XD-Violence dataset, our approach attained an AP of approximately 84.77%, highlighting its effectiveness in accurately localizing violent events under challenging conditions. On the ShanghaiTech dataset, our method reached an AP of approximately 97.91%, further validating its generalizability and precision across diverse scenes. In addition, the ablation study clearly confirmed that the integration of the WDCM alone leads to significant performance improvements, directly supporting the hypothesis that frequency-domain features are critical for robust violence detection. In summary, this work successfully leverages the multi-resolution power of the wavelet transform to solve the long-standing problem of coarse localization in WS-VAD, offering a powerful new direction for feature representation learning. In future work, we plan to explore adaptive wavelet selection and extend the WTFF framework to other multi-modal anomaly detection tasks. In future work, we plan to explore adaptive wavelet selection and extend the WTFF framework to fuse visual time–frequency features with audio features for robust multimodal anomaly detection.

Author Contributions

Conceptualization, F.Z. and J.W.; methodology, F.Z. and J.P.; software, F.Z. and K.D.; validation, L.C. and Y.G.; formal analysis, J.P. and X.L.; investigation, F.Z. and J.P.; resources, J.W.; data curation, F.Z. and J.P.; writing—original draft preparation, F.Z.; writing—review and editing, J.W., J.P. and L.C.; supervision, J.W.; project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62201066.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. UCF-Crime can be accessed at https://www.crcv.ucf.edu/research/real-world-anomaly-detection-in-surveillance-videos/ (accessed on 30 October 2025), XD-Violence at https://roc-ng.github.io/XD-Violence/ (accessed on 30 October 2025), and ShanghaiTech at https://www.datafountain.cn/datasets/5670 (accessed on 30 October 2025).

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions, which helped improve the quality of this manuscript. This work was supported by the National Natural Science Foundation of China under Grant 62201066. The authors also appreciate the administrative and technical support provided by the Beijing Information Science and Technology University and the Aerospace Information Research Institute, CAS.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WTFF	Wavelet-Based Time–Frequency Feature Fusion
WDCM	Wavelet-Dilated Separable Convolution Module
TFFF	Time–Frequency Feature Fusion
WSVVD	Weakly Supervised Video Violence Detection
MIL	Multiple-Instance Learning
TCA	Temporal Context Aggregation
I3D	Inflated 3D Convolutional Network
AUC	Area Under the ROC Curve
AP	Average Precision
FAR	False Alarm Rate

References

Hasan, M.; Choi, J.; Neumann, J.; Davis, K.K.; Davis, J. Learning Temporal Regularity in Video Sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future Frame Prediction for Anomaly Detection—A New Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Chalapathy, R.; Chawla, S. Deep Learning for Anomaly Detection: A Survey. arXiv 2019, arXiv:1901.03407. [Google Scholar] [CrossRef]
Zhong, Y.; Zhu, R.; Yan, G.; Gan, P.; Shen, X.; Zhu, D. Inter-Clip Feature Similarity Based Weakly Supervised Video Anomaly Detection via Multi-Scale Temporal MLP. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1961–1970. [Google Scholar] [CrossRef]
Wu, P.; Zhou, X.; Pang, G.; Yang, Z.; Yan, Q.; Wang, P.; Zhang, Y. Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts. In Proceedings of the ACM International Conference on Multimedia (MM ’24), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 9301–9310. [Google Scholar]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 15. [Google Scholar] [CrossRef]
Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 93–104. [Google Scholar]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96), Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the support of a high-dimensional distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
Sakurada, M.; Yairi, T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, Gold Coast, Australia, 2 December 2014; pp. 4–11. [Google Scholar]
Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Langs, G.; Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [CrossRef] [PubMed]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6479–6488. [Google Scholar]
Shin, J.; Kaneko, Y.; Miah, A.S.M.; Hassan, N.; Nishimura, S. Anomaly Detection in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement. IEEE Access 2024, 12, 65213–65227. [Google Scholar] [CrossRef]
Ren, J.; Yu, C.; Zhao, H.; Wang, Y. Weakly Supervised Video Anomaly Detection via Temporal Context Modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Zhai, Y.; Yu, X.; Wang, S.; Liu, H. Temporal Transformer Networks for Weakly Supervised Video Anomaly Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022. [Google Scholar]
Gao, J.; Xu, M.; Zheng, W.; Wang, Y. Weakly Supervised Video Anomaly Detection via Temporal Graph Representation Learning. In Proceedings of the ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022. [Google Scholar]
Zhang, C.; Li, G.; Qi, Y.; Zhang, K.; Yu, L.; Yan, Y. Multi-Scale Temporal Learning for Weakly Supervised Anomaly Detection in Videos. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Tan, F.; Zhai, M.; Zhai, C. Foreign object detection in urban rail transit based on deep differentiation segmentation neural network. Heliyon 2024, 10, e37072. [Google Scholar] [CrossRef] [PubMed]
Yang, B.; Ma, T.; Rong, H.; Huang, X.; Wang, Y.; Zhao, B.; Wang, C. TADST: Reconstruction with spatio-temporal feature fusion for deviation-based time series anomaly detection. Appl. Intell. 2025, 55, 425. [Google Scholar] [CrossRef]
Liang, W.; Zhang, J.; Zhan, Y. Weakly supervised video anomaly detection based on spatial–temporal feature fusion enhancement. Signal Image Video Process. 2024, 18, 1111–1118. [Google Scholar] [CrossRef]
Jeon, S.H.; Kim, K.; Choi, Y.S. FEFM: Feature Extraction and Fusion Module for Enhanced Time Series Anomaly Detection. In Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, Catania, Italy, 7–11 April 2025; pp. 1130–1137. [Google Scholar]
Cai, Y.; Liu, J.; Guo, Y.; Hu, S.; Lang, S. Video anomaly detection with multi-scale feature and temporal information fusion. Neurocomputing 2021, 423, 264–273. [Google Scholar] [CrossRef]
Li, S.; Wang, Z.; Wang, X.; Yin, Z.; Yao, M. Frequency-enhanced and decomposed transformer for multivariate time series anomaly detection. Appl. Intell. 2025, 55, 556. [Google Scholar] [CrossRef]
Chen, L.; Cao, X.; He, T.; Xu, Y.; Liu, X.; Hu, B. A lightweight All-MLP time–frequency anomaly detection for IIoT time series. Neural Netw. 2025, 187, 107400. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Xu, H.; Jian, S.; Zhang, R.; Li, X.; Yi, Z. Frequency-enhanced Comprehensive Dependency Attention for Time Series Anomaly Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 6–11 April 2025; pp. 1–5. [Google Scholar]
Zhang, W.; Xu, D.; Yao, G.; Lin, X.; Guan, R.; Du, C.; Han, R.; Xuan, X.; Luo, C. FreCT: Frequency-Augmented Convolutional Transformer for Robust Time Series Anomaly Detection. In Advanced Intelligent Computing Technology and Applications; Huang, D.-S., Chen, W., Pan, Y., Chen, H., Eds.; Springer Nature: Singapore, 2025; pp. 15–26. [Google Scholar]
Pu, Y.; Wu, X.; Yang, L.; Wang, S. Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection. IEEE Trans. Image Process. 2024, 33, 4923–4936. [Google Scholar] [CrossRef] [PubMed]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 14–18 September 2025; pp. 363–380. [Google Scholar]
Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 322–339. [Google Scholar]
Luo, W.; Liu, W.; Gao, S. A revisit of sparse coding based anomaly detection in stacked RNN framework. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 341–349. [Google Scholar]
Wu, P.; Liu, J. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans. Image Process. 2021, 30, 3513–3527. [Google Scholar] [CrossRef] [PubMed]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 4955–4966. [Google Scholar]
Pu, Y.; Wu, X. Locality-aware attention network with discriminative dynamics learning for weakly supervised anomaly detection. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Li, S.; Liu, F.; Jiao, L. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1395–1403. [Google Scholar]
Zhang, C.; Li, G.; Xu, Q.; Zhang, X.; Su, L.; Huang, Q. Weakly supervised anomaly detection in videos considering the openness of events. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21687–21699. [Google Scholar] [CrossRef]
Park, S.; Kim, H.; Kim, M.; Kim, D.; Sohn, K. Normality guided multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2664–2673. [Google Scholar]
Zhou, H.; Yu, J.; Yang, W. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. arXiv 2023, arXiv:2302.05160. [Google Scholar] [CrossRef]
Cho, M.; Kim, M.; Hwang, S.; Park, C.; Lee, L.; Lee, S. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12137–12146. [Google Scholar]
Wei, D.-L.; Liu, C.-G.; Liu, Y.; Liu, J.; Zhu, X.-G.; Zeng, X.-H. Look, listen and pay more attention: Fusing multi-modal information for video violence detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 1980–1984. [Google Scholar]
Wei, D.-L.; Liu, Y.; Zhu, X.-G.; Liu, J.; Zeng, X.-H. MSAF: Multimodal supervise-attention enhanced fusion for video anomaly detection. IEEE Signal Process. Lett. 2022, 29, 2178–2182. [Google Scholar] [CrossRef]
Zhang, C.; Li, G.; Qi, Y.; Wang, S.; Qing, L.; Huang, Q.; Yang, M.-H. Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16271–16280. [Google Scholar]
Pu, Y.; Wu, X. Audio-guided attention network for weakly supervised violence detection. In Proceedings of the IEEE 5th International Conference on Electronics and Communication Engineering (ICECE), Xi’an, China, 16–18 December 2022; pp. 219–223. [Google Scholar]
Yu, J.; Liu, J.; Cheng, Y.; Feng, R.; Zhang, Y. Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection. In Proceedings of the ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 6278–6287. [Google Scholar]
AlMarri, S.; Zaheer, M.Z.; Nandakumar, K. A Multi-Head Approach with Shuffled Segments for Weakly-Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January 2024; pp. 132–142. [Google Scholar]
Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8022–8031. [Google Scholar]

Figure 1. Overview of proposed WTFF.

Figure 2. The scheme of WDCM.

Figure 3. Time–Frequency Feature Fusion Network.

Figure 4. The example images of the UCF-Crime dataset.

Figure 5. The example images of the XD-Violence dataset.

Figure 6. The example images of the ShanghaiTech dataset.

Table 1. Performance comparison of state-of-the-art methods on the UCF-Crime dataset.

Method	Feature	AUC (%)	FAR (%)
RTFM [34]	I3D RGB	84.30	–
CRFD [31]	I3D RGB	84.89	0.72
DDL [35]	I3D RGB	85.12	–
MSL [36]	I3D RGB	85.30	–
MLAD [37]	I3D RGB	85.47	7.47
NL-MIL [38]	I3D RGB	85.63	–
PEL [27]	I3D RGB	85.29	0.19
STFFE [20]	I3D RGB	84.52	-
Ours	I3D RGB	85.87	0.18

Table 2. Performance comparison of state-of-the-art methods on the XD-Violence dataset.

Method	Feature	Modality	AP (%)	FAR (%)
UR-DMU [39]	I3D RGB	Audio + Video	81.66	0.65
Cho et al. [40]	I3D RGB	Video	81.30	–
ACF [41]	I3D + VGGish	Video	80.13	–
MSAF [42]	I3D + VGGish	Video	80.51	–
CUPL [43]	I3D + VGGish	Video	81.43	–
CMA-LA [44]	I3D + VGGish	Video	83.54	–
MACIL-SD [45]	I3D + VGGish	Video	83.40	–
PEL [27]	I3D RGB	Video	84.09	0.53
STFFE [20]	I3D RGB	Video	80.33	-
Zhang et al. [43]	I3D RGB	Audio + Video	81.43	-
Salem et al. [46]	I3D RGB	Audio + Video	71.40	-
Ours	I3D RGB	Video	84.77	0.34

Table 3. Performance comparison of state-of-the-art methods on the ShanghaiTech dataset.

Method	Feature	AUC (%)	FAR (%)
CRFD [31]	I3D RGB	97.48	–
RTFM [34]	I3D RGB	97.21	–
MSL [36]	VideoSwin	97.32	–
NL-MIL [38]	I3D RGB	97.43	–
UML [47]	X-CLIP RGB	96.78	–
Cho et al. [40]	I3D RGB	97.60	–
PEL [27]	I3D RGB	97.63	0.00
STFFE [20]	I3D RGB	97.26	-
Ours	I3D RGB	97.91	0.00

Table 4. Ablation experiments of WDCM and TFFF.

	Baseline	WDCM	TFFF	UCF-Crime	XD-Violence	ShanghaiTech
				AUC (%)	AP (%)	AUC (%)
1	√	×	×	85.29	84.09	97.63
2	√	√	×	85.37	82.64	97.76
3	√	√	√	85.87	84.77	97.91

Table 5. Effects of different feature fusion mechanisms across three datasets.

Feature Fusion Mechanisms	UCF-Crime	XD-Violence	SHTech	Avg. Params (M)
	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
Self-Attention Weights	84.01\|0.31	81.14\|0.16	96.64\|0.03	10.9255
Gating Mechanism	84.93\|0.31	82.43\|0.47	97.80\|0.00	2.3233
Cross-Attention Weights	84.45\|0.88	81.53\|0.29	94.57\|0.00	8.6728
TFFF	85.87\|0.18	84.77\|0.34	97.91\|0.00	2.2864

Table 6. Effects of different dilation rates across three datasets.

Dilation	UCF-Crime	XD-Violence	SHTech
	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
2	85.87 \|0.18	83.54\|0.18	97.84\|0.00
4	85.21\|0.26	83.21\|0.07	97.71\|0.00
6	85.34\|1.01	82.91\|0.00	97.85\|0.00
8	84.67\|0.80	82.59\|0.13	97.78\|0.00

Table 7. Effects of different convolutional kernel sizes of TFFF across three datasets.

Fusion Kernel	UCF-Crime	XD-Violence	SHTech
(Size)	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
1	85.51\|0.44	82.64\|0.09	97.91\|0.00
3	85.87\|0.18	83.54\|0.18	97.84\|0.00
5	84.72\|1.14	81.80\|2.15	91.17\|0.01
7	84.21\|0.44	84.77\|0.34	91.36\|0.01

Table 8. Effects of different wavelet decomposition levels across three datasets.

WD Levels	UCF-Crime	XD-Violence	SHTech
	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
2	85.87 \|0.18	83.54\|0.18	97.84\|0.00
3	84.63\|0.94	82.67\|0.26	97.80\|0.00
4	84.18\|0.16	82.94\|0.05	97.74\|0.04
5	84.36\|0.41	82.64\|0.58	97.34\|0.00
6	85.07\|0.86	81.22\|0.38	97.32\|0.00

Table 9. Effects of different wavelet types across three datasets.

Wavelet	UCF-Crime	XD-Violence	SHTech	Average
Type	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)	Parameters
db1	85.87 \|0.18	83.54\|0.18	97.84\|0.00	2.2864 M
db2	84.97\|0.50	81.32\|0.05	96.92\|0.00	2.3152 M
db3	85.16\|0.04	82.95\|0.32	97.54\|0.00	2.3632 M
haar	85.21\|0.33	83.54\|0.18	97.84\|0.00	2.2874 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, F.; Peng, J.; Wang, J.; Liu, X.; Cao, L.; Du, K.; Guo, Y. Wavelet-Based Time–Frequency Feature Fusion for Violence Detection. Electronics 2025, 14, 4320. https://doi.org/10.3390/electronics14214320

AMA Style

Zhang F, Peng J, Wang J, Liu X, Cao L, Du K, Guo Y. Wavelet-Based Time–Frequency Feature Fusion for Violence Detection. Electronics. 2025; 14(21):4320. https://doi.org/10.3390/electronics14214320

Chicago/Turabian Style

Zhang, Fan, Jing Peng, Jinxiao Wang, Xuan Liu, Lin Cao, Kangning Du, and Yanan Guo. 2025. "Wavelet-Based Time–Frequency Feature Fusion for Violence Detection" Electronics 14, no. 21: 4320. https://doi.org/10.3390/electronics14214320

APA Style

Zhang, F., Peng, J., Wang, J., Liu, X., Cao, L., Du, K., & Guo, Y. (2025). Wavelet-Based Time–Frequency Feature Fusion for Violence Detection. Electronics, 14(21), 4320. https://doi.org/10.3390/electronics14214320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Feature Fusion Mechanisms	UCF-Crime	XD-Violence	SHTech	Avg. Params (M)
	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
Self-Attention Weights	84.01\|0.31	81.14\|0.16	96.64\|0.03	10.9255
Gating Mechanism	84.93\|0.31	82.43\|0.47	97.80\|0.00	2.3233
Cross-Attention Weights	84.45\|0.88	81.53\|0.29	94.57\|0.00	8.6728
TFFF	85.87\|0.18	84.77\|0.34	97.91\|0.00	2.2864

Dilation	UCF-Crime	XD-Violence	SHTech
	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
2	85.87 \|0.18	83.54\|0.18	97.84\|0.00
4	85.21\|0.26	83.21\|0.07	97.71\|0.00
6	85.34\|1.01	82.91\|0.00	97.85\|0.00
8	84.67\|0.80	82.59\|0.13	97.78\|0.00

Fusion Kernel	UCF-Crime	XD-Violence	SHTech
(Size)	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
1	85.51\|0.44	82.64\|0.09	97.91\|0.00
3	85.87\|0.18	83.54\|0.18	97.84\|0.00
5	84.72\|1.14	81.80\|2.15	91.17\|0.01
7	84.21\|0.44	84.77\|0.34	91.36\|0.01

WD Levels	UCF-Crime	XD-Violence	SHTech
	AUC\|FAR (%)	AP\|FAR (%)	AUC\|FAR (%)
2	85.87 \|0.18	83.54\|0.18	97.84\|0.00
3	84.63\|0.94	82.67\|0.26	97.80\|0.00
4	84.18\|0.16	82.94\|0.05	97.74\|0.04
5	84.36\|0.41	82.64\|0.58	97.34\|0.00
6	85.07\|0.86	81.22\|0.38	97.32\|0.00

Article Menu

Wavelet-Based Time–Frequency Feature Fusion for Violence Detection

Abstract

1. Introduction

2. Related Works

2.1. Unsupervised Video-Based Violent Behavior Detection

2.2. Weakly Supervised Video Violence Detection

2.3. Feature Fusion-Based Weakly Supervised Video Violence Detection

2.4. Frequency-Domain Analysis and Feature Learning in Video Violence Detection

3. Methodology

3.1. The Overall Framework

3.2. Wavelet-Dilated Separable Convolution Module

3.2.1. Wavelet Transformation

3.2.2. Wavelet Decomposition

3.2.3. Inverse Wavelet Transformation

3.3. Time–Frequency Feature Fusion Network

3.4. Loss Function

4. Experiments

4.1. Datasets and Evaluation Metric

4.2. Implementation Details

4.2.1. Data Pre-Processing

4.2.2. Hyperparameter Settings

4.3. Comparison with State-of-the-Art Methods

4.4. Analysis of the AUC/AP and False Alarm Rate Trade-Off

4.5. Ablation Experiments

4.5.1. Ablation Experiments on WDCM and TFFF

4.5.2. Comparison of Alternative Feature Fusion Mechanisms

4.5.3. Ablation Experiments on Dilation Rates of Dilated Convolutions

4.5.4. Ablation Experiments on a Convolutional Kernel Size of TFFF

4.5.5. Ablation Experiments on Wavelet Decomposition Levels

4.5.6. Ablation Experiments on Wavelet Types

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI