1. Introduction
In recent years, the development of deep learning, particularly convolutional neural networks (CNNs), has led to significant breakthroughs in video-based violence detection (VVD). Violent behavior presents a direct threat to public safety, jeopardizing lives and property. Timely detection of such incidents is crucial for mitigating harm and preventing further escalation. As a result, the development of efficient VVD methods has become an urgent priority to address emerging security threats.
Early research primarily adopted unsupervised approaches [
1,
2]. These methods constructed models of normal behavior using non-violent video data, subsequently detecting violence by identifying deviations in feature space from the learned normal patterns. However, the lack of violence video data during training often resulted in high false-positive rates, particularly in complex or dynamic scenes [
3].
To balance annotation costs and detection performance, weakly supervised VVD (WSVVD) was introduced, requiring only video-level labels (i.e., indicating the presence or absence of violence). For example, Zhu et al. proposed the inter-clip feature similarity-based video violence detection (IFS-VVD) method [
4], which leveraged a multi-scale temporal multi-layer perceptron (MLP) to integrate both global and local temporal relations to improve detection performance. Wu et al. proposed the STPrompt method [
5], which learned temporal prompt embeddings for violence detection and localization by using pre-trained vision-language models. However, existing WSVVD detection methods, particularly those based on the Multiple-Instance Learning (MIL) framework, predominantly rely on features extracted from the temporal domain (e.g., I3D or C3D features). These features are effective at capturing global, low-frequency motion patterns, but they often suffer from information smoothing or loss when dealing with the subtle and abrupt high-frequency variations that characterize the instantaneous start and end of a violent event (e.g., a sudden strike or a rapid fall). This over-reliance on low-frequency temporal information often compromises the model’s capacity for accurate frame-level localization. For instance, methods relying solely on temporal averaging often struggle to distinguish between a fast, non-violent movement and a sudden violent strike, leading to an inevitable temporal offset in the predicted anomaly boundaries. This limitation manifests as a significant gap between video-level classification accuracy and frame-level localization precision (e.g., AUC/AP scores), which hinders real-world application. Our goal is to fundamentally reduce this localization error by introducing complementary frequency cues.
To address this critical limitation, this paper introduces Wavelet-Based Time–Frequency Feature Fusion (WTFF) method. WTFF adopts a cascaded dual-module architecture, consisting of the Wavelet-Dilated Separable Convolution Module (WDCM) for frequency-domain feature extraction and the Time–Frequency Feature Fusion Network (TFFF) for cross-domain feature integration. The primary advantage of WTFF lies in its novel utilization of frequency-domain analysis to extract features that are complementary to the standard temporal features. This is achieved by introducing the wavelet transform, which allows the model to decompose the video signal and capture the detailed, high-frequency motion characteristics essential for precise event boundary detection. Through a systematic feature fusion process, WTFF constructs a more robust representation that integrates both low-frequency temporal context and high-frequency motion details, significantly boosting the accuracy of weakly supervised violence localization. Subsequently, the TFFF fuses the extracted temporal and spectral features, leveraging their complementary nature to generate more discriminative representations for violence detection. The design of maintaining separate feature extraction branches prior to fusion not only allows for simultaneous capture of temporal and frequency-domain features but also preserves the distinct characteristics of each domain. By integrating information from both domains, this architecture facilitates more discriminative detection of violent events. Experimental results on the UCF-Crime, XD-Violence, and ShanghaiTech datasets demonstrate the effectiveness of the proposed method. It should be noted that this study focuses on enhancing the visual feature representation by exploring the time–frequency domain, setting aside multimodal fusion for future investigation.
The main contributions of this manuscript can be summarized as follows:
We propose a novel framework named Wavelet-Based Time–Frequency Feature Fusion (WTFF) for weakly supervised video violence detection, which addresses the limitations of purely temporal feature analysis by incorporating frequency-domain information.
We design the Wavelet-Dilated Separable Convolution Module (WDCM), which innovatively employs the wavelet transform to decompose the video features and effectively isolate and enhance the high-frequency components related to subtle and abrupt violent motions.
We introduce the Time–Frequency Feature Fusion (TFFF) Network to achieve a complementary feature fusion, ensuring that the final anomaly score benefits from the synergistic integration of both low-frequency temporal context and high-frequency motion details.
Extensive experiments on three challenging benchmarks (UCF-Crime, XD-Violence, and ShanghaiTech) demonstrate that our proposed WTFF method achieves superior frame-level anomaly localization performance compared to state-of-the-art methods.
The structure of this paper is arranged as follows.
Section 2 mainly introduces the related work of violence detection methods. In
Section 3, the design and implementation details of the proposed WDCM and TFFF are elaborated.
Section 4 presents the experimental results and analysis. Finally,
Section 5 concludes the study and discusses potential directions for future research.
2. Related Works
2.1. Unsupervised Video-Based Violent Behavior Detection
In early research, unsupervised video violence detection gained significant attention due to the difficulty in collecting and annotating large-scale video data of violent events. Unsupervised Violence Behavior Detection (UVBD) aimed to identify data points that deviated significantly from normal behavior patterns by learning the distribution or characteristics of non-violent data. This field held significant importance in video surveillance applications [
6]. The UVBD methods could be broadly categorized into traditional methods and deep learning-based methods.
Specifically, traditional UVBD methods were typically based on data’s statistical properties, inter-sample proximity, or data distribution. For example, Breunig et al. [
7] proposed the Local Outlier Factor (LOF) method, which identified deviations from normal patterns by measuring the local density deviation of a data point. Ester et al. [
8] proposed the DBSCAN clustering method that regarded points not belonging to any known cluster or forming very small clusters as anomalies. Schölkopf et al. [
9] proposed the One-Class Support Vector Machine method, which built a model for normal data and labeled points with low probability as anomalies (in UVBD, these might indicate violent behavior). However, these traditional methods often faced challenges in computational efficiency and performance when dealing with high-dimensional, complex-structured, or large-scale data.
In recent years, deep learning’s powerful capabilities have spurred the development of unsupervised violence detection, especially for complex data such as images and videos. Leveraging these capabilities and suitability for complex data, researchers have developed various deep learning-based methods for UVBD. For example, Sakurada et al. [
10] proposed the Autoencoder method, where deep models were trained to precisely learn normal data’s encoding-decoding processes. The models minimized reconstruction errors for normal samples while leveraging significantly higher reconstruction errors generated by violent events for violence detection. Schlegl et al. [
11] proposed the AnoGAN and f-AnoGAN methods. These approaches learned the manifold of normal data, identified the closest reconstruction for violence samples on this manifold, and detected violent events using reconstruction errors or latent space discrepancies.
Although unsupervised violence detection methods had made a series of advancements, as they primarily relied on learning the distribution or characteristics of normal data, they still faced challenges in identifying subtle violent events that did not deviate significantly from normal patterns and in generalizing to novel violent patterns that had not appeared in the training data.
2.2. Weakly Supervised Video Violence Detection
While unsupervised violence detection methods had made progress by learning normal data distributions, their inherent absence of violence video data during training often led to high false-positive rates. Moreover, they also faced challenges in detecting subtle deviations from normal patterns and generalizing to novel violent patterns, which together prompted the development of weakly supervised video-based violence detection (WSVVD).
To balance the trade-off between annotation costs and detection performance, WSVVD emerged as a mainstream solution. It not only utilizes video-level binary labels (indicating violence presence/absence) to guide model learning without requiring costly frame-level annotations but also provides implicit cues for identifying violent segments through these video-level labels—this addresses the key limitations of unsupervised methods. Currently, mainstream WSVVD methods primarily focus on two core aspects: one is learning to locate abnormal segments based on video-level labels (often termed temporal localization), and the other is effectively modeling the temporal dynamics of violence patterns.
Early work by Sultani et al. [
12] pioneered the application of Multiple-Instance Learning (MIL) for this task, treating videos as ‘bags’ of segments and employing attention mechanisms for violence detection. Subsequent research advanced MIL, such as Shin et al. [
13] integrated MIL with temporal attention and self-supervised learning for refined localization and feature extraction. Besides, alternative WSVVD methods focused on advanced temporal modeling. For example, Ren et al. [
14] proposed temporal convolutional networks (TCNs) within their WSVVD framework to capture both local and global temporal dependencies for improved violence localization; Zhai et al. [
15] proposed a WSVVD method utilizing transformers to enable long-range temporal reasoning of violent patterns under weak supervision; Gao et al. [
16] introduced temporal graph neural networks (T-GNNs) to model spatiotemporal relationships between weakly labeled video segments; and Zhang et al. [
17] designed a multi-scale temporal fusion network (MSTFN) that integrates short-term motion patterns with long-term semantic representations for violence detection. Tan et al. [
18] proposed a deep differentiation segmentation neural network for video-based foreign object detection in urban rail transit, which enhances detection accuracy through attention mechanisms and morphological post-processing.
2.3. Feature Fusion-Based Weakly Supervised Video Violence Detection
To comprehensively characterize violent events, feature fusion has emerged as an essential research direction for WSVVD. By integrating different types of feature representations, this method enabled simultaneous capture of complementary manifestations of violent events and enhanced detection performance in complex scenarios. Based on this, researchers have proposed various methods based on feature fusion. For example, Yang et al. [
19] introduced TADST, which integrated temporal convolutional networks with graph attention influence networks to effectively capture spatiotemporal dependencies, and further employed an encoder–decoder structure with adaptive violence detection to enhance temporal sequence modeling. Liang et al. [
20] proposed the STFFE framework, which improved violence feature discrimination through temporal-constrained fusion of the top k most salient video segments. Cai et al. Jeon et al. [
21] developed FEFM, combining a multi-dimensional feature extractor with a temporal feature extractor to jointly capture spatial and temporal dynamics, followed by fusion layers for robust violence detection. Cai et al. [
22] proposed a multi-scale predictive framework, which extracted features with different receptive fields to capture spatial information and incorporated temporal constraint through ConvGRU modules for spatio-temporal feature fusion in video violence detection.
In summary, although existing methods based on feature fusion have advanced violence detection, they primarily rely on temporal analysis, often overlooking critical frequency-domain features. However, most violent events can be characterized by distinct frequency-domain patterns, such as high-frequency texture variations, which are challenging for single-domain analysis to fully capture. To address this limitation, this paper proposes a Wavelet-Based Time–Frequency Feature Fusion method for violence detection, utilizing a novel cascaded dual-module architecture.
2.4. Frequency-Domain Analysis and Feature Learning in Video Violence Detection
Frequency-domain analysis, traditionally powerful in signal processing, has recently gained traction in advanced video feature learning. Over the past three years in particular, several state-of-the-art studies have begun exploring this domain to address the limitations of purely temporal models. For instance, Li et al. [
23] proposed the Frequency-Enhanced and Decomposed Transformer for Violent Behavior Detection (FDTAD), which integrates time-domain and frequency-domain decomposition within a transformer architecture to enhance model generalization and reduce false positives in unstable multivariate time series data. Chen et al. [
24] proposed the LTFAD model, a lightweight All-MLP time–frequency violent behavior detection framework that achieves high efficiency and accuracy in IIoT time series analysis through dual-branch reconstruction and time–frequency joint learning. Xu et al. [
25] proposed the FCDATA model, which enhances time series violent behavior detection by integrating frequency-domain rectification with a comprehensive dependency-aware attention mechanism to capture both synchronous and asynchronous inter-variable relationships. Zhang et al. [
26] proposed the FreCT model, which enhances time series violent behavior detection by integrating frequency-domain analysis with a convolutional transformer to jointly capture long-term dependencies and local topology information.
3. Methodology
3.1. The Overall Framework
The overall structure of the proposed Wavelet-Based Time–Frequency Feature Fusion (WTFF) framework is illustrated in
Figure 1. Unlike previous weakly supervised approaches that primarily focus on temporal cues, WTFF introduces a frequency-domain analysis branch to capture subtle variations such as abrupt motion, texture discontinuities, and illumination changes, which are common in violent events. The motivation is that temporal-only modeling (e.g., with RNN or TCN) tends to overlook these fine-grained spatial fluctuations that appear as high-frequency patterns in the spectral domain.
As shown in
Figure 1, the proposed WTFF framework comprises two core components: the Wavelet-Dilated Separable Convolution Module (WDCM) and the Time–Frequency Feature Fusion (TFFF) network. These two modules are jointly optimized in an end-to-end manner to integrate temporal and frequency-domain representations for more discriminative violence detection. The detailed architectures of WDCM and TFFF will be introduced in
Section 3.2 and
Section 3.3, respectively.
The WTFF process performs as follows: untrimmed videos are first divided into non-overlapping 16-frame segments using a fixed step size of 16 frames. Each segment’s features
x are extracted through a pre-trained I3D network [
27] and then processed in parallel by the Temporal Context Aggregation (TCA) [
27] and WDCM modules to obtain temporal features
and frequency-domain features
, respectively. Next, the TFFF module fuses
and
into
. The fused representation is then passed through a two-layer MLP for dimensionality reduction, followed by a classifier that produces the segment-level violence scores
S.
The role of WTFF is dual. First, through the WDCM, the input temporal features are decomposed using the wavelet transform to isolate and enhance the high-frequency components of the video signal. These components correspond to rapid, short-duration motion and texture changes (e.g., sudden impact, quick body movements) that are highly indicative of violence but often suppressed by conventional low-pass operations. Second, the TFFF network systematically fuses the high-frequency features (from WDCM) with the baseline low-frequency temporal features (e.g., from the backbone network), forming a comprehensive spatio-temporal-frequency representation that captures both global context and transient motion details.
In essence, the advantage of WTFF over existing methods lies in its transition from a purely time-domain approach to a time–frequency hybrid modeling paradigm, which is fundamentally more effective for localizing and recognizing anomalies characterized by rapid transients in violent events. This design enables WTFF to simultaneously exploit complementary temporal and frequency cues, thereby improving the robustness and discriminative capability of violence detection in complex real-world surveillance environments.
3.2. Wavelet-Dilated Separable Convolution Module
The WDCM is the core module for extracting frequency-domain representations. It is specifically designed to combine the advantages of wavelet transform and dilated separable convolution. The rationale is twofold: (1) Wavelet transformation decomposes input features into multiple frequency sub-bands, providing localized time–frequency analysis; (2) dilated convolution enlarges the receptive field without increasing parameters, facilitating multi-scale feature aggregation. This combination allows WDCM to model both global context and fine-grained local variations in motion intensity, which are critical for distinguishing violent from non-violent activities.
The WDCM is designed to capture frequency-domain information from video data, primarily consisting of one depth-wise separable convolution layer and multiple dilated convolutional layers.
The process of the WDCM is illustrated in
Figure 2. Initially, a depth-wise separable convolution (DSConv) is applied to the input, which can either be images or feature tensors. Following this, a first-level wavelet decomposition is performed, dividing the feature map into four frequency components: one low-frequency component (LL) and three high-frequency components (LH, HL, HH). Subsequently, separate
dilated convolutional operations (DConv) are applied to each of these components. Finally, an inverse wavelet transform is used to reconstruct the processed frequency components. By combining wavelet decomposition with dilated convolutions, this architecture effectively captures frequency-domain features while simultaneously expanding the receptive fields.
3.2.1. Wavelet Transformation
For wavelet transformation in WDCM, this paper uses the db1 basis due to its computational efficiency and simplicity [
28]. While alternative wavelet bases (e.g., haar, db2, db3) could be used, the alternative wavelet bases would incur higher computational overhead without proportional performance gains.
The wavelet transform is chosen instead of the Fourier transform because it preserves both spatial and frequency localization. In our implementation, the Daubechies-1 (db1) basis [
28] is employed for its simplicity and low computational cost. Alternative bases (e.g., db2, db3) were evaluated but found to incur higher complexity without noticeable accuracy improvement.
For the db1 wavelet transform implementation, a one-dimensional operation is performed on the input image
F along either the width or height dimension. This single-level decomposition is achieved through depth-wise convolutions with kernels
and
. To extend two-dimensional db1 wavelet decomposition, we compose the 1D transforms along both spatial dimensions. This is implemented using the following four distinct filter groups applied through depth-wise convolutions with a stride of 2 [
28]:
Here, captures the low-frequency (smooth) content, while , , and capture horizontal, vertical, and diagonal high-frequency details, respectively. Applying these filters through DSConv produces four sub-band feature maps, separating motion-related high-frequency regions from background-dominant low-frequency areas, providing more interpretable features for violence analysis.
3.2.2. Wavelet Decomposition
Notably, since the kernels in Equation (
1) form an orthogonal basis, the inverse wavelet transform (IWT) can be implemented through transposed convolution,
In WDCM, the decomposition is recursively applied to the low-frequency component to form a hierarchical representation,
Such multi-level decomposition progressively enhances the model’s ability to separate fine-grained spectral components while preserving essential global context. Empirically, a two-level decomposition achieves the best trade-off between efficiency and discriminative power.
3.2.3. Inverse Wavelet Transformation
After capturing the frequency-domain information of the input data, the WDCM processes different frequency components through dilated convolutional operations, as follows:
To further model spatial dependencies within each sub-band, dilated convolutions (DConv) are applied, where small dilation rates (e.g., 2–4) are used to capture localized motion variations without excessive background interference.
The WDCM reconstructs the final output by combining the convolved frequency components through IWT,
This yields a feature map that preserves both low-frequency stability and high-frequency sensitivity. Compared to conventional CNN layers, WDCM provides multi-resolution analysis, improving robustness to illumination changes, occlusion, and noise.
3.3. Time–Frequency Feature Fusion Network
After frequency features and temporal features are extracted, the TFFF network fuses them into a unified representation that captures complementary information. The key insight is that violent events often exhibit temporal discontinuities (e.g., sudden acceleration) accompanied by spectral disturbances (e.g., texture deformation). Hence, fusing these modalities enhances discriminative representation.
Figure 3 illustrates the detailed architecture of the TFFF, where
B denotes the batch size,
D represents the feature dimension, and
indicates the sequence length.
After extracting temporal features
and frequency-domain features
through the TCA [
27] and WDCM modules, respectively, the following feature fusion steps are performed. First, the temporal features
undergo dimension adjustment via a 2D convolution. Next, the frequency-domain features
are normalized using layer normalization. Finally, the processed temporal and frequency-domain features are concatenated along the channel dimension and fused through a 2D convolutional layer. The fused feature is formulated as
where
denotes the channel-wise concatenation operation of tensors.
To further enhance the semantic consistency, the Prompt-Enhanced Learning (PEL) module [
27] from the baseline is retained after fusion. It provides adaptive prompts that guide the model to focus on violence-related patterns. Compared with traditional fusion strategies such as element-wise addition or attention-based fusion, TFFF offers a balanced trade-off between efficiency and representation richness, as verified by ablation studies.
3.4. Loss Function
Following [
27], this paper employs a binary cross-entropy loss and a cross-modal alignment loss [
27] as the loss functions. Specifically, the video-level prediction
is determined by averaging the top-k violence scores. Given a mini-batch of
B samples with video-level labels
, the binary cross-entropy loss is formulated as
Following [
27], WTFF employs a combination of binary cross-entropy (BCE) loss and cross-modal alignment loss. The BCE loss supervises video-level classification, while the alignment term encourages consistency between temporal and frequency modalities.
During the training phase, the overall objective loss function of our model is formulated as
Here, denotes the Kullback–Leibler divergence–based alignment loss, and controls the balance between classification and alignment. This joint optimization encourages WTFF to learn complementary and coherent representations across both time and frequency domains.
5. Conclusions
In this paper, we proposed the Wavelet-Based Time–Frequency Feature Fusion (WTFF) method, a novel approach designed to enhance the accuracy of frame-level anomaly localization in weakly supervised video violence detection. The core innovation of WTFF lies in moving beyond traditional temporal analysis by systematically incorporating frequency-domain features to capture the subtle, high-frequency motion transients inherent in violent events. Specifically, the Wavelet-Dilated Separable Convolution Module (WDCM) was introduced to decompose the video features, effectively isolating the high-frequency components that are critical for pinpointing the exact onset and cessation of anomalous behaviors. Furthermore, the Time–Frequency Feature Fusion (TFFF) Network ensured a complementary integration between these detailed high-frequency cues and the overall low-frequency temporal context. The superiority of our proposed method was comprehensively validated through extensive experiments on three large-scale public benchmarks. On the UCF-Crime dataset, our WTFF method achieved a state-of-the-art AUC of approximately 85.87%, demonstrating robust frame-level anomaly localization. On the XD-Violence dataset, our approach attained an AP of approximately 84.77%, highlighting its effectiveness in accurately localizing violent events under challenging conditions. On the ShanghaiTech dataset, our method reached an AP of approximately 97.91%, further validating its generalizability and precision across diverse scenes. In addition, the ablation study clearly confirmed that the integration of the WDCM alone leads to significant performance improvements, directly supporting the hypothesis that frequency-domain features are critical for robust violence detection. In summary, this work successfully leverages the multi-resolution power of the wavelet transform to solve the long-standing problem of coarse localization in WS-VAD, offering a powerful new direction for feature representation learning. In future work, we plan to explore adaptive wavelet selection and extend the WTFF framework to other multi-modal anomaly detection tasks. In future work, we plan to explore adaptive wavelet selection and extend the WTFF framework to fuse visual time–frequency features with audio features for robust multimodal anomaly detection.