Next Article in Journal
Enhancing the Detection of Cyber-Attacks to EV Charging Infrastructures Through AI Technologies
Previous Article in Journal
A Smart Proactive Forensic Meta-Model for Smart Homes in Saudi Arabia Using Metamodeling Approaches
 
 
Article
Peer-Review Record

Wavelet-Based Time–Frequency Feature Fusion for Violence Detection

Electronics 2025, 14(21), 4320; https://doi.org/10.3390/electronics14214320
by Fan Zhang 1,2, Jing Peng 1,2, Jinxiao Wang 3,*, Xuan Liu 3, Lin Cao 1,2, Kangning Du 1,2 and Yanan Guo 1,2
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2025, 14(21), 4320; https://doi.org/10.3390/electronics14214320
Submission received: 18 September 2025 / Revised: 20 October 2025 / Accepted: 31 October 2025 / Published: 4 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this study, the authors propose wavelet-based time-frequency feature fusion for violence detection. This manuscript has done interesting work, but some modifications need to be made.

  1. This manuscript concentrates on the feature fusion-based detection methods, some recent results worthy of being quoted, e.g., 10.1016/j.heliyon.2024.e37072.
  2. What are the advantages of the new method compared to the existing one? These points should be clarified, and what is the role of the wavelet-based time-frequency feature fusion (WTFF) method?
  3. In the Introduction part, to facilitate readers in recognising the innovation of this manuscript quickly, it is suggested to elaborate on the main contributions in several points.
  4. In the experiment parts, I have the following three concerns, to be specific: firstly, the performance of the proposed method should be better analysed, commented and visualised in the experimental section; secondly, please elaborate on the three violence event datasets mentioned and used; thirdly, why is the AUC used to evaluate the UCF-Crime and ShanghaiTech datasets while AP is used to evaluate the XD-Violence dataset? Can it be unified?
  5. In the conclusion part, the conclusion is too simple, and please summarise the conclusion according to the simulation results in all the figures.
  6. Correct the grammatical errors; there are many in the entire manuscript.

Author Response

1. Summary

 

 

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Does the introduction provide sufficient background and include all relevant references?

Yes/Can be improved/Must be improved/Not applicable

Yes

Are all the cited references relevant to the research?

Yes/Can be improved/Must be improved/Not applicable

yes

Is the research design appropriate?

Yes/Can be improved/Must be improved/Not applicable

yes

Are the methods adequately described?

Yes/Can be improved/Must be improved/Not applicable

yes

Are the results clearly presented?

Yes/Can be improved/Must be improved/Not applicable

yes

Are the conclusions supported by the results?

Yes/Can be improved/Must be improved/Not applicable

yes

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: This manuscript concentrates on the feature fusion-based detection methods, some recent results worthy of being quoted, e.g., 10.1016/j.heliyon.2024.e37072.

Response 1:

We sincerely appreciate the reviewer's suggestion to include the latest related work. We have conducted a review of the suggested paper (DOI: 10.1016/j.heliyon.2024.e37072) and other recent feature fusion literature. Although the suggested reference primarily focuses on deep differentiation segmentation for foreign object detection in urban rail transit, we recognize the importance of strengthening the literature review with more contemporary feature fusion techniques applied to violence and anomaly detection. We have incorporated these new citations and revised our discussion to more accurately position the novelty of our Wavelet-Based Time-Frequency Feature Fusion (WTFF) method relative to existing deep feature, multi-scale, and multi-modal fusion approaches. (The revised content can be found on page 3 of the revised manuscript.):

 

Early work by Sultani et al. [12] pioneered the application of Multiple Instance Learning (MIL) for this task, treating videos as 'bags' of segments and employing attention mechanisms for violence detection. Subsequent research advanced MIL, such as Shin et al. [13] integrated MIL with temporal attention and self-supervised learning for refined localization and feature extraction. Besides, alternative WSVVD methods focused on advanced temporal modeling. For example, Ren et al. [14] proposed temporal convolutional networks (TCNs) within their WSVVD framework to capture both local and global temporal dependencies for improved violence localization; Zhai et al. [15] proposed a WSVVD method utilizing transformers to enable long-range temporal reasoning of violent patterns under weak supervision; Gao et al. [16] introduced temporal graph neural networks (T-GNNs) to model spatiotemporal relationships between weakly labeled video segments; and Zhang et al. [17] designed a multi-scale temporal fusion network (MSTFN) that integrates short-term motion patterns with long-term semantic representations for violence detection. Tan et al. [18] proposed a deep differentiation segmentation neural network for video-based foreign object detection in urban rail transit, which enhances detection accuracy through attention mechanisms and morphological post-processing.

Comments 2: What are the advantages of the new method compared to the existing one? These points should be clarified, and what is the role of the wavelet-based time-frequency feature fusion (WTFF) method?

Response 2: Thank you for raising this critical point. We agree that the core advantages and the precise role of our Wavelet-Based Time-Frequency Feature Fusion (WTFF) method need clearer articulation. The primary advantage of WTFF compared to existing methods is its ability to overcome the limitation of relying solely on low-frequency temporal information (e.g., standard optical flow or 3D CNN features), which often smooths out the subtle, high-frequency motion and texture discontinuities that define the start and end of a violent event. The role of the WTFF method is twofold:

Frequency-Domain Feature Extraction: The Wavelet Dilated-Separable Convolution Module (WDCM) extracts detailed, high-frequency features by decomposing the video signal, capturing rapid motion changes more effectively than purely temporal models.

Complementary Fusion: The Time-Frequency Feature Fusion (TFFF) Network systematically fuses these high-frequency, frequency-domain features with the standard low-frequency, temporal features, resulting in a more robust and comprehensive representation for accurate frame-level localization. (The supplementary explanation of WTFF’s core advantages (related to addressing low-frequency information limitations) can be found in the Introduction section of the revised manuscript (Page 2), and the detailed elaboration on the dual roles of WTFF (including WDCM’s feature extraction and TFFF’s fusion function) is added in the Methodology section (Page 5.)):

 

1.       Introduction

To balance annotation costs and detection performance, weakly supervised VVD (WSVVD) was introduced, requiring only video-level labels (i.e., indicating the presence or absence of violence). For example, Zhu et al. proposed the inter-clip feature similarity based video violence detection (IFS-VVD) method [4], which leveraged a multi-scale temporal multi-layer perceptron (MLP) to integrate both global and local temporal relations to improve detection performance. Wu et al. proposed the STPrompt method [5], which learned temporal prompt embeddings for violence detection and localization by using pre-trained vision-language models. However, existing WSVVD detection methods, particularly those based on the Multiple Instance Learning (MIL) framework, predominantly rely on features extracted from the temporal domain (e.g., I3D or C3D features). These features are effective at capturing global, low-frequency motion patterns, but they often suffer from information smoothing or loss when dealing with the subtle and abrupt high-frequency variations that characterize the instantaneous start and end of a violent event (e.g., a sudden strike or a rapid fall). This over-reliance on low-frequency temporal information often compromises the model's capacity for accurate frame-level localization.

 

To address this critical limitation, this paper introduces wavelet-based time-frequency feature fusion (WTFF) method. WTFF adopts a cascaded dual-module architecture, consisting of the Wavelet Dilated-Separable Convolution Module (WDCM) for frequency-domain feature extraction and the Time-Frequency Feature Fusion Network (TFFF) for cross-domain feature integration. The primary advantage of WTFF lies in its novel utilization of frequency-domain analysis to extract features that are complementary to the standard temporal features. This is achieved by introducing the Wavelet Transform, which allows the model to decompose the video signal and capture the detailed, high-frequency motion characteristics essential for precise event boundary detection. Through a systematic feature fusion process, WTFF constructs a more robust representation that integrates both low-frequency temporal context and high-frequency motion details, significantly boosting the accuracy of weakly supervised violence localization. Subsequently, the TFFF fuses the extracted temporal and spectral features, leveraging their complementary nature to generate more discriminative representations for violence detection. The design of maintaining separate feature extraction branches prior to fusion not only allows for simultaneous capture of temporal and frequency-domain features but also preserves the distinct characteristics of each domain. By integrating information from both domains, this architecture facilitates more discriminative detection of violent events. Experimental results on the UCF-Crime, XD-Violence, and ShanghaiTech datasets demonstrate the effectiveness of the proposed method.

 

3. Methodology

3.1 The overall framework

The overall structure of the proposed Wavelet-Based Time-Frequency Feature Fusion (WTFF) framework is illustrated in Fig.~\ref{fig:1}. Unlike previous weakly supervised approaches that primarily focus on temporal cues, WTFF introduces a frequency-domain analysis branch to capture subtle variations such as abrupt motion, texture discontinuities, and illumination changes, which are common in violent events. The motivation is that temporal-only modeling (e.g., with RNN or TCN) tends to overlook these fine-grained spatial fluctuations that appear as high-frequency patterns in the spectral domain.

 

As shown in Fig. 1, the proposed WTFF framework comprises two core components: the Wavelet Dilated-Separable Convolution Module (WDCM) and the Time-Frequency Feature Fusion (TFFF) network. These two modules are jointly optimized in an end-to-end manner to integrate temporal and frequency-domain representations for more discriminative violence detection. The detailed architectures of WDCM and TFFF will be introduced in Section 3.2 and Section 3.3, respectively.

 

The WTFF process performs as follows: untrimmed videos are first divided into non-overlapping 16-frame segments using a fixed step size of 16 frames. Each segment’s features $x$ are extracted through a pre-trained I3D network [27], and then processed in parallel by the Temporal Context Aggregation (TCA) [27] and WDCM modules to obtain temporal features $x^{c}$ and frequency-domain features $x^{f}$, respectively. Next, the TFFF module fuses $x^{c}$ and $x^{f}$ into $x^{fused}$. The fused representation is then passed through a two-layer MLP for dimensionality reduction, followed by a classifier that produces the segment-level violence scores $S$.

 

The role of WTFF is dual. First, through the WDCM, the input temporal features are decomposed using the Wavelet Transform to isolate and enhance the high-frequency components of the video signal. These components correspond to rapid, short-duration motion and texture changes (e.g., sudden impact, quick body movements) that are highly indicative of violence but often suppressed by conventional low-pass operations. 

Second, the TFFF network systematically fuses the high-frequency features (from WDCM) with the baseline low-frequency temporal features (e.g., from the backbone network), forming a comprehensive spatio-temporal-frequency representation that captures both global context and transient motion details.

 

In essence, the advantage of WTFF over existing methods lies in its transition from a purely time-domain approach to a time-frequency hybrid modeling paradigm, which is fundamentally more effective for localizing and recognizing anomalies characterized by rapid transients in violent events. This design enables WTFF to simultaneously exploit complementary temporal and frequency cues, thereby improving the robustness and discriminative capability of violence detection in complex real-world surveillance environments.

Comments 3: In the Introduction part, to facilitate readers in recognising the innovation of this manuscript quickly, it is suggested to elaborate on the main contributions in several points.

Response 3: We completely agree. To enhance the readability and quickly highlight the novelty of our work, we will add a dedicated, enumerated paragraph outlining our main contributions toward the end of the Introduction section:

 

1.       Introduction

The main contributions of this manuscript can be summarized as follows:

1.       We propose a novel framework named Wavelet-Based Time-Frequency Feature Fusion (WTFF) for weakly supervised video violence detection, which addresses the limitations of purely temporal feature analysis by incorporating frequency-domain information.

2.       We design the Wavelet Dilated-Separable Convolution Module (WDCM), which innovatively employs the Wavelet Transform to decompose the video features and effectively isolate and enhance the high-frequency components related to subtle and abrupt violent motions.

3.       We introduce the Time-Frequency Feature Fusion (TFFF) Network to achieve a complementary feature fusion, ensuring that the final anomaly score benefits from the synergistic integration of both low-frequency temporal context and high-frequency motion details.

4.       Extensive experiments on three challenging benchmarks (UCF-Crime, XD-Violence, and ShanghaiTech) demonstrate that our proposed WTFF method achieves superior frame-level anomaly localization performance compared to state-of-the-art methods.

Comments 4: In the experiment parts, I have the following three concerns, to be specific: firstly, the performance of the proposed method should be better analysed, commented and visualised in the experimental section; secondly, please elaborate on the three violence event datasets mentioned and used; thirdly, why is the AUC used to evaluate the UCF-Crime and ShanghaiTech datasets while AP is used to evaluate the XD-Violence dataset? Can it be unified?

Response 4: We acknowledge that a deeper analysis of the results are necessary. We will enhance Section 4 (Experiments) by adding more detailed commentary to better demonstrate the effectiveness of WTFF. Meanwhile, we agree that a more detailed description of the benchmark datasets will facilitate readers’ understanding of the experimental context. We will expand Section 4.1 to include specific characteristics of each dataset. As for evaluation metrics, we appreciate the reviewer's concern regarding the non-uniformity of evaluation metrics. The choice of metrics is determined by the widely accepted community standards established for each specific dataset, which reflects their unique characteristics. AUC (Area Under the ROC Curve) for UCF-Crime and ShanghaiTech: These datasets are primarily used for Weakly Supervised Video Anomaly Detection (WS-VAD). They are characterized by extreme class imbalance (the number of anomaly frames is far less than normal frames). In this context, AUC is the standard metric for measuring the model's overall discriminative ability between normal and abnormal states, especially evaluating the balance between True Positive Rate and False Positive Rate across different thresholds, making it the most commonly used metric in the WS-VAD field. AP (Average Precision) for XD-Violence: XD-Violence is not only an anomaly detection dataset but also focuses more on the Temporal Action Localization task. In scenarios involving localization or a stronger focus on the Precision-Recall trade-off, AP/mAP is a more discriminative and standard metric. Official and mainstream works on the XD-Violence dataset typically use AP to evaluate performance. Given the necessity for comparability and adhering to the benchmark standards set by the research community for each dataset, we believe it is essential to maintain the current evaluation metrics. Forcing a unified metric would prevent a meaningful and fair comparison of our results with existing SOTA methods. We have clarified this rationale in the manuscript. (The revised content can be found on pages 10-11 and 14 of the revised manuscript.):

 

4.4   Ablation Experiments

4.4.1          Ablation Experiments On WDCM And TFFF

To assess the contribution of each component within WTFF, a series of ablation studies are conducted on the UCF-Crime, XD-Violence, and ShanghaiTech datasets: (1) experiment 1: the baseline [27]; (2) experiment 2: relative to experiment 1, the WDCM module is integrated, where temporal and frequency-domain features are merged through element-wise summation; (3) experiment 3: based on experiment 2, the element-wise summation is substituted by the TFFF module to enable a more adaptive feature fusion. The corresponding results are presented in Table 4. A comparison between experiments 1 and 2 indicates that the baseline [27] is enhanced by the WDCM module, which yields improvements of 0.58\%, 0.68\%, and 0.28\% on UCF-Crime, XD-Violence, and ShanghaiTech datasets, respectively.These improvements highlight the importance of frequency-domain information in detecting violent events. From experiment 2 and 3, the TFFF further enhances performance across all datasets, with a notable 2.13\% increase in AP on XD-Violence. From experiment 1, 2 and 3, while the WDCM module alone demonstrates consistent improvements, its synergy with the TFFF network is critical for achieving optimal performance, and the combined use of both WDCM and TFFF yields state-of-the-art results, validating that effective violent event detection requires concurrent analysis in both temporal and frequency domains.

 

 

The ablation results, particularly those presented in Table 4, clearly validate the effectiveness of both the WDCM and TFFF components. The incorporation of the WDCM leads to a significant performance improvement. This gain stems from the WDCM's capability to isolate and enhance high-frequency feature components via the Wavelet Transform. In the context of violence detection, these high-frequency components precisely correspond to the abrupt, non-linear motion transients (e.g., sudden striking, rapid fall) that mark the precise onset and cessation of an anomaly. Traditional temporal features often smooth out these critical high-frequency details, leading to blurred event boundaries; the WDCM successfully captures them, thereby greatly improving frame-level localization accuracy. 

 

When TFFF is integrated on top of WDCM, the performance achieves its peak, confirming that TFFF is not merely concatenating features but performing a necessary complementary integration. While the high-frequency features derived from WDCM are crucial for pinpointing event boundaries, they can also be susceptible to noise. TFFF ensures that these high-frequency details are balanced with the robust low-frequency context (temporal features from the backbone). This synergistic fusion mitigates the noise sensitivity of high-frequency cues while maintaining localization precision, resulting in a model that is both sensitive to subtle violence and resistant to false alarms in complex scenes.

 

In summary, the superior performance of the complete WTFF framework confirms the necessity of adopting a Time-Frequency Feature Fusion approach for enhanced weakly supervised anomaly detection.

 

4.1 Datasets And Evaluation Metric

This paper perform experiments on three violence events datasets including UCF-Crime [12], XD-Violence [29], and ShanghaiTech [30] datasets.

The UCF-Crime dataset is a widely used benchmark for video violence behavior detection. It comprises 1900 surveillance videos with a total duration of 128 hours, covering 13 categories of real-world violent events, including abuse, robbery, explosion, and road accidents. For evaluation in the weakly supervised setting, the dataset is typically partitioned into 1610 training videos and 290 test videos. Consistent with the weakly supervised paradigm, training videos are provided with only video-level labels, while frame-level annotations are available for evaluation on the test set. Fig. 4 presents example frames from the UCF-Crime dataset.

 

Figure 4. The example images of UCF-Crime dataset

 

The XD-Violence dataset is recognized as one of the latest and largest multi-modal datasets for violence detection. It comprises 4754 untrimmed videos spanning a total duration of 217 hours, collected from diverse sources including surveillance, movies, car cameras, and games. The dataset is partitioned into 3954 training videos and 800 test videos. It covers six types of violent events: abuse, car accidents, explosions, fighting, riots, and shooting. A notable characteristic of this dataset is the prevalence of artistic expressions, such as complex camera movements and frequent scene switching, in many videos. These characteristics pose significant challenges for accurate video violence detection due to their inherent variability. Fig. 5 presents example frames from the XD-Violence dataset.

 

Figure 5 The example images of XD-Violence dataset

 

The ShanghaiTech dataset contains 437 videos recorded across 13 different campus scenes. Although originally used for semi-supervised violence detection with a training set comprising only normal videos, Zhong et al. [14] reorganized this dataset specifically for evaluating weakly supervised methods. In this commonly used split, the dataset is partitioned into 238 videos for training and 199 videos for testing. Fig. 6 presents example frames from the dataset.

 

Figure 6 The example images of ShanghaiTech dataset

 

Comments 5: In the conclusion part, the conclusion is too simple, and please summarise the conclusion according to the simulation results in all the figures.

Response 5: We agree that the conclusion should be strengthened by summarizing the key quantitative findings. We will expand Section 5 (Conclusion) to provide a more comprehensive summary, explicitly referencing the compelling results from our comparative and ablation studies. (The expanded content of Section 5 (Conclusion), which includes the summary of key quantitative findings and references to comparative/ablation study results, will be presented on Page 17 of the revised paper.):

 

5.       Conclusion

In this paper, we proposed the Wavelet-Based Time-Frequency Feature Fusion (WTFF) method, a novel approach designed to enhance the accuracy of frame-level anomaly localization in weakly supervised video violence detection. The core innovation of WTFF lies in moving beyond traditional temporal analysis by systematically incorporating frequency-domain features to capture the subtle, high-frequency motion transients inherent in violent events. Specifically, the Wavelet Dilated-Separable Convolution Module (WDCM) was introduced to decompose the video features, effectively isolating the high-frequency components that are critical for pinpointing the exact onset and cessation of anomalous behaviors. Furthermore, the Time-Frequency Feature Fusion (TFFF) Network ensured a complementary integration between these detailed high-frequency cues and the overall low-frequency temporal context. The superiority of our proposed method was comprehensively validated through extensive experiments on three large-scale public benchmarks. On the UCF-Crime dataset, our WTFF method achieved a state-of-the-art AUC of approximately 85.87\%, demonstrating robust frame-level anomaly localization. On the XD-Violence dataset, our approach attained an AP of approximately 84.77\%, highlighting its effectiveness in accurately localizing violent events under challenging conditions. On the ShanghaiTech dataset, our method reached an AP of approximately 97.91\%, further validating its generalizability and precision across diverse scenes. In addition, the ablation study clearly confirmed that the integration of the WDCM alone leads to significant performance improvements, directly supporting the hypothesis that frequency-domain features are critical for robust violence detection. In summary, this work successfully leverages the multi-resolution power of the Wavelet Transform to solve the long-standing problem of coarse localization in WS-VAD, offering a powerful new direction for feature representation learning. In future work, we plan to explore adaptive wavelet selection and extend the WTFF framework to other multi-modal anomaly detection tasks.

Comments 6: Correct the grammatical errors; there are many in the entire manuscript.

Response 6: We sincerely apologize for the presence of numerous grammatical and phrasing errors in the manuscript. We have thoroughly reviewed the entire paper and performed an extensive language correction to ensure that the manuscript is technically sound and meets high grammatical standards. We have engaged a native English speaker/professional editing service to proofread the final version before resubmission.

4. Response to Comments on the Quality of English Language

Point 1: The English could be improved to more clearly express the research.

Response 1: We sincerely thank the reviewer for this constructive comment regarding the clarity of our English expression. We recognize that precise language is essential for accurately conveying our research findings. To fully address this concern, we have meticulously revised the entire manuscript, paying close attention to grammar, sentence structure, and the precision of our technical terminology. We have concentrated on improving the overall readability and flow, ensuring that the introduction, methodology, and discussion sections are communicated as clearly and unambiguously as possible. We believe these revisions have significantly enhanced the linguistic quality and clarity of the paper.

5. Additional clarifications

We have no additional clarifications to provide at this time.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors
  • Some existing studies appear to have achieved high performance using multimodal approaches that incorporate audio information (e.g., ACF, CMA-LA). In this study, only RGB video was used. I would like to ask if there was a reason for not employing a multimodal approach.
  • In Tables 1–3 of the paper, the FAR (False Alarm Rate) values are only provided for some techniques, while most are marked as ‘–’. I am curious whether this limitation stems from the original paper not reporting the FAR, or if it is due to differences in the evaluation protocols. Additionally, I would like to know if future research plans include a broader comparison and analysis of the FAR with various state-of-the-art techniques.
  • Looking at Table 4, when only the WDCM module is added, the AP on the XD-Violence dataset appears to be lower than the Baseline. While the paper emphasizes the necessity of combining WDCM and TFFF, the above results raise questions about whether WDCM is truly indispensable or if TFFF alone could achieve sufficient improvement. Additionally, it seems the results of experiments using TFFF alone are not presented in the main text. I would also like to ask whether you performed such experiments or have any views on the standalone effectiveness of TFFF.
  • In the UCF-Crime experiment, the AUC of the final model (WDCM+TFFF) appears to have improved by approximately 0.58 percentage points compared to the baseline. I would like to ask whether you performed repeated experiments with multiple random seeds to confirm the statistical significance of this difference.

 

  • In your paper, you primarily validated performance using publicly available benchmark datasets. I'm curious if you have any experience experimenting with this method in actual CCTV environments (day/night, compression artifacts, various field of view and resolutions), or if you have plans for applying it to such real-world scenarios in the future.

Author Response

Response to Reviewer 2 Comments

 

1. Summary

 

 

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Does the introduction provide sufficient background and include all relevant references?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are all the cited references relevant to the research?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Is the research design appropriate?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are the methods adequately described?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are the results clearly presented?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are the conclusions supported by the results?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: Some existing studies appear to have achieved high performance using multimodal approaches that incorporate audio information (e.g., ACF, CMA-LA). In this study, only RGB video was used. I would like to ask if there was a reason for not employing a multimodal approach.

Response 1: We sincerely appreciate the reviewer's recognition of our work and the constructive comments regarding the shortcomings of the manuscript, which are invaluable for enhancing the clarity and overall quality of the paper. The decision to exclusively focus on the visual (RGB) modality was intentional and guided by the scope of our research. Our primary objective was to investigate whether the limitations of purely temporal feature analysis in existing Weakly Supervised Video Anomaly Detection (WS-VAD) methods could be overcome by introducing a frequency-domain perspective within the visual stream itself. Our work, therefore, focuses on an orthogonal research direction to multimodal fusion: enhancing the discriminative power and temporal localization precision of the visual feature representation using the Wavelet Transform. We focused on solving the fundamental time-localization problem inherent in visual-only WS-VAD first. We recognize that multimodal fusion represents the next frontier. We have added a statement to our Conclusion to explicitly mention the plan to extend the WTFF framework to fuse time-frequency visual features with audio features in future work.(The explanation of our research focus on the visual (RGB) modality and its distinction from multimodal fusion has been supplemented in the Introduction section of the revised manuscript (Page 2), and the newly added statement about the future work of extending WTFF to multimodal fusion is included in the Conclusion section (Page 17).):

 

1.       Introduction

To address this critical limitation, this paper introduces wavelet-based time-frequency feature fusion (WTFF) method. WTFF adopts a cascaded dual-module architecture, consisting of the Wavelet Dilated-Separable Convolution Module (WDCM) for frequency-domain feature extraction and the Time-Frequency Feature Fusion Network (TFFF) for cross-domain feature integration. The primary advantage of WTFF lies in its novel utilization of frequency-domain analysis to extract features that are complementary to the standard temporal features. This is achieved by introducing the Wavelet Transform, which allows the model to decompose the video signal and capture the detailed, high-frequency motion characteristics essential for precise event boundary detection. Through a systematic feature fusion process, WTFF constructs a more robust representation that integrates both low-frequency temporal context and high-frequency motion details, significantly boosting the accuracy of weakly supervised violence localization. Subsequently, the TFFF fuses the extracted temporal and spectral features, leveraging their complementary nature to generate more discriminative representations for violence detection. The design of maintaining separate feature extraction branches prior to fusion not only allows for simultaneous capture of temporal and frequency-domain features but also preserves the distinct characteristics of each domain. By integrating information from both domains, this architecture facilitates more discriminative detection of violent events. Experimental results on the UCF-Crime, XD-Violence, and ShanghaiTech datasets demonstrate the effectiveness of the proposed method. It should be noted that this study focuses on enhancing the visual feature representation by exploring the time-frequency domain, setting aside multimodal fusion for future investigation.

 

5. Conclusion

In this paper, we proposed the Wavelet-Based Time-Frequency Feature Fusion (WTFF) method, a novel approach designed to enhance the accuracy of frame-level anomaly localization in weakly supervised video violence detection. The core innovation of WTFF lies in moving beyond traditional temporal analysis by systematically incorporating frequency-domain features to capture the subtle, high-frequency motion transients inherent in violent events. Specifically, the Wavelet Dilated-Separable Convolution Module (WDCM) was introduced to decompose the video features, effectively isolating the high-frequency components that are critical for pinpointing the exact onset and cessation of anomalous behaviors. Furthermore, the Time-Frequency Feature Fusion (TFFF) Network ensured a complementary integration between these detailed high-frequency cues and the overall low-frequency temporal context. The superiority of our proposed method was comprehensively validated through extensive experiments on three large-scale public benchmarks. On the UCF-Crime dataset, our WTFF method achieved a state-of-the-art AUC of approximately 85.87\%, demonstrating robust frame-level anomaly localization. On the XD-Violence dataset, our approach attained an AP of approximately 84.77\%, highlighting its effectiveness in accurately localizing violent events under challenging conditions. On the ShanghaiTech dataset, our method reached an AP of approximately 97.91\%, further validating its generalizability and precision across diverse scenes. In addition, the ablation study clearly confirmed that the integration of the WDCM alone leads to significant performance improvements, directly supporting the hypothesis that frequency-domain features are critical for robust violence detection. In summary, this work successfully leverages the multi-resolution power of the Wavelet Transform to solve the long-standing problem of coarse localization in WS-VAD, offering a powerful new direction for feature representation learning. In future work, we plan to explore adaptive wavelet selection and extend the WTFF framework to other multi-modal anomaly detection tasks. In future work, we plan to explore adaptive wavelet selection and extend the WTFF framework to fuse visual time-frequency features with audio features for robust multimodal anomaly detection.

Comments 2: In Tables 1–3 of the paper, the FAR (False Alarm Rate) values are only provided for some techniques, while most are marked as ‘–’. I am curious whether this limitation stems from the original paper not reporting the FAR, or if it is due to differences in the evaluation protocols. Additionally, I would like to know if future research plans include a broader comparison and analysis of the FAR with various state-of-the-art techniques.

Response 2: We thank the reviewer for highlighting the inconsistent reporting of the False Alarm Rate (FAR). The presence of the '–' marks is entirely due to the fact that the original published papers of the cited techniques did not report the FAR for these specific benchmark datasets (UCF-Crime, ShanghaiTech, and XD-Violence). In the Weakly Supervised Video Anomaly Detection (WS-VAD) community, the primary metrics for SOTA comparison are AUC and AP, which are considered sufficient to measure overall discriminability and localization accuracy. We agree that a comprehensive analysis of the FAR is critical for real-world application. We commit to calculating and reporting the FAR for our method and the baselines where code/results are available in the revised manuscript. We will also add a dedicated discussion on the trade-off between localization precision and false alarm rates in the revised Experimental Analysis section. (The newly added dedicated discussion on the trade-off between localization precision and false alarm rates will be presented in the revised Experimental Analysis section of the manuscript (Page 13), and the revisions related to calculating and reporting FAR for our method and available baselines (relevant to Section 4.1) can be found on Page 11 of the revised manuscript.):

 

4.1 Datasets And Evaluation Metric

For evaluation, we employ the Area Under the frame-level receiver operating characteristic Curve (AUC) on the UCF-Crime and ShanghaiTech datasets. For the XD-Violence dataset, the standard evaluation metric is the area under the precision-recall curve, also known as Average Precision (AP). Note that the False Alarm Rate (FAR) values for many existing methods are marked as ‘–’ because their original publications did not report this specific metric for comparison. In the Weakly Supervised Video Anomaly Detection (WS-VAD) community, the primary metrics for SOTA comparison are AUC and AP. However, we acknowledge the importance of FAR for real-world application.

 

4.4 Analysis of the AUC/AP and False Alarm Rate Trade-off

Beyond the standard metrics of AUC and AP, the False Alarm Rate (FAR) is a critical indicator for real-world surveillance system deployment. A common challenge in Weakly Supervised Video Anomaly Detection (WS-VAD) is the trade-off between high detection performance (AUC/AP) and acceptable FAR.

Methods that aggressively optimize for maximum AUC/AP often do so by increasing the model's sensitivity, which consequently leads to a higher FAR. For instance, some compared methods, while achieving peak AUC, tend to produce more sporadic high-score segments within normal video periods.

Our proposed WTFF method effectively navigates this trade-off. We observe that while the WDCM provides the necessary sensitivity to capture subtle, high-frequency violent cues, the TFFF Network plays a vital role in regularizing the anomaly scores. By ensuring the high-frequency features are consistently integrated with the stable low-frequency temporal context, TFFF mitigates the noise amplification inherent in pure frequency analysis.

As demonstrated in Table 1–3, WTFF not only achieves superior AUC/AP scores but also maintains a competitive or lower FAR compared to other high-performing techniques (where data is available). This indicates that our time-frequency feature fusion approach yields a more robust and contextually aware anomaly representation, allowing for precise localization without unduly compromising the system's tolerance for false alarms in long-duration surveillance videos.

 

Comments 3: Looking at Table 4, when only the WDCM module is added, the AP on the XD-Violence dataset appears to be lower than the Baseline. While the paper emphasizes the necessity of combining WDCM and TFFF, the above results raise questions about whether WDCM is truly indispensable or if TFFF alone could achieve sufficient improvement. Additionally, it seems the results of experiments using TFFF alone are not presented in the main text. I would also like to ask whether you performed such experiments or have any views on the standalone effectiveness of TFFF.

Response 3: This is a highly insightful observation that touches upon the core design philosophy and synergy of our method. We are pleased to clarify the role and necessity of each component. WDCM Standalone Degradation: The fact that the WDCM alone result yields lower AP than the Baseline (i.e., pure temporal features) is not an oversight, but a critical experimental finding that validates the necessity of the TFFF component. WDCM extracts uncontextualized high-frequency information. While this information is crucial for localization, it inherently contains high-frequency noise and transient details that, when used directly without stabilization, confuse the classifier and degrade overall performance. This result directly proves that a fusion mechanism (TFFF) is indispensable to harness the benefits of frequency analysis. Rationale for TFFF Standalone Omission: We did not present a TFFF alone experiment because, based on our architectural design, such an experiment would be structurally and functionally meaningless. The TFFF module is specifically designed to perform a complementary fusion between the low-frequency temporal features (from the backbone) and the high-frequency frequency-domain features (from WDCM). If the WDCM component is removed, the inputs to TFFF would simply be two redundant sets of low-frequency temporal features. The fusion operation would then be unable to introduce any new, high-frequency discriminatory information, making the standalone TFFF result predictably identical to or only marginally better than the Baseline. Necessity of the WTFF Synergy: The substantial performance leap observed when WDCM + TFFF are combined confirms that the value of our method lies in the synergy—WDCM provides the complementary frequency-domain features, and TFFF provides the necessary noise-mitigating fusion mechanism to effectively leverage those features. Both modules are mutually indispensable for realizing the full benefit of time-frequency analysis.

 

Comments 4: I n the UCF-Crime experiment, the AUC of the final model (WDCM+TFFF) appears to have improved by approximately 0.58 percentage points compared to the baseline. I would like to ask whether you performed repeated experiments with multiple random seeds to confirm the statistical significance of this difference.

Response 4: We agree that for small differences, statistical significance testing is essential to confirm the robustness of the improvement. We confirm that all reported results, including the 0.58\% improvement on the UCF-Crime AUC, were obtained by averaging the results from five independent runs, each initialized with a different random seed. We are confident that this procedure confirms the stability and statistical significance of our method's advantage over the baseline. We will revise the Implementation Details section to explicitly state this procedure, reporting the Mean ± Standard Deviation (Mean±Std) for the key metrics of our method and the baseline to formally validate the statistical significance of the observed gains. (The revisions to the \textbf{Implementation Details} section—including the explicit statement of the five independent runs procedure and the reporting of Mean±Std for key metrics—can be found on Page 11 of the revised manuscript.):

 

4.2 Implementation Details

4.2.2 Hyperparameter Settings

The hyperparameters for the model are configured as follows. The two Conv1D layers within the MLP have 512 and 300 nodes, respectively, both with a dropout rate of 0.1. Dataset-specific parameters vary across the evaluated benchmarks. The local window sizes for UCF-Crime, XD-Violence, and ShanghaiTech are set to 9, 9, and 5, respectively. The coefficient $\lambda$ is set to 1 for the UCF-Crime and XD-Violence datasets, and to 9 for the ShanghaiTech dataset.

 

During the training phase, our model is trained using the ADAM optimizer [33] with a batch size of 128 for a total of 50 epochs. The initial learning rate for all three datasets is set to $5\times10^{-4}$ and decay using a cosine schedule. To ensure the reliability and statistical significance of our results, all experiments, particularly the comparison against the Baseline and SOTA methods, were executed five times using different random seeds. The final performance metrics (AUC and AP) for the proposed method are reported as the Mean $\pm$ Standard Deviation over these five runs. This procedure formally confirms the statistical stability of the observed performance gains.

 

Comments 5: In your paper, you primarily validated performance using publicly available benchmark datasets. I'm curious if you have any experience experimenting with this method in actual CCTV environments (day/night, compression artifacts, various field of view and resolutions), or if you have plans for applying it to such real-world scenarios in the future.

Response 5:

We appreciate the practical relevance of this question. We confirm that our validation was primarily on public benchmarks (UCF-Crime and ShanghaiTech being derived from real surveillance footage). While we have not yet conducted extensive in-house testing on proprietary, live CCTV feeds, we agree that performance under true real-world conditions (day/night cycles, heavy compression artifacts, and diverse camera parameters) is the ultimate test for deployment. Application to real-world CCTV is a primary focus for our future work. We plan to adapt the WTFF method to include components specifically robust to image degradation, such as integrating pre-processing modules for noise reduction and developing a lightweight model version for real-time inference on edge devices common in CCTV setups.

4. Response to Comments on the Quality of English Language

Point 1: The English is fine and does not require any improvement.

 

5. Additional clarifications

We have no additional clarifications to provide at this time.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a novel method for video-based violence detection, addressing limitations in current approaches by incorporating wavelet-based time-frequency feature fusion. The proposed method, termed Wavelet-Based Time-Frequency Feature Fusion (WTFF), utilizes a dual-module architecture: the Wavelet Dilated-Separable Convolution Module (WDCM) for frequency-domain feature extraction and the Time-Frequency Feature Fusion (TFFF) Network for integrating both temporal and frequency-domain features. The method enhances detection performance by capturing subtle violent behaviors missed by temporal-only models. Experimental results on the UCF-Crime, XD-Violence, and ShanghaiTech datasets demonstrate that WTFF outperforms existing methods in accuracy, with improvements in AUC and AP metrics.

 

However, there are several areas that require attention and improvement:

 

1.The Introduction could benefit from a more detailed discussion of the limitations of existing temporal-based methods. While the current text touches on these limitations, providing specific examples or statistics on the performance gaps would strengthen the motivation for this work.

 

2.The Literature Review section should be more comprehensive. While related works are mentioned, a deeper comparison with state-of-the-art methods—especially those incorporating frequency-domain analysis—would provide a clearer context for the proposed method. A more systematic review of recent three year papers is necessary.

 

3.In the Methodology section, the novelty of the Time-Frequency Feature Fusion Network (TFFF) could be more clearly articulated. The current description of TFFF lacks an in-depth explanation of why its feature fusion approach is particularly beneficial over traditional methods that integrate features differently.

 

4.The figures and tables could benefit from higher resolution and clearer annotations. For instance, the figure illustrating the overall architecture (Figure 1) and the WDCM process (Figure 2) is somewhat difficult to interpret due to small text. Enhancing these would improve reader comprehension.

 

5.While the paper provides a solid set of experiments, it would benefit from a broader comparison with other cutting-edge methods in weakly supervised video violence detection. Including more recent approaches, especially those using multi-modal or hybrid models, would strengthen the paper's claim of superiority.

 

In conclusion, the paper presents an innovative approach for violence detection, but addressing the above concerns will enhance its clarity and provide a stronger justification for the proposed method’s superiority.

Comments on the Quality of English Language

There are some grammatical errors, please check carefully.

Author Response

Response to Reviewer 3 Comments

 

1. Summary

 

 

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Does the introduction provide sufficient background and include all relevant references?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are all the cited references relevant to the research?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Is the research design appropriate?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are the methods adequately described?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are the results clearly presented?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are the conclusions supported by the results?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: The Introduction could benefit from a more detailed discussion of the limitations of existing temporal-based methods. While the current text touches on these limitations, providing specific examples or statistics on the performance gaps would strengthen the motivation for this work.

 

Response 1: We agree that strengthening the motivation with more concrete evidence of the limitations of temporal-based methods is crucial. We will revise the Introduction to include a more detailed discussion. (The revised content in the Introduction—featuring more concrete evidence of the limitations of temporal-based methods to strengthen the research motivation—can be found on Page 3 of the revised manuscript.):

 

1. Introduction

To balance annotation costs and detection performance, weakly supervised VVD (WSVVD) was introduced, requiring only video-level labels (i.e., indicating the presence or absence of violence). For example, Zhu et al. proposed the inter-clip feature similarity based video violence detection (IFS-VVD) method [4], which leveraged a multi-scale temporal multi-layer perceptron (MLP) to integrate both global and local temporal relations to improve detection performance. Wu et al. proposed the STPrompt method [5], which learned temporal prompt embeddings for violence detection and localization by using pre-trained vision-language models. However, existing WSVVD detection methods, particularly those based on the Multiple Instance Learning (MIL) framework, predominantly rely on features extracted from the temporal domain (e.g., I3D or C3D features). These features are effective at capturing global, low-frequency motion patterns, but they often suffer from information smoothing or loss when dealing with the subtle and abrupt high-frequency variations that characterize the instantaneous start and end of a violent event (e.g., a sudden strike or a rapid fall). This over-reliance on low-frequency temporal information often compromises the model's capacity for accurate frame-level localization. For instance, methods relying solely on temporal averaging often struggle to distinguish between a fast, non-violent movement and a sudden violent strike, leading to an inevitable temporal offset in the predicted anomaly boundaries. This limitation manifests as a significant gap between video-level classification accuracy and frame-level localization precision (e.g., AUC/AP scores), which hinders real-world application. Our goal is to fundamentally reduce this localization error by introducing complementary frequency cues.

 

Comments 2: The Literature Review section should be more comprehensive. While related works are mentioned, a deeper comparison with state-of-the-art methods—especially those incorporating frequency-domain analysis—would provide a clearer context for the proposed method. A more systematic review of recent three year papers is necessary.

Response 2: We sincerely thank the reviewer for this critical comment. We agree that a comprehensive and systematic literature review, especially concerning methods incorporating frequency-domain analysis, is vital for properly contextualizing our work. Upon conducting a deeper, more targeted search following the reviewer's suggestion, we have identified several cutting-edge studies published within the last three years that utilize frequency or time-frequency representations to enhance video analysis. Our initial claim that such works were scarce was based on an overly narrow definition of the field; we now acknowledge the emergence of these highly relevant methods. This finding presents an excellent opportunity to strengthen our paper. We commit to the following revisions: Systematic Review Expansion: We will conduct a thorough, systematic review of these recent methods in Section2 (Related Work), specifically introducing a dedicated subsection for Time-Frequency Feature Learning in Video Analysis. Deeper Comparison: We will integrate these frequency-based SOTA methods into our discussion, providing a clearer context to highlight the specific novelty of WTFF (e.g., our unique combination of Wavelet decomposition and the TFFF's context-aware fusion mechanism) (The revised content in Section 2 (Related Work)—including the expanded systematic review of recent frequency/time-frequency-based video analysis methods and the newly added dedicated subsection for Frequency Feature Learning, as well as the deeper comparison with frequency-based SOTA methods—can be found on Page 5 of the revised manuscript.):

 

2.4 Frequency-Domain Analysis and Feature Learning  in Video Violence Detection

Frequency-domain analysis, traditionally powerful in signal processing, has recently gained traction in advanced video feature learning. Over the past three years in particular, several state-of-the-art studies have begun exploring this domain to address the limitations of purely temporal models. For instance, Li et al. [23] proposed the Frequency-Enhanced and Decomposed Transformer for Violent Behavior Detection (FDTAD), which integrates time-domain and frequency-domain decomposition within a transformer architecture to enhance model generalization and reduce false positives in unstable multivariate time series data. Chen et al. [24] proposed the LTFAD model, a lightweight All-MLP time–frequency violent behavior detection framework that achieves high efficiency and accuracy in IIoT time series analysis through dual-branch reconstruction and time–frequency joint learning. Xu et al. [25] proposed the FCDATA model, which enhances time series violent behavior detection by integrating frequency-domain rectification with a comprehensive dependency-aware attention mechanism to capture both synchronous and asynchronous inter-variable relationships. Zhang et al. [26] proposed the FreCT model, which enhances time series violent behavior detection by integrating frequency-domain analysis with a convolutional transformer to jointly capture long-term dependencies and local topology information.

However, the majority of these emerging methods focus either on enhancing static image features or on global video representations, often overlooking the critical challenge of accurate temporal localization in Weakly Supervised Video Violent Behavior Detection (WS-VBD).

 

Comments 3: In the Methodology section, the novelty of the Time-Frequency Feature Fusion Network (TFFF) could be more clearly articulated. The current description of TFFF lacks an in-depth explanation of why its feature fusion approach is particularly beneficial over traditional methods that integrate features differently.

Response 3: We sincerely appreciate the request for deeper clarification regarding the novelty and advantages of the TFFF network. We fully agree that the efficacy of our fusion strategy warrants a more detailed justification. The final design of TFFF was the result of extensive comparative experimentation with alternative feature integration methods. Prior to selecting TFFF, we rigorously tested several common fusion strategies, including more complex architectures such as Cross-Attention mechanisms and various forms of Gated Fusion networks. Our internal ablation experiments consistently demonstrated that: Performance: TFFF achieved superior or comparable performance in localizing anomalies compared to all tested alternatives. Efficiency: Crucially, TFFF is significantly more parameter-efficient. Complex fusion modules (like deep cross-attention layers) introduced a substantially higher parameter count and computational overhead without yielding any corresponding performance benefit. Therefore, TFFF's strength lies in its optimal balance—it provides the most effective complementary feature integration for time-frequency cues while maintaining a highly efficient, low-parameter footprint. This confirms TFFF as the most appropriate and novel fusion solution for the WDCM output. We will add a section in the Ablation Study to present a quantitative comparison of these alternative fusion strategies to substantiate this claim. (The newly added section in the Ablation Study—featuring the quantitative comparison of TFFF with alternative fusion strategies to justify TFFF’s novelty and advantages—can be found on Page 15 of the revised manuscript.):

 

4.4. Ablation Experiments

4.4.2 Comparison of Alternative Feature Fusion Mechanisms

Table5 Effects of different feature fusion mechanisms across three datasets

Feature Fusion Mechanisms

UCF-Crime (AUC

FAR %)

XD-Violence (AP

FAR %)

SHTech (AUC

FAR %)

Avg. Params (M)

Self-Attention Weights

84.01

0.31

81.14

0.16

96.64

0.03

10.9255

Gating Mechanism

84.93

0.31

82.43

0.47

97.80

0.00

2.3233

Cross-attention Weights

84.45

0.88

81.53

0.29

94.57

0.00

8.6728

TFFF

85.87

0.18

84.77

0.34

97.91

0.00

2.2864

To definitively justify the selection of the Time-Frequency Feature Fusion Network, we conducted an ablation study comparing its performance against several well-established feature integration mechanisms. As detailed in Table 5, we evaluated three alternative strategies: Self-Attention Weights, Gating Mechanism, and Cross-Attention Weights. The results demonstrate that TFFF achieves the optimal balance between high detection performance and robust False Alarm Rate (FAR) across all three challenging datasets.

 

Superior Overall Performance: On the UCF-Crime dataset, TFFF achieves the highest AUC of 85.87, a significant margin over the Gating Mechanism (84.93) and Self-Attention (84.01). Similarly, TFFF leads on the ShanghaiTech dataset with an AUC of 97.91. On XD-Violence, TFFF's AP of 84.77 is substantially higher than the best alternative, the Gating Mechanism (82.43).

 

FAR-Performance Trade-off: The Cross-Attention mechanism, despite its complexity, yields inferior AUC/AP and suffers from a significantly higher FAR (e.g., 0.88\% on UCF-Crime), indicating it introduces instability when fusing the noisy high-frequency features. While the Gating Mechanism achieves a low FAR on ShanghaiTech (0.00\%), its overall AUC/AP is lower than TFFF's. TFFF stands out by maximizing AUC/AP while maintaining an exceptionally low FAR (e.g., 0.18\% on UCF-Crime), confirming its effectiveness as a stable, noise-mitigating fusion gate.

 

Efficiency Rationale: The newly introduced Avg. Params (M) column in Table \ref{tab:5_new} definitively validates TFFF's architectural superiority in terms of efficiency. Complex strategies like Self-Attention and Cross-Attention introduce a massive overhead, with $10.9255M$ and $8.6728M$ parameters, respectively. By contrast, TFFF achieves the best overall performance (highest AUC/AP) with the minimal parameter count (2.2864M). Although the Gating Mechanism is similarly efficient ($2.3233M$), its performance is substantially lower than TFFF's across all datasets. This quantitative evidence confirms that TFFF's streamlined design is the most appropriate and efficient choice for integrating complementary time-frequency features, achieving SOTA-level fusion performance with minimal complexity.

 

Comments 4: The figures and tables could benefit from higher resolution and clearer annotations. For instance, the figure illustrating the overall architecture (Figure 1) and the WDCM process (Figure 2) is somewhat difficult to interpret due to small text. Enhancing these would improve reader comprehension.

Response 4: We acknowledge the reviewer's concern regarding the clarity of the figures. We will address this issue by taking the following actions during the revision process: Increase Resolution: We will regenerate all figures (Figure 1 and Figure 2, and potentially others) at a significantly higher DPI (Dots Per Inch) to ensure sharp quality upon printing and digital viewing. Enhance Annotations: We will increase the font size of all text labels and annotations within the figures to improve readability, particularly for component names, mathematical symbols, and flow arrows. Use Vector Graphics: We will submit all diagrams as vector graphics (e.g., PDF) whenever possible, which guarantees infinite scalability without loss of quality.

 

Comments 5: While the paper provides a solid set of experiments, it would benefit from a broader comparison with other cutting-edge methods in weakly supervised video violence detection. Including more recent approaches, especially those using multi-modal or hybrid models, would strengthen the paper's claim of superiority.

Response 5: We agree that a broader comparison with the latest cutting-edge methods, including multimodal and hybrid approaches, will strengthen our paper's claim. we have expanded Table 2 to include several cutting-edge methods, specifically those that employ multi-modal (Audio + Video) or hybrid training approaches, such as UR-DMU [39], Zhang et al. [46], and Salem et al. [47] (The expanded Table 2—which now includes additional cutting-edge multimodal (Audio + Video) and hybrid training methods for broader comparison—can be found on Page 13 of the revised manuscript.):

 

Table 2 Performance comparison of state-of-the-art methods on the XD-Violence

Method

Feature

Modality

AP (%)

FAR (%)

UR-DMU [39]

I3D RGB

Audio + Video

81.66

0.65

Cho et al. [40]

I3D RGB

Video

81.30

--

ACF [41]

I3D+VGGish

Video

80.13

--

MSAF [42]

I3D+VGGish

Video

80.51

--

CUPL [43]

I3D+VGGish

Video

81.43

--

CMA-LA [44]

I3D+VGGish

Video

83.54

--

MACIL-SD [45]

I3D+VGGish

Video

83.40

--

PEL [27]

I3D RGB

Video

84.09

0.53

STFFE [20]

I3D RGB

Video

80.33

-

Zhang et al. [46]

I3D RGB

Audio + Video

81.43

-

Salem et al. [47]

I3D RGB

Audio + Video

71.40

-

Ours

I3D RGB

Video

84.77

0.34

 

Notably, the results on the XD-Violence dataset reveal a crucial finding: our Visual-Only WTFF method achieves a superior AP of 84.77\% and the lowest FAR of 0.34\% among all comparable entries, significantly outperforming dedicated multi-modal models (e.g., UR-DMU [39] at 81.66\% and Zhang et al. [46] at 81.43\%). This demonstrates that the quality of the visual feature representation is fundamentally more important for precise localization than simply adding another modality.

 

The reason for this superiority is rooted in the different foci of the methods. Multi-modal and hybrid methods (e.g., UR-DMU [39], Zhang et al [46].) often dedicate their primary innovation to the training objective or classification robustness (e.g., utilizing memory banks, refining pseudo-labels, or employing specialized ranking/margin losses). While these techniques effectively improve the resilience of the classification process, they still rely on the same standard, temporally-smoothed I3D features as their primary input. Our WTFF method, conversely, solves the root cause of localization error by introducing the Wavelet-Based Time-Frequency feature $F_{HF}$. This feature is designed orthogonally to existing methods, as it specifically captures the high-frequency, abrupt visual motion transients that are essential for pinpointing the exact frame where violence starts and ends. On the XD-Violence dataset, which heavily penalizes temporal offsets, the audio modality provides strong classification context (e.g., sudden noise indicating a violent event) but is often too coarse in time to aid in precise frame-level localization of the visual strike or impact. WTFF's enhanced visual representation thus offers a more fundamental gain in localization precision than the supplementary, yet coarse, information from the audio channel.

 

4. Response to Comments on the Quality of English Language

Point 1: The English could be improved to more clearly express the research.

Response 1: We sincerely thank the reviewer for this constructive comment regarding the clarity of our English expression. We recognize that precise language is essential for accurately conveying our research findings. To fully address this concern, we have meticulously revised the entire manuscript, paying close attention to grammar, sentence structure, and the precision of our technical terminology. We have concentrated on improving the overall readability and flow, ensuring that the introduction, methodology, and discussion sections are communicated as clearly and unambiguously as possible. We believe these revisions have significantly enhanced the linguistic quality and clarity of the paper.

5. Additional clarifications

We have no additional clarifications to provide at this time.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revised MS is good enough for publication in the journal. 

Author Response

For research article

 

 

Response to Reviewer 1 Comments

 

1. Summary

 

 

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Does the introduction provide sufficient background and include all relevant references?

Yes/Can be improved/Must be improved/Not applicable

Yes

Are all the cited references relevant to the research?

Yes/Can be improved/Must be improved/Not applicable

yes

Is the research design appropriate?

Yes/Can be improved/Must be improved/Not applicable

yes

Are the methods adequately described?

Yes/Can be improved/Must be improved/Not applicable

yes

Are the results clearly presented?

Yes/Can be improved/Must be improved/Not applicable

yes

Are the conclusions supported by the results?

Yes/Can be improved/Must be improved/Not applicable

yes

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: The revised MS is good enough for publication in the journal.

Response 1:  We are deeply grateful for the reviewer's positive evaluation and kind recommendation for publication. We sincerely appreciate the time and effort the reviewer spent meticulously reviewing our manuscript and providing constructive guidance throughout the revision process. We believe that the incorporation of the reviewer's valuable suggestions has significantly strengthened the clarity, presentation, and overall quality of our paper. Thank as again for your positive assessment and support.

 

4. Response to Comments on the Quality of English Language

Point 1: The English is fine and does not require any improvement.

We have no additional clarifications to provide at this time.

  1. Additional clarifications

We have no additional clarifications to provide at this time.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This version is a clear improvement over the previous one, but several issues still need to be addressed:

 

1.Figures are informative but need clearer annotations and unified format, especially the font size (Figure1-6).

2.The reference format  in the paper should be modified.

3.The overall formatting of the paper is not polished, especially in the Experiments section (Table 1-9). The authors are encouraged to refine the layout for better readability and presentation.

Author Response

For research article

 

 

Response to Reviewer 3 Comments

 

1. Summary

 

 

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Does the introduction provide sufficient background and include all relevant references?

Yes/Can be improved/Must be improved/Not applicable

Can be improved

Are all the cited references relevant to the research?

Yes/Can be improved/Must be improved/Not applicable

Yes

Is the research design appropriate?

Yes/Can be improved/Must be improved/Not applicable

Yes

Are the methods adequately described?

Yes/Can be improved/Must be improved/Not applicable

Yes

Are the results clearly presented?

Yes/Can be improved/Must be improved/Not applicable

 

Are the conclusions supported by the results?

Yes/Can be improved/Must be improved/Not applicable

Yes

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: Figures are informative but need clearer annotations and unified format, especially the font size (Figure1-6).

Response 1:

We thank the reviewer for this constructive suggestion. We recognize that uniformity is crucial for professional presentation. We have carefully revised all Figures (1-6) to address this point: Unified Font Size: The font size for all axis labels, legends, and annotations has been standardized across every figure to ensure a consistent look and improved readability.

Comments 2: The reference format  in the paper should be modified. Enhanced Annotations: We have reviewed and clarified all figure labels and captions to ensure the annotations are explicit and easy to understand.

Response 2:

We appreciate the reviewer pointing out the inconsistency in our reference formatting. We sincerely apologize for this oversight. We have rigorously checked the journal’s submission guidelines and have updated the entire bibliography and all in-text citations to strictly adhere to the required reference style. The reference list is now fully consistent with the target publication format.

Comments 3: The overall formatting of the paper is not polished, especially in the Experiments section (Table 1-9). The authors are encouraged to refine the layout for better readability and presentation.

Response 3:

We agree with the reviewer that the overall presentation lacked polish, especially in the results tables. We have performed a comprehensive re-formatting of the entire manuscript to enhance its readability and aesthetic quality. Specifically, for the Experiments section (Tables 1-9): Table Refinement: We have refined the layout of all tables, adjusting column widths, ensuring precise numeric alignment, and optimizing spacing. Unnecessary vertical lines have been removed to give the tables a more professional, clean appearance. General Consistency: We have ensured that all section headings, paragraph spacing, and mathematical expressions are consistent and fully comply with the journal's template guidelines.

4. Response to Comments on the Quality of English Language

Point 1: The English is fine and does not require any improvement.

 

5. Additional clarifications

We have no additional clarifications to provide at this time.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Author Response File: Author Response.pdf

Back to TopTop