Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features
Abstract
1. Introduction
- The training network is effectively and efficiently enabled to incorporate object-level information, such as object motion and count, by combining the base model with supplementary object detection features. Furthermore, it can be easily adapted to any existing object detector and holds significant potential for future applications;
- A new feature format has been designed to adapt the direct YOLO outputs to the proposed framework, to ensure they can be better concatenated with the existing pre-trained features (I3D);
- Attention-based mechanisms have been incorporated to capture the contextual information and enhance the quality of the extracted object features, resulting in improved performance compared with previous methods;
- Experiments have been conducted on two benchmark datasets: ShanghaiTech [12] and UCF-Crime [13]. The proposed method achieves better results on both datasets than the baseline model, offering a promising alternative for tackling weakly supervised video anomaly detection tasks that previous methods have not fully explored.
2. Related Works
2.1. Video Anomaly Detection
2.2. Object Detection
3. Proposed Methods
3.1. Generation of Additional Features
3.2. Feature Pre-Processing
3.3. Fusion of Features
3.4. Training Phase
4. Experiments and Discussions
4.1. Datasets
4.2. Metrics
4.3. Implementation Details
4.4. Benchmark Performance
4.5. Ablation Study
4.5.1. Comparison with Feature Formats
4.5.2. Comparison with Object Detectors
4.6. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 2022, 54, 38. [Google Scholar] [CrossRef]
- Fernandes, G.; Rodrigues, J.J.P.C.; Carvalho, L.F.; Al-Muhtadi, J.F.; Proença, M.L. A comprehensive survey on network anomaly detection. Telecommun. Syst. 2019, 70, 447–489. [Google Scholar] [CrossRef]
- Hilal, W.; Gadsden, S.A.; Yawney, J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
- Fernando, T.; Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. Deep Learning for Medical Anomaly Detection—A Survey. ACM Comput. Surv. 2022, 54, 141. [Google Scholar] [CrossRef]
- Avola, D.; Cinque, L.; Di Mambro, A.; Diko, A.; Fagioli, A.; Foresti, G.L.; Marini, M.R.; Mecca, A.; Pannone, D. Low-Altitude Aerial Video Surveillance via One-Class SVM Anomaly Detection from Textural Features in UAV Images. Information 2021, 13, 2. [Google Scholar] [CrossRef]
- Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. [Google Scholar] [CrossRef]
- Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4975–4986. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, Z.; Zhang, B.; Fok, W.; Qi, X.; Wu, Y.C. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 387–395. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
- Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning What YouWant to Learn Using Programmable Gradient Information. In Computer Vision–ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science Series; Springer Nature: Cham, Switzerland, 2025; Volume 15089, pp. 1–21. [Google Scholar]
- Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar] [CrossRef]
- Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar] [CrossRef]
- Acsintoae, A.; Florescu, A.; Georgescu, M.I.; Mare, T.; Sumedrea, P.; Ionescu, R.T.; Khan, F.S.; Shah, M. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20143–20153. [Google Scholar] [CrossRef]
- Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; Hengel, A.v.d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar] [CrossRef]
- Zaheer, M.Z.; Lee, J.h.; Astrid, M.; Lee, S.I. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14183–14193. [Google Scholar] [CrossRef]
- Pang, G.; Yan, C.; Shen, C.; Hengel, A.v.d.; Bai, X. Self-trained deep ordinal regression for end-to-end video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12173–12182. [Google Scholar] [CrossRef]
- Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10076–10085. [Google Scholar] [CrossRef]
- Luo, W.; Liu, W.; Gao, S. Remembering history with convolutional lstm for anomaly detection. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 439–444. [Google Scholar] [CrossRef]
- Nguyen, D.T.; Lou, Z.; Klar, M.; Brox, T. Anomaly detection with multiple-hypotheses predictions. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 4800–4809. [Google Scholar]
- Nguyen, T.N.; Meunier, J. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1273–1283. [Google Scholar] [CrossRef]
- Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8022–8031. [Google Scholar] [CrossRef]
- Zhong, J.X.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1237–1246. [Google Scholar] [CrossRef]
- Wan, B.; Fang, Y.; Xia, X.; Mei, J. Weakly supervised video anomaly detection via center-guided discriminative learning. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Feng, D.; Harakeh, A.; Waslander, S.L.; Dietmayer, K. A review and comparative study on probabilistic object detection in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2021, 23, 9961–9980. [Google Scholar] [CrossRef]
- Yang, R.; Yu, Y. Artificial convolutional neural network in object detection and semantic segmentation for medical imaging analysis. Front. Oncol. 2021, 11, 638182. [Google Scholar] [CrossRef] [PubMed]
- Raghunandan, A.; Raghav, P.; Aradhya, H.R. Object detection algorithms for video surveillance applications. In Proceedings of the 2018 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 3–5 April 2018; pp. 0563–0568. [Google Scholar] [CrossRef]
- Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffman, J. Hydra Attention: Efficient Attention with Many Heads. In Computer Vision–ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Lecture Notes in Computer Science Series; Springer Nature: Cham, Switzerland, 2023; Volume 13807, pp. 35–49. [Google Scholar] [CrossRef]
- Luo, W.; Liu, W.; Gao, S. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 341–349. [Google Scholar] [CrossRef]
- Park, H.; Noh, J.; Ham, B. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14372–14381. [Google Scholar] [CrossRef]
- Yu, G.; Wang, S.; Cai, Z.; Zhu, E.; Xu, C.; Yin, J.; Kloft, M. Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 583–591. [Google Scholar] [CrossRef]
- Zhang, J.; Qing, L.; Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 4030–4034. [Google Scholar] [CrossRef]
- Wang, J.; Cherian, A. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8201–8211. [Google Scholar] [CrossRef]
- Zaheer, M.Z.; Mahmood, A.; Khan, M.H.; Segu, M.; Yu, F.; Lee, S.I. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14744–14754. [Google Scholar] [CrossRef]
- Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision. In Computer Vision–ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Lecture Notes in Computer Science Series; Springer International Publishing: Cham, Switzerland, 2020; Volume 12375, pp. 322–339. [Google Scholar] [CrossRef]
- Doshi, K.; Yilmaz, Y. Continual learning for anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 254–255. [Google Scholar] [CrossRef]




| Supervision | Method | Features | AUC (%) |
|---|---|---|---|
| Unsupervised | Conv-AE [6] | - | 60.85 |
| Stacked-RNN [32] | - | 68.00 | |
| Frame-Pred [12] | - | 73.40 | |
| Mem-AE [15] | - | 71.20 | |
| MNAD [33] | - | 70.50 | |
| VEC [34] | - | 74.80 | |
| Weakly Supervised | GCN-Anomaly | C3D RGB | 76.44 |
| Zhang et al. [35] | I3D RGB | 82.50 | |
| GCN-Anomaly [23] | TSN Flow | 84.13 | |
| GCN-Anomaly [23] | TSN RGB | 84.44 | |
| AR-Net [24] | I3D Flow | 82.32 | |
| AR-Net [24] | I3D RGB | 85.38 | |
| AR-Net [24] | I3D RGB & I3D Flow | 91.24 | |
| RTFM [7] | I3D RGB | 97.21 | |
| RTFM * [7] | I3D RGB | 95.82 | |
| Ours | I3D RGB & YOLOv7 | 96.51 | |
| Ours | I3D RGB & YOLOv9 |
| Supervision | Method | Features | AUC (%) |
|---|---|---|---|
| Unsupervised | SVM Baseline | - | 50.00 |
| Conv-AE [6] | - | 50.60 | |
| BODS [36] | I3D RGB | 68.26 | |
| GODS [36] | I3D RGB | 70.46 | |
| Zaheer et al. [37] | ResNext | 71.04 | |
| Weakly Supervised | Sultani et al. [13] | C3D RGB | 75.41 |
| Sultani et al. [13] | I3D RGB | 77.92 | |
| Zhang et al. [35] | C3D RGB | 78.66 | |
| GCN-Anomaly [23] | C3D RGB | 81.08 | |
| GCN-Anomaly [23] | TSN Flow | 78.08 | |
| GCN-Anomaly [23] | TSN RGB | 82.12 | |
| Wu et al. [38] | I3D RGB | 82.44 | |
| RTFM [7] | I3D RGB | 84.30 | |
| RTFM * [7] | I3D RGB | 83.03 | |
| Ours | I3D RGB & YOLOv7 | 83.81 | |
| Ours | I3D RGB & YOLOv9 |
| Dataset | Pre-Processing | AUC (%) | |||
|---|---|---|---|---|---|
| MTN | SE Layer | Hydra Attn | FeatVec [39] | Proposed Format | |
| ShanghaiTech | ✓ | 94.73 | 95.16 | ||
| ✓ | 96.07 | 97.45 | |||
| ✓ | 96.15 | 95.93 | |||
| UCF-Crime | ✓ | 82.89 | 83.18 | ||
| ✓ | 83.49 | 83.5 | |||
| ✓ | 82.73 | 84.42 | |||
| Dataset | Pre-Processing | AUC (%) | ||
|---|---|---|---|---|
| SE Layer | Hydra Attn | YOLOv7 | YOLOv9 | |
| ShanghaiTech | ✓ | 96.51 | 97.45 | |
| ✓ | 96.28 | 95.93 | ||
| UCF-Crime | ✓ | 83.81 | 83.5 | |
| ✓ | 83.06 | 84.42 | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Chen, Y.; Yeo, C.K. Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features. Information 2025, 16, 1042. https://doi.org/10.3390/info16121042
Wang Y, Chen Y, Yeo CK. Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features. Information. 2025; 16(12):1042. https://doi.org/10.3390/info16121042
Chicago/Turabian StyleWang, Yanyu, Yang Chen, and Chai Kiat Yeo. 2025. "Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features" Information 16, no. 12: 1042. https://doi.org/10.3390/info16121042
APA StyleWang, Y., Chen, Y., & Yeo, C. K. (2025). Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features. Information, 16(12), 1042. https://doi.org/10.3390/info16121042

