Next Article in Journal
Bayesian Optimization-Driven U-Net Architecture Tuning for Brain Tumor Segmentation
Previous Article in Journal
New Advances and Methodologies in the Field of Time Series and Forecasting—ITISE-2025
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

TransLowNet: An Online Framework for Video Anomaly Detection, Classification, and Localization †

by
Jonathan Flores-Monroy
1,
Gibran Benitez-Garcia
2,
Mariko Nakano-Miyatake
1,*,
Hector Perez-Meana
1 and
Hiroki Takahashi
2,3,4
1
Instituto Politecnico Nacional, ESIME Culhuacán, Mexico City 04440, Mexico
2
Graduate School of Informatics and Engineering, The University of Electro-Communications, Chofugaoka 1-5-1, Chofu-shi 182-8585, Tokyo, Japan
3
Artificial Intelligence eXploration Research Center (AIX), The University of Electro-Communications, Chofugaoka 1-5-1, Chofu-shi 182-8585, Tokyo, Japan
4
Meta-Networking Research Center (MEET), The University of Electro-Communications, Chofugaoka 1-5-1, Chofu-shi 182-8585, Tokyo, Japan
*
Author to whom correspondence should be addressed.
Presented at the First Summer School on Artificial Intelligence in Cybersecurity, Cancun, Mexico, 3–7 November 2025.
Eng. Proc. 2026, 123(1), 28; https://doi.org/10.3390/engproc2026123028
Published: 9 February 2026
(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)

Abstract

This work presents TransLowNet, an online framework for video anomaly detection, classification, and spatial localization. The system segments incoming video streams into clips processed by an X3D-S feature extractor to obtain spatio-temporal representations, which are analyzed by dedicated modules for anomaly detection and recognition, while a MoG2-based stage estimates the spatial regions of anomalous activity. Evaluated on the UCF-Crime dataset, TransLowNet achieved 80.0% AUC, 54.5% accuracy, and 20.3% mAP@0.5, offering an efficient and interpretable approach for continuous video surveillance.

1. Introduction

Video Anomaly Detection (VAD) is a key field in computer vision with wide applications in intelligent video surveillance. Its main goal is to identify atypical events within video sequences. However, most existing approaches are limited to binary detection (determining only whether an anomaly is present) without recognizing the anomaly type or its spatial location, which substantially reduces their practical utility. Moreover, many methods operate under an offline scheme, where analysis is performed only after the entire video has been processed. To enable more efficient and practically deployable systems, anomaly analysis should instead be conducted in an online manner, that is, while the video stream is being received, jointly integrating detection, classification, and localization within a unified framework. In this work, we present TransLowNet, an online modular framework that unifies these three tasks. The framework begins by segmenting video streams into clips, each composed of N frames sampled at temporal intervals of G frames. Each clip undergoes a preprocessing step consisting of a central crop followed by a resize to H × W pixels, ensuring spatial consistency across clips. Once preprocessed, the clips are processed by an Expanded 3D ConvNet (X3D-S) [1], which extracts compact spatio-temporal representations of each clip. These representations are then evaluated by an anomaly detection module [2] that determines whether the observed patterns correspond to normal or abnormal events. When an anomaly is detected, the extracted features trigger two independent processes: (i) a multiclass classifier [3], which assigns the detected action to one of the predefined anomaly categories, and (ii) a spatial localization module based on Mixture of Gaussians (MoG2) [4], which operates at the pixel level to highlight the regions associated with the anomalous behavior. TransLowNet was evaluated on the UCF-Crime dataset [5], achieving competitive results in detection, classification, and localization compared with state-of-the-art methods, while maintaining the advantage of being fully online and modular.

2. Methodology

Figure 1 provides an overview of the proposed framework, TransLowNet, built upon the previous works of Flores et al. [3,6], where the Anomaly Detection Module and the Anomaly Classification Module were independently trained. In this work, the same training configuration is preserved, while the inference stage is extended by incorporating an additional Anomaly Localization Module to estimate the active regions where the anomalous event occurs.
During inference, each processed clip x R H × W × C h , where C h denotes the number of channels, is fed into the Feature Extraction Module (X3D-S) [1], following the scheme described in [3,6]. This module produces a compact spatio-temporal feature representation f R 1 × D , where D is the dimensionality of the feature vector. The resulting representations are then evaluated by the Anomaly Detection Module [2], adapted in [6] to operate in an online mode. The detector assigns to each clip an anomaly score that quantifies its deviation from normal scene behavior. This score is compared against a threshold ( t h r ) set to the 90th percentile of the ROC-derived AUC distribution, following the criterion established in [6]. If the score exceeds t h r , the clip is considered anomalous and is simultaneously forwarded to the Anomaly Classification Module [3] and to the proposed Anomaly Localization Module.
The Anomaly Localization Module relies on the observation that anomalous events often produce abrupt variations in visual dynamics or motion magnitude relative to the background. Following this principle, the Mixture of Gaussians (MoG2) method [4] is employed to adaptively model the background using Gaussian distributions updated over time. This process isolates the regions exhibiting unusual motion, generating a binary mask that is subsequently refined to remove noise and merge nearby regions. Finally, a coherent region representing the area of highest anomalous activity is obtained and expressed as a bounding box that constitutes the final spatial localization estimate of the event.

3. Results

The proposed TransLowNet framework was evaluated on the UCF-Crime dataset [5], which consists of 1900 untrimmed videos divided into 1610 for training (810 anomalous and 800 normal) and 290 for testing (140 anomalous and 150 normal), covering 13 anomaly categories under the standard evaluation protocol. Following the current state-of-the-art practice [2,3,5,6,7,8,9], four metrics were used to assess the performance of the system, each corresponding to a specific task. The Area Under the ROC Curve (AUC) was employed to measure frame-level anomaly detection performance, while Accuracy (ACC) was used to evaluate video-level multi-class classification. For spatial localization, the mean Average Precision at IoU 0.5 (mAP@0.5) was computed using the ground-truth bounding boxes provided in [9]. Finally, GFLOPs were reported to quantify the computational complexity of the model, providing an additional measure of efficiency and scalability for online environments. The detection and classification modules preserved the same training configuration, hyperparameters, and loss functions reported in Flores et al. [3,6], ensuring consistency and reproducibility across experiments. Following those works, each video sequence was divided into clips composed of N = 13 frames sampled at a temporal interval of G = 6 . Each frame was center-cropped and resized to a spatial resolution of H × W = 182 × 182 pixels with C h = 3 RGB channels. The resulting feature dimensionality was D = 192 .
Table 1 summarizes the performance comparison between TransLowNet and several state-of-the-art methods. The proposed framework achieved an AUC of 80.00%, a classification accuracy (ACC) of 54.48%, and a spatial localization performance of 20.32% mAP@0.5, with a computational cost of only 2.844 GFLOPs. These results demonstrate a competitive balance between accuracy and efficiency, highlighting that TransLowNet operates in an online mode, processing video streams sequentially as they are received. This property contrasts with offline methods such as Sultani et al. [5] and Tan et al. [7], which require complete video sequences before identifying anomalies, limiting their applicability in continuous surveillance scenarios that demand immediate analysis. Compared with other approaches, Gao et al. [8] and Al-Lahham et al. [2] achieve strong detection accuracy but lack classification or localization capabilities, restricting them to binary detection tasks. Conversely, Flores et al. [3] report the highest classification accuracy (58.96%) but employ a heavier architecture (28.873 GFLOPs) and do not include a localization component, preventing spatial identification of anomalous regions. In contrast, TransLowNet integrates detection, classification, and localization into a unified, modular, and efficient framework, capable of operating online with a computational cost nearly ten times lower than most prior models. Although the mAP@0.5 value (20.32%) indicates that spatial anomaly localization remains an open challenge (mainly due to high visual variability and low contrast between normal and abnormal patterns) the inclusion of this module represents a meaningful step toward a more comprehensive understanding of anomalous behavior. Overall, TransLowNet maintains competitive detection and classification performance while extending the functional scope of surveillance systems by providing additional spatial interpretability with minimal computational overhead.

4. Conclusions

This work presented TransLowNet, an online modular framework that jointly integrates video anomaly detection, classification, and spatial localization. The results obtained on the UCF-Crime dataset demonstrate a competitive balance between accuracy, interpretability, and computational efficiency. The modular design of the system enables flexible and scalable deployment in continuous surveillance environments. Although the experimental validation was limited to the UCF-Crime dataset, future work will include evaluations on additional benchmarks, as well as per-class and temporal stability analyses to further strengthen the robustness and generalization of the proposed model.

Author Contributions

Conceptualization, J.F.-M., G.B.-G. and M.N.-M.; methodology, J.F.-M. and M.N.-M.; software, H.P.-M.; validation, G.B.-G. and M.N.-M.; formal analysis, G.B.-G.; investigation, J.F.-M.; data curation, H.P.-M.; writing—original draft preparation, J.F.-M.; writing—review and editing, J.F.-M., G.B.-G. and M.N.-M.; visualization, J.F.-M.; supervision, M.N.-M. and H.T.; project administration, H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Feichtenhofer, C. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 203–213. [Google Scholar]
  2. Al-Lahham, A.; Tastan, N.; Zaheer, M.Z.; Nandakumar, K. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6793–6802. [Google Scholar]
  3. Flores-Monroy, J.; Benitez-Garcia, G.; Nakano-Miyatake, M.; Takahashi, H. An online modular framework for anomaly detection and multiclass classification in video surveillance. Appl. Sci. 2025, 15, 9249. [Google Scholar] [CrossRef]
  4. Zivkovic, Z. Improved adaptive Gaussian mixture model for background subtraction. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR), Cambridge, UK, 23–26 August 2004; Volume 2, pp. 28–31. [Google Scholar]
  5. Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6479–6488. [Google Scholar]
  6. Flores-Monroy, J.; Benitez-Garcia, G.; Nakano-Miyatake, M.; Takahashi, H. Detection and Classification of Abnormal Human Actions for Video Surveillance on Edge Devices. In Proceedings of the Forty-Third IEEE Convention of Central America and Panama (CONCAPAN 2025), CAPANA Council (Central America and Panama), San Salvador, El Salvador, 26–28 November 2025. [Google Scholar]
  7. Tan, W.; Yao, Q.; Liu, J. Overlooked video classification in weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 202–210. [Google Scholar]
  8. Gao, A. STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications. Master’s Thesis, San Jose State University, San Jose, CA, USA, 2025. [Google Scholar]
  9. Liu, K.; Ma, H. Exploring background-bias for anomaly detection in surveillance videos. In Proceedings of the 27th ACM International Conference on Multimedia (ACM MM), Nice, France, 21–25 October 2019; pp. 1490–1499. [Google Scholar]
Figure 1. Overview of the proposed modular framework for clip-level anomaly detection, classification, and spatial localization. The input video stream is segmented into consecutive clips, which are first processed by the feature extractor to obtain compact spatio-temporal representations. These representations are then evaluated by the anomaly detection module; if an anomaly is detected, the features are forwarded to the anomaly classification module to assign the event to a specific category, while the corresponding clips are simultaneously processed by the MoG2-based localization module to highlight the active regions associated with the anomalous action. The purple dashed arrow indicate the propagation of clip-level feature vectors extracted by the feature extractor toward the anomaly classifier module (purple box), whereas the pink dashed path denotes the direct use of raw RGB clips for spatial localization without additional feature extraction.
Figure 1. Overview of the proposed modular framework for clip-level anomaly detection, classification, and spatial localization. The input video stream is segmented into consecutive clips, which are first processed by the feature extractor to obtain compact spatio-temporal representations. These representations are then evaluated by the anomaly detection module; if an anomaly is detected, the features are forwarded to the anomaly classification module to assign the event to a specific category, while the corresponding clips are simultaneously processed by the MoG2-based localization module to highlight the active regions associated with the anomalous action. The purple dashed arrow indicate the propagation of clip-level feature vectors extracted by the feature extractor toward the anomaly classifier module (purple box), whereas the pink dashed path denotes the direct use of raw RGB clips for spatial localization without additional feature extraction.
Engproc 123 00028 g001
Table 1. Comparison of performance between state-of-the-art methods and the proposed framework on UCF-Crime.
Table 1. Comparison of performance between state-of-the-art methods and the proposed framework on UCF-Crime.
MethodGFLOPsAUC (%)ACC (%)mAP@0.5 (%)
Sultani et al. [5]∼61.63875.4123.00
Tan et al. [7]∼38.95082.6928.40
Gao et al. [8]∼26.50391.34
Flores et al. [3]28.87382.2758.96
Al-Lahham et al. [2]∼17.82483.40
Ours (TransLowNet)2.84480.0054.4820.32
Note: Bold values indicate the results obtained by the proposed method (TransLowNet).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Flores-Monroy, J.; Benitez-Garcia, G.; Nakano-Miyatake, M.; Perez-Meana, H.; Takahashi, H. TransLowNet: An Online Framework for Video Anomaly Detection, Classification, and Localization. Eng. Proc. 2026, 123, 28. https://doi.org/10.3390/engproc2026123028

AMA Style

Flores-Monroy J, Benitez-Garcia G, Nakano-Miyatake M, Perez-Meana H, Takahashi H. TransLowNet: An Online Framework for Video Anomaly Detection, Classification, and Localization. Engineering Proceedings. 2026; 123(1):28. https://doi.org/10.3390/engproc2026123028

Chicago/Turabian Style

Flores-Monroy, Jonathan, Gibran Benitez-Garcia, Mariko Nakano-Miyatake, Hector Perez-Meana, and Hiroki Takahashi. 2026. "TransLowNet: An Online Framework for Video Anomaly Detection, Classification, and Localization" Engineering Proceedings 123, no. 1: 28. https://doi.org/10.3390/engproc2026123028

APA Style

Flores-Monroy, J., Benitez-Garcia, G., Nakano-Miyatake, M., Perez-Meana, H., & Takahashi, H. (2026). TransLowNet: An Online Framework for Video Anomaly Detection, Classification, and Localization. Engineering Proceedings, 123(1), 28. https://doi.org/10.3390/engproc2026123028

Article Metrics

Back to TopTop