Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features

Wang, Yanyu; Chen, Yang; Yeo, Chai Kiat

doi:10.3390/info16121042

Open AccessArticle

Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features

by

Yanyu Wang

^*

,

Yang Chen

and

Chai Kiat Yeo

College of Computing and Data Science, Nanyang Technological University, Singapore 639768, Singapore

^*

Author to whom correspondence should be addressed.

Information 2025, 16(12), 1042; https://doi.org/10.3390/info16121042

Submission received: 23 October 2025 / Revised: 14 November 2025 / Accepted: 28 November 2025 / Published: 30 November 2025

(This article belongs to the Special Issue Computer Vision for Security Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Surveillance cameras are extensively deployed across public and private environments, driving the need for intelligent video monitoring systems. However, a major challenge arises in Weakly Supervised Video Anomaly Detection (WSVAD), where supervision is limited to video-level labels, making snippet-level anomaly localisation particularly difficult. This challenge is often formulated as a Multiple Instance Learning (MIL) problem. Although recent approaches have achieved encouraging results by modelling spatio-temporal dynamics, they often overlook the semantic information within videos that could further enhance anomaly detection. To bridge this gap, we propose enriching feature representations by applying object detection techniques to extract object-centric features. These features provide supplementary high-level semantic information that supports the discrimination of anomalous events. Experiments conducted on two benchmark datasets, UCF-Crime and ShanghaiTech, demonstrate that our approach achieves performance comparable to state-of-the-art (SOTA) methods. The results highlight that incorporating object-level semantics offers a promising direction for improving WSVAD, underscoring the potential of semantic-aware approaches for more effective anomaly detection.

Keywords:

video anomaly detection; weakly supervised; surveillance video; object detection

1. Introduction

Anomaly detection (AD) is a fundamental research topic that focuses on identifying data samples that deviate from normal behavioural patterns or expected distributions [1]. Due to growing demands across diverse domains, anomaly detection plays an important role in various application areas such as cybersecurity [2], financial surveillance [3], healthcare and medical realm [4], and UAV-based surveillance and monitoring [5]. Among them, one of the most critical applications lies in video surveillance, where it can be used to identify irregular or hazardous events from surveillance videos, e.g., robbery, shooting, and shoplifting, to prevent social harm and ensure public safety.

Nevertheless, detecting anomalies in long and complex surveillance videos remains extremely challenging [6] due to high labour costs, low annotation precision and poor data quality. To address these issues, weakly supervised video anomaly detection (WSVAD) has been proposed as a promising alternative.

A representative example of WSVAD is the Robust Temporal Feature Magnitude (RTFM) framework proposed by Tian et al. [7]. It employs pre-trained Inflated 3D ConvNets features [8] and learns feature magnitude based on top-k instances. Chen et al. [9] further advanced this concept by designing a Magnitude-Contrastive Glance-and-Focus Network (MGFN) to better capture spatial–temporal dependencies. However, a notable limitation is that the feature magnitude is not only affected by abnormality, but also by other factors such as object motion and object count. Consequently, simply pushing normal features away in the feature space may not accurately reflect anomaly boundaries. This observation underscores the importance of incorporating object-level information into the network, as object dynamics and interactions convey rich video semantics that are crucial for improving anomaly detection performance.

In view of the above challenge, some previous studies leveraged optical flow features to capture motion information, which can indirectly reflect object-level dynamics. However, such approaches typically yield suboptimal results [1] and double the feature dimensionality, thereby complicating the training process.

To overcome these limitations, modern object detectors such as YOLO [10,11] are adopted as a practical and efficient solution for generating supplementary object-level features. Unlike pre-computed video features, detection outputs inherently encode additional rich semantic cues, including object categories, motion patterns, and object count, which are highly informative for anomaly understanding. Instead of designing and training additional networks to capture object-related contextual information, mature object detectors are applied directly to the original videos to generate the supplementary object-centric representations. Compared with optical flow, object detection offers lower feature dimensionality and achieves superior detection performance. To overcome the issue that YOLO’s native outputs, which consist of discrete class IDs and bounding box coordinates, are not ideal for direct network training, we instead extract the class probability distributions from the penultimate layer. These distributions are then fused with the corresponding positional information to form a compact but semantically informative feature representation. These object-level features are subsequently fused with pre-trained video representations obtained from the Inflated 3D ConvNet (I3D) model [8], yielding richer contextual cues while effectively reducing detection errors.

The contributions in this work are summarised as follows:

The training network is effectively and efficiently enabled to incorporate object-level information, such as object motion and count, by combining the base model with supplementary object detection features. Furthermore, it can be easily adapted to any existing object detector and holds significant potential for future applications;
A new feature format has been designed to adapt the direct YOLO outputs to the proposed framework, to ensure they can be better concatenated with the existing pre-trained features (I3D);
Attention-based mechanisms have been incorporated to capture the contextual information and enhance the quality of the extracted object features, resulting in improved performance compared with previous methods;
Experiments have been conducted on two benchmark datasets: ShanghaiTech [12] and UCF-Crime [13]. The proposed method achieves better results on both datasets than the baseline model, offering a promising alternative for tackling weakly supervised video anomaly detection tasks that previous methods have not fully explored.

This paper is structured as follows. Section 2 reviews recent research in both video anomaly detection and object detection. Section 3 presents the structure and implementation details of the proposed method. Section 4 introduces the dataset preparation and experimental setup, followed by an analysis of the results. Finally, Section 5 concludes the paper by summarising the key ideas and contributions.

2. Related Works

2.1. Video Anomaly Detection

Supervised anomaly detection [14] generally achieves the best performance across different categories, as it benefits from precise frame-level annotations that provide strong supervision during training. However, it is less commonly studied for complex or large-scale datasets due to the high cost of detailed annotation and the susceptibility to human error, which makes it difficult to ensure high-quality labels. In contrast, unsupervised Anomaly detection methods [15,16,17,18,19,20,21] eliminate the need for expensive manual annotation but often perform suboptimally because they lack clearly defined anomalies and ground-truth supervision for training.

Between these two extremes, weakly supervised learning offers a balanced trade-off between accuracy and annotation cost. By relying solely on coarse video-level labels rather than detailed frame-level annotations, weakly supervised methods significantly reduce the burden of manual labelling while still retaining enough supervisory signal for effective model training. The mainstream WSVAD approaches are primarily based on the MIL framework [13], in which a video (or a set of snippets) is treated as a “bag”, containing multiple “instances” (snippets or frames), and instance-level anomaly patterns are learned using only bag-level video annotations. The objective is to assign higher anomaly scores to abnormal snippets within abnormal videos, while reducing the scores of normal snippets [22]. Sultani et al. [13] initiated this direction by introducing the large-scale UCF-Crime dataset and a MIL-based framework using video-level labels to localise abnormal segments, sparking significant research interest. However, this MIL-based formulation introduces label noise that normal snippets may be incorrectly treated as highly abnormal in an abnormal video, which can degrade the learning process. To eliminate the noise, Zhong et al. [23] adopted a graph convolutional neural network (GCN) and formulated the problem as a supervised learning task under noisy labels, aiming to reduce the influence of incorrectly labelled snippets. While GCN can explicitly model relationships among snippets, joint training with the MIL framework is computationally costly. It can lead to an unconstrained latent space that can cause unstable performance. Building upon these foundations, Wan et al. [24] introduced the Anomaly Regression Network (AR-Net) to learn more discriminative features and improve anomaly detection accuracy to a practical level. Tian et al. [7] further advanced the field with the robust Temporal Feature Magnitude (RTFM) network, which leverages Inflated 3D ConvNets [8] for feature extraction and employs a temporal network to model spatio-temporal dependencies, along with a tailored loss function that encourages separation between normal and abnormal patterns. Following the RTFM structure, Chen et al. [9] observed that a loss function which simply encourages abnormal features to have larger magnitudes than normal ones is fundamentally unreasonable, as this assumption contradicts the inherent magnitude distribution across different videos. To address this, a Magnitude-Contrastive Glance-and-Focus Network (MGFN) was proposed, which introduces a magnitude-contrastive loss and performs a global scan over the video while selectively focusing on specific portions for finer analysis, thereby achieving performance improvements and demonstrating the effectiveness of integrating global and local feature reasoning.

Although these approaches have achieved effective progress, most rely on pre-computed features extracted from models such as I3D [8] or C3D. These features, while effective for general video representation, often lack semantic richness and fail to capture object-level context that is essential for understanding anomalies. To address this limitation, recent progress in object detection offers a promising solution. Modern object detectors achieve high accuracy and real-time performance, making them practical for extracting supplementary features that provide richer semantic information. Building on these strengths, we adopt a weakly supervised learning approach as a balanced compromise between annotation cost and detection performance, enhanced with object-level features to better capture contextual information and improve network learning.

2.2. Object Detection

Object detection aims to develop computational models capable of identifying what objects are present and where they are located within an image or video frame, forming a foundation for many computer vision applications [25,26], including surveillance systems [27]. Among existing approaches, the YOLO family represents the mainstream in real-time object detection [28].

The original YOLO was introduced by Redmon et al. in 2015 [29] as the first one-stage detector in the deep learning era, achieving a significant improvement in detection speed. Since then, the YOLO series has evolved rapidly, each iteration addressing prior limitations and enhancing detection performance. YOLOv7 [10], developed by the same author of YOLOv4 and YOLOR, introduced a trainable bag-of-freebies method that allows the model to outperform all known detectors at that time in both speed and accuracy across various frame rates (5–160 FPS). Impressed by its superior detection accuracy and fast inference speed, YOLOv7 is adopted as one of the object detectors in this study. YOLOv9 [11] further improved upon earlier versions by addressing the issue of information loss during feature extraction and spatial transformation. It introduced Programmable Gradient Information (PGI) for more reliable gradient propagation and a lightweight Generalised Efficient Layer Aggregation Network (GELAN) architecture that improves parameter utilisation using standard convolution operations. As YOLOv9 was built based on YOLOv7 and developed by the same author, jointly evaluating them provides a clearer assessment of their relative performance.

Since object detectors inherently capture motion cues, they can provide valuable supplementary information for anomaly detection, as sudden or unusual movements often signal anomalous events. However, traditional object detectors primarily output bounding boxes, which alone are insufficient as high-level features for WSVAD. Hence, a key research gap remains in transforming these detection outputs into rich, semantically meaningful representations suitable for anomaly analysis. This work aims to bridge that gap by developing strategies to extract and integrate object-centric features that enhance WSVAD.

3. Proposed Methods

Figure 1 shows the proposed framework’s architectural design, which equips the network with the capacity to incorporate object information adequately and effectively exploit these features. This integration is crucial as it significantly impacts the anomaly scores.

Given a weakly labelled video set V, where each video is associated with a binary label

y \in {0, 1}

indicating normal or abnormal, the video is first divided into T snippets, which are then independently processed by a feature extractor and an object detector. The feature extractor produces representations denoted as

F_{i 3 d}

, while the object detector produces YOLO label data, which is then used for feature generation to produce

F_{y o l o}

. Both

F_{i 3 d}

and

F_{y o l o}

are structured as

R^{T \times C \times D}

, where T is the number of temporal snippets per video, C is the number of spatial crops per snippet, and D denotes the feature dimension.

Following feature adaptation and concatenation, the resulting representations are fed into a classifier that produces anomaly scores for each snippet. These scores are then passed through a Top-K magnitude selection mechanism, which identifies the top-k snippets based on their highest feature magnitudes. This mechanism ensures that the most informative snippets with the highest anomaly potential are prioritised for further analysis. Network optimisation is driven by the feature magnitude learning of the selected Top-K snippets, along with a binary cross-entropy loss, enabling the model to effectively maximise the separability between the normal and abnormal videos.

3.1. Generation of Additional Features

Object detectors are employed to generate the feature representation

F_{y o l o}

. The raw detection output produced by YOLO includes object information such as class IDs, bounding box coordinates, and bounding box dimensions (width & height). To ensure the features match the desired dimension

F_{y o l o} \in R^{T \times C \times D}

for subsequent concatenation, a conversion strategy is necessary. Specifically, to match the number of snippets T and ensure consistency with the shape of pre-trained I3D features, object detection is performed on the first frame of every K-frame snippet, which is treated as the key frame. Given the high frames per second (fps) of surveillance videos, the temporal differences between adjacent frames are minimal. Therefore, using only the first frame of each snippet is sufficient to capture anomalies, reducing computational overhead and overall processing time. To match the required number of crops C, the extracted feature is duplicated C times to ensure dimensional consistency.

During the object detection process, a confidence threshold is set to filter out the less reliable results, ensuring that only objects detected with sufficient reliability are included in the textual output. However, focusing solely on the class category with the highest class confidence from the reliable object may result in limited and potentially inaccurate information, especially when the threshold value is low. Hence, to further incorporate the confidence levels of all classes, the final classification step of YOLO that produces discrete class IDs is bypassed, and the class probability distributions from the preceding stage are used as the desired features. As the object detector is trained to recognise N classes, these distributions provide N probabilities for each frame.

To incorporate the bounding box coordinates and dimensions, which are also known as object position information, each input frame is uniformly divided into P patches, with each patch capable of representing N class probabilities. Consequently, the final feature dimension of the

F_{y o l o}

becomes

D = P \times N

, as illustrated in Figure 2, and the feature is denoted as

F_{y o l o} \in R^{T \times D}

. Since each detected object is associated with a bounding box, its spatial overlap with each patch is computed. The ratio of the overlapping area to the total bounding box area is then calculated. If this ratio exceeds a predefined threshold, the corresponding patch is assigned the N-dimensional class probability vector of the detected object, as shown in Equation (1). This approach enables each patch to encode class information for one or more objects, allowing for multiple objects to be represented within the same patch if overlaps occur.

IoU = \frac{Area (Patch \cap BoundingBox)}{Area (Boundingbox)} > R_{t h r e s h o l d}

(1)

3.2. Feature Pre-Processing

After obtaining

F_{i 3 d}

and

F_{y o l o}

, a multi-scale temporal network [7] is applied to

F_{i 3 d}

, while feature pre-processing is performed on

F_{y o l o}

. Feature pre-processing is required to be performed before concatenation to transform distinct feature sets into a compatible representation. This will facilitate more effective feature fusion and enhance overall training. Although no single pre-processing method consistently outperforms others, all tested approaches yield better results than omitting pre-processing altogether, which will be listed in Section 4. To minimise the computational overhead introduced by additional features, lightweight pre-processing techniques are preferentially adopted.

Inspired by the squeeze-and-excitation network (SE-Net) [30], we employ a lightweight channel recalibration mechanism to enhance the network’s representational capacity by capturing channel-wise dependencies at minimal computational cost, shown in Figure 3.

The generated additional feature

F_{y o l o}

undergoes a global average pooling operation, serving as the squeeze step to embed global contextual information. Unlike the original SE-Net, which operates on spatial feature maps,

F_{y o l o} \in R^{T \times D}

contains only temporal and channel dimensions, without spatial height and width. Nevertheless, the temporal dimension T can be analogously treated in the squeeze-and-excitation framework to achieve a similar recalibration effect. The squeeze operation can be formulated using Equation (2):

F_{y o l o}^{s q} = \frac{1}{T} \sum_{i = 1}^{T} F_{y o l o} (i, :), with F_{y o l o} \in R^{T \times D}, F_{y o l o}^{s q} \in R^{D}

(2)

Once the squeeze is done, two simple fully connected (FC) layers are applied to

F_{y o l o}^{s q}

, with a ReLU activation after the first FC layer and a sigmoid activation after the second. The resulting excitation vector is then used to re-weigh the original feature through channel-wise multiplication across the temporal dimension, yielding the final recalibrated representation.

Another pre-processing layer explored in this work is the Hydra attention module [31]. It is an efficient general attention module with multi-heads, which increases the number of attention heads and aligns them with the number of features, while reordering operations within the linear attention framework to maintain constant computational cost.

3.3. Fusion of Features

After pre-processing, the pre-trained feature

F_{i 3 d}

is transformed into

X_{i 3 d}

via the mapping

s_{θ} : F_{i 3 d} ⟶ X_{i 3 d}

, which is implemented by the multi-scale temporal network. Similarly, the object detection feature

F_{y o l o}

is transformed into

X_{y o l o}

through

s_{ϕ} : F_{y o l o} ⟶ X_{y o l o}

, which is realised by the feature pre-processing blocks. With both representations prepared, feature fusion is performed through direct concatenation, the simplest method of merging since the two features share the same shape along the temporal and crop dimensions. The resulting feature is denoted as

X = Concat (X_{i 3 d}, X_{y o l o})

, where

X \in R^{T \times C \times D}

and

D = D_{i 3 d} + D_{y o l o}

.

3.4. Training Phase

After processing

X_{i 3 d}

and

X_{y o l o}

and concatenating the resulting features into X, model training begins.

The training process involves the joint optimisation of feature magnitude learning loss and a binary cross-entropy (BCE)-based classification loss under the Multiple Instance Learning (MIL) framework adopted from the baseline work [7]. The corresponding loss functions are presented in Equations (3) and (4).

L_{mag} = E_{i} [{(m - ({∥\frac{1}{k} \sum_{t \in Ω_{k} (X_{abn}^{i})} X_{abn}^{i}∥}_{2} - {∥\frac{1}{k} \sum_{t \in Ω_{k} (X_{nor}^{i})} X_{nor}^{i}∥}_{2}))}^{2}]

(3)

L_{bce} = - \frac{1}{k} \sum_{t \in Ω_{k} (X)} [y log (p) + (1 - y) log (1 - p)]

(4)

Equation (3) defines the feature magnitude learning loss

L_{mag}

, which encourages a larger difference between the magnitudes of top-k abnormal and normal snippet features. Here,

X_{abn}^{i}

and

X_{nor}^{i}

denote the abnormal and normal snippet features of the i-th video in the top-k set

Ω_{k} (X^{i})

, containing the k snippets with the largest l2-norm. The margin hyperparameter m defines the desired separation between the normal and abnormal features. The squared term penalises deviations from the margin symmetrically, and the loss is averaged across all videos in the batch.

Equation (4) defines the binary cross-entropy (BCE) loss

L_{bce}

, which is computed over the top-k snippet scores selected from each video. Here, p and y denote the predicted probability and the ground-truth label of a snippet in the top-k set

Ω_{k} (X)

, respectively.

In addition, two auxiliary loss components, namely, temporal smoothness and sparsity regularisation, are incorporated [13]. The loss components are summarised in Equations (5) and (6):

L_{smooth} = \sum_{t = 1}^{T - 1} {(s_{t + 1}^{abn} - s_{t}^{abn})}^{2}

(5)

L_{sparsity} = \sum_{t = 1}^{T} {(s_{t}^{abn})}^{2}

(6)

Equations (5) and (6) define the auxiliary losses applied to the abnormal snippet scores

s_{t}^{abn}

, computed over all the abnormal snippets rather than the top-k. The temporal smoothness loss

L_{smooth}

encourages gradual changes between consecutive abnormal scores, penalising abrupt changes between adjacent snippets. The sparsity loss

L_{sparsity}

promotes sparse activation, keeping most abnormal scores low while allowing a few to be high.

The overall loss is defined in Equation (7), where the coefficients

α

,

λ_{1}

and

λ_{2}

control the relative contributions of the losses.

L_{total} = α L_{mag} + L_{bce} + λ_{1} L_{smooth} + λ_{2} L_{sparsity}

(7)

4. Experiments and Discussions

4.1. Datasets

The experiments are performed on two benchmark datasets, the ShanghaiTech [12] and UCF-Crime [13] datasets. These two benchmarks only provide video-level labels and include both the normal and abnormal videos. For the normal videos, all snippets are considered normal, whereas in abnormal videos, only some snippets exhibit abnormal behaviours.

ShanghaiTech campus dataset is a medium-scale level dataset and covers 13 different scenes. It has a total of 437 videos, with 130 belonging to the abnormal videos category. These videos are taken in complex lighting conditions and different camera angles, but are less complex than the UCF-Crime dataset [12]. The original dataset was reorganised by Zhong et al. to move a subset of the abnormal testing videos to the training set to construct a weakly supervised training set [23]. This is widely accepted in the research community, and this work will adopt the same configuration.

UCF-Crime is a large-scale dataset consisting of 1900 long untrimmed videos and has an average frame number of 7247. These anomalies are selected because they have a significant impact on public safety, with 13 different real-world anomalies that impact public safety, such as robbery and shooting. Videos in this dataset encompass a complex and diverse environment, including both indoor and outdoor scenes, and day-time and night-time scenes. For the experiments, a total of 1610 videos will be used as the training set with video-level labels, while the remaining 290 videos will be used as the testing set with frame-level labels.

4.2. Metrics

In line with the previous work and to ensure a fair comparison with the baseline, the frame-level Area Under the Receiver Operating Characteristic Curve (AUC) is used as the evaluation metric for both datasets. AUC measures the model’s ability to distinguish between normal and anomalous frames across all threshold settings, with a higher AUC indicating better discriminative performance. This makes it especially suitable for evaluating models on imbalanced datasets, where accuracy can be misleading.

4.3. Implementation Details

Following prior works, 2048-dimensional video features are extracted using the I3D network [8]. For a fair comparison, the same setup as the baseline RTFM model [7] is adopted. The basic hyperparameters follow default settings:

T = 32, m = 100, C = 10, k = 3

, where each video is divided into 32 snippets, the margin for training purposes is set to 100, and the top 3 snippets are selected for calculating separability. The normal and abnormal weights are set to

λ_{1} = λ_{2} = 1

. Adam optimiser is used with a learning rate of

α = 0.005

for ShanghaiTech and

α = 0.001

for UCF-Crime. The batch size B is the same for both the normal and abnormal videos, meaning that each mini-batch contains B normal and B abnormal videos, and is selected randomly. The batch size B is 16 for ShanghaiTech and 64 for UCF-Crime.

For YOLO detector settings, the default configuration for generating text results is used, employing the model weights ‘yolov7.pt’ and ‘yolov9-c-converted.pt’, with a confidence threshold of 0.25 and an IoU threshold of 0.45. The text result includes object class IDs, bounding box coordinates, bounding box dimensions (width & height) and class confidence scores. To generate the object detection feature

F_{y o l o}

, the feature dimension is set to

D_{y o l o} = P \times N = 1280

, where

P = 16, N = 80

, and

R_{t h r e s h o l d} = 0.3

are selected. We empirically found that

P = 16

patches offer a good trade-off between detection accuracy and computational efficiency.

4.4. Benchmark Performance

The experiment was first conducted on the ShanghaiTech dataset. The frame-level AUC performance on the ShanghaiTech dataset is shown in Table 1. Compared with the unsupervised approach, the proposed method demonstrates superior performance. The baseline model is retrained using the I3D RGB feature; likewise, for the proposed model, which also uses the same parameters. The result shows that the proposed method performs 1.63% better than the reproduced baseline model, and further improvement is observed when using the YOLOv9 object detector instead of YOLOv7.

Table 2 presents the frame-level AUC results on the UCF-Crime dataset. The proposed method achieves a notable improvement over unsupervised approaches and consistently outperforms the retrained baseline under identical experimental settings, with a performance gain of 1.39%. Additionally, incorporating the more advanced YOLOv9 object detector further enhances performance compared to its YOLOv7 counterpart, highlighting the benefit of stronger object-level representation in video anomaly detection tasks.

4.5. Ablation Study

Ablation studies have been carried out to evaluate the effectiveness of the proposed additional feature format (Section 3.1), the pre-processing methods (Section 3.2), and to analyse the impact of different object detectors.

4.5.1. Comparison with Feature Formats

The I3D features are first passed through an MTN block for the baseline model RTFM to unify the feature representations. However, this block is not applied to YOLO-generated features, as they encode class probability distributions across spatial patches rather than temporal dynamics.

The FeatVec method proposed by Doshi et al. [39] was evaluated to better align the feature to the training network. It combines the

X & Y

coordinates, object area, and class probabilities into a single feature vector (shown in Equation (8)) for further training. As an alternative, the proposed YOLO feature format, represented as

F_{y o l o}

, has been tested through various experiments based on the YOLOv9 features to evaluate the effectiveness of the new feature format. The results, presented in Table 3, include comparisons across different pre-processing methods to provide a more comprehensive evaluation.

F_{t}^{i} = [\begin{matrix} C_{x} \\ C_{y} \\ Area \\ p (C_{1}) \\ p (C_{2}) \\ ⋮ \\ p (C_{n}) \end{matrix}]

(8)

The results show that (1) the proposed feature format consistently outperforms the traditional FeatVec approach, validating its effectiveness; and (2) applying traditional processing blocks MTN to the additional YOLO features offers limited improvement over using I3D features alone, highlighting the importance of dedicated pre-processing strategies.

4.5.2. Comparison with Object Detectors

Since the proposed method can be seamlessly integrated with all existing object detectors, it is essential to assess the impact of different detector choices. As YOLO serves as the primary feature generator, a comparative analysis between YOLOv7 and YOLOv9, developed by the same team and based on similar principles, is conducted. The results of this comparison are presented in Table 4.

Analysis of the table reveals that while YOLOv9 does not always achieve better results than YOLOv7 on every instance, it generally achieves better overall results. In cases where YOLOv7 performs slightly better, the margin is minimal, whereas YOLOv9 often yields significantly stronger performance when it excels. This trend suggests that the more advanced architecture of YOLOv9, including specifically its improved gradient flow via PGI, enhanced layer aggregation with GELAN, and more efficient multi-scale representations, enables more accurate object detection, which in turn enhances the effectiveness of the additional object-level features in our model. Since this proposed technique can be seamlessly applied to any object detector, it holds strong potential for future improvements.

4.6. Qualitative Analysis

Qualitative analysis is presented in Figure 4, which shows the predicted anomaly scores generated by our network. Four example videos from the UCF-Crime dataset, including three anomalous and one normal video, are analysed. The results demonstrate that our model can effectively detect abnormal snippets across different types of anomalous events and can identify multiple events within a single video, which is a more challenging task in weakly supervised video anomaly detection. By incorporating object-level information through the additional feature

F_{y o l o}

, our model achieves better performance in capturing abnormal events. Specifically, it produces higher and more stable anomaly scores for abnormal snippets (Figure 4a–c) and lower scores for normal videos (Figure 4d), compared to using only the I3D feature.

5. Conclusions

In conclusion, this paper proposes a novel weakly supervised approach for video anomaly detection. The proposed approach enables the network to leverage additional object-level information by effectively integrating object detection capabilities and an attention mechanism into the baseline framework. A custom-designed feature format is developed to encapsulate all the relevant object information, allowing seamless integration with other features, such as I3D, to support effective network training. For this work, the YOLO object detector is employed to obtain accurate object class categories and positional data, thereby enhancing the reliability and effectiveness of anomaly detection in surveillance videos. The proposed method demonstrates improved performance over the baseline and achieves competitive results to state-of-the-art (SOTA) methods on benchmark datasets, offering a promising alternative for tackling WSVAD tasks that previous methods have not fully explored.

While the proposed method leverages object-level information, the object detector is limited to a fixed set of predefined object categories, which may not fully capture all relevant objects in specific surveillance scenarios, such as crime-related events. Some object classes may also be redundant or less informative for anomaly detection. In future work, we plan to investigate detectors capable of learning a larger variety of object categories and models that can focus on scenario-relevant objects, potentially further enhancing the effectiveness of WSVAD.

Author Contributions

Conceptualisation, Y.W. and Y.C.; methodology, Y.W.; software, C.K.Y.; validation, Y.W., Y.C. and C.K.Y.; formal analysis, Y.W.; investigation, Y.W. and Y.C.; resources, C.K.Y.; writing—original draft preparation, Y.W.; writing—review and editing, C.K.Y.; supervision, C.K.Y.; project administration, Y.W.; funding acquisition, C.K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by the Singapore MOE grant RG100/23 and NTU grant 03INS001984C130.

Data Availability Statement

ShanghaiTech Campus Dataset is publicly available at: https://svip-lab.github.io/dataset/campus_dataset.html (accessed on 10 September 2025). UCF-Crime dataset is publicly available at: https://www.crcv.ucf.edu/projects/real-world/ (accessed on 10 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 2022, 54, 38. [Google Scholar] [CrossRef]
Fernandes, G.; Rodrigues, J.J.P.C.; Carvalho, L.F.; Al-Muhtadi, J.F.; Proença, M.L. A comprehensive survey on network anomaly detection. Telecommun. Syst. 2019, 70, 447–489. [Google Scholar] [CrossRef]
Hilal, W.; Gadsden, S.A.; Yawney, J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
Fernando, T.; Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. Deep Learning for Medical Anomaly Detection—A Survey. ACM Comput. Surv. 2022, 54, 141. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Di Mambro, A.; Diko, A.; Fagioli, A.; Foresti, G.L.; Marini, M.R.; Mecca, A.; Pannone, D. Low-Altitude Aerial Video Surveillance via One-Class SVM Anomaly Detection from Textural Features in UAV Images. Information 2021, 13, 2. [Google Scholar] [CrossRef]
Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. [Google Scholar] [CrossRef]
Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4975–4986. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Z.; Zhang, B.; Fok, W.; Qi, X.; Wu, Y.C. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 387–395. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning What YouWant to Learn Using Programmable Gradient Information. In Computer Vision–ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science Series; Springer Nature: Cham, Switzerland, 2025; Volume 15089, pp. 1–21. [Google Scholar]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar] [CrossRef]
Acsintoae, A.; Florescu, A.; Georgescu, M.I.; Mare, T.; Sumedrea, P.; Ionescu, R.T.; Khan, F.S.; Shah, M. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20143–20153. [Google Scholar] [CrossRef]
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; Hengel, A.v.d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar] [CrossRef]
Zaheer, M.Z.; Lee, J.h.; Astrid, M.; Lee, S.I. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14183–14193. [Google Scholar] [CrossRef]
Pang, G.; Yan, C.; Shen, C.; Hengel, A.v.d.; Bai, X. Self-trained deep ordinal regression for end-to-end video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12173–12182. [Google Scholar] [CrossRef]
Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10076–10085. [Google Scholar] [CrossRef]
Luo, W.; Liu, W.; Gao, S. Remembering history with convolutional lstm for anomaly detection. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 439–444. [Google Scholar] [CrossRef]
Nguyen, D.T.; Lou, Z.; Klar, M.; Brox, T. Anomaly detection with multiple-hypotheses predictions. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 4800–4809. [Google Scholar]
Nguyen, T.N.; Meunier, J. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1273–1283. [Google Scholar] [CrossRef]
Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8022–8031. [Google Scholar] [CrossRef]
Zhong, J.X.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1237–1246. [Google Scholar] [CrossRef]
Wan, B.; Fang, Y.; Xia, X.; Mei, J. Weakly supervised video anomaly detection via center-guided discriminative learning. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
Feng, D.; Harakeh, A.; Waslander, S.L.; Dietmayer, K. A review and comparative study on probabilistic object detection in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2021, 23, 9961–9980. [Google Scholar] [CrossRef]
Yang, R.; Yu, Y. Artificial convolutional neural network in object detection and semantic segmentation for medical imaging analysis. Front. Oncol. 2021, 11, 638182. [Google Scholar] [CrossRef] [PubMed]
Raghunandan, A.; Raghav, P.; Aradhya, H.R. Object detection algorithms for video surveillance applications. In Proceedings of the 2018 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 3–5 April 2018; pp. 0563–0568. [Google Scholar] [CrossRef]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffman, J. Hydra Attention: Efficient Attention with Many Heads. In Computer Vision–ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Lecture Notes in Computer Science Series; Springer Nature: Cham, Switzerland, 2023; Volume 13807, pp. 35–49. [Google Scholar] [CrossRef]
Luo, W.; Liu, W.; Gao, S. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 341–349. [Google Scholar] [CrossRef]
Park, H.; Noh, J.; Ham, B. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14372–14381. [Google Scholar] [CrossRef]
Yu, G.; Wang, S.; Cai, Z.; Zhu, E.; Xu, C.; Yin, J.; Kloft, M. Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 583–591. [Google Scholar] [CrossRef]
Zhang, J.; Qing, L.; Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 4030–4034. [Google Scholar] [CrossRef]
Wang, J.; Cherian, A. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8201–8211. [Google Scholar] [CrossRef]
Zaheer, M.Z.; Mahmood, A.; Khan, M.H.; Segu, M.; Yu, F.; Lee, S.I. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14744–14754. [Google Scholar] [CrossRef]
Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision. In Computer Vision–ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Lecture Notes in Computer Science Series; Springer International Publishing: Cham, Switzerland, 2020; Volume 12375, pp. 322–339. [Google Scholar] [CrossRef]
Doshi, K.; Yilmaz, Y. Continual learning for anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 254–255. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed framework. Untrimmed videos are taken as input and processed by two paths: (a) pretrained I3D model is employed to extract the visual features, which are then refined through a multi-scale temporal network to obtain the final visual representation; and (b) object detector is used to generate the YOLO-based labels, followed by the feature generation process to obtain the YOLO features. Feature pre-processing adaptations, including the SE (Squeeze-and-Excitation) Layer and Hydra attention block, are applied to produce the final object-centric representation. The two types of features will be combined and then undergo a Top-K magnitude selection process. The reference frame shown here is taken from the UCF-Crime dataset.

Figure 2. Patch IDs and bounding boxes. A total of 16 patches are derived from a single frame. Orange rectangles indicate that the bounding box is totally inside the patch box; blue rectangles indicate that the bounding box spans across multiple patch boxes. E.g., Patch ID 1 receives the class confidence from the orange object, as it is fully enclosed within the patch. Additionally, since the overlap between the blue object and the patch exceeds a predefined threshold, the blue object is also considered.

Figure 3. Schematic illustration of the Feature Pre-processing module based on the Squeeze-and-Excitation (SE) mechanism.

Figure 4. Predicted anomaly scores over time for four example videos: (a) Shoplifting028; (b) Arrest001; (c) Burglary037; (d) Normal_Videos_019. Red areas denote ground truth anomalous events.

Table 1. Frame-level AUC performance on the ShanghaiTech dataset. * indicates the result from retraining by the author. The best result is in Bold.

Supervision	Method	Features	AUC (%)
Unsupervised	Conv-AE [6]	-	60.85
	Stacked-RNN [32]	-	68.00
	Frame-Pred [12]	-	73.40
	Mem-AE [15]	-	71.20
	MNAD [33]	-	70.50
	VEC [34]	-	74.80
Weakly Supervised	GCN-Anomaly	C3D RGB	76.44
	Zhang et al. [35]	I3D RGB	82.50
	GCN-Anomaly [23]	TSN Flow	84.13
	GCN-Anomaly [23]	TSN RGB	84.44
	AR-Net [24]	I3D Flow	82.32
	AR-Net [24]	I3D RGB	85.38
	AR-Net [24]	I3D RGB & I3D Flow	91.24
	RTFM [7]	I3D RGB	97.21
	RTFM * [7]	I3D RGB	95.82
	Ours	I3D RGB & YOLOv7	96.51
	Ours	I3D RGB & YOLOv9	$97.45$

Table 2. Frame-level AUC performance on the UCF-Crime dataset. * indicates the result from retraining by the author. The best result is in Bold.

Supervision	Method	Features	AUC (%)
Unsupervised	SVM Baseline	-	50.00
	Conv-AE [6]	-	50.60
	BODS [36]	I3D RGB	68.26
	GODS [36]	I3D RGB	70.46
	Zaheer et al. [37]	ResNext	71.04
Weakly Supervised	Sultani et al. [13]	C3D RGB	75.41
	Sultani et al. [13]	I3D RGB	77.92
	Zhang et al. [35]	C3D RGB	78.66
	GCN-Anomaly [23]	C3D RGB	81.08
	GCN-Anomaly [23]	TSN Flow	78.08
	GCN-Anomaly [23]	TSN RGB	82.12
	Wu et al. [38]	I3D RGB	82.44
	RTFM [7]	I3D RGB	84.30
	RTFM * [7]	I3D RGB	83.03
	Ours	I3D RGB & YOLOv7	83.81
	Ours	I3D RGB & YOLOv9	$84.42$

Table 3. Performance of different feature formats, including FeatVec and the newly proposed format. Various pre-processing methods are evaluated, utilising the YOLOv9 detector. The best result in each dataset is highlighted in bold.

Dataset	Pre-Processing			AUC (%)
Dataset	MTN	SE Layer	Hydra Attn	FeatVec [39]	Proposed Format
ShanghaiTech	✓			94.73	95.16
		✓		96.07	97.45
			✓	96.15	95.93
UCF-Crime	✓			82.89	83.18
		✓		83.49	83.5
			✓	82.73	84.42

Table 4. Comparison between different object detectors—YOLOv7 and YOLOv9. The best result in each dataset is highlighted in bold.

Dataset	Pre-Processing		AUC (%)
Dataset	SE Layer	Hydra Attn	YOLOv7	YOLOv9
ShanghaiTech	✓		96.51	97.45
ShanghaiTech		✓	96.28	95.93
UCF-Crime	✓		83.81	83.5
		✓	83.06	84.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Chen, Y.; Yeo, C.K. Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features. Information 2025, 16, 1042. https://doi.org/10.3390/info16121042

AMA Style

Wang Y, Chen Y, Yeo CK. Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features. Information. 2025; 16(12):1042. https://doi.org/10.3390/info16121042

Chicago/Turabian Style

Wang, Yanyu, Yang Chen, and Chai Kiat Yeo. 2025. "Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features" Information 16, no. 12: 1042. https://doi.org/10.3390/info16121042

APA Style

Wang, Y., Chen, Y., & Yeo, C. K. (2025). Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features. Information, 16(12), 1042. https://doi.org/10.3390/info16121042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Weakly Supervised Video Anomaly Detection with Object-Centric Features

Abstract

1. Introduction

2. Related Works

2.1. Video Anomaly Detection

2.2. Object Detection

3. Proposed Methods

3.1. Generation of Additional Features

3.2. Feature Pre-Processing

3.3. Fusion of Features

3.4. Training Phase

4. Experiments and Discussions

4.1. Datasets

4.2. Metrics

4.3. Implementation Details

4.4. Benchmark Performance

4.5. Ablation Study

4.5.1. Comparison with Feature Formats

4.5.2. Comparison with Object Detectors

4.6. Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI