WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection

Min Li; Jing Sang; Yuanyao Lu; Lina Du

doi:10.3390/jimaging11100354

,

and

School of Artificial Intelligence and Computer Science, North China University of Technology, No. 5 Jinyuanzhuang Road, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

J. Imaging2025, 11(10), 354;https://doi.org/10.3390/jimaging11100354

This article belongs to the Section Computer Vision and Pattern Recognition

Version Notes

Order Reprints

Abstract

Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it is also difficult to model semantic representations. Recently, the cross-modal pre-trained model Contrastive Language-Image Pretraining (CLIP) has shown a strong ability to align visual and textual information. This provides new opportunities for video anomaly detection. Inspired by CLIP, WSVAD-CLIP is proposed as a framework that uses its cross-modal knowledge to bridge the semantic gap between text and vision. First, the Axial-Graph (AG) Module is introduced. It combines an Axial Transformer and Lite Graph Attention Networks (LiteGAT) to capture global temporal structures and local abnormal correlations. Second, a Text Prompt mechanism is designed. It fuses a learnable prompt with a knowledge-enhanced prompt to improve the semantic expressiveness of category embeddings. Third, the Abnormal Visual-Guided Text Prompt (AVGTP) mechanism is proposed to aggregate anomalous visual context for adaptively refining textual representations. Extensive experiments on UCF-Crime and XD-Violence datasets show that WSVAD-CLIP notably outperforms existing methods in coarse-grained anomaly detection. It also achieves superior performance in fine-grained anomaly recognition tasks, validating its effectiveness and generalizability.

Keywords:

WSVAD-CLIP; CLIP; Axial-Graph Module; Text Prompt Mechanism; Video Anomaly Detection

1. Introduction

Video Anomaly Detection (VAD) aims to automatically detect anomalous patterns or events that deviate from normal behavior in video sequences [,,], holding significant value in practical applications such as public safety. As anomalous behaviors are both uncommon and diverse in nature, the detection task faces considerable challenges []. Effective VAD technology can assist intelligent surveillance systems in promptly detecting potential threats, thereby enhancing public safety [,].

Early VAD methods primarily adopted one-class classification approaches (semi-supervised) [,,], but they often misclassified unseen normal behaviors as anomalies. Fully supervised methods require frame-level annotations [,], which are prohibitively expensive, while unsupervised methods usually suffer from suboptimal performance due to a lack of explicit supervision []. In contrast, Weakly Supervised Video Anomaly Detection (WSVAD) has become a practical compromise, requiring only video-level labels while maintaining competitive detection performance [,,,,]. Recently, many researchers have focused on weakly supervised VAD.

Most WSVAD methods extract frame-level features using pre-trained models such as Convolutional 3D Network (C3D) [,], Inflated 3D ConvNet (I3D) [,], and Vision Transformer (ViT) [,], followed by Multiple Instance Learning (MIL) to predict anomaly scores. Despite their progress, two major challenges remain: modeling temporal dependencies of anomalies with diverse durations and capturing the complex semantics of anomalies from weak supervision [,].

In the past two years, Large Language and Vision (LLV) models and Vision-Language Pretraining (VLP) models [,,,] such as CLIP [] have achieved deep alignment between vision and language modalities, providing new opportunities to leverage textual semantics for anomaly detection.

Recently, several studies have attempted to leverage CLIP for video anomaly detection. However, existing methods still show clear gaps: some rely only on visual features from CLIP’s encoder [,], while others fail to effectively model vision-language relationships []. Consequently, they cannot fully exploit CLIP’s cross-modal capability for anomaly understanding.

To address these limitations, this paper introduces WSVAD-CLIP, a framework that enhances both temporal modeling and semantic alignment. Specifically, an Axial-Graph (AG) Module is proposed, which combines Lite Graph Attention Networks (LiteGAT) [] and an Axial Transformer [] to capture both local motion variations and long-range dependencies. Furthermore, unlike prior works that only use CLIP’s original text embeddings, a Text Prompt is designed to integrate learnable prompt [] with knowledge-enhanced prompt [], and it is further extended with an Abnormal Visual-Guided Text Prompt (AVGTP) [] to inject abnormal visual cues into textual embeddings. These designs enable richer semantic representations and stronger cross-modal alignment for anomaly detection.

The main contributions of this work are summarized as follows:

(1) A temporal modeling AG Module is introduced that simultaneously captures temporal dependencies through an Axial Transformer and LiteGAT, effectively enhancing the performance of the WSVAD task.

(2) A Text Prompt mechanism is proposed that combines learnable prompt with knowledge-enhanced prompt to gain deeper insights into the specific semantics of video anomalies. Visual information and text are further integrated through an AVGTP mechanism to generate more informative class-level embeddings.

(3) Extensive experiments are conducted on two representative large-scale benchmark datasets, UCF-Crime [] and XD-Violence [], and the results demonstrate the superior performance of the proposed framework.

2. Related Work

2.1. Video Anomaly Detection

Under the current research paradigm, video anomaly detection tasks are commonly classified into four categories according to existing supervision methods. The first category is one-class classification, i.e., semi-supervised video anomaly detection, where the training process involves only normal instances [,,,]. The second category is fully supervised video anomaly detection, which requires detailed frame-level annotations in the training set [,]. The third category is unsupervised video anomaly detection, which identifies anomalies from completely unlabeled videos []. The fourth category is weakly supervised video anomaly detection, which is trained with video-level normal/abnormal annotations [,,,,,,,].

Among these methods, weakly supervised video anomaly detection has attracted growing attention in the last few years, becoming a research hotspot. WSVAD generally achieves encouraging performance while reducing annotation costs. Sultani et al. pioneered the formulation of WSVAD as a Multiple Instance Learning (MIL) task []. Each video was regarded as a bag, with its temporal segments considered as instances. By leveraging a video-level ranking loss, their model aimed to maximize the separation between the most anomalous instances in positively and negatively labeled video bags. Subsequently, Zhong et al. proposed a method based on Graph Convolutional Network (GCN) to capture the similarity of features and temporal consistency between video segments []. Tian et al. designed a robust temporal feature magnitude learning strategy to enhance the robustness of MIL-based methods against negative instances within anomalous videos, and integrated dilated convolutions and self-attention mechanisms to capture both long-range and short-range temporal dependencies []. Li et al. developed a Multi-Sequence Learning (MSL) model based on the Transformer architecture to estimate anomaly likelihoods at both the video and segment levels []. Zhou et al. proposed an Uncertainty-Regulated Dual-Memory Unit (UR-DMU) to distinguish between normal and anomalous instances, and embedded global and local self-attention strategies within a Transformer framework to effectively capture temporal dependencies []. The aforementioned methods focus solely on coarse-grained VAD. Wu et al. proposed a fine-grained WSVAD approach that differentiates between various types of anomalous frames [], this approach also enables anomaly detection using multimodal information. Current WSVAD methods focus solely on visual patterns, neglecting the rich semantic information that could be obtained through vision-language models.

2.2. Large Language and Vision Models

Recently, large-scale pre-trained models have made significant advancements, Zanella et al. designed LAVAD [], which leverages pre-trained large language models and vision-language models to perform anomaly detection without any training by generating and analyzing textual descriptions of video frames. Dev et al. proposed MCANet, a training-free multi-modal framework leveraging vision, audio, and language models to generate captions for anomaly detection without extra data or retraining []. CLIP is also one of the most representative models This model has demonstrated unprecedented performance in various image-related tasks such as image classification [], object detection [], and semantic segmentation []. Lately, CLIP has also been effectively adapted to the video field. For instance, VideoCLIP was designed to align video–text representations via contrasting time-overlapping video–text pairs with extracted hard negative samples []. Some recent works have also explored the application of VAD using CLIP. Joo et al. proposed leveraging CLIP to effectively extract discriminative visual features; however, their approach did not utilize textual information [].

Zanella et al. employed a feature space transformation based on CLIP to learn anomalies and utilized semantic information to perform MIL segment selection and anomaly identification []. Yang et al. introduced TPWNG, which fuses visual features with textual features derived from CLIP’s text encoder to produce pseudo labels, and then adopts a supervised learning approach for anomaly training []. The OPVAD method proposed by Wu et al. addresses a more realistic and open setting for video anomaly detection []. The above three studies all adopt CLIP-based approaches that integrate visual and textual modalities for anomaly detection, including multimodal transformer-based methods. Existing methods that leverage textual information remain relatively simplistic. To address these limitations, this work proposes a novel approach that more deeply integrates the vision-language knowledge of the pre-trained CLIP model, enhancing the performance of weakly supervised video anomaly detection and enabling finer-grained recognition of abnormal events.

3. Method

In this section, the proposed WSVAD approach is presented. First, the overall architecture is introduced, followed by a detailed explanation of its core components.

3.1. Overall Architecture

The WSVAD task aims to train a model for frame-level anomaly detection using data that only contains video-level labels. In this task, the training data consists of several tuples (V, y), where V represents a complete video, and y is a binary label denoting the presence or absence of anomalous frames in V. If all frames in the video are normal, the label y is 0; if there exists at least one abnormal frame, the label y is 1.

Our framework, as illustrated in Figure 1, consists of both abnormal and normal training videos being fed into the CLIP image encoder to obtain visual features. The anomaly category labels are processed through learnable prompt and knowledge-enhanced prompt, respectively, and then passed through the text encoder to obtain fused anomaly feature embeddings. To capture temporal dependencies, the visual features are input into the AG Module to produce frame-level representations. These representations are then used for coarse-grained anomaly detection through training a binary classifier. Furthermore, The AVGTP combines the aggregated abnormal visual information with the fused textual features to form hybrid prompt features. These hybrid features are then aligned with visual frame features at a fine-grained level, enabling fine-grained abnormal event recognition [].

Figure 1. The overall architecture of WSVAD-CLIP. The framework includes the CLIP image encoder for extracting visual features, the text prompt module for generating learnable prompt and knowledge-enhanced prompt, the text encoder for producing semantic embeddings, and the AG module for temporal modeling. The AVGTP mechanism fuses abnormal visual information with textual features to form hybrid representations, enabling fine-grained anomaly recognition.

3.2. AG Module

In the WSVAD task, video frames are fed into the CLIP image encoder to obtain video features. However, to bridge the gap between CLIP and video-based tasks, it is essential to model the temporal dependencies inherent in videos. To address this issue, a novel AG Module is proposed that captures both global and local temporal information by integrating LiteGAT with an Axial Transformer [,]. The architecture of the AG Module is illustrated in Figure 2.

Figure 2. The architecture of AG Module. Video features are processed by the Axial Transformer to capture global temporal dependencies, then modeled with dual-branch LiteGAT using an adjacency graph (adj) and a distance graph (disadj), and finally fused to obtain enhanced frame-level representations.

The AG Module first feeds video features into the Axial Transformer, where positional encoding and dual-axis attention are employed to model the global temporal dependencies between frames, resulting in initial visual features. Then, two graph structures are constructed: an adjacency graph (adj) based on semantic similarity and a distance graph (disadj) based on temporal intervals. These are fed into the dual-branch LiteGAT module. Each branch incorporates motion enhancement, global statistics modeling, and Gumbel Softmax sampling implemented using PyTorch 2.5.1 to capture structural features between keyframes at both local and global levels. Finally, the outputs of both branches are fused with the Axial Transformer features to enhance the expressiveness of frame-level representations.

3.2.1. Axial Transformer

To enhance the temporal modeling capability between frame-level features, the AxialTransformerWithMLP module is introduced [], which leverages an axial attention mechanism to model dependencies along the row and column dimensions of the spatial layout independently. Given an input feature tensor

F \in R^{B \times C \times H \times W}

, where B denotes the batch size, C is the channel dimension, and

H \times W

represents the temporal-spatial layout, first, a learnable positional encoding

P \in R^{1 \times H \times W \times C}

is incorporated to enhance the model’s positional awareness. By adding it to the original input features, the position-enhanced feature

F_{p e}

is obtained, as shown in Equation (1):

F_{p e} = F + P

(1)

Subsequently, the feature map is passed through two consecutive axial attention modules, which perform self-attention operations along the row and column dimensions, respectively. Let

A_{r o w}

and

A_{c o l}

denote the outputs of the row-wise and column-wise attention modules. The two are fused to obtain the attention-enhanced feature

F_{a t t n}

, as shown in Equation (2):

F_{a t t n} = A_{r o w} + A_{c o l}

(2)

To further enhance the model’s capability in capturing complex temporal patterns, a Multilayer Perceptron (MLP) is introduced following the attention modules, combined with layer normalization (LayerNorm, LN) and residual connection mechanisms. The final output feature

F_{o u t}

is obtained, as shown in Equation (3):

F_{o u t} = F_{a t t n} + M L P (L N (F_{a t t n}))

(3)

This module effectively integrates both local and global temporal dependencies, enhancing the feature representations’ sensitivity to abnormal dynamics. As a result, it facilitates more accurate identification of potential anomalous segments in WSVAD.

3.2.2. LiteGAT

To enhance the modeling of abnormal correlations among frame-level features, LiteGAT [] is proposed, designed to effectively model local and global relationships within weakly supervised video anomaly detection. Specifically, LiteGAT takes as input a sequence of frame-level features

F_{o u t} \in R^{T \times d}

produced by the preceding stage (e.g., Axial Transformer). It first applies linear transformations to generate the Query (

Q

), Key (

K

), and Value (

V

) matrices, and an attention map is constructed based on the scaled dot-product mechanism. To improve the model’s capacity to attend to potential abnormal regions, a motion-aware mechanism is introduced. The L2 norm of the frame features is calculated to obtain a motion intensity vector

m \in R^{T}

, which is then mapped via an MLP to a motion bias matrix

B \in R^{T \times T}

and added to the attention scores. The final attention calculation is shown in Equation (4):

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d}} + B) V

(4)

The elements in

B

reflect the motion differences between frames, thereby guiding the attention mechanism to focus more on regions with significant motion variations, which enhances the sensitivity to anomalies. Furthermore, to promote sparsity and discriminative capacity within the attention structure, Gumbel Softmax sampling is introduced during training to perturb the original attention map and construct a sparse adjacency structure. Let

A_{i j}

denote the original attention scores; the perturbed attention map

{\tilde{A}}_{i j}

is then defined as Equation (5):

{\tilde{A}}_{i j} = \frac{\exp ((\log A_{i j} + g_{i j}) / τ)}{\sum_{k} \exp ((\log A_{i k} + g_{i k}) / τ)}

(5)

Here,

g_{i j}

represents noise sampled from a Gumbel distribution, the temperature coefficient τ is adopted to regulate the smoothness of sampling. This mechanism effectively guides the model to focus on a small number of key inter-frame connections, enhancing its ability to discriminate abnormal segments. Furthermore, to balance local modeling capability and global feature stability, a residual fusion mechanism is introduced that constructs a global modulation path

R \in R^{T \times d}

based on the mean and standard deviation statistics of the input features. This path is fused with the attention-aggregated intermediate features

F^{'} \in R^{T \times d}

, resulting in the final output features

F_{o u t}^{'}

, as shown in Equation (6):

F_{o u t}^{'} = α \cdot F^{'} + (1 - α) \cdot R

(6)

Here,

α \in [0,1]

is a learnable fusion coefficient that represents the weight balance between local attention and global statistics. By incorporating motion bias enhancement, sparse attention sampling, and global statistical fusion, the proposed LiteGAT module effectively models temporal abnormal dependencies in videos. This design improves the accuracy and robustness of anomaly localization under weak supervision without relying on frame-level labels.

3.3. Text Prompt

For weakly supervised video anomaly detection tasks, raw textual labels (e.g., “fighting”, “accident”) are often overly simplistic and insufficient to comprehensively describe complex anomalous events. To overcome this limitation, a Text Prompt mechanism is proposed that integrates learnable prompt and knowledge-enhanced prompt, constructing more expressive textual representations through contextual expansion and external semantic enrichment [,].

3.3.1. Learnable Prompt

In this study, to enhance the expressiveness of label texts and improve cross-modal alignment, a Learnable Prompt mechanism is introduced. Inspired by the Context Optimization for Prompting (CoOp) method, the original discrete category labels are augmented with a set of learnable context vectors, enabling more adaptive and expressive text inputs [].

Specifically, a category label (e.g., “fighting”, “accident”) is first tokenized into a category token using the CLIP tokenizer. It is then concatenated with n learnable prompt vectors

\{c_{1}, \dots, c_{n}\}

on both sides, forming a complete prompt sequence that is fed into the text encoder as input

t_{p r o m p t}

, as shown in Equation (7):

t_{p r o m p t} = \{c_{1}, \dots, c_{n}, t_{l a b e l}, c_{n + 1}, \dots, c_{2 n}\}

(7)

Here,

t_{l a b e l}

denotes the original category token. In this manner, the category token is placed at the center of the sequence. The resulting input sequence, along with the learnable prompts, is then fed into the CLIP text encoder, where it undergoes positional encoding to produce the final category embedding

t_{o u t} \in R^{d}

.

3.3.2. Knowledge-Enhanced Prompt

To enhance the semantic expressiveness of category labels and improve alignment between textual and visual modalities, a Knowledge-Enhanced Prompt mechanism is introduced based on a knowledge graph []. Specifically, ConceptNet is leveraged as an external knowledge source to retrieve semantically related concept words for each category label, which are then incorporated as additional contextual information.

Formally, given a category label

c

, all concept words associated with it are first retrieved from ConceptNet based on a pre-defined set of semantic relations, forming its corresponding concept set

k_{c}

, as shown in Equation (8):

k_{c} = \{k_{1}, k_{2}, . ., {k_{i}, . ., k}_{m}\}

(8)

Each

k_{i}

represents a node in the semantic graph that is closely connected to the category

c

. The concepts are first ranked based on the confidence scores of their associated edges, and only those with scores greater than 0.6 are retained. From this filtered set, the top-6 high-quality concepts are selected to form the final knowledge-enhanced prompt. These concept words are then individually encoded using the CLIP text encoder to obtain their embedding representations. Finally, the embeddings are averaged to produce the knowledge-enhanced prompt representation

t_{c}^{k n o w}

for category

c

, as shown in Equation (9):

t_{c}^{k n o w} = \frac{1}{M} \sum_{i = 1}^{M} {C L I P}_{t e x t} (k_{i})

(9)

This representation captures not only the explicit semantics of the category label but also incorporates implicit semantic relations from the knowledge graph, thereby enhancing the discriminative power of the category embedding. It provides richer prior knowledge to support cross-modal anomaly detection. Finally, a linear weighted fusion of the learnable prompt and the knowledge-enhanced prompt is performed at the feature level to construct a unified textual representation with improved generalization capability.

3.4. Abnormal Visual-Guided Text Prompt

To enhance the semantic expressiveness of textual labels in representing anomalous events, visual context information is introduced to dynamically refine category embeddings. Compared to static textual descriptions, visual features offer more discriminative contextual cues, particularly in abnormal segments. To this end, an Abnormal Visual-Guided Text Prompt (AVGTP) is proposed [], which focuses on extracting key visual features from anomalous segments and aggregating them into a video-level prompt signal. This prompt is then fused with the original text prompt to achieve fine-grained optimization of the category embedding.

Specifically, the anomaly confidence scores

A

obtained from coarse-grained anomaly detection are initially treated as anomaly attention, which is then used to compute the video-level prompt via a dot product with the frame features

F_{o u t}^{'}

, as shown in Equation (10):

V_{a t t n} = N o r m (A^{T} F_{o u t}^{'})

(10)

Here,

N o r m

denotes a normalization operation, and

V_{a t t n} \in R^{d}

represents the abnormal visual-guided prompt.

V_{a t t n}

is then added to the category embedding

t_{t e x t}

, and pass the result through a simple Feed-Forward Network (FFN) [], followed by a skip connection, to generate the final instance-specific category embedding

T

, as shown in Equation (11):

T = F F N (A D D (V_{a t t n}, t_{t e x t})) + t_{t e x t}

(11)

Here,

A D D

denotes element-wise addition. Through this process, the category embedding is able to incorporate visual contextual information from the anomalous segments of the current video, thereby enhancing its adaptability to abnormal semantics.

Finally, the updated category embedding

T

is used to compute a similarity-based alignment map M with the frame-level features

F_{o u t}^{'}

, which serves as the foundation for subsequent anomaly category recognition.

3.5. Objective Function

To effectively distinguish abnormal events in videos, a multi-task training objective is designed, which consists of classification loss, semantic alignment loss, and contrastive loss. These components collaboratively optimize the model’s capacity to express abnormal semantics.

First, for video-level supervision, a Top-K strategy is used that selects the top K scoring frames from videos labeled as normal and abnormal, using their scores as the final video-level prediction. The binary cross-entropy loss is employed for quantifying the discrepancy between the predicted results and the ground truth labels,

L_{c l s}

as shown in Equation (12):

L_{c l s} = B C E (\hat{y,} y)

(12)

Here,

\hat{y}

denotes the predicted video-level score from the model, and

y

represents the corresponding ground truth label.

Furthermore, to address the challenge of category alignment under weak supervision, a Multiple Instance Learning Alignment (MIL-Align) mechanism is introduced. Specifically, an alignment matrix M is constructed between frame-level visual features and all category embeddings. For each category, the top k similarity scores in the corresponding column of M are selected and their average is computed, yielding the alignment score

s_{i}

between the current video and the i-th category. The set of scores for all categories is represented as a vector

S = \{s_{1}, s_{2}, \dots, s_{m}\}

, where

m

represents the total number of categories. Each video is expected to have the highest similarity with its corresponding text label. To this end, the softmax function is applied to compute the predicted probability

p_{i}

for each category, as shown in Equation (13):

p_{i} = \frac{e x p (s_{i} / τ)}{\sum_{j} e x p (s_{j} / τ)}

(13)

Here,

τ

is a temperature scaling factor. Based on this, the cross-entropy loss is employed for quantifying the discrepancy between the ground-truth category and the predicted distribution, and define the semantic alignment loss

L_{a l i g n}

as shown in Equation (14):

L_{a l i g n} = - \log (p_{y})

(14)

Here,

p_{y}

denotes the predicted probability corresponding to the ground-truth category.

To further enhance the discriminability among category embeddings, a contrastive loss

L_{c o n t r a s t}

is introduced. This loss aims to push the normal category embedding away from all abnormal category embeddings in the semantic space. Cosine similarity is adopted as the similarity measure, as shown in Equation (15):

L_{c o n t r a s t} = \sum_{j} m a x (0, \frac{t_{n}^{T} \cdot t_{a j}}{{‖t_{n}‖}_{2} \cdot {‖t_{a j}‖}_{2}})

(15)

Here,

t_{n}

denotes the embedding of the normal category, and

t_{a j}

denotes the embedding of the j-th abnormal category. This loss penalizes excessively high similarity scores to enlarge the semantic distance between normal and anomalous categories.

In summary, our final loss function

L_{t o t a l}

is composed of the three aforementioned components, as shown in Equation (16):

L_{t o t a l} = L_{c l s} + L_{a l i g n} + {λ L}_{c o n t r a s t}

(16)

Here,

λ

is a weighting coefficient used to balance the contribution of the contrastive loss.

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

The study is conducted on two widely adopted benchmarks for weakly supervised video anomaly detection: UCF-Crime and XD-Violence [,]. Example video frames from the two datasets are shown in Figure 3. UCF-Crime is a large-scale real-world surveillance video dataset for WSVAD. It contains a total of 128 h of video, including 1900 surveillance clips that cover 13 real-world anomaly categories that pose major risks to public safety. Specifically, the dataset includes Abuse, Arrest, Arson, Assault, Burglary, Explosion, Fighting, RoadAccidents, Robbery, Shooting, Shoplifting, Stealing, and Vandalism. A notable characteristic of UCF-Crime is that most videos begin with normal segments before transitioning into abnormal events. Among them, 1610 videos are used for training and 290 for testing. XD-Violence is a comprehensive dataset for large-scale violence detection, sourced from diverse media such as films, online content, surveillance videos, and CCTV footage. It comprises 4754 video clips totaling 217 h, encompassing six categories of abnormal events: abuse, car accidents, explosions, fights, riots, and shootings. To prevent violence detection systems from relying on scene background rather than actual events, XD-Violence additionally collects a large number of non-violent videos that share similar backgrounds with violent ones. Out of these, 3954 clips are designated for training, while the remaining 800 are used for testing.

Figure 3. Example video frames from the UCF-Crime and XD-Violence datasets, including both abnormal and normal events.

UCF-Crime and XD-Violence provide valuable resources for weakly supervised video anomaly detection, but both datasets also exhibit certain class imbalances and potential data biases. In UCF-Crime, the Burglary and Stealing categories each contain 100 videos, and RoadAccidents and Robbery each contain 150 videos; these four categories correspond to relatively common real-world anomalies, while the remaining nine categories each have only 50 videos. In XD-Violence, the number of video segments per category also varies, for example, Fighting has 2363 segments, Shooting 1845 segments, Abuse 1290 segments, Car Accidents 1262 segments, Explosion 1101 segments, and Riot only 981 segments, reflecting the difference between more common and less common events. Furthermore, XD-Violence is a multi-label dataset, as a single video may contain multiple types of violent behavior, which further increases the complexity of the data distribution.

4.1.2. Evaluation Metrics

Binary WSVAD, i.e., video-level abnormal vs. normal detection, is referred to as coarse-grained anomaly detection. Following previous works, frame-level Area Under the Curve (AUC) and anomalous video-level AUC (AnoAUC) are reported on UCF-Crime, and frame-level Average Precision (AP) on XD-Violence. For fine-grained WSVAD, i.e., identifying specific abnormal categories, it is referred to as fine-grained anomaly recognition. The conventional evaluation protocol from video action detection is employed, using mean Average Precision (mAP) calculated at various Intersection over Union (IoU) thresholds. In this work, mAP is computed using IoU thresholds ranging from 0.1 to 0.5 with a step size of 0.1, which is a commonly adopted evaluation strategy in weakly supervised video anomaly detection. The average mAP (AVG) across these thresholds is reported as the main performance metric to ensure robustness and comparability. The range of 0.1 to 0.5 is chosen because event boundaries in anomaly detection often involve considerable uncertainty; excessively high thresholds may be overly strict, while overly low thresholds may fail to reflect localization accuracy, making this interval a balanced choice. Note that mAP is calculated only on the abnormal videos in the test set.

4.2. Implementation Details

For the visual and textual encoders in our network, the pretrained CLIP model (ViT-B/16) is adopted and kept frozen during training. The FFN module refers to the standard feed-forward block from the Transformer architecture. The visual input dimensions are set to 256 in length and 512 in width. The fusion weight α in Equation (6) is set to 0.7, and the temperature factor τ in Equation (13) to 0.07. The context length l is set to 20. The coefficient λ in Equation (16) is set to 1 × 10⁻¹ for UCF-Crime and 1 × 10⁻⁴ for XD-Violence. For the knowledge-enhanced prompting, a confidence threshold of 0.7 is applied to filter low-relevance concepts. All experiments are conducted on a single NVIDIA GeForce RTX 2080 Ti GPU, with the random seed fixed at 234 to ensure reproducibility. The model is optimized using the AdamW optimizer with a batch size of 64. Gradient clipping with a maximum norm of 1.0 is applied during training. For UCF-Crime, the weight decay is set to 0.02, the learning rate to 5 × 10⁻⁵, and the model is trained for 10 epochs, which takes approximately 7 h. For XD-Violence, the weight decay is set to 0.01, the learning rate to 2 × 10⁻⁵, and the model is trained for 10 epochs, which takes approximately 15 h.

4.3. Comparison with State-of-the-Art Methods

The method’s performance is evaluated on the UCF-Crime and XD-Violence datasets against the latest state-of-the-art (SOTA) approaches. Specifically, the WSVAD-CLIP model is evaluated against recent advanced methods in both coarse-grained anomaly detection and fine-grained anomaly recognition tasks. The following sections present detailed results and analysis.

As shown in Table 1 and Table 2, competitive results are yielded on both the UCF-Crime and XD-Violence datasets for coarse-grained anomaly detection by WSVAD-CLIP. Specifically, an AUC of 87.85% on UCF-Crime, significantly outperforming methods that do not leverage CLIP features (comparison is made with UR-DMU) [] and showing clear advantages over other CLIP-based approaches (comparisons are performed against both CLIP-TSA and TPWNG) [,]. While the improvement over TPWNG is relatively modest at 0.06%, this is still a noteworthy gain given the maturity of that method. Compared to CLIP-TSA and UR-DMU, our method achieves absolute gains of 0.27% and 0.88%, respectively. On the XD-Violence dataset, improved AP performance is also shown, outperforming TPWNG by 0.27%. Furthermore, WSVAD-CLIP achieves absolute gains of 1.78% over CLIP-TSA and 2.29% over UR-DMU. These results across two large-scale benchmark datasets highlight the strong generalization ability and effectiveness of the method for WSVAD.

Table 1. Coarse-grained anomaly detection comparisons on UCF-Crime.

Table 2. Coarse-grained anomaly detection comparisons on XD-Violence.

The ShanghaiTech dataset was employed as an auxiliary benchmark to preliminarily assess the cross-scenario generalization capability of the proposed model []. This dataset focuses on anomalous pedestrian behaviors in a campus environment, where the test set contains normal videos alongside anomalous videos. Here, anomalies are defined as behaviors that deviate from normal pedestrian activities, such as running, chasing, and excessively fast biking. The task involves detecting “anomalies” without requiring the identification of specific anomalous behavior types. The results of applying the proposed model to this dataset are presented in Table 3, and comparisons with several previous methods demonstrate competitive performance.

Table 3. Coarse-grained anomaly detection comparisons on ShanghaiTech.

For fine-grained anomaly recognition, as illustrated in Table 4 and Table 5, the WSVAD-CLIP model also demonstrates outstanding performance. The approach is compared with previous works such as VadCLIP, AVVD, and Sultani et al. [,,]. The earliest study to propose fine-grained WSVAD, VAAD, was implemented based on CLIP features, while the work by Sultani et al. also involved fine-tuning. As observed, fine-grained WSVAD presents greater challenges compared to coarse-grained WSVAD, as it requires precise multi-class classification and continuous segment-level localization, in contrast to the simpler binary classification setting of coarse-grained detection. In this task, WSVAD-CLIP consistently outperforms prior methods on both the XD-Violence and UCF-Crime datasets. Compared with the work of Sultani et al., substantial gains of 13.66% and 4.46% are achieved on the XD-Violence and UCF-Crime datasets, respectively. On XD-Violence, absolute AVG improvements of 0.61% and 5.1% are obtained over VadCLIP and AVVD, respectively. Similarly, on UCF-Crime, absolute gains of 1.02% over VadCLIP and 1.65% over AVVD are achieved. These results further confirm the effectiveness of the WSVAD-CLIP framework and demonstrate the strong potential of multimodal alignment for improving performance on fine-grained weakly supervised video anomaly detection.

Table 4. Fine-grained anomaly recognition comparisons on XD-Violence.

Table 5. Fine-grained anomaly recognition comparisons on UCF-Crime.

4.4. Ablation Studies

In this section, ablation experiments are performed on the UCF-Crime dataset to verify the contribution of each module in the proposed framework.

4.4.1. Effectiveness of AG Module

As shown in Table 6, an ablation study on the AG module is conducted to verify its effectiveness in temporal modeling. Under the baseline setting without any temporal modeling module, the model achieves only 84.57% AUC and 3.82% AVG, indicating that the absence of temporal modeling hinders the ability to capture key abnormal patterns. Next, a Transformer Encoder (TF-encoder) module employing global self-attention [] is introduced. Although the AUC slightly improves to 85.26%, the AVG remains relatively low at 5.42%, due to the lack of modeling for local temporal dependencies. To enhance local modeling capability, the proposed LiteGAT module is further introduced, which effectively captures local inter-frame structural relations and improves the AUC and AVG to 85.21% and 5.60%, respectively. However, its limited ability to capture global dependencies still leads to some local misjudgments. The Axial Transformer module is then incorporated to strengthen the modeling of long-range dependencies. This results in a significant performance boost, achieving 86.06% AUC and 9.14% AVG, demonstrating strong global contextual modeling capability. Finally, by integrating the strengths of both local and global modeling, LiteGAT is combined with the Axial Transformer to construct the AG Module. This fusion achieves the best performance, with 87.85% AUC and 7.70% AVG, confirming the complementary and synergistic benefits of jointly modeling temporal dependencies at multiple scales.

Table 6. Effectiveness of AG Module.

4.4.2. Effectiveness of Text Prompt

As shown in Table 7, the impact of different text prompt strategies on anomaly detection performance within the WSVAD-CLIP framework is further evaluated. When using static handcrafted prompts as a baseline, the model performance is limited due to the fixed nature of the prompts and lack of semantic adaptability, achieving only 86.32% AUC and 6.30% AVG. To address this, a Learnable Prompt mechanism is introduced, in which a set of trainable context tokens is appended to the original label semantics. This enables the model to optimize the prompt representations in a task-specific manner, thereby guiding the visual encoding process more effectively. This strategy improves performance to 87.75% AUC and 7.34% AVG, demonstrating its advantage in enhancing semantic expressiveness of prompts. Furthermore, a Knowledge-Enhanced Prompt is incorporated by leveraging external knowledge bases such as ConceptNet to construct semantically rich prompt templates. This enhances the model’s understanding of behavioral semantics and results in 86.50% AUC and 7.27% AVG. Although slightly lower than the learnable prompt strategy, this approach shows better generalization in certain categories. Finally, a hybrid prompt strategy is proposed that combines learnable prompt and knowledge-enhanced prompt, integrating the strengths of both task adaptability and semantic prior knowledge. This fusion leads to complementary and reinforced semantic representations, yielding the best performance of 87.85% AUC and 7.70% AVG. These results clearly demonstrate the critical role of prompt design in weakly supervised video anomaly detection.

Table 7. Effectiveness of Text Prompt.

4.4.3. Effectiveness of Abnormal Visual-Guided Text Prompt

As shown in Table 8, the results of the ablation study on the Abnormal Visual-Guided Text Prompt module are presented. Specifically, the differences in model performance with and without the inclusion of this module are compared. The experimental results show that without the abnormal visual-guided text prompt, the model achieves an AUC of 85.78% and an AVG of 6.12%. After incorporating this module, the performance significantly improves, reaching 87.85% AUC and 7.70% AVG. These results clearly demonstrate that integrating visual context from abnormal segments effectively enhances the quality of category embeddings, improving the model’s ability to perceive and discriminate anomalous events. This, in turn, leads to a notable improvement in overall anomaly detection performance.

Table 8. Effectiveness of AVGTP.

4.5. Qualitative Results

Figure 4 illustrates the anomaly detection results on a test video. When an abnormal event occurs, the predicted anomaly scores increase sharply and drop quickly after the event ends, while the scores remain low during normal segments. This behavior demonstrates the model’s strong sensitivity to anomalies and its ability to distinguish normal frames effectively. The figure presents the frame-level anomaly score curve, with selected key frames illustrating the consistency between visual content and the predicted scores. The blue-shaded regions indicate the ground-truth abnormal intervals, while the green-shaded regions highlight false positives (model predicts anomaly but the frame is normal) and false negatives (model fails to detect an actual anomaly). These mispredictions typically occur due to subtle visual changes or ambiguous actions that resemble abnormal events, making it challenging for the model to fully discriminate between normal and anomalous behaviors. Overall, the results demonstrate the effectiveness of the proposed method in accurately and promptly detecting anomalous events.

Figure 4. Qualitative results on the UCF-Crime and XD-Violence datasets. The upper row shows visualization diagrams of anomalous and normal frames from UCF-Crime, while the lower row presents the corresponding visualizations for XD-Violence. Blue-shaded regions indicate ground-truth abnormal intervals, and green-shaded regions highlight false positives and false negatives.

4.6. Discussion

The proposed WSVAD-CLIP framework demonstrates notable performance in weakly supervised video anomaly detection, with 171.59 million parameters, an average inference time of 0.0442 s per frame, and a speed of 22.61 frames per second (FPS).

The AG module, which combines Axial Transformer for global temporal modeling and LiteGAT for local structural awareness, effectively captures abnormal dynamics, with a memory usage of 734.80 Megabytes (MB), a total of 16.55 million parameters, and 4.23 Giga Floating Point Operations (GFLOPs). The average inference time is 0.0137 s per frame, corresponding to 73.21 FPS. Qualitative results show that the anomaly score curves respond sensitively to abnormal events.

The fusion ratio between learnable prompt and knowledge-enhanced prompt has a measurable impact on the results, where an appropriate balance improves the discriminability of class embeddings and visual-semantic alignment; the Text Prompt module, which encodes both types of prompts, requires 766.00 MB of memory, contains 25.20 million parameters, and entails 54.26 GFLOPs per forward pass, highlighting its computational intensity in integrating semantic information.

The AVGTP mechanism further enhances temporal and visual-textual alignment, with a memory footprint of 728.80 MB, 2.10 million parameters, and 0.03 GFLOPs, contributing to improved feature interaction with minimal additional computational cost.

However, some limitations remain. The AVGTP mechanism may overfit because it learns anomaly specific textual prompts, which can memorize training data patterns, especially when datasets are small or imbalanced. The knowledge-enhanced prompt carries a risk of context mismatch, as some concepts retrieved from ConceptNet may not accurately reflect the actual visual content, potentially introducing irrelevant information. Moreover, semantically similar categories (e.g., “Fighting” and “Riot”) and short-duration or highly intertwined abnormal events remain challenging under weak supervision.

Future work may address the overfitting risk of AVGTP, for example, by exploring more robust prompt learning strategies, regularization techniques, or prompt diversification. Efforts may also reduce context mismatch in the knowledge-enhanced prompt, for example, by filtering concepts based on video context, and improve fine-grained anomaly detection, particularly for semantically similar or short/intertwined events. In addition, exploring more efficient temporal modeling and incorporating multimodal information (e.g., audio) could further enhance accuracy and robustness.

5. Conclusions

In this study, WSVAD-CLIP is introduced, a novel approach designed for weakly supervised video anomaly detection. To bridge the gap between pre-trained visual features and semantic representations across modalities, an AG Module is designed to enhance the model’s ability to capture long-range temporal dependencies and local motion variations. The module combines an Axial Transformer for global temporal modeling with LiteGAT for localized feature aggregation. This design enhances the model’s sensitivity to abnormal dynamics, although its performance partially depends on the generalization ability of the pre-trained CLIP, which may limit detection in rare or unusual scenarios.

For semantic modeling, a Text Prompt mechanism is integrated that combines a learnable prompt, which introduces trainable context tokens to class labels, with a knowledge-enhanced prompt that leverages external knowledge bases, thereby enriching the semantic representation of class labels and improving visual-semantic alignment. In addition, the proposed AVGPT mechanism further optimizes the textual feature representations by incorporating visual context. Despite these improvements, the model may still have difficulty distinguishing semantically similar anomaly categories, which could slightly affect fine-grained detection. Nonetheless, these innovations collectively improve the model’s perception of dynamic abnormal behaviors and its generalization across diverse scenarios.

Experimental findings confirm that WSVAD-CLIP delivers notable enhancements in performance on both the UCF-Crime and XD-Violence benchmarks. Ablation studies further validate the effectiveness of each module. This framework offers an innovative solution for weakly supervised video anomaly detection. Future work can explore finer-grained modeling of anomaly types, improved cross-modal alignment strategies, and the incorporation of additional modalities such as audio and motion trajectories to enhance the model’s understanding and discrimination of complex anomalous scenarios, thereby further improving detection accuracy and robustness.

Author Contributions

Conceptualization, M.L. and J.S.; methodology, J.S.; software, J.S.; validation, M.L. and Y.L.; formal analysis, J.S.; investigation, L.D.; resources, M.L. and Y.L.; data curation, L.D.; writing—original draft preparation, J.S.; writing—review and editing, M.L. and J.S.; visualization, L.D.; supervision, M.L. and Y.L.; project administration, M.L.; funding acquisition, M.L. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant Nos. 61971007 and 61571013), and the North China University of Technology Research Start-up Fund Project (No. 11005136025XN076-043).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, P.; Liu, J.; He, X.; Peng, Y.; Wang, P.; Zhang, Y. Towards video anomaly retrieval from video anomaly detection: New benchmarks and model. arXiv 2023, arXiv:2307.12545. [Google Scholar] [CrossRef] [PubMed]
Sabokrou, M.; Fathy, M.; Hoseini, M.; Klette, R. Real-time anomaly detection and localization in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 56–62. [Google Scholar] [CrossRef]
Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 322–339. [Google Scholar] [CrossRef]
Cong, Y.; Yuan, J.; Liu, J. Sparse reconstruction cost for abnormal event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3449–3456. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar] [CrossRef]
Liu, K.; Ma, H. Exploring background-bias for anomaly detection in surveillance videos. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1490–1499. [Google Scholar] [CrossRef]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection—A new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar] [CrossRef]
Lv, H.; Chen, C.; Cui, Z.; Xu, C.; Li, Y.; Yang, J. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15420–15429. [Google Scholar] [CrossRef]
Park, H.; Noh, J.; Ham, B. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14360–14369. [Google Scholar] [CrossRef]
Bai, S.; He, Z.; Lei, Y.; Wu, W.; Zhu, C.; Sun, M.; Yan, J. Traffic anomaly detection via perspective map based on spatial-temporal information matrix. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 117–124. [Google Scholar]
Wang, G.; Yuan, X.; Zheng, A.; Hsu, H.M.; Hwang, J.N. Anomaly candidate identification and starting time estimation of vehicles from traffic videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 382–390. [Google Scholar]
Zaheer, M.Z.; Mahmood, A.; Khan, M.H.; Segu, M.; Yu, F.; Lee, S.I. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14724–14734. [Google Scholar] [CrossRef]
Li, G.; Cai, G.; Zeng, X.; Zhao, R. Scale-aware spatio-temporal relation learning for video anomaly detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 333–350. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Z.; Zhang, B.; Fok, W.; Qi, X.; Wu, Y.C. MGFN: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 7–15 February 2023; Volume 37, pp. 387–395. [Google Scholar] [CrossRef]
Zhang, J.; Qing, L.; Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In Proceedings of the IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 4030–4034. [Google Scholar] [CrossRef]
Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4955–4966. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Li, S.; Liu, F.; Jiao, L. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 1395–1403. [Google Scholar] [CrossRef]
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition. arXiv 2021, arXiv:2109.08472. [Google Scholar] [CrossRef]
Chen, F.L.; Zhang, D.Z.; Han, M.L.; Chen, X.Y.; Shi, J.; Xu, S.; Xu, B. VLP: A survey on vision-language pre-training. Mach. Intell. Res. 2023, 20, 38–56. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Joo, H.K.; Vo, K.; Yamazaki, K.; Le, N. CLIP-TSA: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In Proceedings of the IEEE International Conference on Image Processing, Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3230–3234. [Google Scholar] [CrossRef]
Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8022–8031. [Google Scholar] [CrossRef]
Zanella, L.; Liberatori, B.; Menapace, W.; Poiesi, F.; Wang, Y.; Ricci, E. Delving into CLIP latent space for video anomaly recognition. Comput. Vis. Image Underst. 2024, 249, 104163. [Google Scholar] [CrossRef]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Pu, Y.; Wu, X.; Yang, L.; Wang, S. Learning prompt-enhanced context features for weakly-supervised video anomaly detection. IEEE Trans. Image Process. 2024, 33, 4923–4936. [Google Scholar] [CrossRef] [PubMed]
Wu, P.; Zhou, X.; Pang, G.; Zhou, L.; Yan, Q.; Wang, P.; Zhang, Y. VadCLIP: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6074–6082. [Google Scholar] [CrossRef]
Xu, K.; Sun, T.; Jiang, X. Video anomaly detection and localization based on an adaptive intra-frame classification network. IEEE Trans. Multimed. 2020, 22, 394–406. [Google Scholar] [CrossRef]
Cho, M.; Kim, M.; Hwang, S.; Park, C.; Lee, K.; Lee, S. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12137–12146. [Google Scholar] [CrossRef]
Liu, T.; Lam, K.M.; Kong, J. Distilling privileged knowledge for anomalous event detection from weakly labeled videos. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 12627–12641. [Google Scholar] [CrossRef] [PubMed]
Lv, H.; Zhou, C.; Cui, Z.; Xu, C.; Li, Y.; Yang, J. Localizing anomalies from weakly-labeled videos. IEEE Trans. Image Process. 2021, 30, 4505–4515. [Google Scholar] [CrossRef] [PubMed]
Zhong, J.X.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1237–1246. [Google Scholar]
Zhou, H.; Yu, J.; Yang, W. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3769–3777. [Google Scholar] [CrossRef]
Wu, P.; Liu, X.; Liu, J. Weakly supervised audio-visual violence detection. IEEE Trans. Multimed. 2023, 25, 1674–1685. [Google Scholar] [CrossRef]
Zanella, L.; Menapace, W.; Mancini, M.; Wang, Y.; Ricci, E. Harnessing Large Language Models for Training-Free Video Anomaly Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18527–18536. [Google Scholar] [CrossRef]
Dev, P.P.; Hazari, R.; Das, P. MCANet: Multimodal caption aware training-free video anomaly detection via large language model. In Proceedings of the International Conference on Pattern Recognition, Kolkata, India, 1–5 December 2024; pp. 362–379. [Google Scholar] [CrossRef]
Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 350–368. [Google Scholar] [CrossRef]
Lin, Y.; Chen, M.; Wang, W.; Wu, B.; Li, K.; Lin, B.; Liu, H.; He, X. CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15305–15314. [Google Scholar] [CrossRef]
Xu, H.; Ghosh, G.; Huang, P.Y.; Okhonko, D.; Aghajanyan, A.; Metze, F.; Zettlemoyer, L.; Feichtenhofer, C. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. arXiv 2021, arXiv:2109.14084. [Google Scholar] [CrossRef]
Yang, Z.; Liu, J.; Wu, P. Text prompt with normality guidance for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18899–18908. [Google Scholar] [CrossRef]
Wu, P.; Zhou, X.; Pang, G.; Sun, Y.; Liu, J.; Wang, P.; Zhang, Y. Open-vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18297–18307. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning Temporal Regularity in Video Sequences. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 733–742. [Google Scholar] [CrossRef]
Wang, J.; Cherian, A. GODS: Generalized One-Class Discriminative Subspaces for Anomaly Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 8200–8210. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of WSVAD-CLIP. The framework includes the CLIP image encoder for extracting visual features, the text prompt module for generating learnable prompt and knowledge-enhanced prompt, the text encoder for producing semantic embeddings, and the AG module for temporal modeling. The AVGTP mechanism fuses abnormal visual information with textual features to form hybrid representations, enabling fine-grained anomaly recognition.

Figure 2. The architecture of AG Module. Video features are processed by the Axial Transformer to capture global temporal dependencies, then modeled with dual-branch LiteGAT using an adjacency graph (adj) and a distance graph (disadj), and finally fused to obtain enhanced frame-level representations.

Figure 3. Example video frames from the UCF-Crime and XD-Violence datasets, including both abnormal and normal events.

Figure 4. Qualitative results on the UCF-Crime and XD-Violence datasets. The upper row shows visualization diagrams of anomalous and normal frames from UCF-Crime, while the lower row presents the corresponding visualizations for XD-Violence. Blue-shaded regions indicate ground-truth abnormal intervals, and green-shaded regions highlight false positives and false negatives.

Table 1. Coarse-grained anomaly detection comparisons on UCF-Crime.

Supervision	Method	Feature	AUC (%)
Semi-supervised	SVM Baseline		50.10
	Hasan et al. []		51.20
	BODS []	I3D	68.26
	GODS []	I3D	70.46
Un-supervised	Zaheer et al. []	ResNext	71.04
Fully supervised	Liu&Ma []	NLN	82.00
Weakly-supervised	Sultani et al. []	C3D	75.41
	GCN []	TSN	82.12
	RTFM []	I3D	84.30
	MSL []	VideoSwin	85.62
	UR-DMU []	I3D	86.97
	MGFN []	I3D	86.98
	AVVD []	CLIP	82.45
	AnomalyCLIP []	CLIP	86.36
	OPVAD []	CLIP	86.40
	CLIP-TSA []	CLIP	87.58
	TPWNG []	CLIP	87.79
	WSVAD-CLIP	CLIP	87.85

Table 2. Coarse-grained anomaly detection comparisons on XD-Violence.

Supervision	Method	Feature	AP(%)
Semi-supervised	SVM Baseline		50.80
Semi-supervised	Hasan et al. []		31.25
Weakly-supervised	Sultani et al. []	C3D	73.20
	Wu et al. []	I3D	73.20
	RTFM []	I3D	77.81
	MSL []	VideoSwin	78.58
	UR-DMU []	I3D	81.66
	MGFN []	VideoSwin	80.11
	AVVD []	CLIP	78.10
	AnomalyCLIP []	CLIP	78.51
	OPVAD []	CLIP	66.53
	CLIP-TSA []	CLIP	82.19
	TPWNG []	CLIP	83.68
	WSVAD-CLIP	CLIP	83.95

Table 3. Coarse-grained anomaly detection comparisons on ShanghaiTech.

Supervision	Method	Feature	AUC(%)
Semi-supervised	Zaheer et al. []	ResNext	79.62
Un-supervised	Zaheer et al. []	ResNext	78.93
Weakly-supervised	GCN []	TSN	84.44
	Sultani et al. []	CLIP	91.72
	Wu et al. []	CLIP	95.24
	SSRL []	CLIP	96.22
	RTFM []	CLIP	96.76
	OPVAD []	CLIP	96.98
	MSL []	VideoSwin	97.20
	WSVAD-CLIP	CLIP	97.31

Table 4. Fine-grained anomaly recognition comparisons on XD-Violence.

Method	mAP@IOU(%) AVG
Random Baseline	0.71
Sultani et al. []	11.65
AVVD []	20.21
VadCLIP []	24.70
WSVAD-CLIP	25.31

Table 5. Fine-grained anomaly recognition comparisons on UCF-Crime.

Method	mAP@IOU(%) AVG
Random Baseline	0.08
Sultani et al. []	3.24
AVVD []	6.05
VadCLIP []	6.68
WSVAD-CLIP	7.70

Table 6. Effectiveness of AG Module.

Method	AUC (%)	AVG (%)
Baseline (without temporal modeling)	84.57	3.82
TF-encoder	85.26	5.42
Only GAT	85.21	5.60
Axial Transformer	86.06	9.14
AG Module	87.85	7.70

Table 7. Effectiveness of Text Prompt.

Method	AUC (%)	AVG (%)
Hand-crafted Prompt	86.32	6.30
Learnable Prompt	87.75	7.34
Knowledge-Enhanced Prompt	86.50	7.27
Learnable+ Knowledge Enhanced (Text Prompt)	87.85	7.70

Table 8. Effectiveness of AVGTP.

Method	AUC (%)	AVG (%)
w/o AVGTP	85.78	6.12
w AVGTP	87.85	7.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection

Abstract

1. Introduction

2. Related Work

2.1. Video Anomaly Detection

2.2. Large Language and Vision Models

3. Method

3.1. Overall Architecture

3.2. AG Module

3.2.1. Axial Transformer

3.2.2. LiteGAT

3.3. Text Prompt

3.3.1. Learnable Prompt

3.3.2. Knowledge-Enhanced Prompt

3.4. Abnormal Visual-Guided Text Prompt

3.5. Objective Function

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Studies

4.4.1. Effectiveness of AG Module

4.4.2. Effectiveness of Text Prompt

4.4.3. Effectiveness of Abnormal Visual-Guided Text Prompt

4.5. Qualitative Results

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics