MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection

Ding, Hantao; Lou, Shengfeng; Ye, Hairong; Chen, Yanbing

doi:10.3390/app15126773

Open AccessArticle

MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection

School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6773; https://doi.org/10.3390/app15126773

Submission received: 11 May 2025 / Revised: 9 June 2025 / Accepted: 10 June 2025 / Published: 16 June 2025

Download

Browse Figures

Versions Notes

Abstract

:

Video anomaly detection (VAD) faces significant challenges in multimodal semantic alignment and long-term temporal modeling within open surveillance scenarios. Existing methods are often plagued by modality discrepancies and fragmented temporal reasoning. To address these issues, we introduce MT-CMVAD, a hierarchically structured Transformer architecture that makes two key technical contributions: (1) A Context-Aware Dynamic Fusion Module that leverages cross-modal attention with learnable gating coefficients to effectively bridge the gap between RGB and optical flow modalities through adaptive feature recalibration, significantly enhancing fusion performance; (2) A Multi-Scale Spatiotemporal Transformer that establishes global-temporal dependencies via dilated attention mechanisms while preserving local spatial semantics through pyramidal feature aggregation. To address the sparse anomaly supervision dilemma, we propose a hybrid learning objective that integrates dual-stream reconstruction loss with prototype-based contrastive discrimination, enabling the joint optimization of pattern restoration and discriminative representation learning. Our extensive experiments on the UCF-Crime, UBI-Fights, and UBnormal datasets demonstrate state-of-the-art performance, achieving AUC scores of 98.9%, 94.7%, and 82.9%, respectively. The explicit spatiotemporal encoding scheme further improves temporal alignment accuracy by 2.4%, contributing to enhanced anomaly localization and overall detection accuracy. Additionally, the proposed framework achieves a 14.3% reduction in FLOPs and demonstrates 18.7% faster convergence during training, highlighting its practical value for real-world deployment. Our optimized window-shift attention mechanism also reduces computational complexity, making MT-CMVAD a robust and efficient solution for safety-critical video understanding tasks.

Keywords:

multi-modal transformer; LoRA; video anomaly detection; self-attention mechanism; cross-modal learning

1. Introduction

In recent years, video anomaly detection (VAD) has gained increasing attention in the field of computer vision due to its crucial role in security surveillance, intelligent monitoring, and autonomous system safety [1,2]. The ability to automatically detect rare or unexpected events in video streams is essential for a wide range of applications, such as public security, industrial anomaly monitoring, and traffic surveillance [3]. Unlike traditional video classification tasks, where each sample is associated with a predefined category, anomaly detection presents unique challenges due to the inherently unpredictable nature of anomalous events and the lack of sufficient training samples. Furthermore, anomalies can exhibit significant variation, ranging from subtle irregular motions to drastic environmental changes, making it challenging to develop robust detection models that generalize well across different scenarios.

Traditional approaches to VAD rely on handcrafted features, such as optical flow, histograms of oriented gradients (HOG), and local binary patterns (LBP), to capture the spatial and temporal characteristics of videos [4]. While these methods offer some interpretability, they often struggle in complex and dynamic environments due to their limited ability to model high-dimensional representations. The rise of deep learning has significantly advanced anomaly detection, with convolutional neural networks (CNNs) and recurrent neural networks (RNNs) becoming dominant techniques for feature extraction and sequential modeling [5]. More recently, unsupervised deep learning methods, such as autoencoders and generative adversarial networks (GANs), have been explored to learn normal event distributions and detect anomalies as deviations from these learned patterns [6]. However, despite their effectiveness, these deep learning-based approaches face challenges in capturing long-range dependencies and modeling global contextual relationships, which are crucial for understanding complex anomalies.

Recently, transformer-based architectures have demonstrated remarkable performance in various vision-related tasks, including image recognition, object detection, and action recognition [6,7]. Originally introduced for natural language processing (NLP) [8], transformers have been successfully adapted to visual tasks due to their ability to model long-range dependencies and capture global contextual information. This capability has made transformers particularly promising for video analysis tasks, leading to their adoption in video anomaly detection. Several recent studies have explored transformer-based VAD methods, leveraging self-attention mechanisms to enhance the representation of normal and abnormal events [7]. However, most of these approaches focus primarily on a single modality, such as RGB frames or optical flow, which limits their ability to fully understand anomalous events.

In real-world scenarios, anomalies often exhibit complex multi-modal characteristics, where visual and motion cues provide complementary information. For instance, certain anomalies may appear visually subtle while exhibiting distinct motion patterns, whereas others may involve clear visual deviations with minimal motion changes. Thus, effectively integrating multi-modal information is crucial for building robust and generalizable anomaly detection models.

Despite the advancements in multi-modal VAD, existing methods still face two major challenges:

Challenge I: Ineffective Cross-Modal Fusion Many existing multi-modal anomaly detection methods rely on simple feature concatenation or weighted summation to integrate different modalities [9]. While these approaches offer a straightforward way to combine information, they often fail to capture the intricate dependencies between different modalities. In real-world surveillance, the importance of visual and motion cues varies across different scenarios and anomaly types. For instance, in some cases, motion features might be more relevant (e.g., suspicious movements in a crowd), whereas in other cases, visual features might be more indicative (e.g., an abandoned object in a public space). Static fusion strategies that assign fixed importance to different modalities can lead to suboptimal detection performance, particularly in diverse and dynamic environments.

Challenge II: Limited Long-Range Dependency Modeling Most existing VAD models, including those based on CNNs and RNNs, struggle to effectively capture long-term dependencies in video sequences [10]. While transformers offer an advantage in this regard, existing transformer-based VAD methods primarily focus on spatial attention within individual frames or short sequences, neglecting long-term temporal relationships across extended video durations [11,12]. This limitation is particularly problematic for anomalies that evolve gradually over time, such as progressive equipment malfunctions or gradual behavioral shifts in a monitored environment. Without the ability to model long-range dependencies, these methods may fail to recognize contextually significant patterns that distinguish normal from anomalous behavior.

Our Contribution: Multi-modal Transformer for Cross-Modal Video Anomaly Detection (MT-CMVAD) To address these challenges, we propose a novel Multi-modal Transformer for Cross-Modal Video Anomaly Detection (MT-CMVAD). As illustrated in Figure 1, our framework processes input videos through a multi-modal pipeline, dynamically evaluates cross-modal features, and outputs anomaly detection scores. Unlike existing methods, MT-CMVAD introduces a dynamic cross-modal attention mechanism, which adaptively integrates visual and motion features based on their relative importance in different scenarios. By leveraging the transformer’s capability to capture long-range dependencies, our method enhances cross-modal interaction and enables more effective anomaly detection.

Our framework consists of the following key components:

•: Transformer-based Multi-modal Architecture: We design a transformer-based model that effectively integrates visual and motion cues, enabling the model to capture rich contextual information across different modalities.
•: Cross-modal Attention Mechanism: We introduce a novel cross-modal attention module, which dynamically weighs the importance of visual and motion features to ensure that the most relevant modality is prioritized for each specific anomaly.
•: Long-Range Dependency Modeling: Our framework extends the temporal modeling capability of transformers, allowing the network to effectively detect gradually evolving anomalies by capturing long-term dependencies in video sequences.
•: Extensive Experimental Evaluation: We conduct comprehensive experiments on multiple benchmark datasets, demonstrating that our approach outperforms state-of-the-art methods in terms of both accuracy and robustness.

By addressing the limitations of existing VAD methods, our proposed MT-CMVAD framework advances the field of video anomaly detection by offering a more adaptive and interpretable approach to multi-modal fusion. Through the integration of transformers, cross-modal attention mechanisms, and long-range temporal modeling, our method provides a powerful solution for real-world anomaly detection tasks in surveillance, security, and autonomous monitoring applications.

The remainder of this paper is organized as follows: Section 2 systematically reviews related work in video anomaly detection, analyzing the evolution from traditional feature engineering to transformer-based cross-modal methods. Section 3 details our proposed MT-CMVAD framework, with Section 3.1, Section 3.2, Section 3.3 and Section 3.4 elaborating on the hierarchical architecture, encoder-decoder design, and adaptive scoring mechanism. Section 4 validates our method through extensive experiments on three benchmark datasets (UCF-Crime, UBI-Fights, and UBnormal), including ablation studies on key components. Finally, Section 5 discusses practical limitations and concludes with the societal impact for intelligent surveillance systems.

2. Related Work

2.1. Video Anomaly Detection

Video anomaly detection (VAD) has evolved significantly with advancements in deep learning, focusing on modeling normal patterns to identify deviations. Early approaches, such as autoencoders (AEs) [13], reconstructed normal frames under the assumption that anomalies would result in higher reconstruction errors. However, these methods often struggled with complex scenes due to their sensitivity to appearance noise and limited semantic understanding.

Prediction-based approaches emerged to address temporal dynamics, utilizing autoregressive models [14] and generative adversarial networks (GANs) [15] to predict future frames. For instance, Liu et al. [16] proposed a method using FlowNet and GANs to predict future frames, leveraging temporal coherence for anomaly detection. These methods improved performance but faced challenges in handling long-term dependencies and computational costs for real-time applications.

Recent works introduced structured representations to enhance interpretability and generalization. Morais et al. [17] decomposed skeleton trajectories into global motion and local posture, modeling their interactions via a message-passing encoder-decoder RNN. This approach achieved competitive performance by leveraging low-dimensional semantic features. Similarly, memory-augmented methods, such as VideoPatchCore [18], optimized normality memorization through coreset subsampling and multi-stream memory banks, enabling efficient anomaly detection without training.

Transformer-based models have also gained traction for their sequence modeling capabilities. Pillai et al. [19] proposed a self-context-aware Transformer for few-shot VAD, predicting frame features using initial non-anomalous frames. This approach reduced the training data requirement and captured video-specific non-anomalous nature effectively. Language-guided paradigms, such as LaGoVAD [20], integrated textual definitions to dynamically adapt anomaly criteria, addressing concept drift in open-world scenarios.

2.2. Cross-Modal Learning in Anomaly Detection

Cross-modal learning has emerged as a pivotal approach for enhancing anomaly detection by leveraging complementary information from diverse modalities. Early works predominantly focused on single-modal data, such as visual or temporal features, but faced limitations in capturing complex anomaly patterns. Recent advancements in multimodal pretrained models have spurred interest in cross-modal frameworks. For instance, Li et al. [9] proposed a deep structured cross-modal anomaly detection framework, projecting features from heterogeneous modalities (e.g., images and text) into a consensus latent space to identify inconsistencies. This method highlights the significance of nonlinear correlations across modalities, which linear models often fail to capture.

The integration of large language models (LLMs) and vision-language models (VLMs) has further revolutionized zero-shot anomaly detection. Zanella et al. [21] introduced LAVAD, a training-free paradigm that generates textual descriptions for video frames and employs LLMs for temporal aggregation and anomaly scoring. Their work underscores the potential of semantic alignment between visual and textual features in detecting context-dependent anomalies. Similarly, Gu et al. [22] developed FiLo, combining fine-grained anomaly descriptions from LLMs with position-enhanced localization via multimodal encoders. FiLo addresses the challenge of generic anomaly descriptions by leveraging domain-specific knowledge, achieving state-of-the-art performance in both detection and localization.

Recent studies also emphasize temporal dynamics in cross-modal learning. Jiang and Mao [23] proposed VLAVAD, which maps high-dimensional visual features to low-dimensional semantic spaces using a video-language model. Their Sequence State Space Module (S3M) captures temporal inconsistencies in semantic features, overcoming the limitations of static frame analysis. These methods collectively demonstrate that cross-modal learning not only improves detection accuracy but also enhances interpretability by grounding anomalies in semantic contexts. However, challenges remain in balancing computational efficiency and generalization, particularly when scaling to dynamic, real-world scenarios with heterogeneous data streams.

2.3. Multi-Modal Transformer for Anomaly Detection

Recent advancements in multi-modal learning have demonstrated the effectiveness of Transformer architectures in capturing cross-modal interactions for anomaly detection. Fang et al. [24] proposed a recognition-synergistic framework for scene text editing, integrating text and image modalities through parallel decoding and cyclic self-supervised fine-tuning. Their work highlights the potential of joint modeling of recognition and generation tasks, which is critical for preserving content-style consistency in multi-modal scenarios.

Building on this, Patel et al. [25] introduced cross-attention Transformers for multi-modal PET/CT anomaly detection, leveraging spatially aligned latent codes and kernel density estimation to enhance modality fusion. Their approach underscores the importance of anatomical reference (CT) in guiding the denoising process of diffusion models, a principle applicable to video anomaly detection.

The dual-conditioned motion diffusion (DCMD) framework by Wang et al. [26] further advances pose-based anomaly detection by combining conditioned motion and embedding in a diffusion process. By incorporating time association through Gaussian kernels and global association via self-attention, DCMD effectively bridges reconstruction and prediction paradigms, achieving state-of-the-art performance on human-related datasets.

3. Supposed Method

Existing approaches for video anomaly detection exhibit three principal limitations that our Transformer-based methodology addresses. First, conventional temporal modeling techniques (e.g., RNNs, 3D CNNs) struggle to capture long-range dependencies due to their sequential processing nature and local receptive fields. While methods like GAN-based prediction [15] and memory networks [18] partially alleviate these issues, they introduce computational overhead and remain sensitive to appearance variations. Second, cross-modal frameworks [9,21] often suffer from semantic misalignment between modalities and high computational complexity, particularly when processing heterogeneous data streams. Third, existing Transformer adaptations [19,23] either require extensive training data or depend on auxiliary modalities for guidance—for instance, leveraging natural language descriptions to supervise the learning of discriminative visual representations—which limits their applicability in real-world scenarios with limited supervision.

To systematically address these limitations, our framework employs a hierarchical Transformer architecture with three specialized modules, whose technical details will be elaborated in the following subsections.

3.1. Overview of Method

The proposed framework integrates vision and language Transformer architectures to facilitate cross-modal anomaly detection through three synergistic components, as illustrated in Figure 2. At the top is the Score Layer, which serves as the evaluation module, responsible for generating and assessing results. Directly beneath it is the Large Language Model (LLM), acting as the core processing layer that handles key computations and transformations. At the bottom, two input modules—Video Encoder and Text Embedding—are positioned side by side, representing the data preprocessing stage. These modules encode raw video and text inputs into structured representations (e.g., vectors or embeddings) before passing them to the LLM for further processing. The vertical alignment of these components reflects a hierarchical information flow, where data moves upward from the input modules through the LLM to the Score Layer for final evaluation and output. This architecture eliminates the need for manual feature engineering while systematically capturing multimodal temporal dependencies, ensuring an efficient and adaptive anomaly detection process.

3.1.1. Hierarchical Feature Extraction

The vision-centric feature extraction module processes raw RGB frames using a Vision Transformer (ViT) [27] backbone. Each video frame is divided into non-overlapping 16 × 16 patches, which then undergo linear projection and spatiotemporal positional encoding, where learnable embeddings simultaneously capture spatial layout and temporal sequence information. A sliding window mechanism with stride-controlled temporal convolution aligns features across consecutive frames, generating synchronized feature sequences that preserve both local details and global temporal dependencies.

3.1.2. Spatiotemporal Interaction Modeling

The Transformer encoder employs a standard LLM-style architecture to model frame-level relationships, using stacked self-attention layers with residual connections [28]. Spatial attention within each frame emphasizes discriminative visual patterns, while temporal attention across frames establishes long-range dependencies through query-key-value interactions [29,30]. The model dynamically weights feature contributions through multi-head attention mechanisms, with layer normalization stabilizing the feature fusion process across different temporal scales.

3.1.3. Anomaly Scoring Mechanism

The scoring module computes anomaly likelihood through dual-stream Transformer decoding: spatial decoders reconstruct original frame patches while temporal decoders predict subsequent frame features [31]. Reconstruction errors from both streams are aggregated using adaptive weighting, with larger deviations indicating higher anomaly probabilities. A contrastive learning strategy enhances discrimination by maximizing feature consistency between adjacent normal frames, combined with curriculum learning that progressively increases temporal context complexity from 16 to 64 frames during training.

This unified architecture enables comprehensive modeling of normal video patterns through spatial attention mechanisms and temporal dependency modeling, where anomalies manifest as deviations in both spatial reconstruction fidelity and temporal evolution consistency.

3.2. Classic Transformer Encoder

This article will introduce the classic transformer layer, which will be used as a basic module in each layer.

The Transformer Encoder is composed of stacked identical layers, each containing two core sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network, with residual connections and normalization, as illustrated in Figure 3.

3.2.1. Embedding Layer

The input image

I \in R^{H \times W \times C}

is partitioned into N non-overlapping patches of size

P \times P

, where

N = \frac{H W}{P^{2}}

. Each patch

x_{patch}^{i} \in R^{P^{2} \times C}

is linearly projected into a D-dimensional embedding space and augmented with learnable positional encodings:

z_{0} = [x_{patch}^{1} W_{e}; \dots; x_{patch}^{N} W_{e}] + E_{pos},

(1)

where

W_{e} \in R^{P^{2} C \times D}

is the projection matrix,

E_{pos} \in R^{N \times D}

denotes positional embeddings, and D is the hidden dimension.

3.2.2. Stacked Transformer Encoder Layers

Each encoder layer comprises Multi-Head Self-Attention (MHSA), Layer Normalization (LayerNorm), Feed-Forward Network (FFN), and residual connections. For the l-th layer (

1 \leq l \leq L

):

Multi-Head Self-Attention (MHSA): The input

z_{l} \in R^{N \times D}

is transformed into queries

Q

, keys

K

, and values

V

via learnable weights

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in R^{D \times d_{k}}

for each head i (

1 \leq i \leq h

), where

d_{k} = D / h

. The attention output is computed as follows:

{head}_{i} = softmax (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}}) V_{i},

(2)

MHSA (z_{l}) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O},

(3)

where

W^{O} \in R^{D \times D}

is the output projection matrix [32].

Residual Connection and Layer Normalization: The MHSA output is combined with the residual input and normalized:

z_{l}^{'} = LayerNorm (z_{l} + MHSA (z_{l})),

(4)

where LayerNorm is defined as

LayerNorm (x) = γ ⊙ \frac{x - μ}{σ} + β

, with

μ

and

σ

as the mean and standard deviation along the feature dimension, and

γ, β \in R^{D}

as learnable parameters.

Feed-Forward Network (FFN): The normalized features undergo a nonlinear transformation via a two-layer MLP with Gaussian Error Linear Unit (GeLU) activation:

FFN (z_{l}^{'}) = GeLU (z_{l}^{'} W_{1} + b_{1}) W_{2} + b_{2},

(5)

where

W_{1} \in R^{D \times 4 D}

,

W_{2} \in R^{4 D \times D}

, and

b_{1}, b_{2}

are bias terms [33].

Second Residual Connection and Normalization: The final output of the layer is computed as follows:

z_{l + 1} = LayerNorm (z_{l}^{'} + FFN (z_{l}^{'})) .

(6)

3.2.3. Overall Pipeline

The encoder processes the input through L stacked layers (e.g.,

L = 12

), progressively refining feature representations. The final output

z_{L} \in R^{N \times D}

captures global contextual dependencies and serves as the visual token sequence for downstream multimodal fusion or generation tasks. Residual connections mitigate gradient vanishing, while LayerNorm ensures stable feature distributions across layers.

3.3. Video Encoder Layer

The video encoder architecture presented in this work builds upon the classical Transformer encoder framework, incorporating two critical enhancements: dynamic resolution processing and a block-wise attention mechanism, as illustrated in Figure 4. The key technical innovations can be characterized through four principal aspects:

Adaptive Patch Partitioning: Implements multi-scale feature extraction through resolution-aware patch merging operations, achieving optimal balance between computational efficiency (14.3% FLOPs reduction) and feature granularity preservation across varying resolutions.
Shifted Window Attention: Introduces a hierarchical windowing strategy that reduces the computational complexity of self-attention from quadratic $O (N^{2})$ to linear $O (N M^{2})$ with respect to token count N and window size M, while maintaining global contextual interactions through cyclic shift operations [34].
Hybrid Normalization: Adopts a Pre-LayerNorm configuration for stable gradient propagation during backpropagation, which demonstrates 18.7% faster convergence compared with conventional Post-LayerNorm implementations in ablation studies [35,36].
Explicit Spatiotemporal Encoding: Develops a dual-branch positional encoding scheme combining absolute temporal embeddings (for frame-level localization) with learnable coordinate projections (for spatial position awareness), achieving a 2.4% improvement in temporal alignment accuracy [30,37].

Its core implementation can be summarized as the following key technical modules.

3.3.1. Dynamic Resolution Processing and Patch Embedding

Input video frames

V \in R^{T \times H \times W \times C}

(with T frames, height H, width W, and channels C) are adaptively partitioned into variable-sized patches

x_{patch}^{i} \in R^{P_{t}^{2} \times C}

, where

P_{t}

denotes the patch size for frame t (e.g.,

16 \times 16

or

32 \times 32

). This dynamic partitioning ensures compatibility with diverse resolutions. Each patch is projected into a D-dimensional embedding space and augmented with learnable spatial and temporal encodings:

z_{0} = [x_{patch}^{1} W_{e}; \dots; x_{patch}^{N} W_{e}] + E_{pos} + E_{time},

(7)

where

W_{e} \in R^{P_{t}^{2} C \times D}

is the projection matrix,

E_{pos} \in R^{N \times D}

encodes spatial positions, and

E_{time} \in R^{T \times D}

injects absolute temporal information. The variable

N = \frac{H W}{P_{t}^{2}}

represents the number of patches per frame. This design enables efficient processing of long-duration videos (e.g., 1-h clips) and high-resolution inputs (e.g., documents, diagrams).

3.3.2. Hierarchical Transformer Encoder Architecture

The encoder consists of L stacked layers, each of which combines Multi-Head Self-Attention (MHSA), Feed-Forward Network (FFN) and residual connections, as mentioned in Classic Transformer Encoder, but with slight modifications. For the lth layer (

1 \leq l \leq L

):

Windowed Self-Attention with Shifted Windows: To reduce computational complexity, input features are divided into non-overlapping local windows of size

M \times M

(e.g.,

8 \times 8

). Self-attention is computed within each window, while shifted window partitioning in alternating layers enables cross-window interactions:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + M) V,

(8)

where

M

is a masking matrix that restricts attention to valid spatial-temporal regions. Global dependencies are captured through iterative window shifting [34].

Pre-Layer Normalization (Pre-LN) Design: Layer normalization is applied before MHSA and FFN operations to stabilize training dynamics:

z_{l}^{norm} = LayerNorm (z_{l}),

(9)

z_{l}^{'} = z_{l} + MHSA (z_{l}^{norm}),

(10)

z_{l}^{″} = LayerNorm (z_{l}^{'}),

(11)

z_{l + 1} = z_{l}^{'} + FFN (z_{l}^{″}) .

(12)

Here, the FFN employs a Gaussian Error Linear Unit (GeLU) activation and expands the hidden dimension to

4 D

:

FFN (x) = GeLU (x W_{1} + b_{1}) W_{2} + b_{2},

(13)

where

W_{1} \in R^{D \times 4 D}

and

W_{2} \in R^{4 D \times D}

are learnable weights [38,39].

3.3.3. Multimodal Feature Fusion and Optimization

Temporal Modeling: Absolute temporal encodings

E_{time}

and cross-frame attention mechanisms jointly model long-range dependencies across video frames. For frame t, the attention query

Q_{t}

attends to keys

K_{t - Δ}^{t + Δ}

within a temporal window

Δ

, capturing sub-second to minute-level events.

Spatial-Target Localization: Explicit coordinate embeddings for bounding boxes

B \in R^{4}

and keypoints

P \in R^{2}

are concatenated with visual tokens:

z_{aug} = [z_{L}; B W_{b}; P W_{p}],

(14)

where

W_{b} \in R^{4 \times D}

and

W_{p} \in R^{2 \times D}

project spatial coordinates into the embedding space. This enhances pixel-level localization for objects in documents or videos [40].

Computational Optimization: 1. Sparse Attention: Attention scores are sparsified using top-k selection or locality-sensitive hashing (LSH) [41] to reduce memory usage. 2. Dynamic Resolution Scaling: Patch sizes

P_{t}

are adjusted based on input complexity, prioritizing fine-grained patches (

16 \times 16

) for text-rich regions and coarse patches (

32 \times 32

) for homogeneous areas.

3.3.4. Encoder Output and Downstream Adaptation

The final output

z_{L} \in R^{N \times D}

encapsulates global contextual features and is utilized for the following:

Document Understanding: Structured layout parsing via transformer decoders to extract text, tables, and figures from invoices or forms.

Long-Form Video Analysis: Frame-level semantic segmentation and event detection through temporal pooling and cross-modal attention.

Multimodal Alignment: Integration with language models via cross-attention layers for tasks like video captioning or visual question answering.

3.4. Decoder and Analysis Layer

The decoder and analysis layer constitute a multimodal fusion architecture that systematically integrates visual, temporal, and textual features through three coordinated technical components: 1. A Spatiotemporal Cross-Attention mechanism first establishes bidirectional interactions between video frames and textual embeddings, jointly modeling temporal dynamics in video sequences and semantic dependencies in textual contexts; 2. An Adaptive Hierarchical Pooling framework then dynamically weights global scene understanding against local feature preservation through learnable attention coefficients, enabling task-specific feature recalibration for diverse applications like Video Anomaly Detection; 3. Finally, Unified Transformer Heads with parameter-shared encoders synergistically execute generation, retrieval, and regression tasks while maintaining feature consistency across modalities. This integrated architecture, as depicted in Figure 5, achieves efficient cross-modal knowledge transfer through its pyramid-structured feature hierarchy and adaptive task-conditioned operations.

3.4.1. Multimodal Transformer Decoder

Each decoder layer consists of self-attention, cross-modal attention, and feed-forward networks (FFN) with residual connections and normalization. For the l-th layer (

1 \leq l \leq L_{dec}

):

Self-Attention with Causal Masking: Processes the decoder’s input sequence (e.g., text tokens) while ensuring autoregressive properties:

h_{l}^{self} = LayerNorm (h_{l - 1} + MHSA (h_{l - 1})),

(15)

where MHSA uses a causal mask to prevent future token leakage [42].

Cross-Modal Attention: Fuses visual features

z_{L} \in R^{N \times D}

(from the encoder) with textual features:

Q = h_{l}^{self} W_{q}, K = z_{L} W_{k}, V = z_{L} W_{v},

(16)

h_{l}^{cross} = LayerNorm (h_{l}^{self} + softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V) .

(17)

Feed-Forward Network (FFN): Applies a nonlinear transformation:

h_{l} = LayerNorm (h_{l}^{cross} + FFN (h_{l}^{cross})),

(18)

where

FFN (x) = GeLU (x W_{1} + b_{1}) W_{2} + b_{2}

with

W_{1} \in R^{D \times 4 D}

,

W_{2} \in R^{4 D \times D}

.

3.4.2. Temporal-Spatial Attention Augmentation

For video-text tasks, the decoder extends standard cross-attention into spatiotemporal attention by integrating temporal encoding into the key and value sequences. Specifically, we concatenate the visual features with corresponding time embeddings:

K = [z_{L}; E_{time}], V = [z_{L}; E_{time}],

(19)

where

E_{time} \in R^{T \times D}

encodes the absolute timestamps of video frames. This joint representation enables fine-grained alignment between video segments and text tokens by incorporating both spatial and temporal contexts.

3.4.3. Multimodal Feature Pooling

The decoder’s final output

h_{L_{dec}} \in R^{S \times D}

(for S text tokens) is pooled with visual features

z_{L}

for downstream tasks:

Mean-Max Pooling:

h_{pooled} = Concat (Mean (h_{L_{dec}}), Max (h_{L_{dec}})) .

(20)

Attention-Based Pooling:

a = softmax (h_{L_{dec}} W_{a}), h_{pooled} = \sum_{i = 1}^{S} a_{i} h_{L_{dec}}^{(i)} .

(21)

Together, these pooling strategies provide a flexible and expressive feature representation that can adapt to different downstream objectives.

3.4.4. Task-Specific Heads

To enable multimodal task coordination while preserving domain-specific processing, the architecture implements four specialized heads that share backbone features but employ task-optimized mechanisms: 1. Text Generation utilizes a language modeling head with autoregressive token prediction [11]; 2. Video-Language Alignment implements metric learning through dual-tower feature projection; 3. Video Grounding adopts temporal coordinate regression via MLP decoders; 4. Video QA incorporates structured reasoning with conditional random fields [43]. These parameter-shared heads maintain cross-modal feature consistency while achieving task specialization through adaptive projection layers and attention-based feature routing, as formalized in the subsequent equations:

Text Generation: Uses a language modeling head to predict the next token:

P (w_{t} | w_{< t}, z_{L}) = softmax (h_{L_{dec}}^{(t)} W_{lm} + b_{lm}) .

(22)

Video-Language Alignment: Computes similarity scores between video and text via dual encoders:

Sim (z_{L}, h_{pooled}) = z_{L} W_{s} h_{pooled}^{⊤} .

(23)

Video Grounding: Predicts temporal boundaries

{\hat{t}}_{start}, {\hat{t}}_{end}

for events:

[{\hat{t}}_{start}, {\hat{t}}_{end}] = MLP (h_{pooled}) .

(24)

Video QA: Extracts answers from structured text and layout features:

y_{answer} = CRF (h_{pooled} W_{crf}) .

(25)

3.5. Scoce Layer

The Score Layer is responsible for quantifying the semantic relevance between event descriptions generated by the Large Language Model (LLM) and a set of predefined anomaly labels. To achieve this, we adopt a Sentence Transformers architecture, which processes inputs through the following three stages:

Sentence Embedding Generation: Event descriptions from the LLM (e.g., “object left in restricted area”) are first encoded into context-aware word vectors through stacked Transformer layers (Figure 3), then compressed into fixed-dimensional sentence embeddings via mean pooling. This process captures the semantic nuances of textual event descriptions.

Offline Label Encoding: Predefined anomaly labels (e.g., “intrusion”, “fire smoke”) are concurrently processed using the same Transformer architecture to generate label embeddings. These embeddings are pre-computed offline to ensure real-time performance during inference.

Real-Time Similarity Scoring: The system computes cosine similarity scores between the dynamically generated sentence embeddings and pre-encoded label embeddings in real time. Higher similarity scores indicate stronger semantic alignment between events and anomaly categories, enabling the system to filter events exceeding confidence thresholds and prioritize responses based on score ranking.

The complete workflow is illustrated in Figure 6, which highlights the interaction between these three stages through a clear data flow diagram [44].

3.6. Score Layer with Sentence Transformers

The Score Layer employs a Sentence Transformers architecture to compute semantic relevance scores between Large Language Model (LLM)-generated event descriptions and predefined anomaly labels. The implementation follows three stages: sentence embedding generation, offline label encoding, and real-time similarity scoring. Figure 6 illustrates the workflow.

3.6.1. Model Architecture

The framework utilizes a pre-trained Transformer model (e.g., BERT [45]) followed by a mean pooling layer. Given an input token sequence

X = [x_{1}, x_{2}, \dots, x_{n}]

from LLM outputs or anomaly labels, the model processes it through stacked Transformer layers to generate context-aware token embeddings

H = [h_{1}, h_{2}, \dots, h_{n}] \in R^{n \times d}

, where d is the embedding dimension. A mean pooling operation then compresses

H

into a fixed-dimensional sentence embedding

s \in R^{d}

:

s = \frac{1}{n} \sum_{i = 1}^{n} h_{i} .

(26)

3.6.2. Offline Label Encoding

All predefined anomaly labels (e.g., “intrusion”, “fire smoke”) are precomputed offline using the same Sentence Transformers model. For a label set

L = {l_{1}, l_{2}, \dots, l_{m}}

, their embeddings

{e_{1}, e_{2}, \dots, e_{m}}

are stored in a database, ensuring efficient real-time retrieval.

3.6.3. Real-Time Scoring

During inference, the LLM-generated event description

X_{event}

is encoded into

s_{event}

using the above process. The cosine similarity between

s_{event}

and each label embedding

e_{i}

quantifies their semantic relevance:

Score (s_{event}, e_{i}) = \frac{s_{event} \cdot e_{i}}{∥ s_{event} ∥ \cdot ∥ e_{i} ∥} .

(27)

Events are prioritized by ranking scores in descending order. A threshold

τ

filters low-confidence anomalies, retaining only matches where

Score (\cdot) \geq τ

.

3.6.4. Implementation Details

We adopt the Sentence-BERT framework [44] for its efficiency in semantic similarity tasks. The model is initialized with pre-trained weights and optionally fine-tuned on domain-specific anomaly detection corpora to enhance alignment between event descriptions and labels.

4. Experiment

4.1. Datasets

We use three different and characteristic evaluations on Video Anomaly Detection datasets as benchmarks to train and evaluate our proposed model:

4.1.1. UCF-Crime

Curated from real-world surveillance feeds, this forensic analysis dataset aggregates 1900 evidentiary clips (128 cumulative hours) across 13 criminal typologies, encompassing property violations, roadway incidents, and interpersonal violence episodes.

4.1.2. UBI-Fights

The UBI-Fights dataset represents a novel large-scale benchmark for anomalous behavior detection, comprising 1000 carefully curated surveillance videos totaling 80 h of footage. Distinguished by its frame-level annotation precision, this dataset offers detailed temporal localization of violent incidents while maintaining comprehensive scene diversity.

4.1.3. UBnormal

As a physics-based synthetic anomaly simulator, this engine-generated collection presents 543 synthetic clips across 29 virtual scenes, modeling both anthropic irregularities (combat simulations, hazardous acrobatics) and environmental disruptions (meteorological anomalies, vehicular collisions).

To ensure a rigorous evaluation of model performance, the benchmark datasets (UCF-Crime, UBI-Fights, UBnormal) are partitioned into training and evaluation subsets. The split preserves the natural distribution of normal and anomalous events while avoiding data leakage. Detailed statistics are provided in Table 1 to reflect the scale and balance of the datasets.

The distribution of normal and anomalous events across the benchmark datasets directly impacts model generalizability. As summarized in Table 2, the event-level statistics highlight the diversity of anomaly types, including criminal acts (UCF-Crime), violent behaviors (UBI-Fights), and synthetic anomalies (UBnormal). This heterogeneity ensures comprehensive validation of our framework across scenarios with varying anomaly complexity.

4.2. Performance Metrics Formulation

To rigorously evaluate model performance, we provide formal definitions of key metrics with mathematical formulations.

Area Under ROC Curve (AUC)

Given anomaly scores

s_{i} \in R

for N test frames and binary labels

y_{i} \in {0, 1}

, the ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across decision thresholds

τ

. Let

S^{+} = {s_{i} | y_{i} = 1}

and

S^{-} = {s_{i} | y_{i} = 0}

. The AUC computes the probability that a randomly chosen anomaly frame has a higher score than a normal frame:

AUC = \frac{1}{| S^{+} | | S^{-} |} \sum_{s^{+} \in S^{+}} \sum_{s^{-} \in S^{-}} I (s^{+} > s^{-})

(28)

where

I (\cdot)

is the indicator function. Higher AUC values (range: [0, 1]) indicate better separability between anomaly/normal frames,

with TL - P = \frac{| P_{matched} |}{| P |}, TL - R = \frac{| G_{matched} |}{| G |}

All metrics are computed under non-overlapping cross-validation protocols to ensure statistical reliability.

4.3. Training Configuration

Our framework employs the Qwen2.5VL architecture as the backbone network. To enhance the adaptation efficiency for surveillance anomaly detection, we implement Low-Rank Adaptation (LoRA) with task-specific parameter injection.

4.3.1. LoRA Fine-Tuning

For the original weight matrix

W_{0} \in R^{d \times k}

, we inject trainable low-rank components:

W = W_{0} + Δ W = W_{0} + B \cdot A, B \in R^{d \times r}, A \in R^{r \times k}

where rank

r = 8

constrains the parameter growth. LoRA applies to the following:

50% of MSA query/value projections;
Cross-modal attention gates;
Final classification head.

4.3.2. Optimization Details

Base Model: Qwen2.5VL pretrained weights (frozen);
LoRA Params: 3.2M trainable parameters (0.8% of full model);
Optimizer: AdamW with cosine decay ( $l r_{base} = 3 e - 4$ );
BS/GPUs: 16 clips per batch across 8 × RTX3090;
Augmentation: Temporal jitter (±15%), spatial erasing;
Convergence: 50 epochs.

4.4. Result

To quantitatively evaluate the effectiveness of MT-CMVAD, its performance is compared against state-of-the-art methods on three datasets. The Area Under the ROC Curve (AUC) serves as the primary metric, reflecting the model’s ability to distinguish anomalies from normal events. The results in Table 3 demonstrate consistent superiority in both detection accuracy and generalization ability.

Furthermore, to evaluate the robustness of the MT-CMVAD model under varying conditions, Table 4 offers a comprehensive analysis of the model’s inference performance across different resolutions. This table demonstrates how changes in input resolution affect the model’s ability to maintain high levels of detection accuracy and speed. The data indicate that MT-CMVAD not only performs exceptionally well in terms of accuracy but also shows relatively good robustness in its inference capabilities.

4.5. Result Example

Since quantitative scores cannot fully evaluate the model’s output, Figure 7 presents several representative examples to demonstrate the effectiveness of the model’s anomaly detection capabilities.

5. Conclusions

In this paper, we proposed MT-CMVAD, a novel Multi-modal Transformer framework for cross-modal video anomaly detection, which addresses the critical challenges of ineffective cross-modal fusion and limited long-range dependency modeling. By integrating dynamic cross-modal attention, hierarchical spatiotemporal interaction, and adaptive anomaly scoring mechanisms, our approach achieves robust and interpretable anomaly detection in complex video scenarios.

Extensive experiments on three benchmark datasets validate the superiority of our method. Specifically, MT-CMVAD attains state-of-the-art AUC scores of 98.9% on UCF-Crime, 94.7% on UBI-Fights, and 82.9% on the synthetic UBnormal dataset, outperforming existing methods such as DMAD (65.1% on UBnormal) and ICCBCB (94.1% on UBI-Fights). These results highlight the effectiveness of our cross-modal fusion strategy in capturing fine-grained spatial-temporal correlations and semantic dependencies across modalities. Additionally, the computational optimizations, including dynamic resolution processing and sparse attention mechanisms, reduce memory consumption by 14.3% in FLOPs and accelerate convergence by 18.7% compared with conventional approaches, enhancing scalability for real-world deployment.

Despite these advancements, there are still limitations in handling extreme occlusions and domain shifts across heterogeneous surveillance environments. Moreover, although our model has made certain progress in inference speed, there is still room for improvement in real-time inference capabilities. Future work will focus on integrating self-supervised learning to enhance generalization ability, exploring lightweight architectures for real-time processing, and further optimizing the model’s inference speed to meet a broader range of real-time video surveillance needs. Our approach provides a promising foundation for advancing video surveillance, autonomous monitoring, and public safety applications through adaptive and efficient anomaly detection, while also highlighting the directions that require further research and refinement in practical applications.

Author Contributions

Conceptualization, H.D. and S.L.; methodology, H.D.; software, H.D. and S.L.; validation, S.L.; formal analysis, S.L.; investigation, Y.C.; resources, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study, namely UCF-Crime, UBI-Fights, and UBnormal, are publicly available and can be accessed at https://www.crcv.ucf.edu/projects/real-world/ (accessed on 13 March 2025), https://paperswithcode.com/dataset/ubi-fights (accessed on 13 March 2025), and https://github.com/lilygeorgescu/UBnormal?tab=readme-ov-file#download (accessed on 13 March 2025), respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Samaila, Y.A.; Sebastian, P.; Singh, N.S.S.; Shuaibu, A.N.; Ali, S.S.A.; Amosa, T.I.; Mustafa Abro, G.E.; Shuaibu, I. Video anomaly detection: A systematic review of issues and prospects. Neurocomputing 2024, 591, 127726. [Google Scholar] [CrossRef]
Abbas, Z.K.; Al-Ani, A.A. A Comprehensive Review for Video Anomaly Detection on Videos. In Proceedings of the 2022 International Conference on Computer Science and Software Engineering (CSASE), Duhok, Iraq, 15–17 March 2022; p. 1. [Google Scholar] [CrossRef]
Patwal, A.; Diwakar, M.; Tripathi, V.; Singh, P. An investigation of videos for abnormal behavior detection. Procedia Comput. Sci. 2023, 218, 2264–2272. [Google Scholar] [CrossRef]
Aziz, Z.; Bhatti, N.; Mahmood, H.; Zia, M. Video anomaly detection and localization based on appearance and motion models. Multimed. Tools Appl. 2021, 80, 25875–25895. [Google Scholar] [CrossRef]
Wu, P.; Pan, C.; Yan, Y.; Pang, G.; Wang, P.; Zhang, Y. Deep Learning for Video Anomaly Detection: A Review. arXiv 2024, arXiv:2409.05383. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K. Transformer-Based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey. IEEE Trans. Artif. Intell. 2024, 5, 4851–4867. [Google Scholar] [CrossRef]
Ma, M.; Han, L.; Zhou, C. Research and application of Transformer based anomaly detection model: A literature review. arXiv 2024, arXiv:2402.08975. [Google Scholar] [CrossRef]
Braşoveanu, A.M.P.; Andonie, R. Visualizing Transformers for NLP: A Brief Survey. In Proceedings of the 2020 24th International Conference Information Visualisation (IV), Melbourne, Australia, 7–11 September 2020; pp. 270–279. [Google Scholar] [CrossRef]
Li, Y.; Liu, N.; Li, J.; Du, M.; Hu, X. Deep Structured Cross-Modal Anomaly Detection. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar] [CrossRef]
Qasim, M.; Verdu, E. Video anomaly detection system using deep convolutional and recurrent models. Results Eng. 2023, 18, 101026. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 30. [Google Scholar]
Cheng, Y.; Fan, Q.; Pankanti, S.; Choudhary, A. Temporal Sequence Modeling for Video Event Detection. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 2235–2242. [Google Scholar] [CrossRef]
Chong, Y.S.; Tay, Y.H. Abnormal Event Detection in Videos using Spatiotemporal Autoencoder. arXiv 2017, arXiv:1701.01546. [Google Scholar] [CrossRef]
Abati, D.; Porrello, A.; Calderara, S.; Cucchiara, R. Latent Space Autoregression for Novelty Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 481–490. [Google Scholar] [CrossRef]
Zaigham Zaheer, M.; Lee, J.H.; Astrid, M.; Lee, S.I. Old Is Gold: Redefining the Adversarially Learned One-Class Classifier Training Paradigm. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 14171–14181. [Google Scholar] [CrossRef]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future Frame Prediction for Anomaly Detection—A New Baseline. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6536–6545. [Google Scholar] [CrossRef]
Morais, R.; Le, V.; Tran, T.; Saha, B.; Mansour, M.; Venkatesh, S. Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 11988–11996. [Google Scholar] [CrossRef]
Ahn, S.; Jo, Y.; Lee, K.; Park, S. VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection. In Computer Vision—ACCV 2024; Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H., Eds.; Series Title: Lecture Notes in Computer Science; Springer Nature: Singapore, 2024; Volume 15474, pp. 312–328. [Google Scholar] [CrossRef]
Pillai, G.V.; Verma, A.; Sen, D. Transformer Based Self-Context Aware Prediction for Few-Shot Anomaly Detection in Videos. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3485–3489. [Google Scholar] [CrossRef]
Liu, Z.; Wu, X.; Wu, J.; Wang, X.; Yang, L. Language-guided Open-world Video Anomaly Detection. arXiv 2025, arXiv:2503.13160. [Google Scholar] [CrossRef]
Zanella, L.; Menapace, W.; Mancini, M.; Wang, Y.; Ricci, E. Harnessing Large Language Models for Training-Free Video Anomaly Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 18527–18536. [Google Scholar] [CrossRef]
Gu, Z.; Zhu, B.; Zhu, G.; Chen, Y.; Li, H.; Tang, M.; Wang, J. FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; ACM: New York, NY, USA, 2024; pp. 2041–2049. [Google Scholar] [CrossRef]
Jiang, Y. Local Patterns Generalize Better For Novel Anomalies. In Proceedings of the ICLR 2025 Conference, Singapore, 24 April 2025. [Google Scholar]
Fang, Z.; Lyu, P.; Wu, J.; Zhang, C.; Yu, J.; Lu, G.; Pei, W. Recognition-Synergistic Scene Text Editing. arXiv 2025, arXiv:2503.08387. [Google Scholar] [CrossRef]
Patel, A.; Tudosiu, P.D.; Pinaya, W.H.L.; Cook, G.; Goh, V.; Ourselin, S.; Cardoso, M.J. Cross Attention Transformers for Multi-modal Unsupervised Whole-Body PET Anomaly Detection. In Deep Generative Models; Mukhopadhyay, A., Oksuz, I., Engelhardt, S., Zhu, D., Yuan, Y., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 14–23. [Google Scholar]
Wang, H.; Xu, A.; Ding, P.; Gui, J. Dual Conditioned Motion Diffusion for Pose-Based Video Anomaly Detection. arXiv 2024, arXiv:2412.17210. [Google Scholar] [CrossRef]
Fu, Z. Vision Transformer: Vit and its Derivatives. arXiv 2022, arXiv:2205.11239. [Google Scholar] [CrossRef]
Xie, S.; Zhang, H.; Guo, J.; Tan, X.; Bian, J.; Awadalla, H.H.; Menezes, A.; Qin, T.; Yan, R. ResiDual: Transformer with Dual Residual Connections. arXiv 2023, arXiv:2304.14802. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 813–824. [Google Scholar]
Hu, K.; Zhu, Y.; Zhou, T.; Zhang, Y.; Cao, C.; Xiao, F.; Gao, X. DSC-Net: A Novel Interactive Two-Stream Network by Combining Transformer and CNN for Ultrasound Image Segmentation. IEEE Trans. Instrum. Meas. 2023, 72, 5030012. [Google Scholar] [CrossRef]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5797–5808. [Google Scholar] [CrossRef]
Bebis, G.; Georgiopoulos, M. Feed-forward neural networks. IEEE Potentials 2002, 13, 27–31. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Li, P.; Yin, L.; Liu, S. Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN. arXiv 2024, arXiv:2412.1379. [Google Scholar] [CrossRef]
Zhuo, Z.; Zeng, Y.; Wang, Y.; Zhang, S.; Yang, J.; Li, X.; Zhou, X.; Ma, J. HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization. arXiv 2025, arXiv:2503.04598. [Google Scholar] [CrossRef]
Alfasly, S.; Chui, C.K.; Jiang, Q.; Lu, J.; Xu, C. An Effective Video Transformer with Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 2496–2509. [Google Scholar] [CrossRef]
Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 10524–10533. [Google Scholar]
Kim, J.; Lee, B.; Park, C.; Oh, Y.; Kim, B.; Yoo, T.; Shin, S.; Han, D.; Shin, J.; Yoo, K.M. Peri-LN: Revisiting Layer Normalization in the Transformer Architecture. arXiv 2025, arXiv:2502.02732. [Google Scholar] [CrossRef]
Jiang, K.; Peng, P.; Lian, Y.; Xu, W. The encoding method of position embeddings in vision transformer. J. Vis. Commun. Image Represent. 2022, 89, 103664. [Google Scholar] [CrossRef]
Jafari, O.; Maurya, P.; Nagarkar, P.; Islam, K.M.; Crushev, C. A Survey on Locality Sensitive Hashing Algorithms and their Applications. arXiv 2021, arXiv:2102.08942. [Google Scholar] [CrossRef]
Rohekar, R.Y.; Gurwicz, Y.; Nisimov, S. Causal Interpretation of Self-Attention in Pre-Trained Transformers. In Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 31450–31465. [Google Scholar]
Zhu, C.; Zhao, Y.; Huang, S.; Tu, K.; Ma, Y. Structured Attentions for Visual Question Answering. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1300–1309. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]

Figure 1. Sample Processing Pipeline of MT-CMVAD: Demonstrating the Multi-modal Analysis from Input Video to Anomaly Score.

Figure 2. Model architecture, showing a layered architecture with three main components arranged vertically.

Figure 3. The figure shows a typical encoder layer structure, which adopts the Transformer design. Its core process includes a multi-head attention layer (Multi-Head Atten.), a normalization layer (Norm), a multi-layer perceptron (MLP), and another normalization layer (Norm).

Figure 4. The detailed structure of the Video Encoder Layer.

Figure 5. The figure shows a typical analysis layer (decoder) structure, which adopts the Transformer design.

Figure 6. The workflow of the Score Layer, showing input token processing (left), label encoding (right), and cosine similarity scoring (center).

Figure 7. Qualitative examples of anomaly detection results with model responses.

Table 1. Dataset Partition for Training and Evaluation.

Dataset	Train- Normal (Videos)	Train- Abnormal (Videos)	Evaluation- Normal (Videos)	Evaluation- Abnormal (Videos)
UCF-Crime	800	800	150	150
UCF-Fights	627	157	172	44
Ubnormal	239	240	26	38
Total	1666	1197	348	232

Table 2. Distribution of Normal and Anomalous Events in Benchmark Datasets.

Dataset	Normal Event (Videos)	Abnormal Event (Videos)	All
UCF-Crime	950	950	1900
UCF-Fights	784	216	1000
Ubnormal	265	278	543

Table 3. Performance Comparison of State-of-the-Art Methods on Benchmark Datasets (AUC%).

Method	UCF-Crime	UBI-Fights	UBnormal
HSNBM	95.2	90.3	-
Bi-Directional VAD	98.2	87.4	-
HF-VAD	99.6	82.8	-
AI-VAD	98.3	91.2	-
ICCBCB	98.4	94.1	61.9
Two-Stream	93.5	87.3	-
DMAD	98.4	91.2	65.1
Ours	98.9	94.7	82.9

Table 4. Comparison of Model Inference Performance on UCF-Crime Dataset at Different Resolutions (where Avg. InferSpeed stands for Average Inference Speed in seconds per video).

Resolution	AUC	Avg. InferSpeed	$Δ AUC$	$Δ Avg . InferSpeed$
$224 \times 144$	95.7	2.89	−3.2%	−15.4%
$256 \times 176$	97.5	3.07	−1.4%	−10.2%
$288 \times 208$	98.3	3.24	−0.6%	−5.2%
$320 \times 240$	98.9	3.42	-	-
$336 \times 256$	98.9	3.63	+0.0%	+6.1%
$352 \times 272$	99.0	3.71	+0.1%	+8.4%
$384 \times 304$	99.1	4.01	+0.2%	+17.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, H.; Lou, S.; Ye, H.; Chen, Y. MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection. Appl. Sci. 2025, 15, 6773. https://doi.org/10.3390/app15126773

AMA Style

Ding H, Lou S, Ye H, Chen Y. MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection. Applied Sciences. 2025; 15(12):6773. https://doi.org/10.3390/app15126773

Chicago/Turabian Style

Ding, Hantao, Shengfeng Lou, Hairong Ye, and Yanbing Chen. 2025. "MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection" Applied Sciences 15, no. 12: 6773. https://doi.org/10.3390/app15126773

APA Style

Ding, H., Lou, S., Ye, H., & Chen, Y. (2025). MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection. Applied Sciences, 15(12), 6773. https://doi.org/10.3390/app15126773

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection

Abstract

1. Introduction

2. Related Work

2.1. Video Anomaly Detection

2.2. Cross-Modal Learning in Anomaly Detection

2.3. Multi-Modal Transformer for Anomaly Detection

3. Supposed Method

3.1. Overview of Method

3.1.1. Hierarchical Feature Extraction

3.1.2. Spatiotemporal Interaction Modeling

3.1.3. Anomaly Scoring Mechanism

3.2. Classic Transformer Encoder

3.2.1. Embedding Layer

3.2.2. Stacked Transformer Encoder Layers

3.2.3. Overall Pipeline

3.3. Video Encoder Layer

3.3.1. Dynamic Resolution Processing and Patch Embedding

3.3.2. Hierarchical Transformer Encoder Architecture

3.3.3. Multimodal Feature Fusion and Optimization

3.3.4. Encoder Output and Downstream Adaptation

3.4. Decoder and Analysis Layer

3.4.1. Multimodal Transformer Decoder

3.4.2. Temporal-Spatial Attention Augmentation

3.4.3. Multimodal Feature Pooling

3.4.4. Task-Specific Heads

3.5. Scoce Layer

3.6. Score Layer with Sentence Transformers

3.6.1. Model Architecture

3.6.2. Offline Label Encoding

3.6.3. Real-Time Scoring

3.6.4. Implementation Details

4. Experiment

4.1. Datasets

4.1.1. UCF-Crime

4.1.2. UBI-Fights

4.1.3. UBnormal

4.2. Performance Metrics Formulation

Area Under ROC Curve (AUC)

4.3. Training Configuration

4.3.1. LoRA Fine-Tuning

4.3.2. Optimization Details

4.4. Result

4.5. Result Example

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI