Dual-Branch Network for Video Anomaly Detection Based on Feature Fusion

Huang, Minggao; Li, Jing; Sun, Zhanming; Hu, Jianwen

doi:10.3390/math13244022

Open AccessArticle

Dual-Branch Network for Video Anomaly Detection Based on Feature Fusion

by

Minggao Huang

¹,

Jing Li

¹,

Zhanming Sun

^1,* and

Jianwen Hu

²

¹

Department of Investigation, Hunan Police Academy, Changsha 410138, China

²

School of Artifical Intelligence, Changsha University of Science and Technology, Changsha 410114, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(24), 4022; https://doi.org/10.3390/math13244022

Submission received: 20 October 2025 / Revised: 1 December 2025 / Accepted: 8 December 2025 / Published: 18 December 2025

Download

Browse Figures

Versions Notes

Abstract

Anomaly detection is a critical task in video surveillance, with significant applications in the management and prevention of criminal activities. Traditional convolutional neural networks often struggle with motion modeling and multi-scale feature fusion due to their localized field of view. To address these limitations, this work proposes a Dual-Branch Interactive Feature Fusion Network (DBIFF-Net). DBIFF-Net integrates a CNN branch and a swin transformer branch to extract multi-scale features. To optimize these features for efficient fusion, an interactive fusion module is introduced to efficiently fuse these multi-scale features through skip connections. Then, the temporal shift module is employed to exploit dependencies between video frames, thereby improving the identification of anomalous events. Finally, the channel attention is utilized for decoder to better assist in restoring complex object features in the video. System performance is evaluated on three standard benchmark datasets. DBIFF-Net achieves the area under the receiver operating characteristic (AUC) of 97.7%, 84.5%, and 73.8% on the UCSD ped2, CUHK Avenue, and ShanghaiTech Campus dataset, respectively. Extensive experiments demonstrate that DBIFF-Net outperforms most state-of-the-art methods, validating the effectiveness of our method.

Keywords:

anomaly detection; residual connection; attention mechanism; temporal shift

MSC:

68T45

1. Introduction

With the widespread adoption of video surveillance systems in public security, intelligent transportation, and urban management, video anomaly detection (VAD) has emerged as a critical task. It enables early warning, rapid response, and post-event analysis of criminal activities [1,2,3,4]. Compared to conventional surveillance, which heavily relies on manual monitoring, deep learning-based VAD algorithms can automatically identify anomalies. This technology effectively reduces the burden on operators while enhancing overall monitoring efficiency [5,6].

Due to the low incidence and diversity of anomalies, it is challenging to comprehensively label all anomalous patterns in real-world surveillance videos [7,8]. Therefore, the VAD is frequently addressed as a one-class classification (OCC) problem [9,10]. In this problem, models are trained solely on normal video data, and they identified anomalies during inference [11]. Furthermore, video data possess intrinsic temporal dependencies that are absent in static images [12,13]. This necessitates models that can simultaneously capture both the spatial details within individual frames and the dynamic changes across a sequence, thereby increasing the complexity of the VAD task. Early approaches to VAD relied on traditional machine learning techniques, which model normal events through hand-crafted features and shallow models. For example, Chuang et al. [14] utilized spatio-temporal ratio histograms to represent video features, with anomalies detected via a finite-state machine (FSM) model. Mo et al. [15] propose a joint sparsity model with kernelization for multi-object anomaly detection, employing a greedy pursuit algorithm and linear sparsity to enhance class separability. Other methods, including the support vector machine (SVM) [16], the random forest (RF) [17], and the probabilistic-based model [18], have also achieved notable success in VAD. However, these traditional methods are significantly limited because they depend on manual feature engineering, which is both inefficient and difficult to capture the full complexity of data.

In recent years, deep learning technology has developed rapidly, effectively addressing the limitations of traditional machine learning. A prevalent deep learning approach for OCC is the use of reconstruction-based Auto-Encoders (AEs) [19,20]. These methods are generally trained by minimizing reconstruction error on normal video data and subsequently use reconstruction error as a measure of anomalies. However, these methods continue to face several key challenges. Several recent studies have sought to address this issue from different perspectives. For instance, Liu et al. [21] found that while existing VAD methods leverage multi-modal inputs to maintain robustness in complex scenes, they suffer from high computational costs, substantial storage demands, and strong multi-modal dependencies. To address this, they proposed a semi-supervised knowledge distillation framework that preserves the advantages of multi-modal learning during training while requiring only single-modal data for inference. Jiang et al. [22] pointed out that existing methods for VAD often overlook the properties of motion features, leading to a failure to effectively integrate appearance and motion cues. They introduced a multi-scale differential perception network to enhance motion representation in the appearance branch and eliminate redundant information through fine-grained deep feature processing. Qiu et al. [23] argued that the blurred boundary between normal and abnormal video data leads to low detection accuracy. They introduced a convolution-enhanced self-attentive video auto-encoder based on U-Net, coupled with a dual-scale clustering module, that enhances feature representation, amplifies the normal-abnormal distinction, and provides robust anomaly scoring against temporal leakage. In addition, some studies have introduced the two-stream networks to enhance models’ ability to perceive anomalies in both spatial and temporal stream [24]. However, the introduction of optical flow requires its explicit extraction, which complicates the training process and poses a non-trivial challenge in modeling inter-stream interactions. Meanwhile, other works have explored hybrid architectures that combine local convolutional features with global attention mechanisms to better characterize the spatio-temporal dynamics of abnormal behaviors [25]. Despite their advancements, these methods still struggle to learn fine-grained, localized appearance anomalies. These fine-grained abnormal features are not visually prominent, but they are widespread in real-world surveillance scenarios and are of significant value for detecting anomalies.

To address these limitations, this study proposes the Dual-Branch Interactive Feature Fusion Network (DBIFF-Net), a novel unsupervised framework for video anomaly detection. By leveraging the advantages of the convolutional neural network (CNN) and the swin transformer [26], DBIFF-Net achieves precise prediction for anomalies. Its CNN branch captures detailed textures and structures at multiple spatial resolutions. Its swin transformer branch focuses on modeling human movements, localization, and global information through window-based self-attention. Subsequently, DBIFF-Net also integrates three components to enhance its performance: interactive fusion module, a temporal shift module (TSM) and the channel attention. The interactive fusion module employs an interactive attention mechanism to aggregate multi-scale features across different receptive fields from both branches. TSM [27] is incorporated to model temporal dependencies by shifting feature channels across consecutive frames. The decoder incorporates channel attention gates at each upsampling stage to better exploit the channel relationship of the features for image features restoration. The main contributions are summarized as follows:

This work presents DBIFF-Net, a new video anomaly detection framework that tightly couples a CNN-based local encoder with a swin transformer–based global encoder. Unlike prior dual-branch architectures that simply concatenate or sum multi-scale features, DBIFF-Net is designed to jointly capture fine-grained spatial cues and long-range temporal dependencies through interactive fusion, leading to more discriminative multi-scale representations.
This work proposes an interactive fusion module to facilitate fusion of multi-scale features between local and global encoders. The module performs dynamic, cross-scale interaction, allowing features from different semantic levels and branches to reinforce each other and significantly enhance representation quality.
This work proposes an attention decoder module that employs the channel attention to explicitly models inter-channel dependencies during upsampling, thereby improving the recovery of complex object features.
Comprehensive experiments show that DBIFF-Net surpasses most methods across three benchmark datasets, and extensive ablation studies validate the necessity and effectiveness of each module, confirming the soundness of the overall architectural design.

2. Related Work

Over the past years, a significant amount of research has been dedicated to addressing the VAD problems. This section will focus on three categories of models: AE-based models, dual-stream models, and multi-scale features models.

Due to the scarcity and diversity of anomalies in real-world scenarios, unsupervised reconstruction-based models have been widely applied in VAD [28,29]. Among these, many studies leverage the AE as the core of their detection framework. These frameworks aim to reconstruct future frames or masked frames using normal training data. Anomalies are then detected during inference by identifying frames with high reconstruction errors. For instance, a few works utilize convolutional AEs and their variants to reconstruct frames. The Mem-AE method [30] adopted a memory module to enhance memorization of normal patterns, which helps to reduce the improper reconstruction of anomalies and thus improves detection performance. The ML-MemAE-SC method [31] focused on predictive reconstruction, where the model learns to predict the next frame from a sequence of preceding frames. Furthermore, the ST-MemAE method [32] improved the memory structure by preserving spatio-temporal correlation features, leading to stable and accurate anomaly detection results on major benchmark datasets.

To address the limitations of autoencoders in jointly capturing local details and global context, visual transformers have emerged as an effective solution. For instance, Yuan et al. [33] presented trans-anomaly, a framework that combines a U-Net with a video vision transformer. The framework is designed to capture richer temporal information and broader global contexts, leading to improved anomaly detection performance. Lee et al. [34] proposed AnoViT, a vision transformer-based encoder-decoder model. The model utilized an attention operation to learn global relationship between patches, thereby overcoming the local-feature limitation of traditional convolutional encoder-decoders. Beyond visual transformers, several recent studies have employed dual-stream encoder structures to model the relationship between appearance and motion information in recent years. For example, Cai et al. in [35] proposed AMMC-Net, a dual-stream network that uses a memory mechanism to fuse information from two modalities. Thakare et al. [36] proposed an end-to-end multi-stream architecture for anomaly detection. This architecture employed two parallel spatio-temporal deep feature extractors and a fuzzy aggregation method to fuse the anomaly scores. While these architectures enhance the expressive power of motion features by using parallel networks for optical flow and RGB features, they often incur significant computational costs and data processing overheads. To mitigate the issue of cost, recent studies have proposed using deep transfer learning (DTL) and deep domain adaptation (DDA) to alleviate this problem and have achieved a series of results [37].

The effective fusion of multi-scale features is crucial for anomaly detection, as it enables models to capture both fine-grained disturbances and overall pattern deviations. Common fusion strategies include element-wise summation, concatenation, and attention-based weighting mechanisms. For instance, the ASTNet method [38] employed the residual attention fusion and channel classification gates to aggregate features from temporal and spatial branches. Another class of methods employed feature alignment loss, which explicitly constrains the consistency of the embedding representations generated by different branches (e.g., spatial and temporal branches) within the feature space. By aligning these representations, such methods facilitate more effective collaborative fusion of multimodal or multi-stream information. The CMTSTFN method [39] attempted to guide networks to simultaneously focus on large-scale scene changes and small-scale target details by aligning global and local feature representations. This process enhanced the robustness and accuracy of its anomaly detection. However, most existing fusion methods are still limited by their reliance on single-resolution or shallow operations.

Different from existing methods, this work proposes a dual-branch architecture that applies swin transformer blocks and CNN blocks to the encoder and research multi-scale feature fusion. We believe this dual-branch mechanism facilitates stronger performance in video anomaly detection through multi-scale feature fusion.

3. Methods

The overall architecture of the DBIFF-NET model is shown in the Figure 1. The model consists of two main components: a dual-branch encoder and a decoder. The dual-branch encoder extracts features from the input image and generates global and local feature map sets. To effectively integrate information across different receptive fields, an interactive fusion module combines shallow-level features with deep-level ones. This fusion is crucial because features from convolutional layers at different depths exhibit significant differences. Deeper network layers possess larger receptive fields and extract more abstract features, resulting in the loss of image details. By combining features from different depths, the model can capture more information. Finally, the decoder processes the three groups of fused features through an attention mechanism to predict the corresponding future frame. Due to the scarcity of datasets containing anomalous events, the model adopts an unsupervised approach to detect abnormal activities in videos.

3.1. Dual-Branch Encoder

This encoder comprises two parallel networks: a CNN local branch and a swin transformer global branch. The input of the dual branch encoder is a video frame sequence

I = {I_{t} | I_{t} \in R^{C \times H \times W}}

, where t is the length of video frame, and C, H, and W denote the number of channels, height, and width of the video frames, respectively. Following prior research, the last frame of the video sequence is designated as the key frame

I^{k}

to compute the difference from the predicted frame, while all other frames are concurrently fed into the dual branch encoder. For the convenience of introducing the model structure, we take encoding one frame as an example.

3.1.1. CNN Local Branch

The local feature branch is designed to automatically learn spatial hierarchies within frame through adaptive feature extraction. Furthermore, the convolutional layers apply learnable filters across local regions, enabling effective capture of local patterns.

This module is organized into three stages, where each stage includes two convolutional layers, an activation layer, and a pooling layer. The output feature at the n-th stage is defined as the following:

X_{n}^{C} = C N N_{n} (X_{n - 1}^{C}), n = 1, 2, 3

(1)

where

X_{0}^{C} \in I ∖ {I_{P}}

is an input frame,

X_{n}^{C}

represents the output feature of the n-th stage.

3.1.2. Swin Transformer Global Branch

The swin transformer is utilized for capturing the global information for frame by a window-based self-attention mechanism. Its shifted window design enables cross-window feature integration, effectively modeling long-range dependencies while maintaining computational efficiency.

The swin transformer module likewise comprises three stages. The first stage consist of patch partition, swin embedding, and a swin transformer block, while the second and third stages each include a patch merging operation and a swin transformer block. Specifically, each input frame

I_{t} \in R^{C \times H \times W}

is first partitioned into 2D patches of size

I_{t}^{p} \in R^{p^{2} C \times (⌈\frac{H}{p}⌉ \times ⌈\frac{W}{p}⌉)}

, where P denotes the patch size (typically set to four), as defined in Equation (3). These patches are then fed into the swin transformer block through a linear embedding layer, which corresponds to the first stage. As illustrated in Figure 2, the block incorporates window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA) to facilitate inter-window feature propagation. Subsequently, the output features are downsampled via a patch merging operation, and the next stage features are generated through the swin transformer block. the patch merging operation can reduce spatial resolution while enlarging the receptive field, thereby enabling the model to capture broader contextual information at deeper stages. The output of the swin transformer module at the n-th stage,

X_{n}^{S T}

, can be derived as the following:

X_{0}^{S T} = M L P (P^{P} (I_{t})), I_{t} \in I ∖ {I_{P}}

(2)

X_{1}^{S T} = T_{1} (X_{0}^{S T})

(3)

X_{n}^{S T} = T_{n} (P_{n}^{M} (X_{1}^{S T})) n = 2, 3

(4)

where

P^{P}

is the patch partition operation,

P_{n}^{M}

is the patch merging operation, and

T_{n}

is the swin transformer block.

3.2. Decoder

Symmetrically to the encoder part, the decoder of the DBIFF-NET also comprises three stages. The output from each stage of the encoder is fed into the corresponding interactive fusion module (LDIF, MDIF, and HDIF). Each of interactive fusion module in turn is followed by an attention decoder module that performs upsampling. This design implements skip connections between encoder and decoder stages, which preserves richer spatial details, thereby enabling the decoder to restore image details more effectively during upsampling.

3.2.1. Interactive Fusion Module

Following the dual-branch encoder, we obtain multi-scale output features of one frame

X_{n}^{C}

and

X_{n}^{S T}

. How to fuse these features is a difficult and crucial task. To fully mine and utilize both global and local feature information during fusion, this work designs an interactive fusion module, as shown in the Figure 3. This module employs the global feature from swin transformer layer as weight vectors and those from the CNN layer as value vectors, executing an attention mechanism to facilitate dynamic interaction between features. Based on different encoding stages, the interactive fusion block is divided the low-dimensional interactive fusion block (LDIF), mid-dimensional interactive fusion block (MDIF), and high-dimensional interactive fusion block (HDIF), respectively. The process is described as the following:

σ_{n} = Softmax (M L P_{n} (X_{n}^{S T}))

(5)

X_{n}^{f u s e d} = σ_{n} ⊙ C o n v_{1 \times 1} (X_{n}^{C}) \oplus X_{n}^{S T} n = 3, 2, 1

(6)

where

M L P (\cdot)

and

C o n v_{1 \times 1} (\cdot)

are used to adjust the feature size, ⊙ is Hadamard product, and ⊕ is element-addition operation.

X_{1}^{f u s e d}

,

X_{2}^{f u s e d}

, and

X_{3}^{f u s e d}

represent the low-dimensional fused feature, the mid-dimensional fused feature and the high-dimensional fused feature, respectively.

Furthermore, Section 4.5.2 presents a comprehensive experimental analysis that systematically evaluates multiple other fusion strategies, including element-wise addition, Q-attention, and feature concatenation.

3.2.2. Temporal Shift Module

After HDIF, the high-dimensional fused feature

X_{3}^{f u s e d}

is fed into the Temporal Shift Module (TSM) to learn temporal information across video frames. TSM is a technique for efficient video understanding, which performs translation and permutation of partial channel features between adjacent frames, enabling the results of each layer have channel values between adjacent frames. Specifically, the operation process of TSM is shown in the Figure 4. Firstly, consecutive frame features are concatenated along the time dimension to form the input feature of TSM, denoted as

{\hat{X}}^{H} \in R^{T \times C \times H \times W}

. Then, a portion of channel features in each frame is shifted to the next frame, while the corresponding portion of the channel features in the first frame is zero-padded. As shown in the figure, after temporal shift, the green part of frame

I_{2}

are replaced by the red part of frame

I_{1}

, while the original first two channel features of frame

I_{1}

are zero-padded. This mechanism allows channels shifted backward in time to inject past information into the current frame’s computations, while channels shifted forward introduce information from future frames. Meanwhile, the remaining channels remain unshifted to preserve the spatial characteristics of each frame. This dynamic redistribution of features between adjacent frames is applied prior to the convolution operation within the residual block, thereby enriching temporal context and improving the model’s effectiveness in temporal modeling tasks. The TSM is computed as the following:

{\hat{X}}_{3}^{f u s e d} = c o n c a t ({X_{3_{t}}^{f u s e d} | t \in {1, 2, 3, 4}})

(7)

X^{T S M} = T S M ({\hat{X}}_{3}^{f u s e d})

(8)

where

X_{I_{t}}^{H}

is high-dimensional fusion features of

I_{t}

frame.

3.2.3. Attention Decoder Module

In order to enhance the sensitivity of neural networks to key features, a channel attention mechanism is integrated into decoder. The detailed structure of this attention decoder module is shown in Figure 5. This process can be represented by the following formula:

{\hat{X}}_{n}^{f u s e d} = c o n c a t ({X_{n_{t}}^{f u s e d} | t \in {1, 2, 3, 4}}) n = 3, 2, 1

(9)

{\hat{I}}_{n}^{u p} = U ({\hat{I}}_{n + 1}^{u p} \oplus {\hat{X}}_{n}^{f u s e d}) n = 3, 2, 1

(10)

σ_{n}^{o u t} = S i g m o d (f_{C o n v} (A v g P o o l ({\hat{I}}_{n}^{u p}))) n = 3, 2, 1

(11)

{\hat{I}}_{n}^{o u t} = σ_{n}^{o u t} ⊙ {\hat{I}}_{n}^{u p} n = 3, 2, 1

(12)

where U represents the upsampling operation, which includes the rule activation function, deconvolution operation, and normalization operation.

f_{C o n v}

comprises two

1 \times 1

convolution layers and a rule activation function.

A v g P o o l (\cdot)

is global average pooling operation.

{\hat{X}}_{n}^{f u s e d}

represents continuous frame concatenated features of n-th stage. When n takes four,

{\hat{I}}_{4}^{up}

is equivalent to

X^{TSM}

.

3.2.4. Loss Function

The objective of DBIFF-NET is to predict the next frame from a sequence of consecutive input frames. The discrepancy between predicted and ground truth frames can be quantified using

L_{2}

distance:

L_{d i s} = {∥I^{k} - \hat{I}∥}_{2}

(13)

where

\hat{I}

is the predicted frame by DBIFF-NET output, and

I_{P}

is the key frame.

However, the

L_{2}

loss function is only calculated based on pixel level differences and cannot effectively measure the structural similarity of images. To address this limitation, Gradient Difference Lossc (GDL) and multi-scale structural similarity are used to compensate for pixel loss and improve the clarity of predicted frames:

L_{g d l} = \sum_{i, j} {||I_{i, j}^{k} - I_{i - 1, j}^{k}| - |{\hat{I}}_{i, j} - {\hat{I}}_{i - 1, j}||}_{1} + {||I_{i, j}^{k} - I_{i, j - 1}^{k}| - |{\hat{I}}_{i, j} - {\hat{I}}_{i, j - 1}||}_{1}

(14)

L_{m s s} = 1 - \prod_{m = 1}^{M} {(\frac{2 μ_{I} μ_{\hat{I}} + c_{1}}{u_{I}^{2} + u_{\hat{I}}^{2} + c_{1}})}^{β_{m}} {(\frac{2 σ_{I \hat{I}} + c_{2}}{σ_{I}^{2} + σ_{\hat{I}}^{2} + c_{2}})}^{γ_{m}}

(15)

where

μ_{I}

and

μ_{\hat{I}}

denote the mean of the predicted frames and the key frames, respectively.

σ_{I \hat{I}}

is the covariance between the predicted frame and the real frame.

σ_{I}

and

σ_{\hat{I}}

are the standard deviation between the predicted frame and the key frame, respectively.

c_{1}

and

c_{2}

denote stabilization constants that ensure numerical stability by preventing division by zero.

Finally, the various loss functions above are integrated into the complete loss function as follows:

L_{c o n} (I, \hat{I}) = L_{d i s} + L_{g d l} + L_{m s s} .

(16)

3.2.5. Anomaly Detection

During the testing phase, the model determines anomalies by calculating the anomaly scores of consecutive frames. According to previous research, the anomaly score is a normalized value in the range [0, 1], formulated as follows:

s (t) = 1 - \frac{P S N R_{t} - \min_{t} (P S N R_{t})}{\max_{t} (P S N R_{t}) - \min_{t} (P S N R_{t})}

(17)

where

P S N R

is the peak signal-to-noise ratio, which is an assessment metric for image quality. The calculation of PSNR is as follows:

P S N R = 10 {log}_{10} \frac{{[\max_{{\hat{I}}_{t}}]}^{2}}{\frac{1}{M} \sum_{i = 1}^{M} {(I_{t, i}^{k} - {\hat{I}}_{t, i})}^{2}}

(18)

where M is the total number of frame pixels, t denotes the frame index, and

\max_{{\hat{I}}_{t}}

is the maximum value of frame pixels.

4. Comparative Experiments

4.1. Datasets

To evaluate the performance of DBIFF-NET, this work conducts a series of experiments on three benchmark datasets widely used in the VAD, including UCSD ped2 dataset [40], CUHK Avenue [41], and ShanghaiTech Campus dataset [42]. The detailed descriptions of these two datasets are as follows:

UCSD ped2 Dataset. The UCSD ped2 dataset was recorded on the pedestrian walkways of the University of California San Diego. Sample cases are illustrated in Figure 6a. In this dataset, walking is defined as normal behavior, while activities such as cycling and skateboarding are considered anomalies. It contains 16 training videos and 12 testing videos. In experiments, the training videos are extracted into 2550 frames, while the testing videos are extracted into 2010 frames, all with a resolution of 240 × 360 pixels.
CUHK Avenue. The CUHK Avenue was recorded on the CUHK campus avenue. It consists of 16 training videos and 21 testing videos, covering 47 abnormal events, such as jumping, throwing objects, and dancing. Figure 6b shows the sample cases of this dataset. In experiments, the training videos are extracted into 15,328 frames, while the testing videos are extracted into 15,324, all with a resolution of 360 × 640 pixels.
ShanghaiTech Campus dataset. The ShanghaiTech Campus dataset dataset was created by ShanghaiTech University, featuring footage captured in diverse environments including the university campus, Shanghai streets, and commercial districts. It contains 130 distinct anomaly categories, such as climbing, cycling, and jumping, thereby presenting considerable challenges for video anomaly detection tasks. Figure 6c shows the sample cases of this dataset. The dataset comprises 330 training videos and 107 testing videos. In experiments, the training videos are extracted into 274,515 frames, and the testing videos are extracted into 42,883, all with a resolution of 480 × 856 pixels.

4.2. Training Details

During training, five consecutive frames are selected, and each frame is resized to 256 × 256 pixels for UCSD Ped2, 192 × 320 pixels for CUHK Avenue, and 192 × 288 pixels for ShanghaiTech Campus. The first four frames are input into the model to predict the fifth frame. Subsequently, the predicted fifth frame is compared with the ground-truth fifth frame to calculate the loss. Finally, the Adam optimizer is employed to optimize the model, with the learning rate set to

2 \times 10^{- 4}

. All experiments are conducted on four 4090 GPUs. The hyperparameters of DBIFF-NET are shown in the Table 1.

4.3. Evaluation Metric

This work uses the commonly used performance metric ROC AUC in VAD to evaluate the performance of the model. The range of AUC values is [0, 1], and the higher its value, the better the anomaly detection performance.

4.4. Comparison with State-of-the-Art

To better evaluate the effectiveness of the proposed method, ten state-of-the-art (SOAT) methods are adopted to comparative experiments, including AE-Conv2D [10], Mem-AE [30], MNAD-Recon [43], zxVAD [44], MAMC-Net [45], Dual GroupGAN [46], BDHF [47], CLAE [23], Dang et al. [20], and MemATr [48]. Table 2 presents the performance of DBIFF-NET, and these comparison methods on the UCSD Ped2, CUHK Avenue, and ShanghaiTech Campus datasets. It can be observed that DBIFF-NET achieves 97.7%, 85.0, and 73.8% on UCSD Ped2, CUHK Avenue, and ShanghaiTech Campus datasets, respectively, which surpasses the most SOAT methods. Compared to traditional autoencoder Method (e.g., AE-Conv2D and Mem-AE), our method has a substantial improvement. This indicates that the proposed model is a more effective modeling for the anomaly detection. Table 3 presents a comparative analysis of our model against the method by Dang et al. (2025) [20] in terms of precision, recall, and F1-score. The results demonstrate that DBIFF-Net consistently surpasses theirs in F1-score across all three datasets. Furthermore, the notably higher recall of DBIFF-Net indicates a heightened sensitivity in detecting anomalies under complex scenarios.

It is worth noting that DBIFF-NET demonstrates more reliable performance on the UCSD Ped2 dataset than on the ShanghaiTech Campus dataset. This can be attributed to the larger scale and the diversity of anomaly events in ShanghaiTech Campus dataset, which makes anomaly detection on this dataset more challenging. It should be emphasized that DBIFF-NET dose not utilize any auxiliary techniques, such as adversarial learning, data augmentation, or memory enhancement methods. DBIFF-NET still outperforms many advanced models, suggesting that well-designed, simple multi-scale methods can achieve performance competitive with or even superior to more complex approaches.

Although DBIFF-Net demonstrates better performance against comparison methods on the UCSD Ped2 and ShanghaiTech Campus datasets, its performance on CUHK Avenue is suboptimal. Further analysis revealed that the CUHK Avenue dataset contains noisy or corrupted video. For instance, the training set includes videos of children running and jumping, which significantly contaminates the training data. Additionally, in some test videos, only partial body segments of dancers are visible. These data quality issues substantially challenge the model’s detection capability. We note that other existing methods also struggle to correctly identify such anomalous clips, and this particular issue remains understudied in these methods. Therefore, this represents a valuable and promising research direction. We plan to prioritize investigating such challenges in our future work. From a modeling standpoint, in complex scenarios, DBIFF-NET demonstrates significant performance improvements on the ShanghaiTech Campus dataset, while the results on the CUHK Avenue dataset are relatively weaker. This disparity is largely due to the use of the swin transformer as the global branch, which requires a larger dataset for effective training. The training set for the ShanghaiTech Campus dataset is ten times larger than that of CUHK Avenue, contributing to the superior detection performance observed.

4.5. Ablation Study and Analysis

To evaluate the impact of each module in the proposed method on the performance, this section conducts ablation experiments on the dual-stream network structure, the fusion method within the DBIFF-NET model, channel attention mechanism, and temporal shift module (TSM) on the UCSD Ped2 dataset.

4.5.1. The Impact of Dual-Branch Networks on Performance

To investigate the impact of the dual-branch network on model performance, this experiment replaces the encoding module in the DBIFF-NET with a single-stream CNN and a single-stream swin transformer, respectively. The results are presented in Table 4. While both single-stream networks achieve AUC scores exceeding 80%, the model attains its best performance only when the dual-branch network is utilized. This improvement can be attributed to the multi-scale encoding approach, which enables the model to integrate local and global features for a comprehensive understanding of video frames, thereby enhancing its anomaly detection capability.

4.5.2. Impact of Fusion Methods on Performance

Multi-scale feature fusion is often challenging. In constructing the DBIFF-NET, this study considers several fusion strategies, including element-wise addition (SUM), Q-attention (QA), and feature concatenation (CON). To illustrate these fusion methods, we take two features,

X_{i}^{C}

and

X_{i}^{S} T

, as examples in the following descriptions:

S U M (\cdot) = X_{n}^{C} \oplus X_{n}^{S T}

(19)

Q A (\cdot) = S o f t \max (q X_{n}^{C}) X_{n}^{C} \oplus S o f t \max (q X_{n}^{S T}) X_{n}^{S T}

(20)

C O N (\cdot) = c o n c a t (X_{n}^{C}, X_{n}^{S T})

(21)

where q represents a learnable weight. Comparative results of fusion strategies are presented in Table 5. The table indicates that the performance of QA strategy is on par with that of interactive fusion strategy while outperforming other strategies. Further experiments on the ShanghaiTech Campus dataset revealed that the interactive fusion strategy achieved an AUC of 73.8%, significantly higher than 72.8% of the QA strategy. Therefore, the interactive fusion strategy is adopted in the DBIFF-NET model.

4.5.3. Effectiveness of Each Component in Decoder

In Table 6, we conducted ablation studies on each component in decoder, including the interactive fusion module, the TSM module and the channel attention module. Both the TSM module and the channel attention module effectively enhance anomaly detection capability of DBIFF-NET. Especially for the addition of TSM, it boosts the performance by 3.6%. This shows that the construction of TSM module and the channel attention module play a critical role in contributing to video event restoration. In addition, “more TSM” represents that TSM is applied after the LDIF, MDIF, and HDIF modules. In the final model, TSM is applied only after HDIF. Results demonstrate that randomly increased TSM modules do not achieve the best performance of the model. This proves the rationality of the DBIFF-NET model structure.

4.5.4. Visualization

In this section, we visualize the anomaly detection results of DBIFF-Net. Figure 7 and Figure 8 visualize the anomaly score generated by DBIFF-Net on the test sets of the UCSD Ped2 and ShanghaiTech Campus datasets, respectively. The x-axis in the figure represents video frames, and the y-axis denotes the anomaly score. From the curve of anomaly score, both test sets contain lot of normal frames, presenting a significant challenge for anomaly detection. However, our model successfully captures anomalies with accuracy, demonstrating the effectiveness of our approach.

Specifically, Figure 7 (UCSD Ped2 dataset) shows consistently low anomaly score in the first 80 frames. At approximately frame 90, the anomaly score begins to rapidly increase. Upon examining the data around frame 90, we observe that a bicycle starts to enter the video. By frame 100, when the bicycle becomes fully visible, the anomaly score stabilizes at elevated levels and continue rising as the object becomes more distinct.

The anomaly score curve shown in Figure 8 exhibits significant fluctuations, reflecting the complexity of anomalous behavior in this ShanghaiTech Campus video segment. From this figure, the curve fluctuates but remains consistently low before to 250 frames. However, the anomaly score rapidly increased starting around 280 frames. When it peaks at approximately frame 290 (marked (c) Anomaly frame), it is seen that a black backpack enters the video. Subsequently, the anomaly score rapidly drops again as the backpack completely disappears from the video, reaching a low point. The score then climbs sharply again as the backpack descended. When the backpack is thrown up for the second time, the anomaly score curve shows the same trend, and the anomaly score spikes when the anomalous objects appear and drops sharply when the anomalies conclude. This alignment between score peaks and anomalies clearly demonstrate the model’s ability to effectively distinguish anomalies from normal activity.

5. Conclusions and Future Work

This article introduces a novel video anomaly detection model, DBIFF-NET, designed to enhance the accuracy and efficiency of anomaly detection in video surveillance. The model utilizes multi-scale encoding to extract global and local features from video frames, while leveraging residual skip connections, interactive fusion, channel attention, and TSM to better distinguish between normal and anomalous data. DBIFF-NET demonstrates superior anomaly detection performance compared to most state-of-the-art methods across three benchmark datasets. The ablation studies show that multi-scale encoding, interactive fusion module, channel attention, and TSM collectively enhance anomaly detection performance of the model. Despite its strong overall performance, comparisons on the CUHK Avenue and ShanghaiTech Campus datasets reveal a limitation: the model’s representational capacity appears constrained under data-scarce conditions. Additionally, computational overhead remains non-negligible, with a model size of 371.57 MB and a computational cost of 232.4 GFLOPs for a four-frame input. In future work, data augmentation techniques will be further explored and applied to the CUHK Avenue dataset to enhance model robustness and improve performance under limited-data conditions. In addition, the computational overhead is still non-negligible. We also plan to optimize parameters usage and improve the model’s response time to facilitate more efficient deployment.

Author Contributions

Conceptualization, M.H. and Z.S.; methodology, M.H., J.L. and J.H.; software, J.L. and J.H.; validation, J.L.; formal analysis, M.H. and J.L.; investigation, J.L.; resources, M.H. and Z.S.; data curation, J.L. and J.H.; writing—original draft preparation, M.H.; writing—review and editing, M.H., Z.S., and J.H.; visualization, J.L.; supervision, M.H., Z.S., and J.L.; project administration, M.H. and J.L.; funding acquisition, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Foundation of Education Bureau of Hunan Province, China (Grant No. 22A0687).

Data Availability Statement

The data presented in this study are openly available in UCSD ped2 http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm (accessed on 10 May 2025), CUHK Avenue https://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset.html (accessed on 10 May 2025), and ShanghaiTech Campus https://svip-lab.github.io/dataset/campus_dataset.html (accessed on 10 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kashef, M.; Visvizi, A.; Troisi, O. Smart city as a smart service system: Human-computer interaction and smart city surveillance systems. Comput. Hum. Behav. 2021, 124, 106923. [Google Scholar] [CrossRef]
Samaila, Y.A.; Sebastian, P.; Singh, N.S.S.; Shuaibu, A.N.; Ali, S.S.A.; Amosa, T.I.; Abro, G.E.M.; Shuaibu, I. Video anomaly detection: A systematic review of issues and prospects. Neurocomputing 2024, 591, 127726. [Google Scholar] [CrossRef]
Lin, Y. Spoken instruction understanding in air traffic control: Challenge, technique, and application. Aerospace 2021, 8, 65. [Google Scholar] [CrossRef]
Wu, P.; Pan, C.; Yan, Y.; Pang, G.; Wang, P.; Zhang, Y. Deep learning for video anomaly detection: A review. arXiv 2024, arXiv:2409.05383. [Google Scholar] [CrossRef]
Nilsson, F.; Communications Axis. Intelligent Network Video: Understanding Modern Video Surveillance Systems; CRC Press: Boca Raton, FL, USA, 2023. [Google Scholar]
Taiwo, O.; Ezugwu, A.E.; Oyelade, O.N.; Almutairi, M.S. Enhanced intelligent smart home control and security system based on deep learning model. Wirel. Commun. Mob. Comput. 2022, 2022, 9307961. [Google Scholar] [CrossRef]
Liu, Y.; Yang, D.; Wang, Y.; Liu, J.; Liu, J.; Boukerche, A.; Sun, P.; Song, L. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. ACM Comput. Surv. 2024, 56, 189. [Google Scholar] [CrossRef]
Tur, A.O.; Dall’Asen, N.; Beyan, C.; Ricci, E. Exploring diffusion models for unsupervised video anomaly detection. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: New York, NY, USA, 2023; pp. 2540–2544. [Google Scholar]
Astrid, M.; Zaheer, M.Z.; Lee, J.Y.; Lee, S.I. Learning not to reconstruct anomalies. arXiv 2021, arXiv:2110.09742. [Google Scholar] [CrossRef]
Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. [Google Scholar]
Su, Y.; Tan, Y.; Xing, M.; An, S. VPE-WSVAD: Visual prompt exemplars for weakly-supervised video anomaly detection. Knowl.-Based Syst. 2024, 299, 111978. [Google Scholar] [CrossRef]
Baradaran, M.; Bergevin, R. A critical study on the recent deep learning based semi-supervised video anomaly detection methods. Multimed. Tools Appl. 2024, 83, 27761–27807. [Google Scholar] [CrossRef]
Issa, O.; Shanableh, T. Static video summarization using video coding features with frame-level temporal subsampling and deep learning. Appl. Sci. 2023, 13, 6065. [Google Scholar] [CrossRef]
Chuang, C.H.; Hsieh, J.W.; Tsai, L.W.; Chen, S.Y.; Fan, K.C. Carried object detection using ratio histogram and its application to suspicious event analysis. IEEE Trans. Circuits Syst. Video Technol. 2009, 19, 911–916. [Google Scholar] [CrossRef]
Mo, X.; Monga, V.; Bala, R.; Fan, Z. Adaptive sparse representations for video anomaly detection. IEEE Trans. Circuits Syst. Video Technol. 2013, 24, 631–645. [Google Scholar] [CrossRef]
Abidine, B.M.; Fergani, L.; Fergani, B.; Oussalah, M. The joint use of sequence features combination and modified weighted SVM for improving daily activity recognition. Pattern Anal. Appl. 2018, 21, 119–138. [Google Scholar] [CrossRef]
Hu, C.; Chen, Y.; Hu, L.; Peng, X. A novel random forests based class incremental learning method for activity recognition. Pattern Recognit. 2018, 78, 277–290. [Google Scholar] [CrossRef]
Saligrama, V.; Chen, Z. Video anomaly detection based on local statistical aggregates. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 2112–2119. [Google Scholar]
Wang, Y.; Qin, C.; Bai, Y.; Xu, Y.; Ma, X.; Fu, Y. Making reconstruction-based method great again for video anomaly detection. In Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), Orlando, FL, USA, 28 November–1 December 2022; IEEE: New York, NY, USA, 2022; pp. 1215–1220. [Google Scholar]
Dang, Y.; Chen, J.; Chen, P.; Gao, N.; Huan, R.; Zhao, D. Generate anomalies from normal: A partial pseudo-anomaly augmented approach for video anomaly detection. Vis. Comput. 2025, 41, 3843–3852. [Google Scholar] [CrossRef]
Liu, T.; Lam, K.M.; Kong, J. Distilling privileged knowledge for anomalous event detection from weakly labeled videos. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 12627–12641. [Google Scholar] [CrossRef] [PubMed]
Jiang, M.; Wang, W.; Kong, J. Multi-scale Differential Perception Network for Video Anomaly Detection. In Proceedings of the International Conference on Neural Information Processing, Auckland, New Zealand, 2–6 December 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 243–257. [Google Scholar]
Qiu, S.; Ye, J.; Zhao, J.; He, L.; Liu, L.; Huang, X. Video anomaly detection guided by clustering learning. Pattern Recognit. 2024, 153, 110550. [Google Scholar] [CrossRef]
Zhang, M.; Wang, J.; Qi, Q.; Sun, H.; Zhuang, Z.; Ren, P.; Ma, R.; Liao, J. Multi-scale video anomaly detection by multi-grained spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17385–17394. [Google Scholar]
Su, Y.; Tan, Y.; An, S.; Xing, M. Anomalies cannot materialize or vanish out of thin air: A hierarchical multiple instance learning with position-scale awareness for video anomaly detection. Expert Syst. Appl. 2024, 254, 124392. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
Liu, J.; Liu, Y.; Lin, J.; Li, J.; Cao, L.; Sun, P.; Hu, B.; Song, L.; Boukerche, A.; Leung, V.C. Networking systems for video anomaly detection: A tutorial and survey. ACM Comput. Surv. 2025, 57, 270. [Google Scholar] [CrossRef]
Chen, S.; Guo, W. Auto-encoders in deep learning—a review with new perspectives. Mathematics 2023, 11, 1777. [Google Scholar] [CrossRef]
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; Hengel, A.v.d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
Liu, Z.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 13588–13597. [Google Scholar]
Zhao, Y.; Deng, B.; Shen, C.; Liu, Y.; Lu, H.; Hua, X.S. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1933–1941. [Google Scholar]
Yuan, H.; Cai, Z.; Zhou, H.; Wang, Y.; Chen, X. Transanomaly: Video anomaly detection using video vision transformer. IEEE Access 2021, 9, 123977–123986. [Google Scholar] [CrossRef]
Lee, Y.; Kang, P. Anovit: Unsupervised anomaly detection and localization with vision transformer-based encoder-decoder. IEEE Access 2022, 10, 46717–46724. [Google Scholar] [CrossRef]
Cai, R.; Zhang, H.; Liu, W.; Gao, S.; Hao, Z. Appearance-motion memory consistency network for video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 938–946. [Google Scholar]
Thakare, K.V.; Sharma, N.; Dogra, D.P.; Choi, H.; Kim, I.J. A multi-stream deep neural network with late fuzzy fusion for real-world anomaly detection. Expert Syst. Appl. 2022, 201, 117030. [Google Scholar] [CrossRef]
Himeur, Y.; Al-Maadeed, S.; Kheddar, H.; Al-Maadeed, N.; Abualsaud, K.; Mohamed, A.; Khattab, T. Video surveillance using deep transfer learning and deep domain adaptation: Towards better generalization. Eng. Appl. Artif. Intell. 2023, 119, 105698. [Google Scholar] [CrossRef]
Liu, Y.; Ju, B.; Yang, D.; Peng, L.; Li, D.; Sun, P.; Li, C.; Yang, H.; Liu, J.; Song, L. Memory-enhanced spatial-temporal encoding framework for industrial anomaly detection system. Expert Syst. Appl. 2024, 250, 123718. [Google Scholar] [CrossRef]
Ma, H.; Sun, Z.; Su, Y.; Wang, H.; Li, S.; Yu, Z.; Kang, Y.; Xu, H. Cross-Modal Two-Stream Target Focused Network for Video Anomaly Detection. In Proceedings of the International Conference in Communications, Signal Processing, and Systems, Changbaishan, China, 23–24 July 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 69–78. [Google Scholar]
Li, W.; Mahadevan, V.; Vasconcelos, N. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 18–32. [Google Scholar] [CrossRef]
Lu, C.; Shi, J.; Jia, J. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2720–2727. [Google Scholar]
Luo, W.; Liu, W.; Gao, S. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 341–349. [Google Scholar]
Park, H.; Noh, J.; Ham, B. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14372–14381. [Google Scholar]
Aich, A.; Peng, K.C.; Roy-Chowdhury, A.K. Cross-domain video anomaly detection without target domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2579–2591. [Google Scholar]
Ning, Z.; Wang, Z.; Liu, Y.; Liu, J.; Song, L. Memory-enhanced appearance-motion consistency framework for video anomaly detection. Comput. Commun. 2024, 216, 159–167. [Google Scholar] [CrossRef]
Sun, Z.; Wang, P.; Zheng, W.; Zhang, M. Dual GroupGAN: An unsupervised four-competitor (2V2) approach for video anomaly detection. Pattern Recognit. 2024, 153, 110500. [Google Scholar] [CrossRef]
Shen, G.; Ouyang, Y.; Lu, J.; Yang, Y.; Sanchez, V. Advancing video anomaly detection: A bi-directional hybrid framework for enhanced single-and multi-task approaches. IEEE Trans. Image Process. 2024, 33, 6865–6880. [Google Scholar] [CrossRef] [PubMed]
Chang, J.; Zhen, P.; Yan, X.; Yang, Y.; Gao, Z.; Chen, H. MemATr: An Efficient and Lightweight Memory-augmented Transformer for Video Anomaly Detection. ACM Trans. Embed. Comput. Syst. 2025, 24, 38. [Google Scholar] [CrossRef]

Figure 1. The complete network structure of DBIFF-Net. DBIFF-Net consists of four parts: input as a sequence of frames, output as a predicted frame, a dual-branch encoder, and a decoder.

Figure 2. The overall architecture of the swin transformer block.

Figure 3. The overall architecture of interactive fusion module.

Figure 4. The overall architecture of TSM.

Figure 5. The overall architecture of attention decoder module.

Figure 6. Examples of normal and abnormal frames in the UCSD ped2, CUHK Avenue, and ShanghaiTech Campus dataset, respectively.

Figure 7. Anomaly score of the test video in the UCSD Ped2 dataset.

Figure 8. Anomaly score of the test video in the ShanghaiTech Campus dataset.

Table 1. The hyperparameters of DBIFF-NET.

Hyperparameter	Setting
Epoch	120
Batch size	16
Learning rate	$2 \times 10^{- 4}$
Optimizer	Adam
The length of input video frame (t)	5
The stage N of convolution module and swin transformer module	3
The size of patch	4
The depth of swin transformer block	[2, 2, 2]
The attention head of swin transformer block	[4, 8, 16]
Dropout	0.5
The Number of channels for temporal shift	32

Table 2. Comparison with SOTA methods in term of AUC (%) on UCSD Ped2 and ShanghaiTech Campus datasets.

Methods	UCSD Ped2	CUHK Avenue	ShanghaiTech Campus Dataset
AE-Conv2D [10]	90.0	70.2	60.9
Mem-AE [30]	94.1	71.1	71.2
MNAD-Recon [43]	90.2	82.8	69.8
zxVAD [44]	95.8	83.1	71.6
MAMC-Net [45]	96.7	87.6	71.5
Dual GroupGAN [46]	96.6	85.5	73.1
BDHF [47]	96.4	86.5	73.6
CLAE [23]	90.8	83.1	73.3
Dang et al. [20]	96.0	85.9	73.3
MemATr [48]	95.2	83.8	73.4
DBIFF-Net (Ours)	97.7	84.5	73.8

The bold entries show the best results.

Table 3. The performance of the DBIFF-Net on three datasets.

Dataset	Method	Precision	Recall	F1-Score
UCSD Ped2	Dang et al. [20]	96.0	93.6	94.8
	DBIFF-Net (Ours)	97.1	96.2	96.7
CUHK Avenue	Dang et al. [20]	62.1	71.4	64.4
	DBIFF-Net (Ours)	57.4	75.4	65.2
ShanghaiTech Campus	Dang et al. [20]	53.2	85.6	65.0
	DBIFF-Net (Ours)	58.2	74.1	65.2

The bold entries show the best results.

Table 4. Ablation experiment of different branches on the UCSD Ped2 dataset.

Methods	UCSD Ped2
Single-CNN	97.2
Single-swin transformer	84.1
DBIFF-NET (dual-branch)	97.7

The bold entries show the best results.

Table 5. Ablation experiment of fusion methods on the UCSD Ped2 dataset.

Fusion Methods	UCSD Ped2
SUM	97.2
QA	97.7
CON	96.9
Interactive Fusion	97.7

The bold entries show the best results.

Table 6. Ablation experiment of TSM and channel attention on the UCSD Ped2 dataset.

Interactive Fusion Module	TSM	Channel Attention	ROC AUC
✓			96.4
✓	✓		96.5
✓		✓	94.1
	✓	✓	97.2
✓	✓✓	✓	97.0
✓	✓	✓	97.7

The bold entries show the best results. ✓ denotes that the module is used. ✓✓ represents more TSM.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, M.; Li, J.; Sun, Z.; Hu, J. Dual-Branch Network for Video Anomaly Detection Based on Feature Fusion. Mathematics 2025, 13, 4022. https://doi.org/10.3390/math13244022

AMA Style

Huang M, Li J, Sun Z, Hu J. Dual-Branch Network for Video Anomaly Detection Based on Feature Fusion. Mathematics. 2025; 13(24):4022. https://doi.org/10.3390/math13244022

Chicago/Turabian Style

Huang, Minggao, Jing Li, Zhanming Sun, and Jianwen Hu. 2025. "Dual-Branch Network for Video Anomaly Detection Based on Feature Fusion" Mathematics 13, no. 24: 4022. https://doi.org/10.3390/math13244022

APA Style

Huang, M., Li, J., Sun, Z., & Hu, J. (2025). Dual-Branch Network for Video Anomaly Detection Based on Feature Fusion. Mathematics, 13(24), 4022. https://doi.org/10.3390/math13244022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Dual-Branch Network for Video Anomaly Detection Based on Feature Fusion

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Dual-Branch Encoder

3.1.1. CNN Local Branch

3.1.2. Swin Transformer Global Branch

3.2. Decoder

3.2.1. Interactive Fusion Module

3.2.2. Temporal Shift Module

3.2.3. Attention Decoder Module

3.2.4. Loss Function

3.2.5. Anomaly Detection

4. Comparative Experiments

4.1. Datasets

4.2. Training Details

4.3. Evaluation Metric

4.4. Comparison with State-of-the-Art

4.5. Ablation Study and Analysis

4.5.1. The Impact of Dual-Branch Networks on Performance

4.5.2. Impact of Fusion Methods on Performance

4.5.3. Effectiveness of Each Component in Decoder

4.5.4. Visualization

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI