Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence

Xiao, Jie; Shen, Haocheng; Ding, Yasan; Guo, Bin

doi:10.3390/math13172711

Open AccessArticle

Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence

¹

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China

²

ZTE Corporation, Xi’an 710126, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(17), 2711; https://doi.org/10.3390/math13172711

Submission received: 24 July 2025 / Revised: 13 August 2025 / Accepted: 14 August 2025 / Published: 22 August 2025

(This article belongs to the Special Issue Mathematical Method for Artificial Intelligence and Mobile Edge Computing)

Download

Browse Figures

Versions Notes

Abstract

The rapid advancement of Artificial Intelligence of Things (AIoT) has driven an urgent demand for intelligent video anomaly detection (VAD) to ensure industrial safety. However, traditional approaches struggle to detect unknown anomalies in complex and dynamic environments due to the scarcity of abnormal samples and limited generalization capabilities. To address these challenges, this paper presents an adaptive VAD framework powered by edge intelligence tailored for resource-constrained industrial settings. Specifically, a lightweight feature extractor is developed by integrating residual networks with channel attention mechanisms, achieving a 58% reduction in model parameters through dense connectivity and output pruning. A multidimensional evaluation strategy is introduced to dynamically select optimal models for deployment on heterogeneous edge devices. To enhance cross-scene adaptability, we propose a multilayer adversarial domain adaptation mechanism that effectively aligns feature distributions across diverse industrial environments. Extensive experiments on a real-world coal mine surveillance dataset demonstrate that the proposed framework achieves an accuracy of 86.7% with an inference latency of 23 ms per frame on edge hardware, improving both detection efficiency and transferability.

Keywords:

video anomaly detection; edge intelligence; domain adaptation

MSC:

68T05

1. Introduction

With the rapid development of AI technology and IoT, the Internet of Intelligent Things (AIoT) is helping modern industrial production to develop high efficiency, large scales, and intelligence. Industrial products of various varieties, specifications, equipment statuses, environments and process differences make production conditions complex and variable. If any of the process anomalies are not detected and dealt with in a timely manner, then these anomalies will propagate, evolve, produce a chain reaction [1], and finally affect the safety of industrial production and product quality. In the industrial production environment, the prevalence of surveillance cameras generates large amounts of video surveillance data. Traditional industrial video surveillance systems are mainly based on algorithms such as object detection. However, since abnormal events are unknown, it is difficult for these systems to predict and identify specific types or objects of abnormal occurrences, making them less capable of preventing abnormal events. Therefore, applying artificial intelligence technology to production video analysis to achieve automatic detection and timely warning of abnormal events in industrial production can compensate for the shortcomings of traditional monitoring systems and has broad engineering application prospects.

Based on the use of previous information on abnormal events, existing methods can be divided into specific event detection [2] (supervised), new event detection (semi-supervised) [3], and unsupervised detection [4]. The detection of specific events relies on normal and abnormal samples and their labels. It is suitable for situations where abnormal data is easy to collect, such as detecting violent events. However, this method is limited to specific events and is difficult to apply to different situations. It is also difficult to collect abnormal data, making it difficult to build a robust model. New event anomaly detection only requires normal samples, which is suitable for situations where abnormal samples are rare. The “single classification” strategy is used to determine anomalies based on the similarity between samples and normal events. However, it is limited by the sample distribution, which results in poor generalization. Unsupervised detection methods do not require training samples but are sensitive to samples with large differences, and the lack of supervision information may reduce effectiveness. This paper focuses on how to adaptively detect abnormal events in industrial production surveillance videos to achieve efficient identification of abnormal situations in industrial environments. Since the frequency of abnormal events is significantly lower than that of normal events, resulting in the relative scarcity of abnormal samples, current research generally regards this problem as an unsupervised learning problem. This approach detects deviations from normal behavior by training the model using only samples of normal events. After an abnormal event is detected, classification technology can be used according to specific needs to further refine the semantic information about the abnormal event. This research focuses on detecting anomalies in industrial surveillance videos, such as improper operating behaviors of personnel and equipment failures, which are key factors that affect the stability of industrial production.

In recent years, with the development of deep learning technology, methods such as few-shot learning [5] and transfer learning [6] have laid the foundation for the development of adaptive models for the detection of abnormal events. In addition, with the development of edge computing [7], the combination of anomaly detection with edge devices helps to achieve low-latency anomaly detection in industrial production. However, for industrial production video anomaly detection applications, existing anomaly detection methods still have the following challenges:

How to implement lightweight adaptive anomaly detection in resource-constrained equipment. Most anomalies in the industrial production process are sudden, so anomalies need to be detected as quickly as possible. Existing methods only consider the accuracy of detection but ignore the detection efficiency in practical applications; due to the large number of parameters of the model, a high computing load will be generated on the edge computing device, and the detection delay will be long. Therefore, it is necessary to explore lightweight, low-latency, and adaptive anomaly detection methods in industrial production scenarios.
How to realize transferable anomaly detection methods between complex industrial scenes. Industrial production video scenes are more complex and changeable than normal surveillance videos. The same abnormal behavior may occur in different industrial scenes, limited by the difference in video background. The data distribution of abnormal events will also change, and a model trained in a single scenario may not perform well in other scenarios. Therefore, it is necessary to explore methods that can improve the generalization ability of the model in unknown scenarios to improve the model’s adaptability to the scene.

To address the above challenges, this paper proposes an edge intelligence-equipped adaptive industrial video anomaly detection (VAD) method, which mainly includes a preprocessing module, feature extractor, anomaly detector, scene adaptation module, and model evaluation module. For the first challenge, we propose a lightweight feature adaptation method to reduce detection latency by reducing the amount of network parameters, and provide a variety of pruning architectures to adapt to different video scenarios. For the second challenge, we use domain adaptation technology to perform cross-scenario migration anomaly detection to cope with the complex challenges of industrial monitoring scene changes and realize the adaptation of data resources. In addition, we also design a multidimensional evaluation detection model selection method, selecting the appropriate architecture from a series of pruning models for deployment and detection according to the model running environment. In summary, our main contributions are as follows:

We propose a lightweight, feature-adaptive method for video anomaly detection, centered on an enhanced feature extraction module. This module employs residual networks and channel attention mechanisms to minimize parameter count and adopt dense connections to enable efficient output pruning according to different parameter volumes. Furthermore, we propose a method for selecting optimized pruned models through a multidimensional evaluation to maintain balance between computational efficiency and accuracy. Although there is still room for improvement, the current accuracy and latency balance meet the basic requirements for real-time industrial video anomaly detection in edge environments.
In the scene adaptation module, we propose a scene-adaptive transferable video anomaly detection method in the scene adaptation module to solve the problem of poor generalization ability of existing video anomaly detection methods in industrial production scenarios. This method uses multiple-layer adversarial domain adaptation at different feature levels to ensure the accuracy of industrial video anomaly detection.
We constructed a dataset based on coal mine industrial video surveillance data. Experimental results show that our model can achieve low-latency detection on edge devices and can perform cross-scenario detection on the dataset. The results show that our proposed method can be deployed in edge devices and can effectively detect abnormal events in industrial production videos.

The rest of the paper is organized as follows: We briefly review the related work in Section 2. In Section 3, we present the details of our proposed method, and in Section 4, we experimentally show the performance of the model. Finally, we conclude our work in Section 5.

2. Related Works

2.1. Video Anomaly Detection

The video anomaly detection (VAD) task refers to the detection of abnormal events in video sequences that are inconsistent with normal behavior or scenes [7], such as traffic accidents, violent behavior, and intrusion events. The processing flow of VAD is shown in Figure 1. During this process, the data collected by the visual sensors in the environment are monitored. The raw visual data are then preprocessed and features extracted [8], and the resulting data are provided to the modeling algorithm to model the behavior of the surveillance target and determine whether the behavior is abnormal. Then, local or global feature descriptors, such as single-class support vector machine (SVM) [9], sparse coding [10], and optical flow map [11], are extracted, and finally. a classifier or cluster is used to determine whether the video frame or segment is abnormal. The disadvantage of these methods is that the feature extraction process is based on manual design and has limited feature expression capabilities, making it difficult to adapt to complex and changeable video scenes.

In recent years, with the development of deep learning technology, video anomaly detection methods have begun to use deep neural networks to automatically learn high-level features of videos [12]. Deep learning methods can be divided into classification-based methods, reconstruction-based methods, and prediction-based methods. Classification-based methods usually use hand-designed features or features extracted from deep learning to represent objects or motion information in videos. For example, Nikouei et al. [13] proposed an edge computing-based intelligent monitoring solution, using SVM and lightweight convolutional neural networks (CNNs) to achieve real-time identification of people and objects. Xu et al. [14] proposed an anomaly detection method based on stacked sparse coding to understand the nature of objects in the scene. Reconstruction-based methods are autoencoder-based methods that use models such as autoencoder to learn the internal representation of the video and then determine anomalies by calculating the reconstruction error or anomaly score of the video. For example, Wang et al. [15] proposed an anomaly detection method based on deep neural networks, using a one-stage model to learn features and train a single-class classifier, and then used reconstruction errors to detect anomalies. Jiang et al. [16] proposed an anomaly detection method based on generative adversarial networks for anomaly detection in hyperspectral images, using a generator to reconstruct normal spectral features, and then using a discriminator to distinguish normal and abnormal spectra. Prediction-based methods aim to utilize a model to learn the dynamic features of a video and then determine anomalies by calculating the prediction error or anomaly score of the video. For instance, Liu et al. [17] proposed an anomaly detection method based on a video prediction framework, using spatial and motion constraints to predict future frames, and then detect anomalies based on prediction errors.

2.2. Domain Adaptation

Domain adaptation mainly solves the knowledge transfer problem where the source domain and target domain have different marginal probability distributions while having the same conditional probability distribution. Inspired by adversarial learning, researchers mainly focus on adversarial domain adaptation methods, which adjust the source domain distribution and target domain distribution by adding adversarial objects to the domain adaptation network. For example, DANN [18] consists of a feature extractor, a label predictor, and a domain classifier. The domain classifier tries its best to determine the source of the input features, but the feature extractor manages to fool the domain classifier with the learned features. Unlike DANN, ADDA [19] learns two feature extractors for the source domain and target domain separately. SymNets [20] uses the source domain label classifier and the target domain label classifier to share neurons to form a domain discriminator to confuse the overall features of the domain and fit the edge distribution.

The application of domain adaptation technology in industrial production scenarios can solve the problems of sparse samples and inconsistent distribution of training samples and actual scene samples in industrial anomaly detection scenarios. For example, Wang et al. [21] applied DANN in machine fault diagnosis, improving the model’s adaptability to actual scene datasets. Santos et al. [22] proposed a method based on statistical learning theory to evaluate the generalizability of the feature space between different fields of video anomaly detection, mapping different video datasets to the same feature space to achieve cross-domain anomaly detection. Fan et al. [23] proposed an anomaly detection method based on a weighted adversarial autoencoder, which can align the distribution of normal data in the source domain and the target domain, but keep the distribution of abnormal data inconsistent in the target domain to distinguish abnormal samples.

2.3. Edge Intelligence

Edge intelligence refers to processing tasks such as data processing or model calculations in resource-constrained edge devices, thereby improving the execution speed of applications. For instance, Nikouei et al. [13] introduced a lightweight convolutional neural network (L-CNN), using histogram of oriented gradients (HOG) features to train the network. In order to provide real-time target detection, a neural network algorithm composed of convolutional layers is used for faster object detection. Furthermore, Xu et al. [24] proposed a fast and lightweight YOLO named FL-YOLO, the model of which contains depthwise separable convolutional layers. Chriki et al. [25] used a pre-trained CNN to exploit one-class classification techniques to extract features from drone videos obtained from a complex dataset of drones passing through a parking lot. Captured by humans and machines.

In addition, deep neural networks (DNNs) are often used to detect anomalies in edge devices. Kim et al. [26] solved the problem of low efficiency of the current video analysis system in processing multi-video stream video analysis and proposed a DNN-based GPU edge server method to process large data streams. Kim et al. [27] proposed a lightweight model for real-time video, FrontCNN, which consists of a Shallow 3D CNN and a pre-trained 2D CNN. This end-to-end trainable architecture can learn the spatio-temporal information of videos to achieve optimal performance. In the field of edge computing, the integration of deep learning and edge devices has the advantages of reducing network occupancy, low-latency response, and reducing communication.

3. Methodology

In this section, we first introduce the problem definition of anomaly detection in industrial production videos and then introduce the detection process of our proposed adaptive framework and the methods of each module.

3.1. Problem Definition

Given a source industrial scene video

D_{s} = {\{I_{i}^{s}\}}_{i = 1}^{n_{i}}

and a target industrial scene video

D_{t} = {\{I_{j}^{t}\}}_{j = 1}^{n_{j}}

,

n_{s}

and

n_{t}

represent the number of samples in the source scene and the number of samples in the target scene. The source scene and the target scene are sampled from the joint distribution

p_{s} (x)

and

p_{t} (x)

(satisfying

p_{s} \neq p_{t}

), and share a feature space

X_{s} = X_{t}

. The goal is to design a function that minimizes the distance between the features of the video frame of the source scene and the features of the target scene, thus capturing video features in different scenes and reconstructing the video frames. Based on the difference between the reconstructed frame and the actual frame

P ({\hat{I}}_{t}, I_{t})

, we can determine whether the frame is abnormal.

3.2. Framework

The overall workflow of Industrial-AdaVAD, our proposed framework, is shown in Figure 2. Our adaptive model, Industrial-AdaVAD, dynamically adapts to different industrial environments by selecting the most suitable feature extractor and using adversarial domain adaptation to align feature distributions across scenes. This enables effective detection of abnormal events by comparing reconstructed frames with actual frames. Specifically, it can be divided into the following modules according to the detection flow:

The video preprocessing module processes the characteristics of industrial production videos to obtain the actual input video frames.
The lightweight adaptive feature extractor can be deployed on resource-constrained edge devices to adaptively extract video frame features based on computing resources and video scene complexity.
The anomaly detector reconstructs the input frame and determines whether it is abnormal by calculating the difference between the reconstructed frame and the actual frame.
The scene adaptation module uses the adversarial domain adaptation method to input source scene and target scene frames and calculate the adversarial loss to achieve unsupervised industrial scene adaptation.
The model evaluation module evaluates different pruned models in terms of FLOPs, energy consumption, and detection accuracy, to select the appropriate pruning model according to different operating environments.

3.2.1. Video Preprocessing

Industrial production videos differ greatly from standard surveillance due to unique challenges. Strong light, smoke and dust impact image quality and personnel recognition. The diversity in industrial activities and the varied movements of workers demands robust and flexible feature extraction methods. Additionally, complex backgrounds and numerous devices in industrial settings challenge accurate human feature extraction, necessitating algorithms capable of distinguishing people from objects among these complexities.

Taking into account the characteristics of the industrial scene, preprocessing includes background noise removal, contrast enhancement, and lighting correction to mitigate the effects of lighting and smoke. For worker appearances, human posture estimation captures diverse postures and actions. A background model, generated for subtraction, reduces interference in anomaly detection, yielding preprocessed video frames as actual input.

3.2.2. Adaptive Lightweight Feature Extraction

Traditional video anomaly detection methods usually use Unet as a feature extractor to reconstruct video frames. However, the ordinary Unet encoding unit has the problem of having too many parameters in resource-constrained devices, resulting in high computational load. Inspired by the residual network, this module designs residual neural units for video feature extraction. At the same time, this module also introduces a channel attention mechanism, so that the feature extractor unit can adaptively predict potential key features of video frames, thus extracting features more relevant to the current industrial production scene. Our channel attention mechanism differs from existing works by dynamically adjusting to industrial lighting and interference, enhancing detection accuracy. This mechanism focuses on relevant features, improving efficiency and reducing model parameters by 58%. Preliminary experiments show that it outperforms traditional methods in accuracy and efficiency, making it suitable for real-time anomaly detection in industrial environments.

As shown in Figure 3,

F_{i}

is the input feature of the layer feature extraction unit i. After two

3 \times 3

convolution layers, the compressed feature

X_{i}

is obtained. In order to calculate the interdependence of the channels, the input channel attention module of the dimensional feature channel

x_{i}

performs weight calculation as follows:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j),

(1)

where

z_{c}

represents the weight of a certain channel;

x_{c}

represents the input feature of the channel. After obtaining the weight of each channel, weight it to

X_{i}

and fuse it with the input characteristics of the unit:

s = f (W_{U} δ (W_{D} z)),

(2)

F_{i}^{'} = F_{i} \oplus (X_{i} \otimes s (X_{i}))

(3)

where

f (\cdot)

and

δ (\cdot)

represent the sigmoid and ReLU functions, and

W_{U}

and

W_{D}

represent the parameters of the upsampling layer and the convolution layer. The lightweight video feature extraction unit reduces the number of parameters of the overall video extractor, enabling faster detection. At the same time, the channel attention mechanism enables the feature extractor to extract features related to the current scene, thus ensuring the abnormality of the detection accuracy.

Furthermore, we recognize that industrial surveillance videos encompass both complex and simple scenes. For anomaly detection in complex scenes, deeper network layers are essential for detailed feature extraction, improving the detection of anomalies by enhancing the analysis of video frame intricacies. In contrast, in simpler scenes, utilizing fewer layers for feature extraction suffices to effectively identify anomalies, thereby preventing unnecessary delays in detection.

Inspired by Unet++ proposed by Zhou et al. [28], we propose an adaptive video feature extraction algorithm, which uses a multi-level structure to enable the feature extractor to extract and utilize video frame feature information from multiple levels to adapt to complex video scenes, as shown in Figure 4. The lightweight feature extraction unit adopts a dense connection structure so that each layer is connected to other layers to adapt to video frames in different scenes and sizes. For each feature extraction unit, its calculation structure is as follows:

\{\begin{matrix} H (x^{i - 1, j}), & j = 0 \\ H ([{[x^{i, k}]}_{k = 0}^{j - 1}], μ (x^{i + 1, j - 1})), & j \leq 0 \end{matrix}

(4)

where

H (\cdot)

represents the feature extraction unit composed of the convolution layer and the activation function,

u (\cdot)

represents the upsampling layer, and

x^{(i, j)}

represents the output of node

X^{(i, j)}

. Moreover, to ensure the training of the shallow network, our approach incorporates deep supervised loss, which involves the aggregation of losses from four units,

X^{0, 1}

,

X^{0, 2}

,

X^{0, 3}

, and

X^{0, 4}

, to oversee the video frame reconstruction output at each stage:

(Y, \hat{Y}) = - \frac{1}{N} \sum_{b = 1}^{N} (\frac{1}{2} \cdot Y_{b} + log \hat{Y_{b}} + \frac{2 \cdot Y_{b} \cdot \hat{Y_{b}}}{Y_{b} + \hat{Y_{b}}})

(5)

where

\hat{Y_{b}}

represents the reconstruction result of the b video frame,

Y_{b}

represents the actual input frame, and N represents the batch size. Based on the results of Equation (5), this method can prune the feature extraction network without affecting the weight update to find the most suitable feature extractor network for the current video scene and effectively utilize limited computing resources. Figure 5a–d correspond to the results of pruning different numbers of layers, and the network parameters are reduced in sequence.

3.2.3. Scene-Adaptive Transferable Video Anomaly Detection

To resolve the distribution differences between training scenes and actual industrial scenes, we propose an unsupervised scene adaptation method as shown in Figure 6. First, this method is based on a multi-level adversarial domain adaptation method to align the source scene and the target scene end-to-end. Secondly, this method uses the reconstruction results of source scene video frame features as supervision signals in the process of adversarial domain adaptation to achieve unsupervised adaptation, thus making up for the lack of abnormal event samples.

Taking single-level adversarial adaptation as an example, a scene classifier is established to determine whether the feature comes from the source scene or the target scene, and the feature extractor G tries to confuse the scene classifier C to achieve distribution alignment of source domain features and target domain features. In this method, adversarial training is performed in an unsupervised manner. For the input of the unlabeled video frame I, a fully convolutional scene classifier C is first trained using binary cross-entropy loss:

L_{c} (I) = - [y log C (G (I)) - (1 - y) log (1 - C (G (I)))],

(6)

where

y = 0

when the video frame sample comes from the target scene, and when the video frame sample comes from the source scene,

y = 1

. The loss of adversarial training consists of two parts: the reconstruction loss of the source scene video frame and the adversarial loss of the target scene video frame. For the video frame of the source scene

I_{i}^{s} \in D_{s}

, use the feature extractor to extract its features and then input the sample. The reconstructor performs reconstruction and calculates the reconstruction loss:

L_{r e c} (I_{i}^{s}) = L_{m s e} (D (G (I_{i}^{s})), I_{i}^{s})

(7)

where

L_{m s e}

is the mean square error loss function. For the target scene video frame, the extracted features are input into the scene classifier to predict its scene label. In order to make the distribution of the source scene and the target scene as close as possible, for the input target scene video frame

I_{i}^{t} \in D_{t}

, calculate the adversarial loss:

L_{a d v} (I_{i}^{t}) = - log (C (G (I_{i}^{t})))

(8)

This loss is used to train the feature extractor and fool the scene classifier by maximizing the probability that the target scene is considered to be the source scene.

In industrial surveillance video scenarios, the single-level adaptive method has the problem of insufficient attention to advanced features and easy neglect of local abnormal events. To compensate for the shortcomings of the single-level adaptive method, we introduce multi-level adaptive operations. By introducing a multi-level adaptive mechanism, the model’s sensitivity to local abnormal events is improved. Inspired by the method of using strong–weak models to learn domain-invariant features [29], this method also adds an adversarial domain adaptation module on the low-level feature scale. The multi-level adversarial loss is

L (I_{s}, I_{t}) = \sum_{i} λ_{a d v} L_{a d v} (I_{i}^{t}) + \sum_{i} λ_{r e c} L_{r e c} (I_{i}^{s})

(9)

λ_{a d v}

and

λ_{r e c}

are used to balance the two losses. Based on the above, the ultimate goal is the following:

min_{G} max_{D} L (I_{s}, I_{t})

(10)

which minimizes the reconstruction loss of the source scene video frame while maximizing the probability that the target scene video frame is predicted as the source scene video frame. The general flow of the algorithm is shown in Algorithm 1. The multi-level scene adaptation method is preferred in complex industrial environments to enhance sensitivity to both global and local features.

Algorithm 1 Scene-adaptive algorithm.

Input: Source scene frame

I_{s}

, Target scene frame

I_{t}

Output: Anomaly score (PSNR) of target scene frame

1: for I in source video

D_{s}

, target video

D_{t}

do

2: Feature extractor extracts video frame features

3: Reconstruct video frame

\hat{I}

4: Update scene classifier C

5: for each i in adaptive levels do

6: if

I_{i} \in D_{s}

then

7: Calculates the reconstruction loss by Equation (7)

8: end if

9: if

I_{i} \in D_{t}

then

10: Calculates the adversarial loss by Equation (8)

11: end if

12: end for

13: Calculate total loss by Equation (9)

14: Update Feature extractor and reconstructor by Equation (10)

15: end for

3.2.4. Multidimensional Evaluation Pruned Model Selection Method

The increase in the number of network layers of the video feature extractor will improve the accuracy of anomaly detection, but at the same time, it will increase the detection delay and energy consumption. When running the anomaly detection model in the cloud, the requirements for computing resources and energy consumption are relatively low, but the requirements for accuracy are higher; when running the anomaly detection model on the terminal, it is necessary to minimize the delay and accuracy while ensuring accuracy and energy consumption. Existing systems are often based on a single mode, that is, either in the cloud or on the terminal, relying on manual selection of operating modes. We propose a multidimensional model evaluation method, assigning different weights to dimensions and calculating comprehensive evaluation indicators for different operating environments, thereby helping the system select an appropriate architecture from a series of pruning models for deployment and detection.

Specifically, this method can predict the inference time and energy consumption of the model based on the model structure. For the amount of calculation, FLOPs are used for prediction, which represents the total number of floating-point operations involved in performing an operation. There is a positive correlation between the inference time of a model and FLOPs; that is, on the same hardware device, the more FLOPs, the longer the inference time. The reason is that more floating-point operations will consume more computing resources, resulting in increased time overhead in the inference phase. This paper calculates FLOPs based on the method proposed by Molchanov et al. [30].

\begin{matrix} F L O P s & = n_{c} \times [2 \times (H W K^{2} C_{i n} C_{o u t} + C_{o u t})] + \\ n_{f} \times [C_{i n} \times C_{o u t}] \end{matrix}

(11)

where

n_{c}

and

n_{f}

represent the number of convolutional layers and fully connected layers;

C_{i n}

,

C_{o u t}

represents the input channel and output channel; K is the convolution kernel size;

H W

is the size of the output feature map.

For inference energy consumption, we mainly predict the energy consumption on the edge device side, which consists of three parts: computing energy consumption, energy consumption of accessing the cache, and energy consumption of accessing off-chip memory; the energy consumption of the entire model inference is based on layer-by-layer modeling, and then predictions are made by accumulating all layers. According to the experiments of Liu et al. [31], the prediction model is based on the data observation results of cache simulation, using cache hit rate as the key influencing parameter to measure on-chip and off-chip memory access energy consumption. Existing energy consumption prediction methods are either tightly coupled to the hardware platform and lack simple and fast portability, or they rely on a large amount of statistical simulation data for learning and fitting, and the early data collection is expensive. Unlike existing methods, this method has the characteristics of both analytical methods and fitting models, and can be quickly and effectively migrated to different devices. The specific total energy consumption calculation formula is as follows:

E n e r g y = \sum_{i = 0}^{l a y e r - 1} E n e r g y_{i},

(12)

The energy consumption for the computation of the layer i can be modeled as:

\begin{matrix} E n e r g y_{i} & = σ_{C} \times C_{i} + α \times σ_{c a c h e} \times N_{i} + \\ (1 - α) \times σ_{D R A M} \times N_{i} \end{matrix}

(13)

where

σ_{C}

is the energy coefficient of a single calculation;

C_{i}

is the amount of calculation of the layer;

α

is the cache hit rate;

N_{i}

is the total amount of data in the layer;

σ_{c a c h e}

and

σ_{D R A M}

are the energy coefficients for accessing the cache and DRAM.

Finally, based on the FLOPs obtained in the model, inference energy consumption and anomaly detection AUC, the complete evaluation formula of the model can be obtained:

\begin{matrix} F & = ω_{1} \times A U C + ω_{2} \times F L O P s + ω_{3} \times E n e r g y \end{matrix}

(14)

where

ω_{1}

,

ω_{2}

, and

ω_{3}

are the weights of the three indicators of precision of inference, the amount of inference calculation, and the energy consumption of inference, respectively, which are calculated by the analytical hierarchy process (AHP). The calculation weights are also different on different platforms.

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

This paper conducts scene transfer detection between different datasets. The statistical tables and video frames of the dataset used in this paper are shown in Table 1 and Figure 7. Specifically, experiments are conducted on two public datasets, UCSD Ped and CUHK Avenue:

The UCSD Ped dataset [32] contains two sub-datasets: Ped1 and Ped2. The Ped1 dataset contains videos of pedestrians indoors, while the Ped2 dataset contains videos of pedestrians outdoors. The dataset includes abnormal events captured in various crowd scenes ranging from sparse to dense, such as unexpected behaviors such as walking on the road, walking on the grass, and vehicle movement on the sidewalk. Ped1 consists of 34 training video samples and 36 testing video samples, and Ped2 consists of 16 training video samples and 12 testing video samples.
The CUHK Avenue dataset [3] includes 16 training videos and 21 test videos. Abnormal events in the dataset include random running people, abandoned objects, and people walking with suspicious objects.

In addition, we constructed an actual industrial production video test dataset based on video surveillance data from a coal mine, including abnormal travels through the tunnel, abnormal wall climbing, and abnormal unknown objects driving onto the underground rail. This dataset consists of a total of 5 abnormal scenes, each abnormal scene includes 9–12 video surveillance clips, and each video clip includes 150–200 frames.

4.1.2. Baselines

We utilize two types of anomaly detection methods for comparison, namely non-adaptive methods and adaptive-related methods. All methods were compared with the same total number of data:

(1) Non-adaptive Methods

MPPCA [33]: A spatio-temporal Markov random field (MRF) model is proposed to detect abnormal events in videos. The nodes in the MRF graph correspond to different areas in the video frame.
MDT [34]: A joint spatio-temporal anomaly detector is proposed that integrates time and space to detect abnormal behaviors in crowded scenes.
Unmasking [35]: An unsupervised video abnormal event detection framework based on unmasking technology is proposed. By iteratively training a binary classifier, a classifier with higher training accuracy is finally used to detect abnormal events.
ConvAE [36]: Two methods based on autoencoders are proposed. One is to use traditional hand-designed spatio-temporal local features to learn a fully connected autoencoder. The second is to build a fully convolutional feedforward autoencoder and build an end-to-end learning framework to capture multiple patterns in the dataset to detect anomalies.
MemAE [37]: An improved autoencoder (Memory-Augmented Autoencoder, MemAE) is proposed to improve the robustness of autoencoder-based anomaly detection so that it can better handle anomalies.
MNAD [38]: It is proposed to use a memory module with an update scheme, in which items in the memory record typical patterns of normal data, which enhances the discriminative ability of memory items and features deeply learned from normal data, and improves the anomaly detection effect.

(2) Adaptive Methods

Feature Generalization [22]: Analyze the feature embedding of a pre-trained CNN, use cross-domain generalization metrics to study the generalization ability of source features in different target video domains, and verify the practicability of the feature generalization method on different video datasets.
DANN [23]: An adaptive method based on DANN is proposed for the transfer of anomaly detection knowledge in an unsupervised manner. Unsupervised adversarial domain adaptation is used to generate significant differences between the distribution of normal and abnormal data in the target domain, thereby achieving anomaly detection in new scenarios.
Finetune [39]: Extensive benchmarking using 12 different CNN models trained on ImageNet as feature extractors and fine-tuned on seven video anomaly detection benchmark datasets to detect video anomalies.
Meta-Learning [40]: The problem of few-shot scene-adaptive anomaly detection is proposed, aiming to detect anomalies in previously unseen scenes using only a small number of frames, and a method based on meta-learning is proposed to solve the problem of lack of video data of the target scene.

For video anomaly detection results, AUC is used for measurement. AUC is a method for evaluating the performance of video anomaly detection and represents the area under the receiver operating characteristic curve (ROC curve). The ROC curve is a graph used to show the true positive rate (TPR) and the false positive rate (FPR) of an anomaly detection model under different thresholds. The higher the AUC index, the better the model can distinguish abnormal data from normal data, while reducing false positives and false negatives.

4.1.3. Implementation Details

In the adaptive feature extraction stage, for the lightweight video feature extractor unit, the downsampling and upsampling layers of its channel attention module are set to convolutional layers

1 \times 1 \times C

, the number of channels is

C = 64

, and the dimensionality reduction ratio

r = 16

. The downsampling layer and the upsampling layer use ReLU and sigmoid activation functions, respectively. For the adaptive video feature extraction algorithm, to use deep supervision, a convolution layer

1 \times 1

is added after the four nodes

X^{0, 1}

,

X^{0, 2}

,

X^{0, 3}

, and the sigmoid activation function, so four reconstructed video frames can be generated for a given input video frame and further averaged to generate the final reconstructed video frame. In addition, to compare the detection effect of the model under resource constraints, this section conducts experiments on two hardware platforms: NVIDIA GeForce RTX 3090 and NVIDIA Jetson Nano (5V 15W 6CORE mode), from NVIDIA Corporation.

In the scene-adaptive anomaly detection stage, the height H and width W of the input video frame are set to 256. The domain adaptation module consists of 4 fully convolutional layers. The convolution kernel scale is

4 \times 4

, the convolution kernel movement step is 2, the number of channels is

{64, 128, 256, 512, 1}

, respectively, and a leaky ReLU layer is added after each layer. The feature extractor and sample reconstructor adopt the Unet architecture. The number of channels in the feature extractor is

{64, 128, 256, 512}

, and the number and size of channels for the extracted features are

512, 32 \times 32

, respectively. The adversarial weight

λ_{a d v}

and

λ_{r e c}

in the adversarial loss are set to 1, 0.1 in the high-level feature space and 0.1, 0.05 in the low-level feature space. The feature extractor has a learning rate of

1.5 \times 10^{- 4}

, and the scene classifier has a learning rate of

1 \times 10^{- 4}

.

4.2. Results and Discussion

4.2.1. Performance on Adaptive Lightweight Feature Extraction

In order to evaluate the inference efficiency of our method, this experiment uses gradually increasing training subsets on the ped1 → Coal Mine Video adaptive task to compare the video feature extractor based on the ordinary Unet architecture and the adaptive lightweight video feature extractor proposed in this paper on a PC (NVIDIA GeForce RTX 3090) and NVIDIA Jetson Nano. In addition to the AUC, this study also tested the model’s training time, inference time, average power consumption, and peak power consumption on Jetson. In order to simulate the data-limited scenario in the real environment, the numbers of video frames in the sub-datasets we constructed are

100, 500, 1000

, respectively. In addition, to ensure the fairness of the experiment, the adopted adaptive video feature extraction algorithm was tested at the level

L^{3}

(the same number of layers as ordinary Unet). The results of the experiment are shown in Table 2 and Table 3.

As shown in Table 3, the average inference time of our adaptive lightweight architecture on the PC platform is approximately 22 ms, and the average inference time on the Jetson platform is about 82 ms. Compared to DANN, the inference delay is reduced by about 75%. The execution speed of our lightweight feature extractor on the NVIDIA Jetson Nano is approximately 52 frames per second (fps), enabling real-time detection. Although detection accuracy is only reduced by less than 0.5%, the average energy consumption during training and inference on the Jetson platform is reduced by approximately 70%, and the average power consumption is reduced by about 0.48 W. In addition, this method can achieve better adaptive effects by using only 1000 source scene and target scene video frames, so it can greatly shorten the training time and detection time of the model while ensuring accuracy. Compared with other similar adaptive detection methods, the computational cost of this method is also significantly reduced: the detection rate of the DANN method is about 18 frames/s, while our proposed architecture is 52 frames/s. Therefore, our proposed lightweight feature extraction architecture can achieve low-latency video anomaly detection on platforms and smart terminals.

In addition, to verify the adaptive video feature extraction algorithm, we tested the adaptive video feature extraction algorithm on the Jetson using Ped1 as the source scene, Ped2 and coal mine rail monitoring as the target scene, and tested L Model pruning results at four different levels:

L^{1}

,

L^{2}

,

L^{3}

, and

L^{4}

. As shown in Figure 8, when the target scene is Ped2, the video scene is relatively simple, and the detection accuracy at the three pruning levels of

L^{2}

,

L^{3}

and

L^{4}

is not very different. Therefore, when low-latency detection requirements are high, the level

L^{2}

or

L^{3}

pruning model can be used to reduce detection latency while ensuring detection accuracy. When the target scene is the coal mine rail, because the video scene is more complex, the deeper the network, the better the perception and feature extraction of the video frame details, and the better the anomaly detection effect. At this time, the unpruned model is selected at the level

L^{4}

for detection. In summary, the adaptive video feature extraction algorithm provides a variety of architecture options, so reasonable measures and choices can be made in terms of detection accuracy and delay.

4.2.2. Performance on Scene-Adaptive Detection

In order to verify the performance of the scene-adaptive transferable video anomaly detection method, we set up two types of scene adaptation tasks in the experiment, namely Ped1 → Ped2 and Ped1 → Avenue, and compared them with the non-adaptive baseline and the adaptive baseline. The experimental results are shown in Table 4.

In the Ped1 → Ped2 task, the average AUC of the non-adaptive method is 77.14%, and the average AUC of the adaptive method is 87.60%. In the Ped2 → Coal Mine Video task, the average AUC of the non-adaptive method is 77.14%; the average AUC of the adaptive method is 80.85%. The average performance of the adaptive method on the two tasks is improved by about 9% and 5%, respectively. This shows that the adaptive method can better detect abnormal events in multiple scenarios. We also find that because the video distributions of Ped1 and Ped2 are similar, the adaptive detection effect of Ped2 is better than the detection effect of Coal Mine Video. The AUC of our proposed scene-adaptive detection task Ped1 → Ped2 reached 92.03%, and the AUC in the Ped2 → Coal Mine Video reached 84.31%, which is the best among all methods. It is worth noting that compared to the similar adversarial domain adaptation-based method DANN, our method achieves better performance because of the adaptation at multiple feature levels and the scene classifier is trained in each epoch, which shows that this method enables the anomaly detection model to have better scene generalization performance.

In addition, this experiment visualized the results of the video anomaly detection. Figure 9 shows the PSNR curves of different methods for cross-scene video anomaly detection. The higher the PSNR value, the more similar the reconstructed frame is to the original frame; the dark blue line is the proposed scene-adaptive method, and the light blue line is the non-adaptive anomaly detection method. When an abnormal event occurs, such as when an abnormal vehicle enters, the reconstruction score of the video frame decreases. Compared with the non-adaptive method, our proposed method has a higher score in normal frame reconstruction and abnormal frame reconstruction. The gap between scores is larger, so the proposed adaptive method can better detect anomalies across scenes.

Finally, we tested on the real Coal Mine Video dataset, which mainly includes four industrial production monitoring scenes: cableway, east side of the equipment center, underground track, and control room. In order to ensure fairness, the source scene datasets all use Ped2 and are compared with three types of adaptive methods. The detection results are shown in Table 5. The results show that the method based on adversarial domain adaptation is overall better than the method based on fine-tuning. A reasonable explanation for this is that adversarial domain adaptation usually does not require label information from the target domain, so it is more suitable for application in unsupervised scenarios, which makes it more effective when the target scene labels are sparse.

Figure 10 shows the adaptive anomaly detection results of our method trained on Ped2 in different industrial production scenarios. Taking Figure 10a as an example, it shows the detection results in the new monitoring scenario. Under normal conditions, the PSNR value is high, but when an abnormal event occurs, the PSNR value decreases rapidly. In the abnormal section, you can see abnormal vehicles entering the track, while in the second and third abnormal sections, you can see outsiders and other vehicles intruding. Figure 10b,c show that abnormal vehicles entering the warehouse and people climbing over the wall were also detected in the cable library monitoring and east wall monitoring equipment, respectively. This shows that the adaptive method proposed in this chapter is effective in actual situations and can effectively detect abnormal events in industrial production monitoring scenarios.

Furthermore, we compare the single-level and multi-level adaptive video frame reconstruction results based on an actual industrial scenario of underground track monitoring. Figure 11 shows the heat map of video frame reconstruction, in which abnormal events (abnormal intrusion of transport vehicles and workers) are shown in the red box. From the heat map, it can be seen that for the same abnormal frame, multi-level feature reconstruction is used. The reconstruction results using only a single level of adaptation also highlight the abnormal parts, while the normal parts are darker. Therefore, using multi-level adaptation can enhance the contrast between abnormal and normal parts, thereby better distinguishing abnormal events.

5. Conclusions

This paper proposes an adaptive edge intelligence framework for industrial video anomaly detection, which addresses the challenges of abnormal sample scarcity and limited generalizability in complex environments. A lightweight feature extractor, combining residual networks and channel attention mechanisms, achieves a 58% parameter reduction via dense connectivity and output pruning. In addition, a multilayer adversarial domain adaptation mechanism enhances cross-scene transferability. Our proposed Industrial-AdaVAD framework achieved an accuracy of 86.7% with an inference latency of 23 ms per frame on the NVIDIA Jetson Nano, demonstrating effective and efficient anomaly detection in industrial settings.

Despite the promising results, our work has several limitations that should be acknowledged. Specifically, the model performance may degrade under extreme lighting conditions or when there is significant occlusion, which is common in some industrial settings. Additionally, the generalizability of the model to scenarios with different camera angles or resolutions requires further validation. These limitations highlight the need for further research and development. To address these limitations, future work will focus on improving adaptability and robustness in various industrial scenarios. First, integration of multi-model sensory data, such as infrared, acoustic, and gas signals, will be pursued to improve anomaly detection under extreme or low-visibility conditions through cross-modal transfer learning. Second, continuous domain adaptation algorithms will be developed to address temporal shifts and evolving production environments without requiring full retraining. Lastly, we aim to design hierarchical edge–cloud collaborative architectures to support real-time feedback and decision making, paving the way for closed-loop autonomous industrial safety monitoring systems.

Author Contributions

Conceptualization, J.X. and B.G.; methodology, J.X. and H.S.; data curation, J.X. and H.S.; writing—original draft preparation, J.X., H.S. and Y.D.; writing—review and editing, Y.D. and B.G.; supervision, Y.D. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China (No. 62032020), and the China Postdoctoral Science Foundation under Grant Number 2025M774363.

Data Availability Statement

The data presented in this study are openly available in [CUHK Avenue dataset] at [https://www.cse.cuhk.edu.hk/~leojia/projects/detectabnormal/index.html, accessed on 11 August 2025], reference number [3]. The data presented in this study are openly available in [UCSD Ped dataset] at [http://visal.cs.cityu.edu.hk/downloads, accessed on 11 August 2025], reference number [32].

Acknowledgments

We would like to thank the reviewers for their comments, which helped to considerably improve this article. During the preparation of this work, we used GPT-3.5 in order to correct grammatical errors and improve readability. After using this tool, we reviewed and edited the content as needed and take full responsibility for the content of the paper.

Conflicts of Interest

Haocheng Shen is an employee of the ZTE Corporation. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIoT	Artificial Intelligence of Things
VAD	Video Anomaly Detection
SVM	Support Vector Machine
CNN	Convolutional Neural Networks
L-CNN	Lightweight Convolutional Neural Network
HOG	Histogram of Oriented Gradients
DNN	Deep Neural Network
AHP	Analytical Hierarchy Process
MRF	Markov Random Field
MemAE	Memory-Augmented Autoencoder
ROC	Receiver Operating Characteristic
TPR	True Positive Rate
FPR	False Positive Rate

References

Ma, L.; Dong, J.; Peng, K.; Zhang, C. Hierarchical monitoring and root-cause diagnosis framework for key performance indicator-related multiple faults in process industries. IEEE Trans. Ind. Inform. 2018, 15, 2091–2100. [Google Scholar] [CrossRef]
Luo, W.; Liu, W.; Gao, S. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 341–349. [Google Scholar]
Lu, C.; Shi, J.; Jia, J. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 2720–2727. [Google Scholar]
Pimentel, M.A.; Clifton, D.A.; Clifton, L.; Tarassenko, L. A review of novelty detection. Signal Process. 2014, 99, 215–249. [Google Scholar] [CrossRef]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. (CSUR) 2020, 53, 1–34. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
Patrikar, D.R.; Parate, M.R. Anomaly detection using edge computing in video surveillance system. Int. J. Multimed. Inf. Retr. 2022, 11, 85–110. [Google Scholar] [CrossRef]
Georgiou, T.; Liu, Y.; Chen, W.; Lew, M. A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int. J. Multimed. Inf. Retr. 2020, 9, 135–170. [Google Scholar] [CrossRef]
Chen, Y.; Zhou, X.S.; Huang, T.S. One-class SVM for learning in image retrieval. In Proceedings of the 2001 International Conference on Image Processing (Cat. No. 01CH37205), Thessaloniki, Greece, 7–10 October 2001; Volume 1, pp. 34–37. [Google Scholar]
Zhao, B.; Li, F.-F.; Xing, E.P. Online detection of unusual events in videos via dynamic sparse coding. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3313–3320. [Google Scholar]
Mehran, R.; Oyama, A.; Shah, M. Abnormal crowd behavior detection using social force model. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 935–942. [Google Scholar]
Nayak, R.; Pati, U.C.; Das, S.K. A comprehensive review on deep learning-based methods for video anomaly detection. Image Vis. Comput. 2021, 106, 104078. [Google Scholar] [CrossRef]
Nikouei, S.Y.; Chen, Y.; Song, S.; Xu, R.; Choi, B.Y.; Faughnan, T. Smart surveillance as an edge network service: From harr-cascade, svm to a lightweight cnn. In Proceedings of the 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC), Philadelphia, PA, USA, 18–20 October 2018; pp. 256–265. [Google Scholar]
Xu, R.; Nikouei, S.Y.; Chen, Y.; Polunchenko, A.; Song, S.; Deng, C.; Faughnan, T.R. Real-time human objects tracking for smart surveillance at the edge. In Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [Google Scholar]
Wang, C.; Dong, S.; Zhao, X.; Papanastasiou, G.; Zhang, H.; Yang, G. SaliencyGAN: Deep learning semisupervised salient object detection in the fog of IoT. IEEE Trans. Ind. Inform. 2019, 16, 2667–2676. [Google Scholar] [CrossRef]
Jiang, T.; Li, Y.; Xie, W.; Du, Q. Discriminative reconstruction constrained generative adversarial network for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4666–4679. [Google Scholar] [CrossRef]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection—A new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
Zhang, Y.; Tang, H.; Jia, K.; Tan, M. Domain-symmetric networks for adversarial domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5031–5040. [Google Scholar]
Wang, Q.; Michau, G.; Fink, O. Domain adaptive transfer learning for fault diagnosis. In Proceedings of the 2019 Prognostics and System Health Management Conference (PHM-Paris), Paris, France, 2–5 May 2019; pp. 279–285. [Google Scholar]
dos Santos, F.P.; Ribeiro, L.S.; Ponti, M.A. Generalization of feature embeddings transferred from different video anomaly detection domains. J. Vis. Commun. Image Represent. 2019, 60, 407–416. [Google Scholar] [CrossRef]
Fan, C.; Zhang, F.; Liu, P.; Sun, X.; Li, H.; Xiao, T.; Zhao, W.; Tang, X. Importance weighted adversarial discriminative transfer for anomaly detection. arXiv 2021, arXiv:2105.06649. [Google Scholar] [CrossRef]
Xu, Z.; Li, J.; Zhang, M. A surveillance video real-time analysis system based on edge-cloud and fl-yolo cooperation in coal mine. IEEE Access 2021, 9, 68482–68497. [Google Scholar] [CrossRef]
Chriki, A.; Touati, H.; Snoussi, H.; Kamoun, F. Deep learning and handcrafted features for one-class anomaly detection in UAV video. Multimed. Tools Appl. 2021, 80, 2599–2620. [Google Scholar] [CrossRef]
Kim, W.J.; Youn, C.H. Lightweight online profiling-based configuration adaptation for video analytics system in edge computing. IEEE Access 2020, 8, 116881–116899. [Google Scholar] [CrossRef]
Kim, J.H.; Kim, N.; Won, C.S. Deep edge computing for videos. IEEE Access 2021, 9, 123348–123357. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In International Workshop on Deep Learning in Medical Image Analysis; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6956–6965. [Google Scholar]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
Liu, S.; Li, X.; Zhou, Z.; Guo, B.; Zhang, M.; Shen, H.; Yu, Z. Adaenlight: Energy-aware low-light video stream enhancement on mobile devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2023, 6, 172. [Google Scholar] [CrossRef]
Chan, A.B.; Vasconcelos, N. Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 909–926. [Google Scholar] [CrossRef]
Kim, J.; Grauman, K. Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2921–2928. [Google Scholar]
Li, W.; Mahadevan, V.; Vasconcelos, N. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 18–32. [Google Scholar] [CrossRef]
Tudor Ionescu, R.; Smeureanu, S.; Alexe, B.; Popescu, M. Unmasking the abnormal events in video. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2895–2903. [Google Scholar]
Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. [Google Scholar]
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; Hengel, A.v.d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
Park, H.; Noh, J.; Ham, B. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14372–14381. [Google Scholar]
Gutoski, M.; Ribeiro, M.; Hattori, L.T.; Romero, M.; Lazzaretti, A.E.; Lopes, H.S. A comparative study of transfer learning approaches for video anomaly detection. Int. J. Pattern Recognit. Artif. Intell. 2021, 35, 2152003. [Google Scholar] [CrossRef]
Lu, Y.; Yu, F.; Reddy, M.K.K.; Wang, Y. Few-shot scene-adaptive anomaly detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 125–141. [Google Scholar]

Figure 1. Video anomaly detection workflow.

Figure 2. The overall workflow of Industrial-AdaVAD.

Figure 3. Lightweight feature extractor unit.

Figure 4. Adaptive feature extractor.

Figure 5. Different pruned models for adaptive feature extractor output. Blue nodes represent activated network units involved in real-time inference calculations of the model, used for feature extraction and anomaly detection in industrial scenarios; gray nodes are units pruned and deactivated to adapt to edge devices for lightweighting, which can reduce the amount of computation. Besides, (a) is the unpruned fully-activated basic model, and (b–d) are the processes of progressively streamlining the structure through pruning strategies according to the edge-adaptation requirements of industrial scenarios, reflecting the evolution of the model towards lightweighting and balancing detection accuracy with the resource limitations of edge devices.

Figure 6. Multi-level scene adaptation method.

Figure 7. Frames in videos from different datasets.

Figure 8. Comparison of adaptive task pruning results in different scenarios.

Figure 9. PSNR of different methods for cross-scene anomaly detection. The light blue curve represents traditional video anomaly detection methods, while the dark blue curve represents our method. As can be seen, when anomalies occur in the video scene, PSNR decreases in both cases. However, our method is more sensitive to changes in anomaly events and can detect some less significant anomalies (such as shadow areas).

Figure 10. PSNR for anomaly detection in surveillance videos of different coal mine scenes. The vertical axis represents PSNR, which reflects frame quality; abnormal scenarios are often accompanied by a significant decrease in PSNR. The horizontal axis corresponds to the time axis of the video frame sequence. The red boxes embedded in the video frames indicate abnormal targets (such as personnel or equipment). For example, in the underground track scenario (a), the orange curve shows stable PSNR under normal conditions, but the shaded area experiences a sudden drop due to abnormalities (such as equipment failure or unauthorized personnel entry), with the small figure’s red box marking the abnormal area; (b) In the cable warehouse scene (green curve), PSNR decreases during anomalies, with the small figure showing intrusive equipment; (c) In the equipment center fence scene (blue curve), PSNR decreases during anomalies, with the small figure indicating personnel scaling the fence, demonstrating that PSNR anomalies are correlated with actual anomalies.

Figure 11. Adaptive video frame reconstruction heat map at different levels. (a) shows the original surveillance frame, with red boxes highlighting abnormal targets (such as vehicles or people); (b) shows the feature heatmap processed by traditional VAD methods (blue-yellow-red indicates feature response intensity), which fails to sufficiently focus on the features of abnormal targets; (c) presents the results of multi-level domain adaptive processing, enabling more precise detection of abnormalities. This validates the advantages of multi-level adaptive processing in industrial scene anomaly recognition (e.g., personnel violations, equipment abnormalities), thereby improving detection accuracy.

Table 1. The datasets of video anomaly detection.

	UCSD Ped		Avenue	Coal Mine Video
	Ped1	Ped2	Avenue	Coal Mine Video
Training Video Frame	6766	2533	15,328	12,687
Testing Video Frame	7164	1997	15,324	11,529

Table 2. Evaluation of the ped1 → Coal Mine Video task based on different extractors (Part 1: Performance and Power).

	Training Data	AUC	Energy Consumption [mAh]	Peak Power [W]	Average Power [W]
DANN (Based on Unet)	Source	76.46%	-	-	-
	100	77.47%	25.60	7.648	7.579
	500	81.86%	126.80	7.637	7.651
	1000	84.68%	257.40	7.661	7.674
Lightweight Feature Extractor	Source	80.10%	-	-	-
	100	76.85%	9.40	7.246	7.233
	500	80.93%	44.30	7.271	7.264
	1000	84.31%	83.30	7.197	7.201

Table 3. Evaluation of the ped1 → Coal Mine Video task based on different extractors (Part 2: Training and Inference Time).

	Training Data	Training Time [min]		Inference Time [ms/f]
	Training Data	Jetson	PC	Jetson	PC
DANN (Based on Unet)	Source	-	-	-	-
	100	0.89	1.68	226	95
	500	4.76	3.88	246	84
	1000	9.64	5.75	213	87
Lightweight Feature Extractor	Source	-	-	-	-
	100	0.32	1.07	83	23
	500	1.61	2.11	85	20
	1000	3.19	3.46	78	21

Table 4. Detection results of different scene-adaptive tasks (the results in bold indicate the best performance).

	Method	Ped1 → Ped2	Ped1 → Coal Mine Video
No Ada.	MPPCA [33]	69.30%	-
	MPPCA+SFA [33]	61.30%	-
	MDT [34]	82.90%	-
	Unmasking [35]	82.20%	-
	ConvAE [36]	81.10%	78.16%
	MNAD [38]	81.40%	76.96%
	MemAE (directly) [37]	81.83%	74.24%
Ada.	Feature Gen [22]	80.18%	77.59%
	MemAE (finetune) [37]	82.00%	78.63%
	DANN [23]	91.90%	83.89%
	Finetune [39]	89.30%	81.03%
	Meta-Learning [40]	90.21%	79.69%
	Our Method	92.03%	84.31%

Table 5. Adaptive anomaly detection results of coal mine production videos in different scenarios.

Methods	Coal Mine Scene				Average AUC
Methods	Cableway	Equipment Center	Track	Control Room	Average AUC
Feature Generalization	81.23%	79.14%	85.32%	80.45%	81.54%
MemAE (Finetune)	83.91%	84.25%	87.41%	82.13%	84.43%
DANN	86.77%	87.93%	86.82%	85.22%	86.69%
Our Method	87.45%	89.76%	90.23%	84.78%	88.06%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, J.; Shen, H.; Ding, Y.; Guo, B. Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence. Mathematics 2025, 13, 2711. https://doi.org/10.3390/math13172711

AMA Style

Xiao J, Shen H, Ding Y, Guo B. Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence. Mathematics. 2025; 13(17):2711. https://doi.org/10.3390/math13172711

Chicago/Turabian Style

Xiao, Jie, Haocheng Shen, Yasan Ding, and Bin Guo. 2025. "Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence" Mathematics 13, no. 17: 2711. https://doi.org/10.3390/math13172711

APA Style

Xiao, J., Shen, H., Ding, Y., & Guo, B. (2025). Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence. Mathematics, 13(17), 2711. https://doi.org/10.3390/math13172711

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence

Abstract

1. Introduction

2. Related Works

2.1. Video Anomaly Detection

2.2. Domain Adaptation

2.3. Edge Intelligence

3. Methodology

3.1. Problem Definition

3.2. Framework

3.2.1. Video Preprocessing

3.2.2. Adaptive Lightweight Feature Extraction

3.2.3. Scene-Adaptive Transferable Video Anomaly Detection

3.2.4. Multidimensional Evaluation Pruned Model Selection Method

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Baselines

4.1.3. Implementation Details

4.2. Results and Discussion

4.2.1. Performance on Adaptive Lightweight Feature Extraction

4.2.2. Performance on Scene-Adaptive Detection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI