1. Introduction
Visual anomaly detection (VAD), which recognizes anomalies using unlabeled normal images, is an intriguing topic in computer vision because collecting anomaly samples can be challenging. During the training phase, VAD methods learn the normal patterns solely from normal images. In the testing phase, anomalies are detected by identifying discrepancies between the test image and the normal images. In applications such as video surveillance [
1], medical image analysis [
2], and industrial defect detection [
3], although anomalies are rare, the demand for their precise detection makes VAD an indispensable technology.
Currently, most VAD methods focus on single-class detection tasks, training separate models for each class. However, in real-world applications, especially in dynamic production environments such as flexible manufacturing, data often involve multiple classes. Existing methods typically rely on class-separation strategies, which not only increase the complexity of model deployment but also lead to inefficient resource usage. In particular, class-separated anomaly detection methods often share the same network parameters across different tasks, making it challenging to efficiently capture features from multi-class data. Therefore, it is necessary and significant to study multi-class anomaly detection methods.
There has been some work [
4,
5,
6,
7,
8,
9] on multi-class anomaly detection, and most of these methods focus on enhancing the multi-class feature extraction capability. For example, UniAD [
4] and DiAD [
5] detect anomalies by reconstructing the normal appearance of input data and comparing it with the original data. Recently, MambaAD [
8] introduced the Mamba framework into anomaly detection, achieving leading performance with its enhanced global feature extraction capabilities and efficient computational performance. However, despite being effective, these methods still adhere to the static learning strategy: the trained model always shares the same network parameters when faced with multi-class anomaly detection tasks. This static strategy limits the model’s ability to dynamically adjust its responses based on the input data, limiting its performance in multi-class data.
To address this limitation, this paper proposes a dynamic visual adaptation framework for multi-class anomaly detection. First, as shown in
Figure 1, we select the Mamba network as the base architecture to ensure powerful feature extraction capabilities. Next, we propose a network plug-in, the
Hyper AD Plug-in, which dynamically adjusts the model parameters based on the characteristics of input data. Through the collaboration of the Mamba blocks, CNN blocks, and the proposed
Hyper AD Plug-in, we extract the global, local, and dynamic features for multi-class data. Furthermore, we introduce the Mixture-of-Experts (MoE) module, which consists of multiple experts and a gating network responsible for the dynamic routing mechanism. Through this mechanism, the MoE module dynamically selects the most relevant expert to process the input samples, thereby enhancing the accuracy of multi-class anomaly detection. The source code is available at
https://github.com/bebop96/DVAD.git (accessed on 1 November 2025).
Our main contributions are summarized as follows:
We propose a dynamic visual adaptation framework for multi-class anomaly detection. This includes a network plug-in, the Hyper AD Plug-in, which dynamically adjusts the model parameters based on the input data. By integrating the Mamba blocks, CNN blocks, and the proposed Hyper AD Plug-in, we achieve the balance of global, local, and dynamic features for multi-class data, thereby enhancing detection performance.
We introduce the Mixture-of-Experts (MoE) module. Through the dynamic routing mechanism of the gating network and the collaboration of experts, the model can adaptively learn features from multi-class data.
The proposed method surpasses existing methods in the multi-class anomaly detection task and achieves leading performance.
2. Related Work
2.1. Single-Class Anomaly Detection
Single-class anomaly detection methods [
10] are divided into embedding-based and reconstruction-based methods. Embedding-based methods include one-class classification (OCC) methods, as well as memory bank-based, normalized flow-based, and network distillation-based methods.
OCC methods [
11,
12,
13,
14] compress normal features into a hyperplane [
15] or hypersphere [
16]. In multi-class anomaly detection, OCC performs poorly as features from different classes are mapped to the same hypersphere, causing projection confusion.
Memory bank-based methods [
17,
18,
19,
20] store normal features, classifying test samples as anomalous if they do not match stored features. The size and quality of the memory bank are crucial. Therefore, methods like PatchCore [
17] and PaDiM [
19] reduce the memory bank size by core subset sampling and Gaussian distribution modeling, while GraphCore [
21] enhances features by integrating graph structures. However, in multi-class anomaly detection, the memory bank grows rapidly, increasing storage costs and reducing detection efficiency.
Normalized flow [
22,
23] and
network distillation methods [
24,
25,
26] detect anomalies by comparing model responses. These methods still need more evaluation for multi-class anomaly detection, as normal samples are defined differently across classes.
Reconstruction-based methods [
27,
28,
29,
30] recover normal appearances and detect anomalies by localization based on differences. Some multi-class anomaly detection methods employ reconstruction strategies, which will be discussed next.
Recently, RealNet [
31], SimpleNet [
32], and PyramidFlow [
33] have advanced reconstruction- and flow-based anomaly detection capabilities. However, most of them remain single-class oriented and rely on static feature representations. Our DVAD framework differs by dynamically adapting model parameters and jointly learning global–local–dynamic features across multiple classes.
2.2. Multi-Class Anomaly Detection
Current methods [
4,
5,
6,
7,
8,
9] in multi-class anomaly detection use static learning, with fixed model parameters across tasks. UniAD [
4] improves multi-class reconstruction using neighborhood attention but struggles with complex textures. DiAD [
5] enhances reconstruction for complex structures and large-scale defects, but its high training and deployment costs limit its practical use. Recently, MambaAD [
8], leveraging the global feature perception of Mamba [
34], achieves leading performance. Our method also adopts the Mamba architecture. However, unlike MambaAD, our method emphasizes the importance of dynamic visual adaptation.
2.3. Mamba-MoE
Mamba could efficiently handle long-range dependencies of visual features through state space modeling. Compared to CNNs, Mamba offers superior global feature perception capabilities. In contrast to Transformers, Mamba captures long-range dependencies with linear computational complexity, making it more effective in processing visual data.
The Mixture-of-Experts (MoE) model [
35,
36] divides the learning task into multiple experts, each responsible for handling a specific subset of the input data. During training, a gating network dynamically selects which experts should collaborate based on the features of the given input. MoE enables the model to adaptively leverage the strengths of different experts to address various aspects of the problem. Recently, Pioro et al. [
37] integrated MoE with Mamba, achieving performance that surpasses both Transformers and standard Mamba models. In this paper, we adopt MoE to enhance the multi-class anomaly detection performance.
2.4. Hypernetwork
A hypernetwork [
38] was first proposed to reduce parameters of large-scale neural networks. The output of a hypernetwork is the hyperparameters of another network. Recently, Zhang et al. [
39] proposed HyperLLaVA, which applies the hypernetwork to the multimodal large language model LLaVA [
40], significantly improving LLaVA’s performance across different downstream tasks. Compared to HyperLLaVA, the proposed
Hyper AD plug-in incorporates class-specific embeddings, providing more explicit class guidance to the model.
3. Method
3.1. Overview
The framework of our method is shown in
Figure 1. Given an input image
X from multi-class data, we first pass it through a pre-trained CNN visual encoder, defaulted to ResNet-34 [
41], to extract the class embedding
E and visual features from the first, second, and third layers, which are denoted as
, where
Next, we apply several convolutional layers to perform multi-scale feature fusion, adjusting the sizes of to match. The result is a merged feature map , where B is the batch size, C is the feature dimension, and H and W are the spatial dimensions of F.
Subsequently, the feature map
F is passed through the global–local–dynamic (GLD) module, which generates predictions for different resolution feature maps
(
). The GLD module integrates three sub-components—the Mamba block, the CNN block, and the proposed Hyper AD Plug-in—to extract global, local, and dynamic features, respectively. These components are detailed in the following subsections.
The total loss function is the sum of reconstruction losses for each of the feature maps predicted by the GLD module, where
In the GLD module, we simultaneously extract global (), local (), and dynamic () features using the Mamba block, the CNN block, and the proposed Hyper AD Plug-in, respectively. These features are then combined through the Mixture-of-Experts (MoE) module. The MoE module uses a dynamic routing mechanism within its gating network to adjust expert responses based on the input data. Finally, the GLD module outputs the collaborative results from the MoE module, achieving optimal feature extraction for multi-class data by balancing global, local, and dynamic features. In the following subsection, we describe each module in detail.
3.2. Mamba Block
Mamba is a variant of state space models (SSMs) [
34,
42]. Given an input sequence
, it is transformed through states
, producing an output sequence
. Since the SSM predicts the entire sequence
directly, it captures global features more effectively than CNNs:
where
, while
,
, and
C are state transition matrices.
The feature extraction process of the Mamba block is shown in
Figure 1. For input feature
, we flatten it, apply a linear layer, followed by a convolutional layer, and process it with the SwiGLU [
43] activation function. The sequence is then passed through the SSM module to obtain the output sequence. A bypass with a linear layer and SwiGLU activation is also applied to the input feature. Finally, the outputs from both branches are concatenated and passed through a linear layer to produce the global feature
, with
3.3. Hyper AD Plug-In
As shown in
Figure 1, the class embedding
E is expanded to match the dimensions of
F and added to
F, thereby incorporating class-specific information:
. Subsequently, the
Hyper AD Plug-in module is applied to extract dynamic features
. This module comprises a feature extraction network
and a hypernetwork
H, with
H dynamically adjusting
’s hyperparameters based on input features
F.
The network
adopts an Adapter structure [
44] with a bottleneck layer and a residual connection. We first adjust
, where
represents the number of features and
C represents the channel dimension.
F is firstly reduced to
(
) by the bottleneck layer.
is then expanded back to
, with
where
is the dynamic feature output, while
,
,
, and
are dynamically adjusted by
H based on the input feature
F. We firstly compute
as follows:
To compute (
11), we first adjust the dimension of feature
F and perform a linear mapping, with
where
Z is the predefined latent space of a dimension (default
). We compute the mean of
F as
.
is used to compute
, with
where
. Using
,
is computed. Similarly, the variables
are obtained using
as input to
H, where
The pseudocode of the
Hyper AD Plug-in is in Algorithm 1.
| Algorithm 1 Hyper AD Plug-in |
- Input:
Training sample X; Weights and biases of linear layers ; Predefined dimension . - Output:
Dynamic feature . - 1:
Feature Dimension Reduction: - 2:
Pre-trained features - 3:
Obtain the feature map F by merging - 4:
Incorporate class-specific information - 5:
Map feature - 6:
Mean feature - 7:
Compute - 8:
Compute - 9:
- 10:
Feature Dimension Expansion: - 11:
Map feature - 12:
Mean feature - 13:
Compute - 14:
Compute - 15:
- 16:
|
3.4. The Mixture-of-Experts (MoE) Module
As shown in
Figure 1, the local feature
is extracted through the CNN block. Then, by merging three features,
,
, and
, we obtain the mixed feature
.
We utilize the Mixture-of-Experts (MoE) module to processes
for balancing global, local, and dynamic features. The MoE module consists of
M experts and a gating network. Each expert processes
independently, and the results are weighted by the gating network. The gating network computes a set of probabilities for the experts, which are then used to combine the experts’ outputs via
where
is the output weight map of the gating network. Finally, the output for each expert is obtained as
, where M denotes the number of experts in the MoE module. Each expert consists of a convolution layer with independent parameters. Then, MoE performs a weighted sum of the outputs of each expert based on
G, resulting in the following output feature:
where
is the weight map assigned by the gating network to the
i-th expert,
is the output feature map of the
i-th expert, and ⊙ represents the Hadamard product. Finally, GLD outputs features as follows:
3.5. Abnormal Score
For a sample
from the test dataset
, the feature maps
are obtained based on (
1). Then we merge
into the feature map
. Next, based on (
2), the GLD module predicts the feature maps
. We compute the anomaly score map
S for the test sample
based on the mean squared error between the GLD output feature maps and the pre-trained feature maps:
where
are resized to match the dimensions of the input image.
4. Experiments
4.1. Experimental Settings
4.1.1. Datasets
Assume the dataset contains
N classes, with the training dataset for each class denoted as
and the testing dataset as
. We collect the training samples from all
N classes to form the training dataset
and the testing dataset
. We utilized two widely used industrial image datasets—MVTec AD [
45] and VisA [
46]—to evaluate the performance of our method.
4.1.2. Comparison Methods
First, we compare the proposed method with multi-class anomaly detection methods, including UniAD [
4], DiAD [
5], ViTAD [
7], and MambaAD [
8]. Additionally, we also select mainstream single-class anomaly detection methods for comparison, including DRAEM [
27], SimpleNet [
32], RealNet [
31], CFA [
18], CFLOW-AD [
22], PyramidFlow [
33], RD [
25], and RD++ [
47].
4.1.3. Evaluation Metrics
We evaluate the detection performance at both the image level and the pixel level. The metrics include (1) mAU-ROC (mean area under the ROC curve), (2) mAP (mean average precision), (3) m-max (mean of the maximum score). Additionally, we use (4) mAU-PRO (mean area under the precision–recall curve) and (5) mIOU-max (mean of the maximum intersection over union) to evaluate the model’s detection performance.
4.1.4. Implementation Details
We resized the input images to
pixels and normalized them using the mean and standard deviation of the ImageNet [
48] dataset, i.e., mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. We trained the models for 100 epochs using the Adam [
49] optimizer with a learning rate of
and a decay rate of
. The batch size is
. The pre-trained encoder utilized ResNet-34 [
41]. The pre-trained features include three resolutions, with feature map sizes of
,
, and
. The number of experts in the MoE module is
.
4.2. Experimental Results and Analysis
As shown in
Table 1, our method achieves leading performance at both the image and pixel levels on the MVTec AD dataset. First, compared to single-class anomaly detection methods, our method (98.8%) surpasses DRAEM (54.5%), SimpleNet (95.4%), RealNet (84.8%), CFA (57.6%), CFLOW-AD (91.6%), PyramidaFlow (70.2%), RD (93.6%), and RD++ (97.9%) in terms of the image-level mAU-ROC metric. Additionally, when compared to multi-class anomaly detection methods, including DiAD (88.9%), ViTAD (98.3%), MambaAD (97.8%), and UniAD (92.5%), our method also outperforms them. Notably, despite both our method and MambaAD utilizing Mamba as the basic architecture, our method outperforms MambaAD due to the proposed dynamic visual adaptation framework.
As shown in
Table 2, we also evaluated the multi-class anomaly detection performance on the VisA dataset. Due to the inherent difficulty of detection in the VisA dataset, the performance of single-class anomaly detection methods significantly decreased. In contrast, multi-class anomaly detection methods performed better in maintaining higher detection accuracy. Although the performance of our method on this dataset has slightly decreased, it still outperforms other methods.
4.3. Visualization of Detection Results
As shown in
Figure 2, we present the defect detection results for 15 classes of the MVTec AD dataset. The proposed method demonstrates strong detection capabilities for defects in both object- and texture-class images.
5. Ablation Study
5.1. Backbone
As shown in
Table 3, we selected common pre-trained CNN backbones, including ResNet-18 [
41], ResNet-34, ResNet-50, and WideResNet-50 [
50]. The results indicate that shallow backbones, such as ResNet-18, do not provide sufficiently rich visual features, resulting in suboptimal performance. In contrast, ResNet-34, ResNet-50, and WideResNet-50 achieve higher accuracy. Furthermore, although WideResNet-50 demonstrates the best overall performance, we opted to use ResNet-34 for two reasons: First, the detection performance of ResNet-34 is nearly identical to that of WideResNet-50, yet its architecture is simpler, thereby consuming fewer resources during deployment. Second, we aim to maintain consistency with the backbone settings of MambaAD, thereby emphasizing the effectiveness of our proposed dynamic visual adaptation framework, which integrates the
Hyper AD Plug-in and MoE.
5.2. Feature Extraction Process
As shown in
Table 4, we analyze the feature extraction process with the image-level mAU-ROC metric as an example.
When using Mamba blocks, CNN blocks, or the Hyper AD Plug-in individually to extract features, the model’s detection accuracies are 97.8%, 97.5%, and 97%, respectively. These are lower than the default setting, which uses all modules together (98.8%). This highlights the importance of the synergistic effect among the modules. Integrating all three—Mamba blocks, CNN blocks, and the Hyper AD Plug-in—allows the model to balance global and local features and adjust dynamically, achieving optimal detection performance.
When combining subsets of these modules, detection performance improves but remains below the default setting (98.8%). Specifically, combining (Mamba and the CNN), (the CNN and the Hyper AD Plug-in), and (Mamba and the Hyper AD Plug-in) achieves mAU-ROC scores of 98.3%, 98.2%, and 98.0%, respectively. This confirms that all three modules—Mamba, the CNN, and the Hyper AD Plug-in—are essential for optimal performance, and omitting any one of them reduces the model’s accuracy.
5.3. The Mixture-of-Experts (MoE) Module
We explored the number of experts for MoE. As shown in
Table 5, we set the number of experts as
,
,
, and
. When
, the MoE module is essentially equivalent to a
convolution layer, achieving an image-level mAU-ROC score of 98.1%. As
, the model’s detection performance progressively improves: with
, and 8, the image-level mAU-ROC scores are 98.4%, 98.8%, and 98.9%, respectively. These results lead to two key conclusions:
Leveraging the MoE dynamic routing mechanism, the model benefits from the collaboration of multiple experts to perform dynamic feature extraction across different classes. This collaborative strategy results in improved performance for multi-class anomaly detection.
Increasing the number of experts correlates with better detection performance; for instance, outperforms . However, the performance gains become marginal when .
6. Discussion
The proposed dynamic visual adaptation framework achieves leading performance on industrial anomaly detection benchmarks such as MVTec AD and VisA. This effectiveness is largely attributed to the intrinsic properties of industrial visual data: defects are typically local, while background textures are repetitive, and the normal samples exhibit high structural regularity. Under these conditions, the collaborative mechanism among the Mamba block, the CNN block, and Hyper AD Plug-in effectively captures both global consistency and local deviations. The Hyper AD Plug-in dynamically modulates feature representations according to the visual statistics of each class, allowing the model to adapt to intra-class variations while maintaining inter-class discrimination. Consequently, the proposed method performs well in industrial scenarios characterized by controllable lighting, limited texture complexity, and well-defined object boundaries.
Beyond industrial settings, the proposed framework also demonstrates potential in other structured domains. In medical imaging (e.g., X-ray, CT, and MRI), anatomical structures display consistent spatial layouts and morphological regularity, which are compatible with our global–local–dynamic feature modeling strategy. The multi-class design further enables unified handling of multiple anatomical regions or lesion types without retraining separate models, offering practical advantages for large-scale diagnostic systems. However, in natural images, where anomalies are often semantic or context-dependent rather than texture-based, the model faces challenges due to background clutter, scale variability, and open-world diversity. Although the Mixture-of-Experts (MoE) module provides adaptive routing among experts, its capability to handle semantic anomalies remains limited. Future research will focus on integrating large-scale pre-training and domain generalization techniques, such as foundation models and self-supervised adaptation, to improve robustness and cross-domain scalability.
From the perspective of computational efficiency, the DVAD framework is also designed with practical deployment considerations in mind. During backbone selection, we evaluated multiple ResNet variants and found that shallower networks such as ResNet-18 offer limited feature quality, whereas deeper models like ResNet-50 and WideResNet-50 achieve higher accuracy at the cost of increased complexity. The choice of ResNet-34 thus provides a good trade-off between detection performance and computational resources, balancing accuracy with inference efficiency for industrial applications. Moreover, several architectural components contribute to overall efficiency. The Hyper AD Plug-in employs lightweight bottleneck projections to achieve dynamic feature adaptation with minimal overhead. The Mixture-of-Experts (MoE) mechanism selectively activates only a subset of experts, reducing redundant computation and enabling parallel processing. The Mamba block further enhances scalability by replacing quadratic self-attention with linear-time structured state space updates. These design choices together ensure that the proposed method maintains competitive efficiency while delivering high detection accuracy, making it well-suited for real-world deployment scenarios.
7. Conclusions
This paper presents a multi-class anomaly detection method based on a dynamic visual adaptation framework. The proposed Hyper AD Plug-in enables dynamic feature extraction by adjusting network hyperparameters according to input data. Through the combination of the Mamba block, the CNN block, and Hyper AD Plug-in, the framework effectively balances global, local, and dynamic representations. In addition, the Mixture-of-Experts (MoE) module enhances adaptability by dynamically allocating feature learning across multiple experts. Experimental results on the MVTec AD and VisA datasets demonstrate that the proposed method achieves state-of-the-art performance in both image-level and pixel-level anomaly detection.
The current framework is well-suited for structured industrial data, but further work is needed to extend its generalization ability to domains with higher semantic variability. In future studies, emphasis will be placed on improving efficiency, enabling lightweight deployment, and incorporating large-scale pre-trained models for better transferability to complex real-world scenarios such as natural and medical images. These efforts will contribute to building a unified, adaptive, and efficient anomaly detection paradigm for diverse visual environments.
Author Contributions
H.G.: Methodology, Investigation, Writing—Original Draft; H.L.: Validation; F.S.: Data Curation; Z.Z.: Conceptualization, Resources, Funding Acquisition, Supervision. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Beijing Municipal Natural Science Foundation, China, grant number L243018.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available within the article.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this paper:
| DVAD | Dynamic visual adaptation framework for multi-class anomaly detection |
| GLD | Global–local–dynamic module (Mamba, CNN, and Hyper AD Plug-in) |
| MoE | Mixture of experts |
| SSM | Structured state space model |
| CNN | Convolutional neural network |
| SwiGLU | Swish–gated linear unit |
| AUROC | Area under the receiver operating characteristic curve |
| mAUROC | Mean AUROC across categories |
| MVTec AD | MVTec anomaly detection dataset |
| VisA | Visual anomaly (VisA) dataset |
| Hyper AD Plug-in | Proposed lightweight dynamic adapter for feature modulation |
References
- Zhang, S.; Gong, M.; Xie, Y.; Qin, A.K.; Li, H.; Gao, Y.; Ong, Y.S. Influence-Aware Attention Networks for Anomaly Detection in Surveillance Videos. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5427–5437. [Google Scholar] [CrossRef]
- Zhao, H.; Li, Y.; He, N.; Ma, K.; Fang, L.; Li, H.; Zheng, Y. Anomaly Detection for Medical Images Using Self-Supervised and Translation-Consistent Features. IEEE Trans. Med. Imaging 2021, 40, 3641–3651. [Google Scholar] [CrossRef]
- Yao, H.; Yu, W.; Luo, W.; Qiang, Z.; Luo, D.; Zhang, X. Learning Global-Local Correspondence With Semantic Bottleneck for Logical Anomaly Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3589–3605. [Google Scholar] [CrossRef]
- You, Z.; Cui, L.; Shen, Y.; Yang, K.; Lu, X.; Zheng, Y.; Le, X. A unified model for multi-class anomaly detection. Adv. Neural Inf. Process. Syst. 2022, 35, 4571–4584. [Google Scholar]
- He, H.; Zhang, J.; Chen, H.; Chen, X.; Li, Z.; Chen, X.; Wang, Y.; Wang, C.; Xie, L. A diffusion-based framework for multi-class anomaly detection. Proc. AAAI Conf. Artif. Intell. 2024, 38, 8472–8480. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, C.; Li, X.; Tian, G.; Xue, Z.; Liu, Y.; Pang, G.; Tao, D. Learning Feature Inversion for Multi-class Anomaly Detection under General-purpose COCO-AD Benchmark. arXiv 2024, arXiv:2404.10760. [Google Scholar]
- Zhang, J.; Chen, X.; Wang, Y.; Wang, C.; Liu, Y.; Li, X.; Yang, M.H.; Tao, D. Exploring plain vit reconstruction for multi-class unsupervised anomaly detection. arXiv 2023, arXiv:2312.07495. [Google Scholar]
- He, H.; Bai, Y.; Zhang, J.; He, Q.; Chen, H.; Gan, Z.; Wang, C.; Li, X.; Tian, G.; Xie, L. Mambaad: Exploring state space models for multi-class unsupervised anomaly detection. arXiv 2024, arXiv:2404.06564. [Google Scholar]
- Lu, R.; Wu, Y.; Tian, L.; Wang, D.; Chen, B.; Liu, X.; Hu, R. Hierarchical vector quantized transformer for multi-class unsupervised anomaly detection. Adv. Neural Inf. Process. Syst. 2023, 36, 8487–8500. [Google Scholar]
- Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
- Chen, Y.; Tian, Y.; Pang, G.; Carneiro, G. Deep one-class classification via interpolated gaussian descriptor. Proc. AAAI Conf. Artif. Intell. 2022, 36, 383–392. [Google Scholar] [CrossRef]
- Reiss, T.; Hoshen, Y. Mean-shifted contrastive loss for anomaly detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2155–2162. [Google Scholar] [CrossRef]
- Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep one-class classification. In Proceedings of the International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; PMLR: Vancouver, BC, Canada, 2018; pp. 4393–4402. [Google Scholar]
- Reiss, T.; Cohen, N.; Bergman, L.; Hoshen, Y. Panda: Adapting pretrained features for anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2806–2814. [Google Scholar]
- Schölkopf, B.; Williamson, R.C.; Smola, A.; Shawe-Taylor, J.; Platt, J. Support vector method for novelty detection. Adv. Neural Inf. Process. Syst. 1999, 12, 582–588. [Google Scholar]
- Tax, D.M.; Duin, R.P. Support vector data description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
- Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
- Lee, S.; Lee, S.; Song, B.C. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]
- Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, 10–15 January 2021; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2021; pp. 475–489. [Google Scholar]
- Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
- Xie, G.; Wang, J.; Liu, J.; Jin, Y.; Zheng, F. Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Gudovskiy, D.; Ishizaka, S.; Kozuka, K. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 98–107. [Google Scholar]
- Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; Wu, L. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows. arXiv 2021, arXiv:2111.07677. [Google Scholar] [CrossRef]
- Zolfaghari, M.; Sajedi, H. Unsupervised Anomaly Detection with an Enhanced Teacher for Student-Teacher Feature Pyramid Matching. In Proceedings of the 2022 27th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 23–24 February 2022; IEEE: New York, NY, USA, 2022; pp. 1–4. [Google Scholar]
- Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
- Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution Knowledge Distillation for Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14902–14912. [Google Scholar]
- Zavrtanik, V.; Kristan, M.; Skočaj, D. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8330–8339. [Google Scholar]
- Zavrtanik, V.; Kristan, M.; Skočaj, D. Dsr–a dual subspace re-projection network for surface anomaly detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 539–554. [Google Scholar]
- Liu, Z.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13588–13597. [Google Scholar]
- Liu, S.; Zhou, B.; Ding, Q.; Hooi, B.; Zhang, Z.; Shen, H.; Cheng, X. Time Series Anomaly Detection With Adversarial Reconstruction Networks. IEEE Trans. Knowl. Data Eng. 2023, 35, 4293–4306. [Google Scholar] [CrossRef]
- Zhang, X.; Xu, M.; Zhou, X. RealNet: A feature selection network with realistic synthetic anomaly for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16699–16708. [Google Scholar]
- Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
- Lei, J.; Hu, X.; Wang, Y.; Liu, D. Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14143–14152. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
- Pióro, M.; Ciebiera, K.; Król, K.; Ludziejewski, J.; Krutul, M.; Krajewski, J.; Antoniak, S.; Miłoś, P.; Cygan, M.; Jaszczur, S. Moe-mamba: Efficient selective state space models with mixture of experts. arXiv 2024, arXiv:2401.04081. [Google Scholar]
- Ha, D.; Dai, A.M.; Le, Q.V. HyperNetworks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Zhang, W.; Lin, T.; Liu, J.; Shu, F.; Li, H.; Zhang, L.; He, W.; Zhou, H.; Lv, Z.; Jiang, H.; et al. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models. arXiv 2024, arXiv:2403.13447. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2024, 36, 34892–34916. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the The International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: Baltimore, MD, USA, 2019; pp. 2790–2799. [Google Scholar]
- Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
- Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; Dabeer, O. SPot-the-Difference Self-supervised Pre-training for Anomaly Detection and Segmentation. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXX. Springer: Berlin/Heidelberg, Germany, 2022; pp. 392–408. [Google Scholar]
- Tien, T.D.; Nguyen, A.T.; Tran, N.H.; Huy, T.D.; Duong, S.; Nguyen, C.D.T.; Truong, S.Q. Revisiting reverse distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24511–24520. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference, York, UK, 19–22 September 2016; British Machine Vision Association: Sheffield, UK, 2016. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).