DVAD: A Dynamic Visual Adaptation Framework for Multi-Class Anomaly Detection

Gao, Han; Luo, Huiyuan; Shen, Fei; Zhang, Zhengtao

doi:10.3390/ai6110289

Open AccessArticle

DVAD: A Dynamic Visual Adaptation Framework for Multi-Class Anomaly Detection

The Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

AI 2025, 6(11), 289; https://doi.org/10.3390/ai6110289

Submission received: 21 September 2025 / Revised: 2 November 2025 / Accepted: 4 November 2025 / Published: 8 November 2025

(This article belongs to the Special Issue Artificial Intelligence in Industrial Systems: From Data Acquisition to Intelligent Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

Despite the superior performance of existing anomaly detection methods, they are often limited to single-class detection tasks, requiring separate models for each class. This constraint hinders their detection performance and deployment efficiency when applied to real-world multi-class data. In this paper, we propose a dynamic visual adaptation framework for multi-class anomaly detection, enabling the dynamic and adaptive capture of features based on multi-class data, thereby enhancing detection performance. Specifically, our method introduces a network plug-in, the Hyper AD Plug-in, which dynamically adjusts model parameters according to the input data to extract dynamic features. By leveraging the collaboration between the Mamba block, the CNN block, and the proposed Hyper AD Plug-in, we extract global, local, and dynamic features simultaneously. Furthermore, we incorporate the Mixture-of-Experts (MoE) module, which achieves a dynamic balance across different features through its dynamic routing mechanism and multi-expert collaboration. As a result, the proposed method achieves leading accuracy on the MVTec AD and VisA datasets, with image-level mAU-ROC scores of 98.8% and 95.1%, respectively.

Keywords:

anomalydetection; multi-class detection; unsupervised learning; industrial vision

1. Introduction

Visual anomaly detection (VAD), which recognizes anomalies using unlabeled normal images, is an intriguing topic in computer vision because collecting anomaly samples can be challenging. During the training phase, VAD methods learn the normal patterns solely from normal images. In the testing phase, anomalies are detected by identifying discrepancies between the test image and the normal images. In applications such as video surveillance [1], medical image analysis [2], and industrial defect detection [3], although anomalies are rare, the demand for their precise detection makes VAD an indispensable technology.

Currently, most VAD methods focus on single-class detection tasks, training separate models for each class. However, in real-world applications, especially in dynamic production environments such as flexible manufacturing, data often involve multiple classes. Existing methods typically rely on class-separation strategies, which not only increase the complexity of model deployment but also lead to inefficient resource usage. In particular, class-separated anomaly detection methods often share the same network parameters across different tasks, making it challenging to efficiently capture features from multi-class data. Therefore, it is necessary and significant to study multi-class anomaly detection methods.

There has been some work [4,5,6,7,8,9] on multi-class anomaly detection, and most of these methods focus on enhancing the multi-class feature extraction capability. For example, UniAD [4] and DiAD [5] detect anomalies by reconstructing the normal appearance of input data and comparing it with the original data. Recently, MambaAD [8] introduced the Mamba framework into anomaly detection, achieving leading performance with its enhanced global feature extraction capabilities and efficient computational performance. However, despite being effective, these methods still adhere to the static learning strategy: the trained model always shares the same network parameters when faced with multi-class anomaly detection tasks. This static strategy limits the model’s ability to dynamically adjust its responses based on the input data, limiting its performance in multi-class data.

To address this limitation, this paper proposes a dynamic visual adaptation framework for multi-class anomaly detection. First, as shown in Figure 1, we select the Mamba network as the base architecture to ensure powerful feature extraction capabilities. Next, we propose a network plug-in, the Hyper AD Plug-in, which dynamically adjusts the model parameters based on the characteristics of input data. Through the collaboration of the Mamba blocks, CNN blocks, and the proposed Hyper AD Plug-in, we extract the global, local, and dynamic features for multi-class data. Furthermore, we introduce the Mixture-of-Experts (MoE) module, which consists of multiple experts and a gating network responsible for the dynamic routing mechanism. Through this mechanism, the MoE module dynamically selects the most relevant expert to process the input samples, thereby enhancing the accuracy of multi-class anomaly detection. The source code is available at https://github.com/bebop96/DVAD.git (accessed on 1 November 2025).

Our main contributions are summarized as follows:

We propose a dynamic visual adaptation framework for multi-class anomaly detection. This includes a network plug-in, the Hyper AD Plug-in, which dynamically adjusts the model parameters based on the input data. By integrating the Mamba blocks, CNN blocks, and the proposed Hyper AD Plug-in, we achieve the balance of global, local, and dynamic features for multi-class data, thereby enhancing detection performance.
We introduce the Mixture-of-Experts (MoE) module. Through the dynamic routing mechanism of the gating network and the collaboration of experts, the model can adaptively learn features from multi-class data.
The proposed method surpasses existing methods in the multi-class anomaly detection task and achieves leading performance.

2. Related Work

2.1. Single-Class Anomaly Detection

Single-class anomaly detection methods [10] are divided into embedding-based and reconstruction-based methods. Embedding-based methods include one-class classification (OCC) methods, as well as memory bank-based, normalized flow-based, and network distillation-based methods.

OCC methods [11,12,13,14] compress normal features into a hyperplane [15] or hypersphere [16]. In multi-class anomaly detection, OCC performs poorly as features from different classes are mapped to the same hypersphere, causing projection confusion. Memory bank-based methods [17,18,19,20] store normal features, classifying test samples as anomalous if they do not match stored features. The size and quality of the memory bank are crucial. Therefore, methods like PatchCore [17] and PaDiM [19] reduce the memory bank size by core subset sampling and Gaussian distribution modeling, while GraphCore [21] enhances features by integrating graph structures. However, in multi-class anomaly detection, the memory bank grows rapidly, increasing storage costs and reducing detection efficiency. Normalized flow [22,23] and network distillation methods [24,25,26] detect anomalies by comparing model responses. These methods still need more evaluation for multi-class anomaly detection, as normal samples are defined differently across classes. Reconstruction-based methods [27,28,29,30] recover normal appearances and detect anomalies by localization based on differences. Some multi-class anomaly detection methods employ reconstruction strategies, which will be discussed next.

Recently, RealNet [31], SimpleNet [32], and PyramidFlow [33] have advanced reconstruction- and flow-based anomaly detection capabilities. However, most of them remain single-class oriented and rely on static feature representations. Our DVAD framework differs by dynamically adapting model parameters and jointly learning global–local–dynamic features across multiple classes.

2.2. Multi-Class Anomaly Detection

Current methods [4,5,6,7,8,9] in multi-class anomaly detection use static learning, with fixed model parameters across tasks. UniAD [4] improves multi-class reconstruction using neighborhood attention but struggles with complex textures. DiAD [5] enhances reconstruction for complex structures and large-scale defects, but its high training and deployment costs limit its practical use. Recently, MambaAD [8], leveraging the global feature perception of Mamba [34], achieves leading performance. Our method also adopts the Mamba architecture. However, unlike MambaAD, our method emphasizes the importance of dynamic visual adaptation.

2.3. Mamba-MoE

Mamba could efficiently handle long-range dependencies of visual features through state space modeling. Compared to CNNs, Mamba offers superior global feature perception capabilities. In contrast to Transformers, Mamba captures long-range dependencies with linear computational complexity, making it more effective in processing visual data.

The Mixture-of-Experts (MoE) model [35,36] divides the learning task into multiple experts, each responsible for handling a specific subset of the input data. During training, a gating network dynamically selects which experts should collaborate based on the features of the given input. MoE enables the model to adaptively leverage the strengths of different experts to address various aspects of the problem. Recently, Pioro et al. [37] integrated MoE with Mamba, achieving performance that surpasses both Transformers and standard Mamba models. In this paper, we adopt MoE to enhance the multi-class anomaly detection performance.

2.4. Hypernetwork

A hypernetwork [38] was first proposed to reduce parameters of large-scale neural networks. The output of a hypernetwork is the hyperparameters of another network. Recently, Zhang et al. [39] proposed HyperLLaVA, which applies the hypernetwork to the multimodal large language model LLaVA [40], significantly improving LLaVA’s performance across different downstream tasks. Compared to HyperLLaVA, the proposed Hyper AD plug-in incorporates class-specific embeddings, providing more explicit class guidance to the model.

3. Method

3.1. Overview

The framework of our method is shown in Figure 1. Given an input image X from multi-class data, we first pass it through a pre-trained CNN visual encoder, defaulted to ResNet-34 [41], to extract the class embedding E and visual features from the first, second, and third layers, which are denoted as

F_{1}, F_{2}, F_{3}

, where

[F_{1}; F_{2}; F_{3}; E] = E n c (X)

(1)

Next, we apply several convolutional layers to perform multi-scale feature fusion, adjusting the sizes of

F_{1}, F_{2}, F_{3}

to match. The result is a merged feature map

F \in R^{B \times C \times H \times W}

, where B is the batch size, C is the feature dimension, and H and W are the spatial dimensions of F.

Subsequently, the feature map F is passed through the global–local–dynamic (GLD) module, which generates predictions for different resolution feature maps

F_{i}

(

i = 1, 2, 3

). The GLD module integrates three sub-components—the Mamba block, the CNN block, and the proposed Hyper AD Plug-in—to extract global, local, and dynamic features, respectively. These components are detailed in the following subsections.

\begin{matrix} F_{3}^{*} = GLD (F), F_{2}^{*} = GLD (F_{3}^{*}), F_{1}^{*} = GLD (F_{2}^{*}) \end{matrix}

(2)

The total loss function is the sum of reconstruction losses for each of the feature maps predicted by the GLD module, where

L = \sum_{i = 1}^{3} {∥ F_{i}^{*} - F_{i} ∥}_{2}

(3)

In the GLD module, we simultaneously extract global (

F_{G}

), local (

F_{L}

), and dynamic (

F_{D}

) features using the Mamba block, the CNN block, and the proposed Hyper AD Plug-in, respectively. These features are then combined through the Mixture-of-Experts (MoE) module. The MoE module uses a dynamic routing mechanism within its gating network to adjust expert responses based on the input data. Finally, the GLD module outputs the collaborative results from the MoE module, achieving optimal feature extraction for multi-class data by balancing global, local, and dynamic features. In the following subsection, we describe each module in detail.

3.2. Mamba Block

Mamba is a variant of state space models (SSMs) [34,42]. Given an input sequence

x (t) \in R^{L}

, it is transformed through states

h (t) \in R^{L}

, producing an output sequence

y (t) \in R^{L}

. Since the SSM predicts the entire sequence

y (t)

directly, it captures global features more effectively than CNNs:

h (t) = \bar{A} h (t - 1) + \bar{B} x (t), y (t) = C h (t)

(4)

where

h (0) = 0

, while

\bar{A}

,

\bar{B}

, and C are state transition matrices.

\begin{matrix} \bar{A} & = e^{Δ A}, \bar{B} = {(Δ A)}^{- 1} (e^{Δ A} - I) Δ B \end{matrix}

(5)

\begin{matrix} \bar{K} & = (C \bar{B}, C \bar{A} \bar{B}, \dots, C {\bar{A}}^{L - 1} \bar{B}), y (t) = x (t) * \bar{K} \end{matrix}

(6)

The feature extraction process of the Mamba block is shown in Figure 1. For input feature

F \in R^{B \times C \times H \times W}

, we flatten it, apply a linear layer, followed by a convolutional layer, and process it with the SwiGLU [43] activation function. The sequence is then passed through the SSM module to obtain the output sequence. A bypass with a linear layer and SwiGLU activation is also applied to the input feature. Finally, the outputs from both branches are concatenated and passed through a linear layer to produce the global feature

F_{G}

, with

\begin{matrix} F_{G} = Linear ( & SSM (SwiGLU (Conv (Linear (F)))) \\ + SwiGLU (Linear (F))) \end{matrix}

(7)

3.3. Hyper AD Plug-In

As shown in Figure 1, the class embedding E is expanded to match the dimensions of F and added to F, thereby incorporating class-specific information:

F = F + E

. Subsequently, the Hyper AD Plug-in module is applied to extract dynamic features

F_{D}

. This module comprises a feature extraction network

Φ

and a hypernetwork H, with H dynamically adjusting

Φ

’s hyperparameters based on input features F.

The network

Φ

adopts an Adapter structure [44] with a bottleneck layer and a residual connection. We first adjust

F \in R^{C \times N}

, where

N = B \times H \times W

represents the number of features and C represents the channel dimension. F is firstly reduced to

F_{d o w n} \in R^{K \times N}

(

K < C

) by the bottleneck layer.

F_{d o w n}

is then expanded back to

F_{u p} \in R^{C \times N}

, with

\begin{matrix} F_{d o w n} & = S w i G L U (W_{d o w n} F + B_{d o w n}) \end{matrix}

(8)

\begin{matrix} F_{u p} & = W_{u p} F_{d o w n} + B_{u p} \end{matrix}

(9)

\begin{matrix} F_{D} & = F + F_{u p}, \end{matrix}

(10)

where

F_{D} \in R^{C \times N}

is the dynamic feature output, while

W_{d o w n} \in R^{K \times C}

,

B_{d o w n} \in R^{K}

,

W_{u p} \in R^{C \times K}

, and

B_{u p} \in R^{C}

are dynamically adjusted by H based on the input feature F. We firstly compute

{W_{d o w n}, B_{d o w n}}

as follows:

\begin{matrix} W_{d o w n}, B_{d o w n} = H (F) \end{matrix}

(11)

To compute (11), we first adjust the dimension of feature F and perform a linear mapping, with

\begin{matrix} F = ReLU (Linear (F)) \in R^{N \times Z}, \end{matrix}

(12)

where Z is the predefined latent space of a dimension (default

Z = 16

). We compute the mean of F as

F_{m e a n} = Mean (F) \in R^{Z}

.

F_{m e a n}

is used to compute

{W_{d o w n}, B_{d o w n}}

, with

\begin{matrix} W_{d o w n} = W_{1} F_{m e a n} + B_{1} \in R^{K \times C} \end{matrix}

(13)

\begin{matrix} B_{d o w n} = W_{2} F_{m e a n} + B_{2} \in R^{K}, \end{matrix}

(14)

where

W_{1} \in R^{(K \cdot C) \times Z}, B_{1} \in R^{K \cdot C}, W_{2} \in R^{K \times Z}, B_{2} \in R^{K}

. Using

{W_{d o w n}, B_{d o w n}}

,

F_{d o w n}

is computed. Similarly, the variables

{W_{u p}, B_{u p}}

are obtained using

F_{d o w n}

as input to H, where

\begin{matrix} W_{u p}, B_{u p} = H (F_{d o w n}) \end{matrix}

(15)

The pseudocode of the Hyper AD Plug-in is in Algorithm 1.

Algorithm 1 Hyper AD Plug-in

Input:: Training sample X; Weights and biases of linear layers ${W_{1} \in R^{(K \cdot C) \times Z}, B_{1} \in R^{K \cdot C}}, {W_{2} \in R^{K \times Z}, B_{2} \in R^{K}}, {W_{3} \in R^{(K \cdot C) \times Z}, B_{3} \in R^{K \cdot C}}, {W_{4} \in R^{C \times Z}, B_{4} \in R^{C}}$ ; Predefined dimension $Z = 16$ .
Output:: Dynamic feature $F_{D}$ .
1:: Feature Dimension Reduction:
2:: Pre-trained features $[F_{1}; F_{2}; F_{3}; E] = E n c (X)$
3:: Obtain the feature map F by merging $[F_{1}; F_{2}; F_{3}]$
4:: Incorporate class-specific information $F = F + E$
5:: Map feature $F = ReLU (Linear (F)) \in R^{N \times Z}$
6:: Mean feature $F_{m e a n} = Mean (F) \in R^{Z}$
7:: Compute $W_{d o w n} = W_{1} F_{m e a n} + B_{1} \in R^{K \times C}$
8:: Compute $B_{d o w n} = W_{2} F_{m e a n} + B_{2} \in R^{K}$
9:: $F_{d o w n} = S w i G L U (W_{d o w n} F + B_{d o w n}) \in R^{K \times N}$
10:: Feature Dimension Expansion:
11:: Map feature $F_{d o w n} = ReLU (Linear (F_{d o w n})) \in R^{N \times Z}$
12:: Mean feature $F_{d - m e a n} = Mean (F_{d o w n}) \in R^{Z}$
13:: Compute $W_{u p} = W_{3} F_{d - m e a n} + B_{3} \in R^{C \times K}$
14:: Compute $B_{u p} = W_{4} F_{d - m e a n} + B_{4} \in R^{C}$
15:: $F_{u p} = W_{u p} F + B_{u p} \in R^{C \times N}$
16:: $F_{D} = F + F_{u p}$

3.4. The Mixture-of-Experts (MoE) Module

As shown in Figure 1, the local feature

F_{L}

is extracted through the CNN block. Then, by merging three features,

F_{G} \in R^{B \times C \times H \times W}

,

F_{L} \in R^{B \times 2 C \times H \times W}

, and

F_{D} \in R^{B \times C \times H \times W}

, we obtain the mixed feature

F_{m}

.

\begin{matrix} F_{m} = [F_{G}; F_{L}; F_{D}] \in R^{B \times 4 C \times H \times W} \end{matrix}

(16)

We utilize the Mixture-of-Experts (MoE) module to processes

F_{m}

for balancing global, local, and dynamic features. The MoE module consists of M experts and a gating network. Each expert processes

F_{m}

independently, and the results are weighted by the gating network. The gating network computes a set of probabilities for the experts, which are then used to combine the experts’ outputs via

\begin{matrix} G = Softmax (Conv (F_{m})) \end{matrix}

(17)

where

G \in R^{B \times M \times H \times W}

is the output weight map of the gating network. Finally, the output for each expert is obtained as

E_{i} (F_{m}) \in R^{B \times C \times H \times W}, \forall i \in {1, 2, \dots, M}

, where M denotes the number of experts in the MoE module. Each expert consists of a convolution layer with independent parameters. Then, MoE performs a weighted sum of the outputs of each expert based on G, resulting in the following output feature:

\begin{matrix} F_{M o E} = \sum_{i = 1}^{M} G_{i} ⊙ E_{i} (F_{m}) \in R^{B \times C \times H \times W} \end{matrix}

(18)

where

G_{i} \in R^{B \times H \times W}

is the weight map assigned by the gating network to the i-th expert,

E_{i} (F_{m}) \in R^{B \times C \times H \times W}

is the output feature map of the i-th expert, and ⊙ represents the Hadamard product. Finally, GLD outputs features as follows:

\begin{matrix} F^{*} = F + F_{M o E} \end{matrix}

(19)

3.5. Abnormal Score

For a sample

X_{t e s t}

from the test dataset

X_{t e s t}

, the feature maps

F_{1}, F_{2}, F_{3}

are obtained based on (1). Then we merge

F_{1}, F_{2}, F_{3}

into the feature map

F \in R^{B \times C \times H \times W}

. Next, based on (2), the GLD module predicts the feature maps

F_{1}^{*}, F_{2}^{*}, F_{3}^{*}

. We compute the anomaly score map S for the test sample

X_{t e s t}

based on the mean squared error between the GLD output feature maps and the pre-trained feature maps:

S_{1} = {∥ F_{1}^{*} - F_{1} ∥}_{2}, S_{2} = ∥ F_{2}^{*} - F_{2} ∥_{2}, S_{3} = {∥ F_{3}^{*} - F_{3} ∥}_{2}

(20)

\begin{matrix} S & = S_{1} + S_{2} + S_{3} \end{matrix}

(21)

where

S_{1}, S_{2}, S_{3}

are resized to match the dimensions of the input image.

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

Assume the dataset contains N classes, with the training dataset for each class denoted as

X_{train} (N_{i})

and the testing dataset as

X_{test} (N_{i})

. We collect the training samples from all N classes to form the training dataset

X_{train} = {X_{train} (N_{i})}_{i = 1}^{N}

and the testing dataset

X_{test} = {X_{test} (N_{i})}_{i = 1}^{N}

. We utilized two widely used industrial image datasets—MVTec AD [45] and VisA [46]—to evaluate the performance of our method.

4.1.2. Comparison Methods

First, we compare the proposed method with multi-class anomaly detection methods, including UniAD [4], DiAD [5], ViTAD [7], and MambaAD [8]. Additionally, we also select mainstream single-class anomaly detection methods for comparison, including DRAEM [27], SimpleNet [32], RealNet [31], CFA [18], CFLOW-AD [22], PyramidFlow [33], RD [25], and RD++ [47].

4.1.3. Evaluation Metrics

We evaluate the detection performance at both the image level and the pixel level. The metrics include (1) mAU-ROC (mean area under the ROC curve), (2) mAP (mean average precision), (3) m

F_{1}

-max (mean of the maximum

F_{1}

score). Additionally, we use (4) mAU-PRO (mean area under the precision–recall curve) and (5) mIOU-max (mean of the maximum intersection over union) to evaluate the model’s detection performance.

4.1.4. Implementation Details

We resized the input images to

224 \times 224

pixels and normalized them using the mean and standard deviation of the ImageNet [48] dataset, i.e., mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. We trained the models for 100 epochs using the Adam [49] optimizer with a learning rate of

5 \times 10^{- 3}

and a decay rate of

1 \times 10^{- 4}

. The batch size is

B = 16

. The pre-trained encoder utilized ResNet-34 [41]. The pre-trained features include three resolutions, with feature map sizes of

F_{1} : (B \times 64 \times 64 \times 64)

,

F_{2} : (B \times 128 \times 32 \times 32)

, and

F_{3} : (B \times 256 \times 16 \times 16)

. The number of experts in the MoE module is

N = 5

.

4.2. Experimental Results and Analysis

As shown in Table 1, our method achieves leading performance at both the image and pixel levels on the MVTec AD dataset. First, compared to single-class anomaly detection methods, our method (98.8%) surpasses DRAEM (54.5%), SimpleNet (95.4%), RealNet (84.8%), CFA (57.6%), CFLOW-AD (91.6%), PyramidaFlow (70.2%), RD (93.6%), and RD++ (97.9%) in terms of the image-level mAU-ROC metric. Additionally, when compared to multi-class anomaly detection methods, including DiAD (88.9%), ViTAD (98.3%), MambaAD (97.8%), and UniAD (92.5%), our method also outperforms them. Notably, despite both our method and MambaAD utilizing Mamba as the basic architecture, our method outperforms MambaAD due to the proposed dynamic visual adaptation framework.

As shown in Table 2, we also evaluated the multi-class anomaly detection performance on the VisA dataset. Due to the inherent difficulty of detection in the VisA dataset, the performance of single-class anomaly detection methods significantly decreased. In contrast, multi-class anomaly detection methods performed better in maintaining higher detection accuracy. Although the performance of our method on this dataset has slightly decreased, it still outperforms other methods.

4.3. Visualization of Detection Results

As shown in Figure 2, we present the defect detection results for 15 classes of the MVTec AD dataset. The proposed method demonstrates strong detection capabilities for defects in both object- and texture-class images.

5. Ablation Study

5.1. Backbone

As shown in Table 3, we selected common pre-trained CNN backbones, including ResNet-18 [41], ResNet-34, ResNet-50, and WideResNet-50 [50]. The results indicate that shallow backbones, such as ResNet-18, do not provide sufficiently rich visual features, resulting in suboptimal performance. In contrast, ResNet-34, ResNet-50, and WideResNet-50 achieve higher accuracy. Furthermore, although WideResNet-50 demonstrates the best overall performance, we opted to use ResNet-34 for two reasons: First, the detection performance of ResNet-34 is nearly identical to that of WideResNet-50, yet its architecture is simpler, thereby consuming fewer resources during deployment. Second, we aim to maintain consistency with the backbone settings of MambaAD, thereby emphasizing the effectiveness of our proposed dynamic visual adaptation framework, which integrates the Hyper AD Plug-in and MoE.

5.2. Feature Extraction Process

As shown in Table 4, we analyze the feature extraction process with the image-level mAU-ROC metric as an example.

When using Mamba blocks, CNN blocks, or the Hyper AD Plug-in individually to extract features, the model’s detection accuracies are 97.8%, 97.5%, and 97%, respectively. These are lower than the default setting, which uses all modules together (98.8%). This highlights the importance of the synergistic effect among the modules. Integrating all three—Mamba blocks, CNN blocks, and the Hyper AD Plug-in—allows the model to balance global and local features and adjust dynamically, achieving optimal detection performance.
When combining subsets of these modules, detection performance improves but remains below the default setting (98.8%). Specifically, combining (Mamba and the CNN), (the CNN and the Hyper AD Plug-in), and (Mamba and the Hyper AD Plug-in) achieves mAU-ROC scores of 98.3%, 98.2%, and 98.0%, respectively. This confirms that all three modules—Mamba, the CNN, and the Hyper AD Plug-in—are essential for optimal performance, and omitting any one of them reduces the model’s accuracy.

5.3. The Mixture-of-Experts (MoE) Module

We explored the number of experts for MoE. As shown in Table 5, we set the number of experts as

N = 1

,

N = 2

,

N = 5

, and

N = 8

. When

N = 1

, the MoE module is essentially equivalent to a

1 \times 1

convolution layer, achieving an image-level mAU-ROC score of 98.1%. As

N > 1

, the model’s detection performance progressively improves: with

N = 2, 5

, and 8, the image-level mAU-ROC scores are 98.4%, 98.8%, and 98.9%, respectively. These results lead to two key conclusions:

Leveraging the MoE dynamic routing mechanism, the model benefits from the collaboration of multiple experts to perform dynamic feature extraction across different classes. This collaborative strategy results in improved performance for multi-class anomaly detection.
Increasing the number of experts correlates with better detection performance; for instance, $N = 5$ outperforms $N = 2$ . However, the performance gains become marginal when $N > 5$ .

6. Discussion

The proposed dynamic visual adaptation framework achieves leading performance on industrial anomaly detection benchmarks such as MVTec AD and VisA. This effectiveness is largely attributed to the intrinsic properties of industrial visual data: defects are typically local, while background textures are repetitive, and the normal samples exhibit high structural regularity. Under these conditions, the collaborative mechanism among the Mamba block, the CNN block, and Hyper AD Plug-in effectively captures both global consistency and local deviations. The Hyper AD Plug-in dynamically modulates feature representations according to the visual statistics of each class, allowing the model to adapt to intra-class variations while maintaining inter-class discrimination. Consequently, the proposed method performs well in industrial scenarios characterized by controllable lighting, limited texture complexity, and well-defined object boundaries.

Beyond industrial settings, the proposed framework also demonstrates potential in other structured domains. In medical imaging (e.g., X-ray, CT, and MRI), anatomical structures display consistent spatial layouts and morphological regularity, which are compatible with our global–local–dynamic feature modeling strategy. The multi-class design further enables unified handling of multiple anatomical regions or lesion types without retraining separate models, offering practical advantages for large-scale diagnostic systems. However, in natural images, where anomalies are often semantic or context-dependent rather than texture-based, the model faces challenges due to background clutter, scale variability, and open-world diversity. Although the Mixture-of-Experts (MoE) module provides adaptive routing among experts, its capability to handle semantic anomalies remains limited. Future research will focus on integrating large-scale pre-training and domain generalization techniques, such as foundation models and self-supervised adaptation, to improve robustness and cross-domain scalability.

From the perspective of computational efficiency, the DVAD framework is also designed with practical deployment considerations in mind. During backbone selection, we evaluated multiple ResNet variants and found that shallower networks such as ResNet-18 offer limited feature quality, whereas deeper models like ResNet-50 and WideResNet-50 achieve higher accuracy at the cost of increased complexity. The choice of ResNet-34 thus provides a good trade-off between detection performance and computational resources, balancing accuracy with inference efficiency for industrial applications. Moreover, several architectural components contribute to overall efficiency. The Hyper AD Plug-in employs lightweight bottleneck projections to achieve dynamic feature adaptation with minimal overhead. The Mixture-of-Experts (MoE) mechanism selectively activates only a subset of experts, reducing redundant computation and enabling parallel processing. The Mamba block further enhances scalability by replacing quadratic self-attention with linear-time structured state space updates. These design choices together ensure that the proposed method maintains competitive efficiency while delivering high detection accuracy, making it well-suited for real-world deployment scenarios.

7. Conclusions

This paper presents a multi-class anomaly detection method based on a dynamic visual adaptation framework. The proposed Hyper AD Plug-in enables dynamic feature extraction by adjusting network hyperparameters according to input data. Through the combination of the Mamba block, the CNN block, and Hyper AD Plug-in, the framework effectively balances global, local, and dynamic representations. In addition, the Mixture-of-Experts (MoE) module enhances adaptability by dynamically allocating feature learning across multiple experts. Experimental results on the MVTec AD and VisA datasets demonstrate that the proposed method achieves state-of-the-art performance in both image-level and pixel-level anomaly detection.

The current framework is well-suited for structured industrial data, but further work is needed to extend its generalization ability to domains with higher semantic variability. In future studies, emphasis will be placed on improving efficiency, enabling lightweight deployment, and incorporating large-scale pre-trained models for better transferability to complex real-world scenarios such as natural and medical images. These efforts will contribute to building a unified, adaptive, and efficient anomaly detection paradigm for diverse visual environments.

Author Contributions

H.G.: Methodology, Investigation, Writing—Original Draft; H.L.: Validation; F.S.: Data Curation; Z.Z.: Conceptualization, Resources, Funding Acquisition, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Beijing Municipal Natural Science Foundation, China, grant number L243018.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available within the article.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this paper:

DVAD	Dynamic visual adaptation framework for multi-class anomaly detection
GLD	Global–local–dynamic module (Mamba, CNN, and Hyper AD Plug-in)
MoE	Mixture of experts
SSM	Structured state space model
CNN	Convolutional neural network
SwiGLU	Swish–gated linear unit
AUROC	Area under the receiver operating characteristic curve
mAUROC	Mean AUROC across categories
MVTec AD	MVTec anomaly detection dataset
VisA	Visual anomaly (VisA) dataset
Hyper AD Plug-in	Proposed lightweight dynamic adapter for feature modulation

References

Zhang, S.; Gong, M.; Xie, Y.; Qin, A.K.; Li, H.; Gao, Y.; Ong, Y.S. Influence-Aware Attention Networks for Anomaly Detection in Surveillance Videos. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5427–5437. [Google Scholar] [CrossRef]
Zhao, H.; Li, Y.; He, N.; Ma, K.; Fang, L.; Li, H.; Zheng, Y. Anomaly Detection for Medical Images Using Self-Supervised and Translation-Consistent Features. IEEE Trans. Med. Imaging 2021, 40, 3641–3651. [Google Scholar] [CrossRef]
Yao, H.; Yu, W.; Luo, W.; Qiang, Z.; Luo, D.; Zhang, X. Learning Global-Local Correspondence With Semantic Bottleneck for Logical Anomaly Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3589–3605. [Google Scholar] [CrossRef]
You, Z.; Cui, L.; Shen, Y.; Yang, K.; Lu, X.; Zheng, Y.; Le, X. A unified model for multi-class anomaly detection. Adv. Neural Inf. Process. Syst. 2022, 35, 4571–4584. [Google Scholar]
He, H.; Zhang, J.; Chen, H.; Chen, X.; Li, Z.; Chen, X.; Wang, Y.; Wang, C.; Xie, L. A diffusion-based framework for multi-class anomaly detection. Proc. AAAI Conf. Artif. Intell. 2024, 38, 8472–8480. [Google Scholar] [CrossRef]
Zhang, J.; Wang, C.; Li, X.; Tian, G.; Xue, Z.; Liu, Y.; Pang, G.; Tao, D. Learning Feature Inversion for Multi-class Anomaly Detection under General-purpose COCO-AD Benchmark. arXiv 2024, arXiv:2404.10760. [Google Scholar]
Zhang, J.; Chen, X.; Wang, Y.; Wang, C.; Liu, Y.; Li, X.; Yang, M.H.; Tao, D. Exploring plain vit reconstruction for multi-class unsupervised anomaly detection. arXiv 2023, arXiv:2312.07495. [Google Scholar]
He, H.; Bai, Y.; Zhang, J.; He, Q.; Chen, H.; Gan, Z.; Wang, C.; Li, X.; Tian, G.; Xie, L. Mambaad: Exploring state space models for multi-class unsupervised anomaly detection. arXiv 2024, arXiv:2404.06564. [Google Scholar]
Lu, R.; Wu, Y.; Tian, L.; Wang, D.; Chen, B.; Liu, X.; Hu, R. Hierarchical vector quantized transformer for multi-class unsupervised anomaly detection. Adv. Neural Inf. Process. Syst. 2023, 36, 8487–8500. [Google Scholar]
Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Chen, Y.; Tian, Y.; Pang, G.; Carneiro, G. Deep one-class classification via interpolated gaussian descriptor. Proc. AAAI Conf. Artif. Intell. 2022, 36, 383–392. [Google Scholar] [CrossRef]
Reiss, T.; Hoshen, Y. Mean-shifted contrastive loss for anomaly detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2155–2162. [Google Scholar] [CrossRef]
Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep one-class classification. In Proceedings of the International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; PMLR: Vancouver, BC, Canada, 2018; pp. 4393–4402. [Google Scholar]
Reiss, T.; Cohen, N.; Bergman, L.; Hoshen, Y. Panda: Adapting pretrained features for anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2806–2814. [Google Scholar]
Schölkopf, B.; Williamson, R.C.; Smola, A.; Shawe-Taylor, J.; Platt, J. Support vector method for novelty detection. Adv. Neural Inf. Process. Syst. 1999, 12, 582–588. [Google Scholar]
Tax, D.M.; Duin, R.P. Support vector data description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
Lee, S.; Lee, S.; Song, B.C. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, 10–15 January 2021; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2021; pp. 475–489. [Google Scholar]
Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
Xie, G.; Wang, J.; Liu, J.; Jin, Y.; Zheng, F. Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Gudovskiy, D.; Ishizaka, S.; Kozuka, K. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 98–107. [Google Scholar]
Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; Wu, L. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows. arXiv 2021, arXiv:2111.07677. [Google Scholar] [CrossRef]
Zolfaghari, M.; Sajedi, H. Unsupervised Anomaly Detection with an Enhanced Teacher for Student-Teacher Feature Pyramid Matching. In Proceedings of the 2022 27th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 23–24 February 2022; IEEE: New York, NY, USA, 2022; pp. 1–4. [Google Scholar]
Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution Knowledge Distillation for Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14902–14912. [Google Scholar]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8330–8339. [Google Scholar]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Dsr–a dual subspace re-projection network for surface anomaly detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 539–554. [Google Scholar]
Liu, Z.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13588–13597. [Google Scholar]
Liu, S.; Zhou, B.; Ding, Q.; Hooi, B.; Zhang, Z.; Shen, H.; Cheng, X. Time Series Anomaly Detection With Adversarial Reconstruction Networks. IEEE Trans. Knowl. Data Eng. 2023, 35, 4293–4306. [Google Scholar] [CrossRef]
Zhang, X.; Xu, M.; Zhou, X. RealNet: A feature selection network with realistic synthetic anomaly for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16699–16708. [Google Scholar]
Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
Lei, J.; Hu, X.; Wang, Y.; Liu, D. Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14143–14152. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Pióro, M.; Ciebiera, K.; Król, K.; Ludziejewski, J.; Krutul, M.; Krajewski, J.; Antoniak, S.; Miłoś, P.; Cygan, M.; Jaszczur, S. Moe-mamba: Efficient selective state space models with mixture of experts. arXiv 2024, arXiv:2401.04081. [Google Scholar]
Ha, D.; Dai, A.M.; Le, Q.V. HyperNetworks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
Zhang, W.; Lin, T.; Liu, J.; Shu, F.; Li, H.; Zhang, L.; He, W.; Zhou, H.; Lv, Z.; Jiang, H.; et al. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models. arXiv 2024, arXiv:2403.13447. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2024, 36, 34892–34916. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the The International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: Baltimore, MD, USA, 2019; pp. 2790–2799. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; Dabeer, O. SPot-the-Difference Self-supervised Pre-training for Anomaly Detection and Segmentation. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXX. Springer: Berlin/Heidelberg, Germany, 2022; pp. 392–408. [Google Scholar]
Tien, T.D.; Nguyen, A.T.; Tran, N.H.; Huy, T.D.; Duong, S.; Nguyen, C.D.T.; Truong, S.Q. Revisiting reverse distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24511–24520. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference, York, UK, 19–22 September 2016; British Machine Vision Association: Sheffield, UK, 2016. [Google Scholar]

Figure 1. Framework of the proposed DVAD method. Multi-scale features

F_{1}, F_{2}, F_{3}

are obtained from different layers of the encoder and fused in the global–local–dynamic module.

Figure 1. Framework of the proposed DVAD method. Multi-scale features

F_{1}, F_{2}, F_{3}

are obtained from different layers of the encoder and fused in the global–local–dynamic module.

Figure 2. Defects visualization of the MVTec AD dataset. From left to right, the images are the original image, the ground truth for defect annotations, the predicted anomaly score map by the model, and the final anomaly region localization result.

Table 1. Multi-class anomaly detection performance on the MVTec AD dataset.

Method	Image-Level			Pixel-Level			mAU-PRO	mIOU-max
Method	mAU-ROC	mAP	m $F_{1}$ -max	mAU-ROC	mAP	m $F_{1}$ -max	mAU-PRO	mIOU-max
DRAEM [27]	54.5%	76.3%	83.6%	47.6%	3.2%	6.7%	14.3%	3.5%
SimpleNet [32]	95.4%	98.3%	95.7%	96.8%	48.8%	51.9%	86.9%	36.4%
RealNet [31]	84.8%	94.1%	90.9%	72.6%	48.2%	41.4%	56.8%	28.8%
CFA [18]	57.6%	78.3%	84.7%	54.8%	11.9%	14.7%	25.3%	8.9%
CFLOW-AD [22]	91.6%	96.7%	93.4%	95.7%	45.9%	48.6%	88.3%	33.2%
PyramidFlow [33]	70.2%	85.5%	85.5%	80.0%	22.3%	22.0%	47.5%	12.8%
RD [25]	93.6%	97.2%	95.6%	95.8%	48.2%	53.6%	91.2%	37.0%
RD++ [47]	97.9%	98.8%	96.4%	97.3%	54.7%	58.0%	93.2%	41.5%
DiAD [5]	88.9%	95.8%	93.5%	89.3%	27.0%	32.5%	63.9%	21.1%
ViTAD [7]	98.3%	99.3%	97.3%	97.6%	55.2%	58.4%	92.0%	42.3%
MambaAD [8]	97.8%	99.3%	97.3%	97.4%	55.1%	57.6%	93.4%	41.2%
UniAD [4]	92.5%	97.3%	95.4%	95.8%	42.7%	48.0%	89.3%	32.5%
Ours	98.8%	99.6%	98.0%	97.8%	56.8%	59.4%	93.5%	43.0%