DSMBAD: Dual-Stream Memory Bank Framework for Unified Industrial Anomaly Detection

Hu, Hongmin; Wang, Xiaodong; Fan, Jiangtao; Zeng, Zhiqiang; Lu, Junwen; Hong, Otis; Zhang, Jihuang

doi:10.3390/electronics14142748

Open AccessArticle

DSMBAD: Dual-Stream Memory Bank Framework for Unified Industrial Anomaly Detection

by

Hongmin Hu

¹,

Xiaodong Wang

^1,*,

Jiangtao Fan

¹,

Zhiqiang Zeng

¹

,

Junwen Lu

¹,

Otis Hong

² and

Jihuang Zhang

^3,*

¹

College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China

²

Xiamen Truesight Technology Co., Ltd., Xiamen 361000, China

³

Information Technology Center, Huizhou Engineering Vocational College, Huizhou 516023, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(14), 2748; https://doi.org/10.3390/electronics14142748

Submission received: 9 June 2025 / Revised: 30 June 2025 / Accepted: 2 July 2025 / Published: 8 July 2025

Download

Browse Figures

Versions Notes

Abstract

Industrial image anomaly detection requires the simultaneous identification of local structural and global logical anomalies. Existing methods specialize in single-type anomalies due to divergent feature requirements: structural anomalies demand fine-grained local features, while logical anomalies need semantic features. Consequently, designing a unified network architecture that effectively captures both features without task conflicts remains a key challenge. To address this problem, we propose a Dual-Stream Memory Bank Anomaly Detection (DSMBAD) framework, which enables the collaborative detection of both structural and logical anomalies from complementary perspectives. The framework consists of two memory banks: one stores patch features for detecting structural anomalies through local feature discrepancies, while the other uses segmentation maps to model component relationships for logical anomaly identification. Additionally, a feature distillation mechanism aligns features from different backbone networks to enhance global semantic information. We also introduce a shape-based anomaly scoring method that quantifies differences in component relationships using spatial–morphological features. Experimental results on the MVTec LOCO AD dataset show that our method achieves 91.0% I-AUROC (logical) and 90.8% (structural), significantly outperforming single-type models. Ablation studies confirm the dual-stream design and module effectiveness, offering a novel unified solution.

Keywords:

image anomaly detection; feature distillation; memory bank

1. Introduction

Industrial image anomaly detection [1,2,3,4,5,6] is one of the core tasks in quality control for smart manufacturing, aiming to precisely identify surface defects or assembly errors of products in complex industrial scenarios. With the rapid development of deep learning technologies [7,8], unsupervised learning-based anomaly detection methods, such as autoencoders [9] and generative adversarial networks (GANs) [10,11], have gradually become mainstream due to their ability to avoid the expensive and inefficient reliance on manual annotations. However, the complexity of anomaly types in industrial settings poses higher demands on detection models. In real-world production, anomalies encompass both local structural defects, such as those shown in the last two rows of Figure 1 (e.g., wrong label on juice bottle and notched line in splicing connectors), and global logical errors, as illustrated in the first two rows of Figure 1 (e.g., missing cereal and nuts in a breakfast box and pushpins that violates logical constraints). Current methods [12,13,14,15] are often tailored to address a single type of anomaly, struggling to cover both simultaneously, which leads to high rates of missed detections or false alarms in complex scenarios, thus limiting their universal applicability in industrial settings.

Existing approaches typically handle structural and logical anomalies separately. For structural anomalies, mainstream methods [16,17,18] rely on image reconstruction or local feature discrepancy detection. For instance, autoencoders [19] localize defects through reconstruction residuals, or multi-scale pyramid networks [20,21] extract fine-grained features. However, these methods lack modeling of global semantic information, making them ineffective at detecting logical anomalies. On the other hand, logical anomaly detection methods [14,22] predominantly depend on semantic segmentation maps, analyzing relationships between components or functional consistency. For example, attention mechanisms are used to capture constraint violations (e.g., positional rules or component combinations). Nevertheless, such methods exhibit insufficient sensitivity to local structural changes, resulting in subpar performance when detecting structural anomalies. Furthermore, most existing methods rely on a single anomaly score, which struggles to uniformly quantify the differences between structural damage and logical inconsistencies, exacerbating misjudgment risks in complex scenarios.

Figure 1. Some normal (first column) and anomalous (second column) images from the MVTec LOCO AD dataset. It includes training images, test images (anomalous regions highlighted in red), the predictions of PatchCore [23], and our method. It can be observed that the detection results of PatchCore include several false positives, mistakenly identifying some non-anomalous regions as anomalies. In contrast, our method accurately localizes all the true anomalous regions.

In industrial production, structural and logical anomalies often coexist, necessitating the construction of a unified detection framework to enhance detection accuracy. However, the feature spaces of two anomaly types differ significantly: structural anomalies rely on fine-grained local features, while logical anomalies require semantic features to model global relationships. Designing a network architecture that can accommodate both features while avoiding task conflicts remains a core challenge. Additionally, anomaly scores for both types need to be fused within a unified metric space to prevent decision biases caused by discrepancies in scale; failure to address this fusion challenge inherently limits the performance ceiling of a unified framework.

To address the aforementioned problems, we propose a dual-stream unified anomaly detection framework for industrial images, enabling the collaborative detection and quantification of structural and logical anomalies. Specifically, we detect these two types of anomalies by utilizing two memory banks based on segmentation maps and patch features, respectively. The segmentation map memory bank is responsible for capturing global features, while the patch feature memory bank handles local features. Furthermore, we design a feature distillation guidance mechanism to align features generated by different backbone networks, integrating them into a unified framework. We also propose an anomaly score calculation method based on spatial–morphological features such as centroid, area, and variance, modeling relationships between components from a morphology perspective to achieve superior detection performance. We evaluate the proposed method on the public MVTec LOCO AD dataset. Our approach achieves prior I-AUROC results not only in logical anomaly detection but also in structural anomaly detection. Our contributions are listed as follows:

We propose a dual-stream memory bank framework that separately stores segmentation maps and fine-grained patch features. By analyzing discrepancies in both streams, the model can detect structural and logical anomalies simultaneously from complementary perspectives;
We employ a feature distillation to guide the student model to learn from different backbone networks, enhancing its anomaly detection capability. We also introduce a new anomaly scoring method based on component shapes, offering a fresh perspective on anomaly computation;
On the MVTec LOCO AD benchmark, our method achieves 91.0% I-AUROC for logical anomalies and 90.8% for structural anomalies, outperforming the latest approaches by 0.6% in terms of the average performance. These results also surpass those of single-type models. Ablation studies confirm that the dual-stream framework is critical to effective anomaly detection.

2. Related Works

Research on industrial image anomaly detection primarily revolves around two core issues: structural anomalies (e.g., surface defects and texture damage) and logical anomalies (e.g., misaligned components and assembly errors). Existing methods propose targeted solutions from the perspectives of feature modeling, detection paradigms, and task adaptability.

Structural Anomaly Detection Methods. Structural anomalies typically manifest as physical damage in localized regions (e.g., scratches and dents) with their detection relying on fine-grained feature modeling and local discrepancy analysis. (1) Image Reconstruction-Based Methods: Representative approaches include autoencoders (AE) [24] and variational autoencoders (VAE) [25]. These methods reconstruct input images and compute residual maps to locate anomalous regions. For instance, DRAEM [26]—a discriminatively trained reconstruction embedding for surface anomaly detection—integrates a segmentation network with an anomaly generator, detecting surface anomalies through the discriminative training of a reconstruction-based embedding model. However, such methods are susceptible to interference from complex backgrounds and struggle to distinguish normal texture variations from actual defects. (2) Generative Adversarial Network (GAN)-Based Methods: Examples include boosting fine-grained visual anomaly detection with coarse-knowledge-aware adversarial learning (CKAAD) [27], which leverages coarse knowledge-aware adversarial learning to enhance fine-grained visual anomaly detection. It employs an energy discriminator to differentiate between normal and abnormal features while guiding the autoencoder to transform anomalous samples into normal ones for anomaly detection and localization. (3) Feature Similarity-Matching Methods: These methods, represented by SPADE [28], PatchCore [23], and PaDiM [29], utilize pre-trained convolutional neural networks (CNNs) [30] to extract multi-scale features and construct normal sample feature distributions (e.g., Mahalanobis distance) to localize anomalies. While sensitive to local textures, these methods lack sufficient modeling of global semantic information. Meanwhile, current approaches for structural anomaly detection overly rely on local feature discrepancies, fail to adequately model global contextual relationships, and are prone to interference from background factors, leading to missed detections of logical anomalies.

Logical Anomaly Detection Methods. Logical anomalies involve violations of topological relationships or functional consistency among components, necessitating the modeling of global semantic associations. (1) Segmentation-Based Methods: For example, comAD [14] applies K-means clustering on pre-trained features to segment multiple components within an image. SAM-LAD [31] utilizes SAM [32] to decompose images into multiple components and analyzes component correspondences to detect logical anomalies. (2) Text Generation-Based Methods: Approaches like LogicAD [33] and LogSAD [34] leverage multi-modal vision language models [35,36,37] to generate text descriptions, fine-tuning them to model and analyze images from a textual perspective. By comparing textual differences, these methods aim to identify anomalies. However, methods targeting logical anomaly detection often neglect local details, resulting in suboptimal performance in detecting structural anomalies.

3. Method

Our method consists of two components: one is patch-feature stream, which stores patch features in the patch-feature memory bank, and the other is segmentation-map stream, which stores segmentation maps in a segmentation-map memory bank. As illustrated in Figure 2, during the training phase, the normal images are used to generate patch features and segmentation maps, which are then stored in their respective memory banks. Specifically, (1) the patch-feature memory bank stores patch-level features that capture fine-grained details of the image, and (2) the segmentation-map memory bank assists in determining the validity of various component combinations to identify logical anomalies. During the test phase, the differences between the patch feature and segmentation maps of the test image and those of the training images are computed separately. Finally, the distances are calculated to derive the anomaly score. The pseudo-code of the training and test procedure is shown in Algorithm 1.

3.1. Patch-Feature Stream

To accurately detect structural anomalies in images, more fine-grained local features are required—features that can precisely locate subtle structural defects within the image. Inspired by feature extraction methods [18,23], patch features are obtained by dividing the extracted image features into small patches. This operation enhances the granularity of the original features, allowing the model to focus on localized regions of the image and accurately pinpoint the locations of structural defects.

Algorithm 1 DSMBAD training and test pseudo-code

Require: Training set

X_{t r a i n}

, test set

X_{t e s t}

, cluster number K

1:: Training Phase:
2:: for each $x_{i} \in X_{t r a i n}$ do
3:: Get patch feature f_patch according to Equation (2)
4:: Get sematic feature f_s according to Equation (4)
5:: Component ← SegModule(f_s, K) # SegModule: Segmentation Module
6:: $M_{p a t c h}$ .add(f_patch)
7:: $M_{m a p}$ .add(Component)
8:: end for
9:: Testing Phase:
10:: for each $x_{t} \in X_{t e s t}$ do
11:: Get patch-feature score S_patch according to Equation (7)
12:: Get segmentation-map score S_map according to Equation (8)
13:: Get shape score S_shape according to Equation (14)
14:: ${\tilde{S}}_{m a p} \leftarrow S_{m a p} + S_{s h a p e}$
15:: $S_{t o t a l} \leftarrow Norm (S_{p a t c h}) + Norm ({\tilde{S}}_{m a p})$
16:: end for

Patchifying and Pooling. For a given image

x_{i}

from the training set

X_{t r a i n}

, the lth output of the feature extractor is

g_{l} \in R^{H_{l} \times W_{l} \times C_{l}}

(l = 1, 2, … , m), where

H_{l}

,

W_{l}

, and

C_{l}

are the height, width, and channel size of the output. To generate fine-grained patch features based on the phases illustrated in Figure 2, our patch-feature stream can be described as follows:

For each entry

g_{l}^{(h, w)} \in R^{C_{l}}

at location (

h, w

), its patch feature set

N_{p}^{(h, w)}

with a patchsize p is constructed as:

\begin{matrix} N_{p}^{(h, w)} = {(h^{'}, y^{'}) | & h^{'} \in [h - ⌊ p / 2 ⌋, \dots, h + ⌊ p / 2 ⌋], \\ y^{'} \in [w - ⌊ p / 2 ⌋, \dots, w + ⌊ p / 2 ⌋]}, \end{matrix}

(1)

where

N_{p}^{(h, w)}

defines a

p \times p

neighborhood around (

h, w

), and

⌊ p / 2 ⌋

ensures symmetric padding. An adaptive average pooling aggregation function

F_{a g g}

is applied to the patch features in

N_{p}

to obtain aggregated features

f_{p a t c h}

for each layer as:

f_{p a t c h} = F_{a g g} (\{g_{l}^{(h^{'}, y^{'})} ∣ (h^{'}, y^{'}) \in N_{p}^{(h, w)}\}),

(2)

and the patch features

f_{p a t c h}

are then stored in the patch-feature memory bank for use during the subsequent test phase.

3.2. Segmentation-Map Stream

To effectively detect logical anomalies in images, it is critical to explicitly model the intrinsic constraints among components, such as spatial layouts and topological dependencies. This is achieved by first segmenting the image into semantically meaningful component maps and then systematically analyzing pairwise relationships between components through statistical metrics. By quantifying deviations from normal constraints, the framework can pinpoint violations of logical consistency, such as misassembled parts or missing components, thereby enabling precise logical anomaly detection.

Feature Distillation Guidance. Our objective is to construct a unified framework for anomaly detection. The architecture of ResNet [30], comprising stacked convolutional layers, is more suited to extract fine-grained features. However, it is less effective at generating segmentation maps directly, as evidenced by our ablation study on different teacher models. Our work is inspired by the knowledge distillation approach [38], where a teacher model can effectively guide the student model to learn the features enriched with global context—a prerequisite for constructing discriminative segmentation maps to detect logical anomalies.

Specifically, we employ a self-supervised Vision Transformer DINO [39] as the teacher model, which excels in capturing holistic semantic relationships through its attention mechanism. Since the outputs of thes two models differ, the student model needs to align with the output of the teacher model. Inspired by RD [13], we design a multi-scale feature fusion module to align their feature representations. The specific steps are as follows: for an image

x_{i}

from the training set

X_{t r a i n}

, the multi-layer feature outputs

g_{l}

(l = 1, 2, … , m) of the student model is extracted by feature extractor. We employ a transformation function to synchronize these various features with the corresponding features of the teacher:

\begin{matrix} ϕ (g_{l}) = \{\begin{matrix} Downsample (g_{l}) & if D_{s}^{l} > D_{t}^{l} \\ Upsample (g_{l}) & if D_{s}^{l} < D_{t}^{l} \\ g_{l} & if D_{s}^{l} = D_{t}^{l} \end{matrix}, \end{matrix}

(3)

where

D_{s}^{l}

and

D_{t}^{l}

denote the feature dimensions of layer l in the student model and the teacher model, respectively.

Finally, we obtain m dimensionally consistent features, which are concatenated to form a new feature

f_{s}

as follows:

\begin{matrix} f_{s} = F_{c a t} (ϕ (g_{1}), ϕ (g_{2}), \dots, ϕ (g_{m})) . \end{matrix}

(4)

After aligning the feature

f_{s}

with the feature

f_{t}

extracted by the teacher network through two-dimensional convolution, we define the distillation loss

L_{l o g}

between them. This alignment is achieved using a mean squared error (MSE) loss during training, and it is expressed as:

L_{l o g} = \frac{1}{N} \sum_{i = 1}^{N} {(f_{t}^{(i)} - f_{s}^{(i)})}^{2},

(5)

where N denotes the total of training samples, while

f_{t}^{(i)}

and

f_{s}^{(i)}

represent the features

f_{s}

and

f_{t}

of the ith sample, respectively.

Segmentation Module. After the feature distillation process, the features extracted by ResNet can now be utilized to construct segmentation maps. As illustrated in the segmentation module in Figure 3, K-means clustering is applied to

f_{s}

to obtain k clusters. The cosine similarity between each cluster and the original feature

f_{s}

is then calculated to generate an initial segmentation map. Subsequently, the segmentation map is resized to match the original image dimensions through interpolation. Finally, a fully connected Gaussian Conditional Random Field (CRF) [40] is applied as a post-processing step to refine the segmentation results. To identify noise components, we further filter each refined map using an 11 × 11 mean filter, discarding regions with maximum values below 0.5 (normalized to [0, 1]) [14], thereby enhancing foreground–background separation and yielding the final segmentation map.

3.3. Anomaly Score Computation

The proposed method stores the patch features of normal images and their corresponding segmentation maps in separate memory banks. Patch feature-based methods have been shown to effectively detect structural anomalies by focusing on local features. In contrast, logical anomalies require modeling and analyzing the relationships between components in the segmentation maps. The core of our anomaly scoring mechanism relies on comparing the test sample against the representations of normal samples stored in the two memory banks to quantify deviations from normality. During testing, for each stream, we compute the discrepancy between the test sample and its nearest neighbors within the corresponding memory bank. This discrepancy serves as the foundational anomaly score for that stream. For outliers, we use KNN-based anomaly scores, which naturally smooth out minor outliers. The following describes how to calculate anomaly scores in two memory banks.

Patch-Feature Memory Bank. This memory bank

M_{p a t c h}

is constructed by storing patch features to detect structural defects following established approaches [23,29]: For each training sample

x_{i} \in X_{t r a i n}

, their patch features

f_{p a t c h}^{(i)}

are extracted and used to build a memory bank

M_{p a t c h}

:

M_{p a t c h} = ⋃_{i = 1}^{N} \{f_{p a t c h}^{(i)}\},

(6)

where

f_{p a t c h}^{(i)}

represents the patch features of the ith training sample. The anomaly score

S_{p a t c h}

for the test sample

x_{t} \in X_{t e s t}

is predicted as:

S_{p a t c h} = max_{f \in {f_{p a t c h}^{(t)}}} min_{\tilde{f} \in M_{p a t c h}} {∥ f - \tilde{f} ∥}^{2},

(7)

where

f_{p a t c h}^{(t)}

denotes the patch feature of the test sample.

Segmentation-Map Memory Bank. This memory bank

M_{m a p}

is built by storing segmentation maps containing four components to detect logical errors, following the calculation method used during the test phase. Specifically, the method includes the following steps. (1) Area Feature: The area feature is computed by summing the total number of pixels within the segmented regions. (2) Color Feature: The image is converted from RGB space to CIELAB space, which consists of three components: L (luminance), a (green–red), and b (blue–yellow). For each pixel, luminance is ignored, and the ratio

b / a

is calculated. The average value over the entire region is then taken as the color feature. (3) Quantity Feature: It is derived by grouping regions using DBSCAN [41] and calculating their density. By combining these three features, the anomaly score

S_{m a p}

is obtained by calculating the average

L_{2}

distance between the test image

x_{t}

and its five nearest neighbors

N_{k}

(k = 5)

and is defined as:

S_{m a p} = \frac{1}{k} \sum_{x \in N_{k}} {∥ x_{t} - x ∥}^{2} .

(8)

However, we find that the computation of distances between normal and test images based on the aforementioned features is relatively independent, leading to insufficient anomaly scoring and limited generalization across complex scenarios. To address this problem, based on anomalies that often exhibit specific spatial–morphological patterns, we additionally design a new distance calculation method that incorporates spatial–morphological features–centroid (

d x, d y

), area ratio

a r

(the ratio of the number of pixels in the segmented region to the total image), and variance v to model inter-component relationships from a shape-based perspective. The centroid captures positional consistency,

a r

quantifies size conformity relative to the entire image, and v measures shape regularity or distribution dispersion. The computing steps are as follows:

Step 1: Relationship Calculation: For each segmented component c in image

x_{i}

in

X_{t r a i n}

, we extract three key spatial–morphological features: centroid

(d x_{c}, d y_{c})

, area ratio

a r_{c}

, and variance

v_{c}

. For every unique pair of components

(p, q)

where

p < q

, we compute the absolute difference for each feature type between the two components:

\begin{matrix} Δ d x_{p q} & = | d x_{p} - d x_{q} |, \\ Δ d y_{p q} & = | d y_{p} - d y_{q} |, \\ Δ a r_{p q} & = | a r_{p} - a r_{q} |, \\ Δ v_{p q} & = | v_{p} - v_{q} | . \end{matrix}

(9)

Each pair

(p, q)

thus contributes a 4-dimensional difference vector:

Δ_{p q} = (Δ d x_{(p q)}, Δ d y_{(p q)}, Δ a r_{(p q)}, Δ v_{(p q)}) .

(10)

The complete inter-component relationship representation for each image is the set of all pairwise difference vectors:

R = \{Δ_{p q} ∣ 1 \leq p < q \leq K - 1\},

(11)

where K denotes the number of clusters.

Step 2: Threshold Determination: Using the training set of normal images

X_{t r a i n}

, we learn the acceptable range for each type of pairwise difference feature. For each difference feature type

k \in {d x, d y, a r, v}

and for each possible component pair

(p, q)

, we collect all values

Δ k_{p q}^{(i)}

across all training images

x_{i} \in X_{t r a i n}

, and the maximum and minimum values of

Δ k_{p q}^{(i)}

are obtained. The deviation threshold T is defined as the width of the normal range to define the normal range of deviations:

\begin{matrix} Δ {\tilde{k}}_{p q} & = max (Δ k_{p q}^{(i)}) - min (Δ k_{p q}^{(i)}), \\ T & = \{Δ {\tilde{k}}_{p q} ∣ 1 \leq p < q \leq K - 1\} . \end{matrix}

(12)

Step 3: Distance Calculation: As shown in Figure 4, for a test image

x_{t}

, its inter-component relationship representation

R^{(t)} = \{Δ_{p q}^{(t)}\}

is extracted. The nearest neighbors

x_{r}

of the test image are retrieved from

N_{k}

, and the relationships

R^{(r)} = \{Δ_{p q}^{(r)}\}

are identified as the reference relationships. For each feature type k within each pairwise difference vector

Δ_{p q}^{(t)}

, the deviation between reference relationships and the relationships of the test image is computed as:

D_{k, p q} = | R_{k, p q}^{(t)} - R_{k, p q}^{(r)} | .

(13)

If the difference

D_{k, p q}

exceeds the threshold

T_{k, p q}

, it is considered as an anomaly and the

L_{2}

-distance is calculated, i.e.:

S_{s h a p e} = \sum_{1 \leq p < q \leq K - 1} \sum_{k \in {d x, d y, a r, v}} {∥ R_{k, p q}^{(t)} - R_{k, p q}^{(r)} ∥}^{2}, if D_{k, p q} > T_{k, p q} .

(14)

S_{s h a p e}

is added as part of the anomaly score for the segmentation map. The process is defined as:

{\tilde{S}}_{m a p} = S_{s h a p e} + S_{m a p} .

(15)

Aggregation of Anomaly Scores. The anomaly scores derived from the two memory banks originate from different feature spaces and calculation methods, leading to inherent differences in their numerical scales and distributions. Directly summing these uncalibrated scores would cause the score with the larger inherent scale to dominate the aggregated result, potentially masking the signal from the other anomaly type. This imbalance would compromise the accuracy of the final anomaly score. To ensure that both scores contribute proportionally to the final result based on their respective deviation magnitudes, we first normalize both scores, and the final aggregated anomaly score

S_{t o t a l}

is obtained after normalization, i.e.,:

S_{t o t a l} = N o r m (S_{p a t c h}) + N o r m ({\tilde{S}}_{m a p}),

(16)

where

N o r m (\cdot)

denotes the normalizing operation. Specifically, it is derived from

(S - μ) / σ

, where S represents the original anomaly score, while

μ

and

σ

denote the mean and standard deviation, respectively.

4. Experiment

4.1. Implementation Details

Dataset. (1) The MVTec LOCO AD dataset [42], which is widely recognized as a benchmark for detecting logical and structural anomalies simultaneously, is employed to evaluate all the compared methods discussed in this paper. This dataset includes five categories, with the training and validation sets consisting exclusively of normal images, while the test set contains both normal and abnormal images. Notably, the abnormal images in the test set encompass both structural and logical anomalies, making this dataset the only one that characterizes both types of anomalies. (2) Another dataset, the MVTec AD dataset [43], comprises 10 object categories and 5 texture categories. Each category consists of a training set with normal images and a test image with only structural anomalies.

Settings. We adopt a pre-trained WideResNet50 [44] network as our backbone network. In the patch-feature stream, the patchify and pooling operations follow the settings of PatchCore [23]. In the segmentation-map stream, the teacher model uses DINO ViT-S/8 [39] as the backbone network to guide the student model in generating features rich in global information. The student model extracts features with dimensions of 256, 512, and 1024 from the images corresponding to the first, second, and third layers, respectively, i.e., m set to 3, which are then aligned with the features generated by the teacher model through a multi-scale feature fusion process. The student model is trained using the Adam optimizer with a learning rate of 0.05, 500 iterations, and a batch size of 8. In the segmentation module, all images are resized to 224 × 224, and the number of cluster K is set to 5. In our comparative analysis, we benchmark our model against several state-of-the-art models including DSKD [21], LogicAD [33], SLSG [45], PatchCore [23], SINBAD [46], SimpleNet [18] and ComAD [14]. Following the previous research, we utilize the Area Under the Receiver Operating Characteristic (I-AUROC) as the principal quantitative metric to evaluate image-level anomaly detection performance on the MVTec LOCO AD dataset and MVTec AD dataset. We use the saturated Per-Region Overlap (sPRO) metric [42] to evaluate the anomaly localization performance with the per-pixel false positive rate of 5%, which is a generalized version of the PRO metric [43]. This metric saturates once the overlap with the ground truth reaches a predefined saturation threshold. All thresholds are provided by the MVTec LOCO AD dataset.

4.2. Quantitative and Qualitative Results

The anomaly detection results on the MVTec LOCO AD are shown in Table 1, where the image-level anomaly scores are calculated using Equation (16). As shown in the Table 1, our method achieves the highest overall performance when detecting both types of anomalies simultaneously.

Models specifically designed to detect logical anomalies, such as LogicAD [33] and ComAD [14], tend to perform superior on logical anomaly detection than on structural anomaly detection. Compared with them, our method not only outperforms these models in structural anomaly detection but also surpasses them in detecting logical anomalies. Similarly, compared to models like SimpleNet [18] and PatchCore [23] tailored for structural anomaly detection, our approach achieves superior performance in both structural and logical anomaly detection. Moreover, when compared with models like DSKD [21], SINBAD [46] and SLSG [45] explicitly designed to detect both types of anomalies simultaneously, our method achieves better performance in terms of the average performance, with 91.0% and 90.8% I-AUROC for logical and structural anomalies, surpassing the second-best method in average scores by 0.6%. For the image anomaly localization metric sPRO, as shown in Table 2, our method outperforms the second-best comparative approach by 5.2%, demonstrating the accuracy of our anomaly localization capability. The results from both metrics exhibit that our dual-stream framework effectively resolves the performance trade-off between different anomaly types, offering a unified solution.

We also compared the efficiency of our method with other models, as shown in Table 3. Compared to existing approaches, our model employs fewer parameters and FLOPs, achieving an inference speed of 12.2 ms. Although our method slightly lags behind SAM-LAD in terms of the sPRO metric, we achieve significantly faster inference—over 40 times quicker. A possible reason is that SAM-LAD employs multiple backbone networks along with the SAM (Segment Anything Model) [32], leading to a larger number of parameters during inference, which slows down the process. This further demonstrates that our method strikes a better balance between performance and efficiency.

The detection results on the MVTec-AD dataset are shown in Table 4. On the MVTec-AD dataset, which contains only structural anomalies, our model demonstrates commendable performance with 99.2% compared to models [13,23,29] specifically designed to address structural anomalies, surpassing PatchCore [23] by 0.1% in I-AUROC. Furthermore, our method also exhibits notable advantages over HGAD [49] and MoEAD [50], which are recently new methods for detecting structural anomalies. Our method surpasses theirs by 0.8% and 1.5%, respectively.

Representative samples and the corresponding anomaly localization results are visualized in Figure 5 and Figure 6.

As shown in Figure 5, our method demonstrates superior performance in detecting logical anomalies compared to PatchCore. For instance, in the “breakfast box” category, PatchCore misclassifies certain non-anomalous regions—the fruit in the left half of the third column—as anomalies, resulting in false positives. Similar issues are observed across other categories. In the “juice bottle” category, where the color of the juice is highly similar to that of the label, making them difficult to distinguish, our model accurately identifies the location of the missing label. Moreover, our method successfully detects the absence of the liquid that should be present inside the bottle and verifies whether the liquid and label correspond correctly. In the “pushpins” category, the grid containing the pushpins is easily confused with the background. Nevertheless, our model effectively detects violations of the logical constraints—where each grid should contain exactly one pushpin. In the “splicing connectors” category, our approach also successfully identifies cases where two connectors are incorrectly linked.

As illustrated in Figure 6, in the case of structural anomalies, PatchCore performs poorly in certain categories such as “splicing connectors” and “screw bag”, where defects are often small and easily obscured by background noise, making them difficult to detect. Despite these challenges, our model successfully detects and precisely localizes such subtle and localized anomalies, including incomplete oranges in the “breakfast box”, damaged packaging and broken washers in the “screw bag”, and fractured connector lines and contaminated connectors in the “splicing connectors”. These results demonstrate the effectiveness of our method in identifying a wide variety of structural defects.

4.3. Ablation Studies

Ablation Study on Different Memory Banks. The experimental results in Table 5 validate the effectiveness of different memory banks of our method in handling various types of anomalies. As shown in the first three rows of the table, when using only a single memory bank—such as the patch-feature memory bank—the model achieves strong structural anomaly detection (87.7% I-AUROC), surpassing the only use of the segmentation-map memory bank by 13.2%. This performance gap confirms that patch features excel at capturing fine-grained local details for structural defects. Similarly, in the segmentation-map memory bank, we model the relationships between components, and the logical anomaly detection reaches 87.5% I-AUROC higher than the only use of the patch-feature memory bank by 12.6%. This demonstrates targeted effectiveness in detecting logical anomalies. When combining the patch feature score

S_{p a t c h}

with either part (

S_{m a p}

or

S_{s h a p e}

) of the segmentation map score, the overall performance improves. For instance, combining

S_{p a t c h}

with

S_{s h a p e}

(based on spatial–morphological features) elevates the average I-AUROC to 86.8%, which is driven by a notable rise in logical anomaly detection from 75.5% to 85.3%. Similarly, integrating

S_{p a t c h}

with

S_{m a p}

(which models component relationships via color and quantity features), the average I-AUROC increases from 81.6% (patch-only) to 90.8% with structural anomaly detection improving from 87.7% to 90.7%. Moreover, only when both

S_{m a p}

and

S_{s h a p e}

are jointly fused with

S_{p a t c h}

does the model achieve optimal performance (91.0% for logical and 90.8% for structural anomalies). The experiment in the last row integrates the strengths of both memory banks—leveraging their respective advantages in structural and logical anomaly detection—achieving the best overall performance. This result confirms the effectiveness of the proposed dual-stream memory bank framework.

Ablation Study on Different Modules. The experimental results presented in Table 6 demonstrate the impact of different modules on the overall model performance. The experiments in the second and third rows of the table verify the effectiveness of the two proposed modules. Specifically, the results in the second row indicate that the knowledge learned by the student model through knowledge distillation is effective for anomaly detection. The results in the third row confirm the importance of the shape-based anomaly score in capturing the shape relationships between components. Furthermore, by comparing the results of the first and the last rows, it is evident that the integration of both modules contributes to a further improvement in the overall performance of the model.

Ablation Study on Different Teacher Models. To further validate the impact of the teacher model on the performance of the dual-stream framework, we compare three configurations: “w/o” (without feature distillation), “WideResNet50” (WideResNet50 as the teacher model), and “DINO” (DINO as the teacher model), as shown in Table 6. The experimental results demonstrate that removing the feature distillation mechanism (w/o) leads to I-AUROC scores of 82.5% for logical anomalies, 87.8% for structural anomalies and 85.2% for the average. This indicates that while the student model alone can effectively capture local structural anomalies through independently extracted features, it lacks sufficient capability in modeling global logical anomalies. When WideResNet50 is employed as the teacher model, the performance on structural anomaly detection reaches 89.2%, but the score for logical anomalies slightly drops to 82.1%. This reflects that the multi-scale convolutional framework of WideResNet50 enhances local feature alignment but lacks global semantic guidance, which is critical for logical anomaly detection. In contrast, when DINO is used as the teacher model and under the dual-stream architecture, both logical and structural anomaly detection performance improves significantly, achieving 91.0% and 90.8% I-AUROC, respectively, with an average of 90.9%. DINO captures global semantic relationships in images through self-supervised pre-training, guiding the student model to generate highly discriminative segmentation maps. Additionally, its multi-scale feature fusion mechanism optimizes local feature alignment, leading to balanced improvements in the detection of both anomaly types. These experiments confirm that the choice of teacher model is crucial for the success of the unified framework.

Ablation Study on Different Cluster numbers. This experiment investigates the impact of the number of cluster (K) in the segmentation module on anomaly detection performance, as shown in Table 7. The experimental results show that when

K = 3

, the performance is the worst in both cases, in which the I-AUROC of logical anomalies and structural anomalies are 86.0% and 89.4%, respectively, with an average of 87.7%. A small number of clusters proves insufficient for accurately segmenting complex components, thereby limiting the performance of logical anomaly detection. As K increases to 5, the I-AUROC of logical anomaly detection has improved significantly from 86.0% to 91.0%, while structural anomaly detection remains stable at 90.8%, resulting in an average of 90.9%. This indicates that a moderate number of clusters can effectively capture semantic relationships among key components for identifying logical anomalies while preserving local details for the precise localization of structural defects. However, further increasing K to 6 or 7 leads to a slight decline in logical anomaly detection performance to 90.2% and 90.1%, respectively, with the average of both dropping to 90.3% and 90.2%. This may be attributed to excessive clustering causing redundant segmentation maps and introducing noise interference. These experiments confirm that

K = 5

is the optimal choice on the selected dataset, achieving fine-grained component segmentation, thus validating the effectiveness of the default setting used in this study.

Ablation Study on Different Scales of Memory Bank. We conduct experiments to investigate the impact of memory bank scale on anomaly detection performance, as presented in Table 8. The results show that as the memory bank scale increases from 1% to 100%, the I-AUROC of logical and structural anomalies improves significantly from 88.1% and 87.9% to 91.0% and 90.8%, respectively, with the average performance increasing from 88.0% to 90.9%. When the memory bank scale is small (e.g., 1%), the limited coverage of stored patch features and segmentation maps restricts the model’s ability to fully capture the diversity of normal samples. This leads to notable fluctuations in logical anomaly detection (STD ± 2.5) and limited generalization capability for local details in structural anomaly detection (STD ± 2.8). As the memory bank scale increases to 75%, the STDs for logical and structural anomalies drop to ±0.5 and ±0.9, respectively, indicating that a larger memory bank enhances the model’s robustness to the normal data distribution and reduces variance caused by random sampling. At a 100% memory bank scale, the model achieves the highest performance by comprehensively learning both patch-level features and component relationships from all training samples, resulting in I-AUROC scores of 91.0% for logical anomalies and 90.8% for structural anomalies. These experiments validate the positive correlation between memory bank scale and detection performance—that is, a larger memory bank scale lead to better detection accuracy. They also highlight the need to balance computational cost and accuracy in practical applications.

5. Conclusions

In this paper, we propose a novel framework for image anomaly detection, DSMBAD, which unifies the tasks of structural and logical anomaly detection through a dual-stream memory bank discrepancy analysis. The patch-feature memory bank localizes structural defects by comparing stored fine-grained patch features, while the segmentation-map memory bank enhances the detection of logical anomalies by modeling and contrasting inter-component relationships. The designed feature distillation guidance, along with the shape-based anomaly score mechanism, further contributes to improving the overall performance of the model. Experimental results demonstrate competitive improvements in accuracy for both structural and logical anomaly detection tasks. Despite superior performance, DSMBAD has the limitation that the dual-stream design increases computational and memory costs, which could limit deployment. Future work will explore methods for compressing anomaly detection networks, such as network pruning techniques [51], to improve deployment efficiency.

Author Contributions

Conceptualization, X.W.; Methodology, H.H.; Validation, H.H., J.Z. and J.F.; Formal analysis, X.W.; Investigation, J.F. and O.H.; Resources, J.L. and J.Z.; Data curation, H.H. and J.Z.; Writing – review and editing, X.W.; Visualization, H.H.; Supervision, Z.Z.; Project administration, O.H. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Natural ScienceFoundation of Xiamen (3502Z202473071). National Natural Science Foundation of Fujian Province (Grant Nos.2023J011428, 2022J011236, 2022Y0077), Industry-University-Research Collaborative Innovation Project of Fujian Province (Grant No. 2024H6035).

Data Availability Statement

The MVTec AD dataset and the MVTec Logical Constraints Anomaly-Detection dataset (MVTec LOCO AD) can be obtained from https://www.mvtec.com/company/research/datasets/mvtec-ad (accessed on 9 June 2025) and https://www.mvtec.com/company/research/datasets/mvtec-loco (accessed on 9 June 2025).

Conflicts of Interest

Author Otis Hong was employed by the company Xiamen Truesight Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Han, Z.; Wan, J.; Li, Y.; Li, G. A Refined Knowledge Distillation Network for Unsupervised Anomaly Detection. Electronics 2024, 13, 4793. [Google Scholar] [CrossRef]
Wang, X.; Fan, J.; Yan, F.; Hu, H.; Zeng, Z.; Wu, P.; Huang, H.; Zhang, H. Unsupervised Anomaly Detection via Normal Feature-Enhanced Reverse Teacher–Student Distillation. Electronics 2024, 13, 4125. [Google Scholar] [CrossRef]
Choi, B.; Jeong, J. ViV-Ano: Anomaly detection and localization combining vision transformer and variational autoencoder in the manufacturing process. Electronics 2022, 11, 2306. [Google Scholar] [CrossRef]
Cai, M.; Wang, X.; Sohel, F.; Lei, H. Unsupervised Anomaly Detection for Improving Adversarial Robustness of 3D Object Detection Models. Electronics 2025, 14, 236. [Google Scholar] [CrossRef]
Li, S.; Yuan, J.; Zhu, H.; Zhong, X. Synergistic Integration of Cross-Spatial Learning for Lightweight Crack Detection. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Li, S.; Cai, S.; Teng, S.; Wei, S.; Yuan, J.; Zhong, X. Weather-Robust Spatial-Frequency Decoupling Transformer for Crack Segmentation. IEEE Sens. J. 2025. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Adibhatla, V.A.; Huang, Y.C.; Chang, M.C.; Kuo, H.C.; Utekar, A.; Chih, H.C.; Abbod, M.F.; Shieh, J.S. Unsupervised anomaly detection in printed circuit boards through student–teacher feature pyramid matching. Electronics 2021, 10, 3177. [Google Scholar] [CrossRef]
Baur, C.; Wiestler, B.; Albarqouni, S.; Navab, N. Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Revised Selected Papers, Part I. Springer: Cham, Switzerland, 2019; pp. 161–169. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Akcay, S.; Atapour-Abarghouei, A.; Breckon, T.P. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Proceedings of the Computer Vision—ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part III. Springer: Cham, Switzerland, 2019; pp. 622–637. [Google Scholar]
Tien, T.D.; Nguyen, A.T.; Tran, N.H.; Huy, T.D.; Duong, S.T.; Nguyen, C.D.T.; Truong, S.Q.H. Revisiting Reverse Distillation for Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 24511–24520. [Google Scholar]
Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
Liu, T.; Li, B.; Du, X.; Jiang, B.; Jin, X.; Jin, L.; Zhao, Z. Component-aware anomaly detection framework for adjustable and logical industrial visual inspection. Adv. Eng. Inform. 2023, 58, 102161. [Google Scholar] [CrossRef]
Yao, H.; Yu, W.; Luo, W.; Qiang, Z.; Luo, D.; Zhang, X. Learning global-local correspondence with semantic bottleneck for logical anomaly detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 3589–3605. [Google Scholar] [CrossRef]
He, H.; Zhang, J.; Chen, H.; Chen, X.; Li, Z.; Chen, X.; Wang, Y.; Wang, C.; Xie, L. A diffusion-based framework for multi-class anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 8472–8480. [Google Scholar]
Liu, T.; Li, B.; Du, X.; Jiang, B.; Geng, L.; Wang, F.; Zhao, Z. Fair: Frequency-aware image restoration for industrial visual anomaly detection. arXiv 2023, arXiv:2309.07068. [Google Scholar]
Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. Simplenet: A Simple Network for Image Anomaly Detection and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
Luo, W.; Yao, H.; Yu, W. Normal reference attention and defective feature perception network for surface defect detection. IEEE Trans. Instrum. Meas. 2023, 72, 2511514. [Google Scholar] [CrossRef]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4183–4192. [Google Scholar]
Zhang, J.; Suganuma, M.; Okatani, T. Contextual affinity distillation for image anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 149–158. [Google Scholar]
Kim, S.; An, S.; Chikontwe, P.; Kang, M.; Adeli, E.; Pohl, K.M.; Park, S.H. Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 8591–8599. [Google Scholar] [CrossRef]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards Total Recall in Industrial Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
Bergmann, P.; Löwe, S.; Fauser, M.; Sattlegger, D.; Steger, C. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. arXiv 2018, arXiv:1807.02011. [Google Scholar]
Venkataramanan, S.; Peng, K.C.; Singh, R.V.; Mahalanobis, A. Attention guided anomaly localization in images. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 485–503. [Google Scholar]
Zavrtanik, V.; Kristan, M.; Skocaj, D. DRAEM—A Discriminatively Trained Reconstruction Embedding for Surface Anomaly Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 8330–8339. [Google Scholar]
Fang, Q.; Su, Q.; Lv, W.; Xu, W.; Yu, J. Boosting Fine-Grained Visual Anomaly Detection with Coarse-Knowledge-Aware Adversarial Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 16532–16540. [Google Scholar]
Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International conference on Pattern Recognition, Virtual, 10–11 January 2021; Springer: Cham, Switzerland, 2021; pp. 475–489. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Peng, Y.; Lin, X.; Ma, N.; Du, J.; Liu, C.; Liu, C.; Chen, Q. Sam-lad: Segment anything model meets zero-shot logic anomaly detection. Knowl.-Based Syst. 2025, 314, 113176. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Jin, E.; Feng, Q.; Mou, Y.; Decker, S.; Lakemeyer, G.; Simons, O.; Stegmaier, J. LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction. arXiv 2025, arXiv:2501.01767. [Google Scholar] [CrossRef]
Zhang, J.; Wang, G.; Jin, Y.; Huang, D. Towards Training-free Anomaly Detection with Vision and Language Foundation Models. arXiv 2025, arXiv:2503.18325. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the ICML, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Wang, G.; Han, S.; Ding, E.; Huang, D. Student-teacher feature pyramid matching for anomaly detection. arXiv 2021, arXiv:2103.04257. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. Density-based spatial clustering of applications with noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 240. [Google Scholar]
Bergmann, P.; Batzner, K.; Fauser, M.; Sattlegger, D.; Steger, C. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization. Int. J. Comput. Vis. 2022, 130, 947–969. [Google Scholar] [CrossRef]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
Yang, M.; Liu, J.; Yang, Z.; Wu, Z. SLSG: Industrial image anomaly detection with improved feature embeddings and one-class classification. Pattern Recognit. 2024, 156, 110862. [Google Scholar] [CrossRef]
Cohen, N.; Tzachor, I.; Hoshen, Y. Set Features for Fine-grained Anomaly Detection. arXiv 2023, arXiv:2302.12245. [Google Scholar] [CrossRef]
Guo, H.; Ren, L.; Fu, J.; Wang, Y.; Zhang, Z.; Lan, C.; Wang, H.; Hou, X. Template-guided hierarchical feature restoration for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6447–6458. [Google Scholar]
Yao, H.; Luo, W.; Yu, W. Visual anomaly detection via dual-attention transformer and discriminative flow. arXiv 2023, arXiv:2303.17882. [Google Scholar]
Yao, X.; Li, R.; Qian, Z.; Wang, L.; Zhang, C. Hierarchical gaussian mixture normalizing flow modeling for unified anomaly detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 92–108. [Google Scholar]
Meng, S.; Meng, W.; Zhou, Q.; Li, S.; Hou, W.; He, S. MoEAD: A Parameter-Efficient Model for Multi-class Anomaly Detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 345–361. [Google Scholar]
Wang, X.; Zheng, Z.; He, Y.; Yan, F.; Zeng, Z.; Yang, Y. Progressive local filter pruning for image retrieval acceleration. IEEE Trans. Multimed. 2023, 25, 9597–9607. [Google Scholar] [CrossRef]

Figure 2. Overview of the training and test phases of our method. (1) During the training phase, patch features and segmentation maps of normal images are extracted and stored in their respective memory banks. (2) During the test phase, the distances to the nearest neighbors of the test image’s patch feature and segmentation maps in the memory banks are computed separately and used as anomaly scores.

Figure 3. The feature distillation guidance and segmentation module. The red “X” in segmentation module denotes filtered background.

Figure 4. The details for obtaining the shape-based anomaly score.

Figure 5. Qualitative evaluation of our method and PatchCore conducted on the MVTec LOCO AD dataset, encompassing test images with logical anomalies (anomalous regions highlighted in red), corresponding ground truths, and the resulting anomaly maps.

Figure 6. Qualitative evaluation of our method and PatchCore conducted on the MVTec LOCO AD dataset, encompassing test images with structural anomalies (anomalous regions highlighted in red), corresponding ground truths, and the resulting anomaly maps.

Table 1. Comparison between our model and existing methods on the MVTec LOCO AD dataset in terms of both logical anomaly (LA) and structural anomaly (SA) with I-AUROC. The best and second-best performance scores are indicated in bold and underlined, respectively.

	Category/Method	DSKD [21]	LogicAD [33]	SLSG [45]	PatchCore [23]	SINBAD [46]	SimpleNet [18]	ComAD [14]	Ours
LA	Breakfast Box	-	93.1	-	80.0	96.5	77.1	94.7	93.0
	Juice Bottle	-	81.6	-	92.3	96.6	87.8	90.9	95.6
	Pushpins	-	98.1	-	73.8	83.4	69.0	89.0	96.3
	Screw Bag	-	83.8	-	55.7	78.6	51.6	79.7	77.3
	Splicing Connectors	-	73.4	-	75.6	89.3	72.0	84.4	92.6
	Average (Logical)	81.2	86.0	89.6	75.5	88.9	71.5	87.8	91.0
SA	Breakfast Box	-	-	-	75.2	87.5	80.9	70.0	83.4
	Juice Bottle	-	-	-	97.8	93.1	90.4	80.5	97.2
	Pushpins	-	-	-	81.9	74.2	81.6	93.8	94.7
	Screw Bag	-	-	-	88.6	92.2	83.3	65.0	86.6
	Splicing Connectors	-	-	-	94.9	76.7	82.6	63.8	92.0
	Average (Structural)	86.9	-	91.4	87.7	84.7	83.8	74.6	90.8
Average		84.0	-	90.3	81.6	86.8	77.6	81.2	90.9

Table 2. Comparison between our model and existing methods on the MVTec LOCO AD dataset in terms of the average of both logical and structural anomaly with sPRO. The best and second-best performance scores are indicated in bold and underlined, respectively.

Category/Method	DSKD [21]	GCAD [42]	THFR [47]	GLCF [15]	SLSG [45]	Ours
Breakfast Box	56.8	50.2	58.3	52.8	-	72.1
Juice Bottle	86.5	91.0	89.6	91.3	-	91.5
Pushpins	82.5	73.9	76.3	61.5	-	75.9
Screw Bag	62.7	55.8	61.5	61.5	-	70.2
Splicing Connectors	76.7	79.8	84.8	78.5	-	86.6
Average	73.0	70.1	74.1	70.3	67.3	79.3

Table 3. Efficiency comparison of different methods. Performance is evaluated by I-AUROC/sPRO metrics. The best and second-best performance scores are indicated in bold and underlined, respectively.

Method	Params	FLOPs	Latency (ms)	Performance
PatchCore [23]	-	-	47.1	81.6/-
DADF [48]	298.5 M	149.3 G	54.5	83.7/67.4
SAM-LAD [31]	-	34.6 G	526.3	90.7/83.2
Ours	90.9 M	48.4 G	12.2	90.9/79.3

Table 4. Anomaly detection results in terms of the I-AUROC on the MVTec-AD dataset. The best and second-best performance scores are indicated in bold and underlined, respectively.

Category/Method	HGAD [49]	MoEAD [50]	PatchCore [23]	RD [13]	PaDiM [29]	Ours
Carpet	100	99.7	98.7	98.9	99.8	98.9
Grid	99.6	99.2	98.2	100	96.7	98.5
Leather	100	100	100	100	100	100
Tile	100	99.5	98.7	99.3	98.1	98.9
Wood	99.5	98.6	99.2	99.2	99.2	99.1
Average (Texture)	99.8	99.4	99.0	99.5	98.8	99.1
Bottle	100	100	100	100	99.9	100
Cable	97.3	97.6	99.5	95.0	92.7	99.7
Capsule	99.0	92.5	98.1	96.3	91.3	98.3
Hazelhut	99.9	100	100	99.9	92.0	100
Metal Nut	100	99.4	100	100	98.7	100
Pill	96.3	94.7	96.6	96.6	93.3	97.1
Screw	95.5	92.7	98.1	97.0	85.8	98.5
Toothbrush	91.2	94.2	100	99.5	96.1	100
Transistor	97.7	99.8	100	96.7	97.4	99.9
Zipper	100	98.3	99.4	98.5	90.3	99.5
Average (Object)	97.7	96.9	99.2	98.0	93.8	99.3
Average	98.4	97.7	99.1	98.5	95.4	99.2

Table 5. Performance comparison on MVTec LOCO AD with different memory banks (measured in I-AUROC).

M_{p a t c h}

and

M_{m a p}

represent the patch-feature and segmentation-map memory banks, respectively. The best and second-best performance scores are indicated in bold and underlined, respectively. The “✓” denotes the scenario in which the module is used.

Table 5. Performance comparison on MVTec LOCO AD with different memory banks (measured in I-AUROC).

M_{p a t c h}

and

M_{m a p}

represent the patch-feature and segmentation-map memory banks, respectively. The best and second-best performance scores are indicated in bold and underlined, respectively. The “✓” denotes the scenario in which the module is used.

$M_{patch}$	$M_{map}$		Logical Anomaly	Structural Anomaly	Average
$S_{patch}$	$S_{map}$	$S_{shape}$	Logical Anomaly	Structural Anomaly	Average
✓			75.5	87.7	81.6
	✓		87.5	74.0	80.8
		✓	80.7	75.1	77.9
✓	✓		90.8	90.7	90.8
✓		✓	85.3	88.3	86.8
	✓	✓	88.1	74.5	81.3
✓	✓	✓	91.0	90.8	90.9

Table 6. Performance comparison on MVTec LOCO AD with different modules and different teacher models (measured in I-AUROC). The experiments in the table are based on the results obtained under the dual-stream architecture. The “w/o” denotes the scenario without feature distillation. The best and second-best performance scores are indicated in bold and underlined, respectively. The “✓” denotes the scenario in which the module is used.

Teacher	Feature Distillation	Shape-Based Anomaly Score	Logical Anomaly	Structural Anomaly	Average
w/o			80.3	86.9	83.6
w/o		✓	82.5	87.8	85.2
WideResNet50	✓	✓	82.1	89.2	85.7
DINO	✓		90.8	90.7	90.8
DINO	✓	✓	91.0	90.8	90.9

Table 7. Performance comparison on MVTec LOCO AD with different cluster numbers (measured in I-AUROC). The best and second-best performance scores are indicated in bold and underlined, respectively.

K	Logical Anomaly	Structural Anomaly	Average
3	86.0	89.4	87.7
4	89.3	90.7	90.0
5	91.0	90.8	90.9
6	90.2	90.3	90.3
7	90.1	90.3	90.2

Table 8. Performance comparison on MVTec LOCO AD with different scales of memory bank (measured in I-AUROC). The memory bank scale represents the scale of training samples used from the memory bank. The best and second-best performance scores are indicated in bold and underlined, respectively.

Memory Bank Scale	Logical Anomaly	Structural Anomaly	Average
1%	88.1 ± 2.5	87.9 ± 2.8	88.0 ± 2.7
25%	89.6 ± 1.4	88.7 ± 2.1	89.2 ± 1.7
50%	90.0 ± 1.0	89.2 ± 1.6	89.6 ± 1.3
75%	90.5 ± 0.5	89.9 ± 0.9	90.2 ± 0.7
100%	91.0	90.8	90.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, H.; Wang, X.; Fan, J.; Zeng, Z.; Lu, J.; Hong, O.; Zhang, J. DSMBAD: Dual-Stream Memory Bank Framework for Unified Industrial Anomaly Detection. Electronics 2025, 14, 2748. https://doi.org/10.3390/electronics14142748

AMA Style

Hu H, Wang X, Fan J, Zeng Z, Lu J, Hong O, Zhang J. DSMBAD: Dual-Stream Memory Bank Framework for Unified Industrial Anomaly Detection. Electronics. 2025; 14(14):2748. https://doi.org/10.3390/electronics14142748

Chicago/Turabian Style

Hu, Hongmin, Xiaodong Wang, Jiangtao Fan, Zhiqiang Zeng, Junwen Lu, Otis Hong, and Jihuang Zhang. 2025. "DSMBAD: Dual-Stream Memory Bank Framework for Unified Industrial Anomaly Detection" Electronics 14, no. 14: 2748. https://doi.org/10.3390/electronics14142748

APA Style

Hu, H., Wang, X., Fan, J., Zeng, Z., Lu, J., Hong, O., & Zhang, J. (2025). DSMBAD: Dual-Stream Memory Bank Framework for Unified Industrial Anomaly Detection. Electronics, 14(14), 2748. https://doi.org/10.3390/electronics14142748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSMBAD: Dual-Stream Memory Bank Framework for Unified Industrial Anomaly Detection

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Patch-Feature Stream

3.2. Segmentation-Map Stream

3.3. Anomaly Score Computation

4. Experiment

4.1. Implementation Details

4.2. Quantitative and Qualitative Results

4.3. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI