Next Article in Journal
Combining Genetic Algorithm with Local Search Method in Solving Optimization Problems
Previous Article in Journal
VonEdgeSim: A Framework for Simulating IoT Application in Volunteer Edge Computing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Anomaly Detection via Normal Feature-Enhanced Reverse Teacher–Student Distillation

1
College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China
2
School of Computer Engineering, Guangzhou City University of Technology, Guangzhou 510812, China
3
Xiamen Yaxon Zhilian Technology, Ltd., Xiamen 361013, China
*
Authors to whom correspondence should be addressed.
Electronics 2024, 13(20), 4125; https://doi.org/10.3390/electronics13204125
Submission received: 3 October 2024 / Revised: 15 October 2024 / Accepted: 17 October 2024 / Published: 20 October 2024

Abstract

:
In modern industrial production, unsupervised anomaly detection methods have gained significant attention due to their ability to address the challenge posed by the scarcity of labeled anomaly samples. Among them, unsupervised anomaly detection methods based on reverse distillation (RD) have become a mainstream choice, which has attracted extensive research due to their excellent anomaly detection performance. However, there is a problem of “feature leakage” in the RD model, which may lead to non-anomalous regions being incorrectly identified as defects. To solve this problem, we propose a Normal Feature-Enhanced Reverse teacher–student Distillation (NFERD) method. Specifically, we designed and incorporated a normal feature bank (NFB) module into the basic RD network. This module stores normal features extracted by the teacher model, assisting the student model in learning normal features more efficiently, thereby addressing the problem of “feature leakage”. In addition, to effectively fuse the feature maps extracted by the student model with the feature maps in NFBs, we designed a Hybrid Attention Fusion Module (HAFM), which ensures the preservation of key information during the feature fusion process by the parallel processing of spatial and channel attention mechanisms. Through experimental verification on two publicly available datasets, i.e., MVTec and KSDD, our method outperformed the existing mainstream methods in both image-level and pixel-level anomaly detection. Specifically, we achieved an average I-AUROC score of 99.32% on MVTec and a 98.75% P-AUROC on the KSDD, showing clearer segmentation results, especially in complex scenarios. Furthermore, our method surpassed the second-best method by over 1.4% PRO on MVTec, demonstrating its effectiveness.

1. Introduction

In modern manufacturing, product quality directly affects the brand reputation and market competitiveness. Efficient and accurate defect detection can not only improve the product’s qualification rate, but also can effectively reduce the recall and rework costs caused by quality issues. The traditional manual defect detection relies on visual inspection, which is not effective for detecting small or hidden defects and is easily affected by operator fatigue, making it difficult to meet the requirements of modern industry for high precision and speed. In contrast, recent deep learning methods [1,2] demonstrate significant advantages in the field of anomaly detection. Deep learning models, especially convolutional neural networks (CNNs), can automatically learn features from large amounts of image data without the need for manually designed complex feature extraction algorithms. This enables deep learning models to more accurately identify various types of defects, including subtle cracks, scratches, and other surface imperfections, and can quickly and continuously perform detection tasks, greatly improving the detection efficiency.
In the production of industrial products, due to the low defect rate, it is difficult to obtain abnormal samples [3]. Therefore, unsupervised defect detection methods have become particularly important, as they mainly rely on normal samples for training. These methods include the reconstruction-based method, memory-based method, and knowledge-distillation-based method (KDM). The KDM has significant advantages. It learns the features of normal samples from the complex teacher model through the smaller student model, which can effectively detect anomalies without requiring a large number of defect samples. At the same time, it can improve the detection speed and reduce the demand for computing resources. The teacher model is usually a large pre-trained model that can learn rich feature representations. The goal of the KDM is to train a student model that can mimic the teacher model’s feature representation of normal samples. In the inference stage, when an image containing anomalies is input, the teacher model can extract a complete feature representation, including normal and abnormal features. Due to the fact that the student model only learns the feature representations of normal samples during the training phase, during the inference phase, the student model may only obtain normal feature representations. This KDM utilizes this inconsistency to locate anomaly areas. However, in the traditional KDM, due to the overly similar network structures of the teacher model and the student model, the student model may inadvertently learn undesirable abnormal features from the teacher model, which may lead to the inability to segment anomaly areas correctly. To address this issue, reverse distillation (RD) [4] inverts the structure of the student network, helping the student model focus on learning the feature representations of normal samples.
It is worth noting that from the third column of Figure 1, it can be seen that the RD method mistakenly identifies some non-anomaly areas as anomaly areas, which are actually caused by factors such as dust or light. However, these factors are not true anomalies but rather a part of normal features and are also present in large quantities in the training images (as shown in the first column of Figure 1). Nonetheless, the model still incorrectly identifies these normal areas as anomaly areas. We call this problem “feature leakage”. To solve this problem, we designed a Normal Feature-Enhanced Reverse teacher–student Distillation (NFERD) method that combines the normal feature bank mechanism with the reverse teacher–student distillation model. Specifically, we established three normal feature banks with the same structure. The teacher model stores the feature maps extracted at different stages into corresponding normal feature banks to help the student model integrate these normal features. In addition, another challenge we faced was how to effectively integrate the feature maps extracted by the student model with those in normal feature banks. Although adding feature maps directly or concatenating them by channel is a common feature fusion method, the direct addition may result in information loss, while simple channel concatenation may cause problems of dimensionality expansion and redundancy. Therefore, to address this challenge, we designed a Hybrid Attention Fusion Module (HAFM). It processes spatial attention and channel attention in parallel to ensure that key spatial details are preserved and inter-channel information is maintained during the fusion process. In summary, our main contributions are as follows:
  • We identified the issue of “feature leakage” in reverse distillation and propose a framework called the Normal Feature-Enhanced Reverse teacher-student Distillation (NFERD) method to address this problem.
  • We utilized the normal feature bank (NFB) mechanism to store the normal features extracted by the teacher model so that the student model can learn better from them.
  • We designed a Hybrid Attention Fusion Module (HAFM) that uses spatial and channel attention mechanisms in parallel to fuse the feature maps extracted by the student model with those retrieved from the normal feature bank, effectively reducing information loss and redundancy issues caused by simple fusion mechanisms.
  • We conducted extensive comparative experiments and ablation studies on two publicly available datasets to demonstrate the effectiveness of the proposed method.

2. Related Work

In this section, we briefly review relevant research in the field of unsupervised anomaly detection, which can be roughly classified into three categories: reconstruction-based methods, memory-based methods, and knowledge-distillation-based methods.

2.1. Reconstruction-Based Methods

The reconstruction-based anomaly detection methods mainly rely on the encoder–decoder structure. The core idea is to train the model using normal samples so that the model can learn the distribution of normal features and perform anomaly detection and localization by calculating the reconstruction error with abnormal samples. However, a traditional Autoencoder (AE) is prone to blurring effects during image reconstruction, resulting in poor pixel-level segmentation performance. Researchers proposed a series of improvement measures to address this issue, such as adding contextual information [5] and using a Variational Autoencoder (VAE) [6]. In addition, to improve the AE reconstruction method, the latest research proposed strategies such as introducing skip connections in the network structure [7], adding pseudoanomaly data [8,9,10], adopting a pyramid design [5,11], and improving the loss functions [12]. For example, the Multi-Scale feature-clustering Fully Convolutional Autoencoder (MS-FCAE) [5] is used to process image reconstruction when there are textured backgrounds. To enhance the robustness of the model and avoid model degradation, some studies [13,14] also introduced attention mechanisms. Although reconstruction-based methods have strong scalability and are easy to implement, there are still issues such as high representative requirements for normal samples and difficulty in selecting reconstruction error thresholds.

2.2. Memory-Based Methods

The memory-based unsupervised anomaly detection methods rely on storing feature representations of normal samples and comparing the features of new samples with those in the normal feature bank during the inference stage to locate anomaly areas. These methods’ core lies in the design and maintenance of the normal feature bank to ensure an effective representation of the distribution of normal features. For instance, some studies [15,16] used pre-trained networks to extract image-level or patch-level features from normal data to construct memory libraries, and use some distance measurement algorithms to calculate anomaly scores. However, these methods often require significant computing resources and memory space. To address this challenge, PatchCore [17] uses a greedy algorithm to select a subset of the most representative normal features from the normal feature bank, thereby reducing memory usage and computational costs. Another study [18] used pre-trained networks to extract multi-scale features from normal samples and model each feature as a multivariate Gaussian distribution. In the inference stage, anomaly detection is performed by calculating the Marxian distance between the feature vector of the sample to be detected and the normal distribution. Although memory-based methods are simple and effective, the demand for memory and computing resources remains high due to the need to store a large number of feature representations and the complex calculations involved in the detection process.

2.3. Knowledge-Distillation-Based Methods

The knowledge-distillation-based methods [19,20,21,22] locate anomaly areas by utilizing the output differences between different networks. These methods usually include pre-trained teacher models and student models that need to be trained. The goal is to have the student models mimic the behavior that occurs when teacher models extract normal features. In the inference stage, potential anomaly areas are located by comparing the output differences between the teacher model and the student model. However, a common issue with this method is that the similarity in structure between the teacher model and the student model often leads to the student model directly copying the output of the teacher model, thereby limiting the accuracy of anomaly localization. To solve this problem, RD improves the effectiveness of anomaly detection by reversing the structure of the student model to avoid the same data flow between the teacher model and the student model. However, existing RD methods still face the problem of “feature leakage”.

3. Methodology

3.1. Network Structure

We designed an efficient network framework called Normal Feature-Enhanced Reverse teacher–student Distillation (NFERD) specifically for industrial anomaly detection. As shown in Figure 2, NFERD consists of three core components: a reverse knowledge distillation network, normal feature bank (NFB), and Hybrid Attention Fusion Module (HAFM). The reverse knowledge distillation network includes a teacher model T and a student model S, both of which use the Wide ResNet50 architecture. Among them, the teacher model was pre-trained on the ImageNet dataset, while the student model was trained from scratch. Our goal was to train the student model to mimic the output of the teacher model. As we point out in the Introduction Section, there is a problem of “feature leakage” in the reverse knowledge distillation network, which may result in some normal areas being incorrectly identified as anomalies. To address this issue, we introduce the normal feature bank, which aims at enhancing the student model’s ability to represent normal features, thereby improving the overall anomaly detection accuracy.
Specifically, we designed three NFBs with the same structure, each of which stores normal feature maps extracted by the teacher model at different stages of the network. The feature maps at each stage represent information at different levels of abstraction. We aimed to fuse normal features from the NFB with those extracted by the student model. This allows the student model to not only learn normal feature representations imparted by the teacher but also directly obtain normal features stored in the NFB, thereby strengthening its representation capabilities for normal features. However, how to fuse the feature maps extracted by the student model with those picked up from the NFB is also a challenge. Therefore, we designed a Hybrid Attention Fusion Module (HAFM), which ensures that key spatial details are preserved while maintaining information between channels during the fusion process by processing spatial attention and channel attention in parallel. The specific implementation of NFERD is as follows: Before the model training, we first randomly selected N normal images from the training dataset and input these images into the pre-trained teacher model. Then, we stored the normal feature maps extracted by the teacher model at different stages of the network in the corresponding NFB. During the model training, the defect-free training images were input into a pre-trained teacher model, which extracted feature maps from three different stages of the model. Then, the OCBE module fused these feature maps from three different stages and input the fused feature map into the student model that was structurally opposite to the teacher model. At the same time, the student model could access these NFBs. Afterward, the feature maps extracted by the teacher model at different stages were subjected to loss calculation with those extracted by the students at the corresponding stages. The following subsection provides a detailed introduction to each module.

3.2. Normal Feature Bank Module

We observed a phenomenon in the reverse knowledge distillation model, i.e., “feature leakage”, which may result in normal areas being incorrectly identified as anomaly areas. To address this issue, we introduced the NFB module, which stored some normal features extracted by the teacher model so that the student model could better learn the normal features. First, we randomly selected N normal samples from the training set, which were then input into the pre-trained teacher model. The teacher model processed these samples and generated feature maps within its four different stages. The feature maps from the first three stages were used for the subsequent NFB storage and network training, while the last stage was used to construct the one-class bottleneck embedding (OCBE) block, which was used to fuse the feature maps from the first three stages. To this end, we created three NFBs with the same structure, each dedicated to storing feature maps extracted by the teacher model at different stages. Before storing the feature maps in the NFB, the feature maps were first downsampled through bilinear interpolation to adjust their sizes, ensuring that each size matched the output of the student model in the corresponding stage. Then, we adjusted the number of channels in each feature map through a 1 × 1 convolution operation to match the number of channels extracted by the student model in the corresponding stage.
In the student model stage, we hoped to fuse the feature maps extracted by the student model with those in the NFBs. For optimal utilization, we assessed the similarity between the student’s feature maps and all NFB feature maps. Specifically, we flattened the feature maps in NFBs and the feature maps extracted by the student model into one-dimensional vectors to calculate their similarity. The specific steps were as follows: First, we flattened N feature maps stored in the NFBs into one-dimensional vectors and recorded them as v 1 , v 2 , , v N . A similar flattening operation was also applied to the feature map extracted by the student model and recorded as v s . Then, we calculated the cosine similarity between v s and v 1 , v 2 , , v N separately, to obtain the similarity set S. In this way, we could quantify the similarity between the features extracted by the student model and the features stored in the NFBs. Note that S i represents the similarity between the feature map extracted by the student model and the i-th feature map in the NFB. The similarity S i and similarity set S can be calculated by the following formula:
S i ( v s , v i ) = v s · v i v s v i ,
S = S i ( v s , v i ) , i = 1 , 2 , , N .
Feature maps with a high similarity usually contain information similar to the feature maps extracted by the student model, which means that these feature maps are more likely to contain information that is more useful for the learning of the student model. In contrast, feature maps with a lower similarity may contain different or irrelevant information, and using this information may interfere with the learning process of the student models. If the similarity between the vector v s and vectors v 1 , v 2 , , v N is low, excluding these vectors is advisable, as they may not contain relevant normal feature information. Therefore, we selected K vectors with the highest similarity from v 1 , v 2 , , v N . Specifically, we sorted the similarity set S from high to low to obtain the similarity set S s o r t e d . Then, we selected the top K vectors from S s o r t e d to obtain the similarity subset S k . The process of obtaining the top K most similar ones is shown in Figure 3.
S s o r t e d = S 1 ( v s , v 1 ) , S 2 ( v s , v 2 ) , , S N ( v s , v N ) ,
S k = S 1 ( v s , v 1 ) , S 2 ( v s , v 2 ) , , S K ( v s , v K ) .
To enable the student model to obtain more valuable information from the NFB, thereby improving the learning effectiveness and model performance, we calculated the corresponding weight w for the similarity set S k , which represents how many relevant normal features need to be recalled from the corresponding feature maps. The main purpose of introducing weight w was to ensure that the student model could focus on extracting the most valuable information from the NFB, thereby optimizing the distillation process. By aggregating the weight and corresponding feature vector, the final weighted feature vector recalled from the NFB could be obtained. The weighted feature vector was then reshaped into a weighted feature map and returned to the student model. The weight w and the final returned feature vector v could be calculated using the following specific formula:
w i = S i j = 1 K S j ,
v = i = 1 K w i · v i .

3.3. Hybrid Attention Fusion Module

How to fully utilize these feature maps in NFBs is also an important challenge. Although adding feature maps directly or concatenating them by channel is a common feature fusion method, direct addition may result in information loss, while concatenating them by channel may cause problems of dimensionality expansion and redundant information. In addition, a Convolutional Block Attention Module (CBAM) [23] sequentially processes the Channel Attention Module (CAM) and Spatial Attention Module (SAM) through concatenation. This serial processing means that the input of the latter module depends on the output of the previous module. In deep network architectures, this serial processing order may cause delays or information loss during transmission. To overcome these limitations, we designed a Hybrid Attention Fusion Module (HAFM), which parallelizes the spatial attention and channel attention to ensure that key spatial details are preserved and inter-channel information is maintained during the fusion process. The designed HAFM is shown in Figure 4. Specifically, in the HAFM, the spatial attention mechanism is used to capture important regions in the feature map, emphasizing or suppressing feature responses at certain spatial positions by learning a weight matrix. Meanwhile, the channel attention mechanism focuses on the interdependence between different feature channels, assigning a weight to each channel to highlight the more critical features of the task. The parallel processing of these two attention mechanisms will result in a more robust and informative feature representation. Through this, the HAFM can effectively integrate feature information from different sources, providing more refined and targeted feature representations for subsequent task processing.

3.4. Loss Function

Cosine similarity is an effective method for measuring the similarity between feature maps. It is widely used in various applications. In this study, we used cosine similarity to measure the similarity between the output feature maps of the teacher model and the student model. Specifically, we used F T i to represent the output feature map of the teacher model in the i-th stage of the network, and F S i to represent the output feature map of the student model in the same stage. Our training objective was to make the output feature map of the student model as close as possible to the output feature map of the teacher model. Therefore, the two-dimensional anomaly score graph M i in the i-th stage could be calculated using the following formula:
M i ( x , y ) = 1 ( F T i ( x , y ) ) T · ( F S i ( x , y ) ) ( F T i ( x , y ) ) F S i ( x , y ) ) ,
where ( x , y ) represents the spatial position in the feature map. When the value of M i is large, it indicates that the similarity between the feature map F T i and F S i at position ( x , y ) is small. Given that the teacher model and student model output feature maps at multiple stages, the distillation loss function L K D that guides the optimization of the student model can be defined as
L K D = i = 1 I 1 X i Y i x = 1 X i y = 1 Y i M i ( x , y ) ,
where X i and Y i , respectively, represent the width and height of the feature map output by the network in the i-th stage, and I represents the number of feature maps output by the network and was set to three in this paper.

4. Experimental Results and Discussions

In this section, we evaluate the anomaly detection and localization performance of our NFERD method on the MVTec dataset [24] and Kolektor Surface-Defect Dataset (KSDD) [25] using several evaluation metrics.

4.1. Datasets

MVTec dataset: The MVTec dataset is a widely used dataset specifically designed for anomaly detection research. It contains 5354 high-resolution color images. These images are divided into 15 different object and texture categories.
KSDD: The KSDD is a publicly available dataset for surface defect detection, consisting of 52 images with visible defects and 347 images without defects, with an image size of 230 × 630 pixels. We used a three-fold cross-validation strategy, using only the anomaly-free images to evaluate the performance of the model.

4.2. Evaluation Metrics

Image-level anomaly score: To evaluate the overall performance of anomaly detection on an image level, we utilized the Area Under the Receiver Operating Characteristic Curve (AUROC). This metric provides a comprehensive measure of how well a model can distinguish between normal and anomalous images. The AUROC score ranges from 0 to 1, where a higher value indicates better discrimination ability.
Pixel-level anomaly score: To assess the precision of anomaly localization at the pixel level, we employed two metrics: the AUROC and the Per-Region-Overlap (PRO) score [20]. The AUROC was again used here to evaluate the model’s capability in distinguishing between normal and anomalous pixels. The PRO, on the other hand, quantifies the overlap between detected anomalies and true anomalies on a per-region basis, which is particularly useful for understanding the spatial accuracy of the detection.

4.3. Experimental Settings

In our network architecture, both the teacher encoder and the student decoder utilize Wide ResNet50 as the base network. Wide ResNet [26,27,28], an enhancement over the original ResNet architecture by increasing the network width, has been widely utilized. Notably, the teacher encoder was pre-trained on the ImageNet [29] dataset, while the student network was designed in the opposite direction to the teacher’s architecture. The number N of feature maps stored in the normal feature bank (NFB) was set to 20. The input image dimensions varied depending on the dataset: 256 × 256 for the MVTec dataset and 704 × 256 for the KSDD. Our network was trained for 200 epochs with a batch size of four. The learning rate was set to 0.005 for both the MVTec dataset and the KSDD. The Adam optimizer [30] was employed with parameters β 1 = 0.5 and β 2 = 0.999 . Our framework was implemented using PyTorch 1.10.1 and trained on an NVIDIA RTX 3090 Ti GPU. For comparative methods, we configured the hyper-parameters according to the recommendations provided in their original publications.

4.4. Experimental Results on the MVTec Dataset

In this experiment, we chose Spade [15], US [20], MKD [19], PaDiM [16], PatchCore [17], RD [4], and HFRA [31] as the baselines and present a detailed comparison of our proposed method with other mainstream methods using three metrics in Table 1, Table 2 and Table 3. These metrics were the Image-level AUROC (I-AUROC), Pixel-level AUROC (P-AUROC), and PRO score (PRO). Table 1 shows the results of our method on the I-AUROC. From the data, it can be seen that our method achieved an average performance of 99.32% in 15 categories, which was 0.22% higher than the second-best result and demonstrated significant advantages. It is worth mentioning that our method achieved the best level in different texture categories, which strongly proved our method’s excellent ability to handle anomaly detection in texture categories. Table 2 and Table 3 show the pixel-wise anomaly localization results of our method. It is evident from the tables that our method demonstrated excellent performance in all three metrics. In 15 different categories, our method achieved an average P-AUROC and PRO of 98.29% and 95.23%, respectively. Compared with the second-best results, the improvements were 0.48% and 1.3%, respectively. This further demonstrated its powerful ability for anomaly localization. Furthermore, compared with RD, our method demonstrated a higher anomaly localization performance. This also fully demonstrated the significant advantages of combining the reverse knowledge distillation with the normal feature bank.
From Figure 5, it can be seen that our method showed significant differences in the visualization effects compared with the RD methods. Specifically, in the third and fourth rows of the figure, there was a sharp contrast between the RD method and our method. The RD method was obviously susceptible to the influence of surrounding noise, which led to poor performance in defect segmentation. In contrast, our method demonstrated superior visualization performance, accurately segmenting the defect area completely while effectively avoiding interference from the surrounding noise areas.

4.5. Experimental Results on the KSDD

In this experiment, we selected six methods for comparative analysis: US [20], PaDim [16], PatchCore [17], Semi-Orthogonal (SO) [32], GMFF [33], and RD [4]. Following [32,34,35], we employed a three-fold cross-validation strategy to evaluate the performance of these methods. Table 4 presents the P-AUROC results, from which we can see that our method exhibited significant performance advantages. Specifically, in the results of Fold1 and Fold2, our method improved by 0.04% and 0.25%, respectively, compared with the second-best results. This advantage was more prominent in Fold3, with an improvement of 0.4% compared with the second-best result. When calculating the average performance of all the folds, our method outperformed the second-best method by 0.25%. Figure 6 provides visual results that intuitively demonstrate the performance of our method and other comparative methods. It can be clearly seen from the first line that the KSDD had complex background noise, which posed a great challenge during the defect detection tasks. These complex backgrounds not only increase the difficulty of detection but may also introduce additional noise that interferes with the model’s accurate identification of actual defects. As shown in the third line of the figure, the PatchCore method was susceptible to the influence of the surrounding noise, which resulted in the defect area not being correctly segmented. In contrast, the RD method (fourth line) and our method (fifth line) showed better visualization results and were able to completely segment the defect area. However, it is also evident from the figure that RD was also susceptible to the influence of complex backgrounds, which resulted in many non-defective areas being incorrectly labeled. In contrast, our method performed better in handling complex backgrounds, which effectively avoided mislabeling non-defective areas as anomalies and ensured more accurate segmentation results.

4.6. Ablation Study

Effect of the number of fused feature maps on the model performance: The number of fused feature maps from the NFB may affect the performance of the model, as it determines the richness of the feature information accessible to the model. Inadequate fusion may lead to information scarcity, potentially affecting the accuracy of the model. Conversely, excessive fusion may introduce redundant information, noise, and unnecessary computational burdens. Therefore, in this part, we delve into the impact of fusing different numbers of feature maps from the NFB on the model performance. Through an experimental analysis of the MVTec dataset and the KSDD, we evaluated the performance of the model when the number of fused feature maps was 0, 5, 10, 15, 20, and 25. When the number of fused feature maps was 0, the model degenerated into the baseline method (RD [4]). From the experimental results in Table 5 and Table 6, it can be seen that all cases where the number of fused feature maps was greater than 0 exhibited better performance compared with the case where the number of fused feature maps was 0 (RD), which strongly proved the effectiveness of our propose NFERD method in the anomaly detection tasks. Furthermore, from Table 5, it can be observed that when the number of fused feature maps was 20, the I-AUROC, P-AUROC, and PRO all reached their highest levels. Similar results can be obtained from Table 6; when the number of fused feature maps was set to 20, the P-AUROC reached the best value. It is worth noting that when the number of feature maps reached 25, these metrics stopped improving, which indicates that the number of fused feature maps was not simply a case of “the more, the better”, but rather there existed an optimal balance point for the selected datasets. In this paper, we set the number of fused feature maps to 20.
Effect of different feature fusion strategies on the model performance: As discussed in this section, we investigated the effectiveness of the Hybrid Attention Fusion Module (HAFM) in the anomaly detection tasks. Specifically, we compared the impact on the model performance by replacing the HAFM with simple fusion mechanisms, i.e., adding feature maps directly (adding) or concatenating by channel (CC). Furthermore, to validate the effectiveness of our designed HAFM with a parallel processing strategy, we also conducted two additional experiments: one by replacing the HAFM with a single Channel Attention Module (CAM) and Spatial Attention Module (SAM) independently, and the other by modifying the parallel mechanism in the HAFM into a serial processing mechanism of the SAM and CAM (SAM + CAM or CAM + SAM). These experiments were conducted on the MVTec dataset and the KSDD. The experimental results are shown in Table 7 and Table 8. First, on the MVTec dataset, when using simple fusion strategies, i.e., adding (the first row) and CC (the second row), the I-AUROC, P-AUROC, and PRO reached 98.79%, 97.94%, and 94.57%, and 98.22%, 97.84%, and 94.61%, respectively. When only the SAM (the third row) or CAM (the fourth row) was used for the feature fusion, the I-AUROC, P-AUROC, and PRO reached 98.96%, 97.89%, and 94.60%, and 99.01%, 97.97%, and 95.02%, respectively. It is worth noting that when processing the attention modules in series, the values of the three metrics decreased significantly (fifth and sixth rows). Our HAFM, which processes the CAM and SAM in parallel, produced the highest performance in terms of all these three metrics, showing significant improvement compared with the other fusion strategies. Similar observations were also found on the KSDD. These results strongly confirm the superiority of the HAFM in feature fusion and its positive impact on the model performance.

5. Conclusions

We propose an unsupervised anomaly detection method called NFERD. The core innovation of NFERD lies in the combination of reverse knowledge distillation technology and the normal feature bank (NFB) mechanism, aiming to improve the representation ability of student models on normal sample features. Specifically, the NFB is used to store normal feature maps extracted by the teacher model at different stages of the network, which are then used as reference learning for the student model. To reduce information loss and redundancy issues caused by simple fusion mechanisms, i.e., adding feature maps directly or concatenating them by channel and effectively fusing the feature maps extracted by the student model with those stored in the NFB, we designed a Hybrid Attention Fusion Module (HAFM). It processes spatial attention and channel attention in parallel to ensure that key spatial details are preserved and inter-channel information is maintained during the fusion process. Comparative experiments and ablation studies conducted on two publicly available datasets confirmed the effectiveness and superiority of the proposed method.

Author Contributions

Conceptualization, J.F.; Data curation, F.Y., H.H. (Hongmin Hu), Z.Z. and H.H. (Haiyan Huang); Formal analysis, P.W.; Methodology, J.F.; Supervision, H.Z.; Validation, F.Y., H.H. (Hongmin Hu) and Z.Z.; Writing—original draft, X.W.; Writing—review and editing, X.W. All authors read and agreed to the published version of this manuscript.

Funding

This study was supported by the Natural Science Foundation of Xiamen (3502Z20227073), National Natural Science Foundation of Fujian Province (grant nos. 2023J011428, 2022J011236, 2022J011235).

Data Availability Statement

The MVTec AD dataset and Kolektor Surface-Defect Dataset (KSDD) can be obtained from https://www.mvtec.com/company/research/datasets/mvtec-ad (accessed on 2 February 2024) and https://www.vicos.si/resources/kolektorsdd (accessed on 2 February 2024), respectively.

Conflicts of Interest

Author Hangqi Zhang was employed by the company Xiamen Yaxon Zhilian Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Gao, T.; Yang, J.; Tang, Q. A multi-source domain information fusion network for rotating machinery fault diagnosis under variable operating conditions. Inf. Fusion 2024, 106, 102278. [Google Scholar] [CrossRef]
  2. Wang, X.; Xu, X.; Wang, Y.; Wu, P.; Yan, F.; Zeng, Z. A robust defect detection method for syringe scale without positive samples. Vis. Comput. 2023, 39, 5451–5467. [Google Scholar] [CrossRef] [PubMed]
  3. Wei, X.; Yang, Z.; Liu, Y.; Wei, D.; Jia, L.; Li, Y. Railway track fastener defect detection based on image processing and deep learning techniques: A comparative study. Eng. Appl. Artif. Intell. 2019, 80, 66–81. [Google Scholar] [CrossRef]
  4. Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
  5. Yang, H.; Chen, Y.; Song, K.; Yin, Z. Multiscale feature-clustering-based fully convolutional autoencoder for fast accurate visual inspection of texture surface defects. IEEE Trans. Autom. Sci. Eng. 2019, 16, 1450–1467. [Google Scholar] [CrossRef]
  6. Dehaene, D.; Frigo, O.; Combrexelle, S.; Eline, P. Iterative energy-based projection on a normal data manifold for anomaly localization. arXiv 2020, arXiv:2002.03734. [Google Scholar]
  7. Collin, A.-S.; Vleeschouwer, C.D. Improved anomaly detection by training an autoencoder with skip connections on images corrupted with stain-shaped noise. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7915–7922. [Google Scholar]
  8. Zavrtanik, V.; Kristan, M.; Skočaj, D. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 8330–8339. [Google Scholar]
  9. Lv, C.; Shen, F.; Zhang, Z.; Xu, D.; He, Y. A novel pixel-wise defect inspection method based on stable background reconstruction. IEEE Trans. Instrum. Meas. 2020, 70, 1–13. [Google Scholar] [CrossRef]
  10. Yang, M.; Wu, P.; Feng, H. Memseg: A semi-supervised method for image surface defect detection using differences and commonalities. Eng. Appl. Artif. Intell. 2023, 119, 105835. [Google Scholar] [CrossRef]
  11. Mei, S.; Yang, H.; Yin, Z. An unsupervised-learning-based approach for automated defect inspection on textured surfaces. IEEE Trans. Instrum. Meas. 2018, 67, 1266–1277. [Google Scholar] [CrossRef]
  12. Bergmann, P.; Löwe, S.; Fauser, M.; Sattlegger, D.; Steger, C. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. arXiv 2018, arXiv:1807.02011. [Google Scholar]
  13. Liu, W.; Li, R.; Zheng, M.; Karanam, S.; Wu, Z.; Bhanu, B.; Radke, R.J.; Camps, O. Towards visually explaining variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8642–8651. [Google Scholar]
  14. Venkataramanan, S.; Peng, K.-C.; Singh, R.V.; Mahalanobis, A. Attention guided anomaly localization in images. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 485–503. [Google Scholar]
  15. Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
  16. Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International Conference on Pattern Recognition, Milan, Italy, 10–15 January 2021; pp. 475–489. [Google Scholar]
  17. Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
  18. Rippel, O.; Mertens, P.; Merhof, D. Modeling the distribution of normal data in pre-trained deep features for anomaly detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6726–6733. [Google Scholar]
  19. Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14902–14912. [Google Scholar]
  20. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4183–4192. [Google Scholar]
  21. Wang, G.; Han, S.; Ding, E.; Huang, D. Student-teacher feature pyramid matching for anomaly detection. arXiv 2021, arXiv:2103.04257. [Google Scholar]
  22. Tang, C.; Zhou, S.; Li, Y.; Dong, Y.; Wang, L. Advancing pre-trained teacher: Towards robust feature discrepancy for anomaly detection. arXiv 2024, arXiv:2405.02068. [Google Scholar]
  23. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  24. Bergmann, P.; Batzner, K.; Fauser, M.; Sattlegger, D.; Steger, C. The mvtec anomaly detection dataset: A comprehensive real-world dataset for unsupervised anomaly detection. Int. J. Comput. Vis. 2021, 129, 1038–1059. [Google Scholar] [CrossRef]
  25. Tabernik, D.; Šela, S.; Skvarč, J.; Skoxcxaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2020, 31, 759–776. [Google Scholar] [CrossRef]
  26. Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
  27. Li, B.; Lima, D. Facial expression recognition via resnet-50. Int. J. Cogn. Comput. Eng. 2021, 2, 57–64. [Google Scholar] [CrossRef]
  28. Ikechukwu, A.V.; Murali, S.; Deepu, R.; Shivamurthy, R. Resnet-50 vs. vgg-19 vs training from scratch: A comparative analysis of the segmentation and classification of pneumonia from chest x-ray images. Glob. Transit. Proc. 2021, 2, 375–381. [Google Scholar] [CrossRef]
  29. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  30. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  31. Chen, H.; Chen, P.; Mao, H.; Jiang, M. A hierarchically feature reconstructed autoencoder for unsupervised anomaly detection. arXiv 2024, arXiv:2405.09148. [Google Scholar]
  32. Kim, J.-H.; Kim, D.-H.; Yi, S.; Lee, T. Semi-orthogonal embedding for efficient unsupervised anomaly segmentation. arXiv 2021, arXiv:2105.14737. [Google Scholar]
  33. Zhang, F.; Kan, S.; Zhang, D.; Cen, Y.; Zhang, L.; Mladenovic, V. A graph model-based multiscale feature fitting method for unsupervised anomaly detection. Pattern Recognit. 2023, 138, 109373. [Google Scholar] [CrossRef]
  34. Wang, X.; Wang, Y.; Xu, X.; Yan, F.; Zeng, Z. Two-stage deep neural network with joint loss and multi-level representations for defect detection. J. Electron. Imaging 2022, 31, 063060. [Google Scholar] [CrossRef]
  35. Yang, H.; Zhu, Z.; Lin, C.; Hui, W.; Wang, S.; Zhao, Y. Self-supervised surface defect localization via joint de-anomaly reconstruction and saliency-guided segmentation. IEEE Trans. Instrum. Meas. 2023, 72, 5014710. [Google Scholar] [CrossRef]
Figure 1. Examples of “feature leakage” in training images (first column) and testing images (second column), both of which were non-defects but contained some dust or impurities (red squares). From the third column, it can be seen that RD was susceptible to “feature leakage” issues, which led to the misidentification of dust or impurities as defects. In contrast, our method (fourth column) showed superior performance and was less affected by feature leakage issues.
Figure 1. Examples of “feature leakage” in training images (first column) and testing images (second column), both of which were non-defects but contained some dust or impurities (red squares). From the third column, it can be seen that RD was susceptible to “feature leakage” issues, which led to the misidentification of dust or impurities as defects. In contrast, our method (fourth column) showed superior performance and was less affected by feature leakage issues.
Electronics 13 04125 g001
Figure 2. The overall architecture of our proposed NFERD network.
Figure 2. The overall architecture of our proposed NFERD network.
Electronics 13 04125 g002
Figure 3. The process of obtaining the top K most similar feature vectors.
Figure 3. The process of obtaining the top K most similar feature vectors.
Electronics 13 04125 g003
Figure 4. The structure of the Hybrid Attention Fusion Module.
Figure 4. The structure of the Hybrid Attention Fusion Module.
Electronics 13 04125 g004
Figure 5. The visualization results of different methods on the MVTec dataset indicate that the RD [4] method was easily affected by complex backgrounds, which resulted in unsatisfactory defect segmentation effects. In contrast, our method produced higher-quality defect visualization effects and was less affected by complex backgrounds.
Figure 5. The visualization results of different methods on the MVTec dataset indicate that the RD [4] method was easily affected by complex backgrounds, which resulted in unsatisfactory defect segmentation effects. In contrast, our method produced higher-quality defect visualization effects and was less affected by complex backgrounds.
Electronics 13 04125 g005
Figure 6. The visualization results of different methods on the KSDD. Although the KSDD contained complex variable backgrounds, which often led to the performance degradation of the detection models due to cluttered backgrounds, it was encouraging to see that our method had a strong anti-interference ability. It could not only accurately locate abnormal areas but also accurately identified these areas, even in highly challenging backgrounds, which demonstrated its excellent robustness and accuracy.
Figure 6. The visualization results of different methods on the KSDD. Although the KSDD contained complex variable backgrounds, which often led to the performance degradation of the detection models due to cluttered backgrounds, it was encouraging to see that our method had a strong anti-interference ability. It could not only accurately locate abnormal areas but also accurately identified these areas, even in highly challenging backgrounds, which demonstrated its excellent robustness and accuracy.
Electronics 13 04125 g006
Table 1. Anomaly detection results in terms of the I-AUROC on the MVTec dataset.
Table 1. Anomaly detection results in terms of the I-AUROC on the MVTec dataset.
Category/MethodSpade [15]US [20]MKD [19]PaDiM [16]PatchCore [17]RD [4]HFRA [31]Ours
TextureCarpet92.8091.6079.3099.8098.7098.9010099.96
Grid47.3081.0078.0096.7098.2010098.80100
Leather95.4088.2095.10100100100100100
Tile96.5099.1091.6098.1098.7099.3098.6099.57
Wood95.8097.7094.3099.2099.2099.2098.6099.30
Avg85.5691.5287.6698.7698.9699.4899.2099.77
ObjectBottle97.2099.0099.4099.90100100100100
Cable84.8086.2089.2092.7099.5095.0098.9098.13
Capsule91.0086.1080.5091.3098.1096.3093.6098.88
Hazelnut88.1093.1098.4092.0010099.90100100
Metal nut71.0082.0073.6098.7010010099.90100
Pill80.1087.9082.7093.3096.6096.6097.7098.15
Screw66.7054.9083.3085.8098.1097.0096.0099.26
Toothbrush88.9095.3092.2096.1010099.5098.3099.72
Transistor90.3081.8085.6097.4010096.7099.7098.25
Zipper96.6091.9093.2090.3099.4098.5098.6098.58
Avg85.4785.8287.8193.7599.1797.9598.2799.10
Total avg85.5087.7287.7695.4299.1098.4698.6099.32
Table 2. Anomaly localization results in terms of the P-AUROC on the MVTec dataset.
Table 2. Anomaly localization results in terms of the P-AUROC on the MVTec dataset.
Category/MethodSpade [15]US [20]MKD [19]PaDiM [16]PatchCore [17]RD [4]HFRA [31]Ours
TextureCarpet97.5093.5095.6499.1099.0098.9098.9099.02
Grid93.7089.9091.7897.3098.7099.3098.3099.35
Leather97.6097.8098.0599.2099.3099.4099.1099.42
Tile87.4092.5082.7794.1095.4095.6095.1096.38
Wood88.5092.1084.8094.9095.0095.3096.2095.50
Avg92.9493.1690.6196.9297.4897.7097.5297.93
ObjectBottle98.4097.8096.3298.3098.6098.7097.8098.58
Cable97.2091.9082.4096.7098.4097.4097.5097.91
Capsule99.0096.8095.8698.5098.8098.7098.2098.83
Hazelnut99.1098.2094.6298.2098.7098.9098.4099.14
Metal nut98.1097.2086.3897.2098.4097.3097.2097.75
Pill96.5096.5089.6395.7097.4098.2097.4099.12
Screw98.9097.4095.9698.5099.4099.6099.0099.67
Toothbrush97.9097.9096.1298.8098.7099.1098.3099.15
Transistor94.1073.7076.4597.5096.3092.5095.7095.88
Zipper96.5095.6093.9098.5098.8098.2097.7098.64
Avg97.5794.3090.7697.7998.3597.8697.7298.47
Total avg96.0393.9290.7197.5098.0697.8197.7098.29
Table 3. Anomaly localization results in terms of the PRO on the MVTec dataset.
Table 3. Anomaly localization results in terms of the PRO on the MVTec dataset.
Category/MethodSpade [15]US [20]MKD [19]PaDiM [16]PatchCore [17]RD [4]HFRA [31]Ours
TextureCarpet94.7087.90-96.2096.6097.0095.2097.31
Grid86.7095.20-94.6096.0097.6094.4097.77
Leather97.2094.50-97.8098.9099.1097.6099.15
Tile75.9094.60-86.0087.3090.6079.8092.14
Wood87.4091.10-91.1089.4090.9090.8093.26
Avg88.3892.66-93.1493.6495.0491.5695.93
ObjectBottle95.5093.10-94.8096.2096.6092.1096.12
Cable90.9081.80-88.8092.5091.0092.6092.91
Capsule93.7096.80-93.5095.5095.8090.5096.38
Hazelnut95.4096.50-92.6093.8095.5097.6096.28
Metal nut94.4094.20-85.6091.4092.3099.4092.76
Pill94.6096.10-92.7093.2096.4094.7097.85
Screw96.0094.20-94.4097.9098.2095.0098.78
Toothbrush93.5093.30-93.1091.5094.5086.5094.72
Transistor87.4066.60-84.5083.7078.0088.5086.83
Zipper92.6095.10-95.9097.1095.4093.1096.18
Avg93.4090.77-91.5993.2893.3793.0094.88
Total avg91.7391.40-92.1193.4093.9392.5095.23
Table 4. Anomaly localization results in terms of the P-AUROC on the KSDD, where we report the average value (Avg) and standard deviation (Std) of each fold.
Table 4. Anomaly localization results in terms of the P-AUROC on the KSDD, where we report the average value (Avg) and standard deviation (Std) of each fold.
MethodUS [20]PaDim [16]PatchCore [17]SO [32]GMFF [33]RD [4]Ours
Fold190.4093.9093.3095.3096.2097.8697.90
Fold288.3093.5093.4095.1098.1698.6798.92
Fold390.2096.2093.9097.6098.3299.0399.43
Avg ± Std89.63 ± 1.2094.53 ± 1.5093.53 ± 0.1596.00 ± 1.4096.80 ± 0.0098.52 ± 0.4498.75 ± 0.74
Table 5. Comparison of the anomaly detection and localization performance of 15 categories via fusing different numbers of feature maps on the MVTec dataset.
Table 5. Comparison of the anomaly detection and localization performance of 15 categories via fusing different numbers of feature maps on the MVTec dataset.
Number0510152025
I-AUROC98.4699.0899.1999.1399.3298.81
P-AUROC97.8198.2098.2298.0998.2698.23
PRO93.9395.0895.0894.9295.2395.06
Table 6. Performance comparison in terms of the P-AUROC via fusing different numbers of feature maps on the KSDD.
Table 6. Performance comparison in terms of the P-AUROC via fusing different numbers of feature maps on the KSDD.
Number0510152025
Fold197.8697.8997.8197.8997.9097.80
Fold298.6798.3198.6398.3898.9298.56
Fold398.5299.3399.4298.9199.4399.42
Avg98.3598.5198.6298.3998.7598.59
Table 7. Anomaly detection and localization results of different fusion strategies on the MVTec dataset.
Table 7. Anomaly detection and localization results of different fusion strategies on the MVTec dataset.
Fusion StrategiesI-AUROCP-AUROCPRO
Adding98.7997.9494.57
CC98.2297.8494.61
SAM98.9697.8994.60
CAM99.0197.9795.02
SAM + CAM96.2897.5193.30
CAM + SAM91.6195.6988.94
HAFM99.3298.2995.23
Table 8. Anomaly localization result of different fusion strategies in terms of the P-AUROC on the KSDD.
Table 8. Anomaly localization result of different fusion strategies in terms of the P-AUROC on the KSDD.
Fusion StrategiesFold1Fold2Fold3Avg
Adding97.8598.7399.2798.62
CC97.3998.7199.0698.39
SAM97.7098.6898.7098.36
CAM97.6498.8099.0698.50
SAM + CAM97.8698.8399.1398.61
CAM + SAM97.4798.7199.2598.48
HAFM97.9098.9299.4398.75
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Fan, J.; Yan, F.; Hu, H.; Zeng, Z.; Wu, P.; Huang, H.; Zhang, H. Unsupervised Anomaly Detection via Normal Feature-Enhanced Reverse Teacher–Student Distillation. Electronics 2024, 13, 4125. https://doi.org/10.3390/electronics13204125

AMA Style

Wang X, Fan J, Yan F, Hu H, Zeng Z, Wu P, Huang H, Zhang H. Unsupervised Anomaly Detection via Normal Feature-Enhanced Reverse Teacher–Student Distillation. Electronics. 2024; 13(20):4125. https://doi.org/10.3390/electronics13204125

Chicago/Turabian Style

Wang, Xiaodong, Jiangtao Fan, Fei Yan, Hongmin Hu, Zhiqiang Zeng, Pengtao Wu, Haiyan Huang, and Hangqi Zhang. 2024. "Unsupervised Anomaly Detection via Normal Feature-Enhanced Reverse Teacher–Student Distillation" Electronics 13, no. 20: 4125. https://doi.org/10.3390/electronics13204125

APA Style

Wang, X., Fan, J., Yan, F., Hu, H., Zeng, Z., Wu, P., Huang, H., & Zhang, H. (2024). Unsupervised Anomaly Detection via Normal Feature-Enhanced Reverse Teacher–Student Distillation. Electronics, 13(20), 4125. https://doi.org/10.3390/electronics13204125

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop