Noise-to-Norm Reconstruction for Industrial Anomaly Detection and Localization

Anomaly detection has a wide range of applications and is especially important in industrial quality inspection. Currently, many top-performing anomaly-detection models rely on feature-embedding methods. However, these methods do not perform well on datasets with large variations in object locations. Reconstruction-based methods use reconstruction errors to detect anomalies without considering positional differences between samples. In this study, a reconstruction-based method using the noise-to-norm paradigm is proposed, which avoids the invariant reconstruction of anomalous regions. Our reconstruction network is based on M-net and incorporates multiscale fusion and residual attention modules to enable end-to-end anomaly detection and localization. Experiments demonstrate that the method is effective in reconstructing anomalous regions into normal patterns and achieving accurate anomaly detection and localization. On the MPDD and VisA datasets, our proposed method achieved more competitive results than the latest methods, and it set a new state-of-the-art standard on the MPDD dataset.


Introduction
Anomaly detection has a wide range of applications in fields such as industrial quality inspection [2,7,30], medical diagnosis [15], and video surveillance [13,26].In industrial quality inspection, anomaly detection can identify and locate defects in the appearance of a product, improve product quality, and ensure standards compliance.With the emergence of modern technologies in computer vision, anomaly detection using deep learning methods has rapidly developed as an effective solution for industrial quality inspections, addressing the challenges of low efficiency and difficulty in conducting large-scale manual inspections.
Supervised methods that have high data-annotation costs and low adaptability to new defects had limited use in recent years.Therefore, most studies are now focused on unsupervised learning methods.Among these, featureembedding-based methods [3,17,22,6] that utilize pretrained models to extract image features and realize the measurement or comparison of features by feature modeling are widely used.However, positional consistency of the detection objects is crucial for these methods.In contrast, reconstruction-based methods [1,10,9,27] do not have this limitation and do not require additional training data, making them suitable for various scenarios.
Reconstruction-based methods exhibit better anomaly detection and localization performances for randomly placed objects.Unlike traditional image reconstruction methods, our proposed reconstruction model uses noisy images as input; this disrupts the abnormal areas and makes it difficult to distinguish them from normal patterns, thus solving the problem of reconstructing abnormal regions owing to its strong reconstruction capabilities.In addition, our proposed reconstruction model is based on M-net [14] and employs a multiscale fusion structure.Before being fed into the reconstruction network, the noisy image is down sampled to varied sizes to enlarge the model's receptive field, providing better robustness to anomalous regions of diverse sizes.The reconstruction network comprises three parts: an encoder, a decoder, and a feature fusion module; both the encoder and decoder contain residual attention modules and skip connections between them.The feature fusion module fuses the multiscale features to generate the reconstructed image.
Numerous experiments on the MPDD [7] and VisA [30] datasets have demonstrated that the proposed end-to-end anomaly-detection method has excellent performance.The main contributions of this study are summarized as follows: 1. We introduce a novel unsupervised anomaly detection method based on the noise-to-norm paradigm.
2. We propose a residual attention module that can be embedded in the encoder and decoder to achieve high-quality reconstruction of noisy images.
3. Our method achieves state-of-the-art (SOTA) performance on the MPDD dataset.

Related Work
Unsupervised learning addresses the high annotation costs and difficulty in collecting negative samples, making it the mainstream method for image anomaly detection.Unsupervised learning methods can be divided into two main categories: reconstruction-and feature embedding-based methods.

Feature Embedding-based Methods
Feature embedding-based methods aim to determine a feature distribution that can distinguish between normal and anomalous samples.Typically, these methods use a pre-trained network as a feature extractor to extract shallow features from images.By fitting normal sample features to a Gaussian distribution, Mahalanobis distance is a common method to calculates the anomaly scores [3,29] between the test set samples and the Gaussian distribution, to estimate the anomaly localization.Research in [17] employed coreset-subsampled memory bank to ensure low inference cost at higher performance.Some studies attach a normalizing flow module to the feature extractor [6,18,22], features were first extracted and a normalizing flow module was utilized to enable transformations between data distributions and well-defined densities.Subsequently, anomaly detection and localization were performed based on the probability density of the feature map.
In general, feature embedding-based methods have achieved better results on the MVTec AD [2] dataset than those of the reconstruction-based methods because of their powerful representation capability of deep features.However, they rely on the uniformity of an object's location, which makes optimization difficult for cases in which the object's position varies significantly.

Reconstruction-based Methods
The reconstruction-based method trains an encoder and a decoder to reconstruct images with a low dependence on pretrained models.This method aims to train a reconstruction model that works well on positive samples but poorly on anomalous regions and achieves anomaly detection and localization by comparing the original image with the reconstructed image.Early studies used Autoencoders [19,10,24] for image reconstruction, whereas some methods employed a generative adversarial network [1,20,9] to obtain better reconstruction performance.However, there is a problem of overgeneralization, which can lead to an accurate reconstruction of anomalous regions.To address this issue, some researchers proposed a method based on image inpainting [25,16,8,27], in which masks are used to remove parts of the original image, preventing the reconstruction of anomalous regions.However, for images with complex structures and irregular textures, excessive loss of the original information may limit the reconstruction ability and cause many false positives in normal regions.

Overview
The proposed anomaly detection framework is based on the noise-to-norm paradigm, as shown in Fig. 1.
Specifically, we introduce random Gaussian noise to corrupt the original image, and the process of adding noise is defined as follows: where λ ∈ (0, 1), x 0 is the data obtained by normalizing each channel of the original image according to a Gaussian distribution(µ = 0.5, σ = 0.5).We add random noise generated from the same Gaussian distribution to the original image using weighted blending, thereby allowing us to control the degree to which the noise corrupts the original image.In contrast to the methods that simulate anomalies [24], our approach of adding noise is not intended to simulate anomalies.Instead, its purpose is to completely obscure the distinguishable appearance of anomalous regions, allowing the reconstruction network to transform the anomalous image into a normal image.
After adding noise, the images were down sampled to varied sizes to serve as multiple inputs.These inputs were then utilized by the reconstruction network to generate anomaly free images.During the training phase, only anomaly free samples were used to train the reconstruction network.The reconstructed images were compared to the original images using a loss function, and the reconstruction capability of the model was continuously improved.During the inference phase, anomaly localization was achieved by generating an anomaly map that captured pixel-level differences between the reconstructed and original images.The specific details of the reconstructed network are described below.

Reconstruction Network
The overall architecture of the proposed reconstruction network is shown in Fig. 2. The network is based on the M-net [14], which originated from the field of image segmentation and has been proven to be effective in the domain of denoising.Inspired by the SRMnet [5], we incorporated pixel shuffle operations into the encoder and decoder for upsampling and downsampling; this allows us to effectively manage resolution changes in the network and improve the reconstruction quality.The residual attention modules were merged after concatenating the features to enhance the feature representation and capture the relevant information.The encoder and decoder were connected through skip connections to facilitate the flow of information between different feature levels.The multiscale features were combined in the feature fusion module to generate the final reconstructed image.This design enables the network to effectively capture anomalies and produce high-quality reconstructions.

Residual Attention Module
The Residual Attention Modules are integrated after the concatenation of features to enhance feature representation and capture relevant information.These modules leverage residual connections and attention mechanisms to selectively emphasize notable features and suppress irrelevant ones.By focusing on informative regions and enhancing feature discrimination, the residual attention modules improve the network's ability to generate highquality reconstructions.In addition, the residual connections address the issue of vanishing gradients.By propagating gradients more effectively through the network, the residual connections enable faster convergence and improve the accuracy of the model.The specific structure of the residual attention module is shown in Fig. 3.It comprises global pooling, convolutional, and activation layers.In both pathways, a 1 × 1 convolutional layer is employed to adjust the number of feature channels.Selective Kernel Feature Fusion (SKFF) Our decoder generates four feature maps with different resolutions, and we employed the SKFF [23] module for feature fusion.The SKFF allows for the selection of different convolutional kernels at different spatial positions to facilitate the fusion of features from different scales, enabling the integration of multiscale reconstruction features.This approach avoids directly connecting each feature map and instead aggregates weighted features, addressing the issues of a large number of parameters and higher computational complexity in the M-net.

Metric Function
We employed a metric function that combines the MS-SSIM and 1 proposed by Hang Zhao et al.in [28].SSIM [21] is a widely used indicator for measuring the structural similarity between images.The SSIM for pixel p is defined in Eq. 2.
where the means and standard deviations are computed using a Gaussian filter G σ G with a standard deviation σ G .MS-SSIM uses different Gaussian filters (σ = 0.5, 1, 2, 4, and8) to compute the original image and is, defined as follows: where l M and cs j are the terms defined in Eq. 2, and the index j represents different Gaussian filters with different σ values, For convenience, we set α = β j = 1, for j = {1, . . ., M }.
During the training phase, for an image of size H × W , the MS-SSIM loss can be expressed as follows: The total loss is calculated by adding the 1 loss multiplied by the Gaussian filter and the weighted MS-SSIM loss.This formula is shown as Eq. 5.
where α represents the weight coefficient.
During the inference stage, we calculated the anomaly localization by computing the MS-SSIM and 1 error for each pixel.

Datasets
MPDD MPDD [7] is a challenging dataset that focuses on detecting defects in the manufacturing process of painted metal parts.It reflects the real-world situations encountered by human workers on production lines.The dataset includes six categories of metal parts.The images were captured under various spatial orientations, positions, and distance conditions with different light intensities and non-uniform backgrounds.The training set consisted of 888 normal samples, whereas the test set consisted of 176 normal and 282 abnormal samples.
VisA VisA [30] consists of 10,821 images.There are 9,621 normal and 1,200 abnormal images.VisA contains 12 subsets, each corresponding to one class of objects.We assigned 90% of the normal images to the training set, whereas 10% of the normal images and all anomalous samples were grouped as the test set.

Experimental Details
OOur studywork wais implemented in PyTtorch usingwith an NVIDIA GeForce GTX 2080Ti.We resized all the original images of the VisA and the MPDD images to 256×256 for both training and testing.We divided 20% of the training dataset into validation sets.For each category of these two datasets, we utilized AdamW optimizer [12] with β = (0.5, 0.999).We set the initial learning rate to 10 −6 and used cosine annealing [11] to adjust the learning rate with T max = 100 and eta min = 10 −6 .The maximum number of training epochs was set to 500, and the training was stopped early if the loss did not decrease within 20 consecutive epochs.
We evaluated our approach using different metrics for comparison with other baselines.We used the area under the curve (AUC) of the receiver operating characteristic (ROC) to evaluate the performance of image-level anomaly detection and pixel-level anomaly localization.

Comparative Experiments
MPDD We compared theour proposed method with several state-of-the-artSOTA methods on the MPDD dataset, including reconstruction-based methods [20,1] and feature-based methods [3,6,17].The image-level detection results are listed in Table 1, and the anomaly segmentation results are presented in Table 2. Experiments demonstrated that our proposed method outperformed previous SOTA methods on the MPDD dataset.The partial visualization results of the proposed method on the MPDD [7] dataset are shown in Fig. 4.
Specifically, as shown in Table 1, our proposed method achieved an overall improvement of 8.82% compared to that of the previous best-performing method, CFLOW [6].The most significant improvement was observed in the tubes that contained multiple instances with a random distribution of positions.These results highlighted the advantages of the proposed method.As shown in Table 2, the best average performance was achieved.However, the proposed method has some limitations.We were unable to achieve satisfactory performance in the brown bracket category.Most defects in the brown bracket category are deformation defects, and our method cannot accurately restore deformations, which hinders the accurate identification of such defects.VisA Further, to validate the generalizability and versatility of our method, we compared it with other SOTA methods [24,4,3,6,22,17] on the VisA [30] dataset.The anomaly detection results for the VisA dataset are listed in Table 3.
Experiments demonstrated that our proposed method performed competitively on the VisA dataset.

Ablation Studies
Effect of λ In this study, we employed a noise-to-norm reconstruction paradigm.
To validate the effectiveness of adding noise and the effect of the noise coefficient (λ) oon the detection results, we conducted comparative experiments.The results, as shown in Table 4, demonstrate that the overall detection performance was the best when λ = 0.3.Compared to that of the case without added noise (λ = 0), the detection accuracy increased by 22.28%, and the segmentation accuracy increased by 9.55%.Therefore, we finally set λ = 0.3.These experimental results confirm the significant improvement in anomaly detection achieved using the noise-to-norm reconstruction approach.

Importance of Residual Attention Module
To demonstrate the effectiveness of the proposed residual attention module, we conducted an ablation experiment.In the control group, we replaced the residual attention module with a 1 × 1 convolutional layer, which was used to change the number of feature

Conclusion
In this study, an industrial image anomaly detection method based on noiseto-norm reconstruction is proposed.We enhanced the M-net by incorporating a residual attention module and feature fusion, obtaining a reconstruction network.
Experimental results demonstrate that our method achieves SOTA performance in anomaly detection and localization on the MPDD dataset, and it also exhibits competitive performance on the VisA dataset.Our proposed method has significant advantages for handling data with multiple instances and varying object positions.However, the proposed method has limitations in detecting object absences or displacement anomalies.In future work, we will explore methods that combine the feature distribution of positive samples with a reconstruction approach to improve the anomaly detection performance of the model.

Fig. 2 .
Fig. 2. The overall architecture of the proposed reconstruction network.

Fig. 4 .
Fig. 4. Visualization of examples on MPDD.(a) Original image (b) Ground truth (c) Reconstructed image (d) Anomaly map (e) Prediction of anomalous regions (f) Prediction of anomalous regions on the original image

Table 1 .
Comparison of image-level detection results (AUROC%) on the MPDD dataset.Best results are highlighted in bold.

Table 2 .
Comparison of pixel-level detection results (AUROC%) on the MPDD dataset.Best results are highlighted in bold.

Table 3 .
Comparison of image-level detection results (AUROC%) on the VisA dataset.Best results are highlighted in bold.

Table 4 .
Effect of noise coefficient (λ) on image/pixel-level detection results (AUROC%).Best results are highlighted in bold.

Table 5 .
Effect of residual attention module on image/pixel-level detection results (AUROC%).Best results are highlighted in bold.