1. Introduction
Defects on the surface of industrial products refer to incomplete, irregular, or non-compliant areas or traces that occur during manufacturing, processing, or usage. These defects can be caused by physical, chemical, mechanical, or other factors and they can affect the appearance, quality, and performance of the products. The presence of defective products has a significant impact on both businesses and users. In mature industrial production processes, defective products exhibit three main characteristics. Firstly, the number of defective products is extremely low compared to normal products. Secondly, the defects exhibit various forms and diverse types. Thirdly, the defect areas are relatively small and the defect images are similar in distribution to the normal images. Therefore, identifying the differences between normal and defective samples is a highly challenging task.
Traditional detection methods primarily rely on increased allocation of human resources, where product quality inspectors visually discern the quality of products. This approach proves to be inefficient and incurs high costs. In addition, machine vision-based defect detection methods have also been widely explored, including techniques such as edge detection, threshold segmentation, and texture analysis. However, these techniques exhibit significant limitations when applied. For example, noise and variations in illumination can directly result in inaccurate edge detection, unstable threshold segmentation, and interference with the texture analysis results. Moreover, these methods typically rely on designed feature extraction, lacking good adaptability to different types of defects or image scenes, requiring adjustments and optimizations specific to the problem at hand, which further involves the challenge of parameter selection. In recent years, there has been rapid progress in deep learning methods aimed at emulating human habits and capabilities, with the objective of substituting humans in performing complex and high-risk tasks. With the swift advancement of computer technology and the enhancement of computational capabilities, the performance of deep learning-based anomaly detection techniques has been continuously improving. These techniques have found extensive applications in various domains, including agricultural production [
1,
2], industrial manufacturing [
3,
4], aerospace [
5,
6], and computer network security [
7,
8].
Supervised anomaly detection based on image data is one of the commonly employed methods in the field of deep learning. By being able to learn the distinctive features of positive and negative samples, it typically achieves the desired task objectives. However, the stable performance of supervised learning methods relies on a massive dataset with a balanced distribution of positive and negative samples. The major challenge in surface defect detection tasks lies in the extremely limited quantity of defect samples, which can result in overfitting of the model during fully supervised learning and subsequently affects the detection accuracy. In comparison, reconstruction-based semi-supervised anomaly detection methods, which do not require labeled defect samples, have gained popularity as an alternative approach. Among them, the two most classical categories are based on Generative Adversarial Networks (GANs) and Autoencoders (AEs), two fundamental techniques in the field of semi-supervised learning for image reconstruction. These methods extensively train on a large number of normal samples, aiming to learn the close relationship between the high-dimensional and low-dimensional distributions of images. This enables the network to learn how to reconstruct output images that closely resemble the input images. During testing, defect images are fed into the pre-trained network model, and due to significant differences from the reconstructed images, they are effectively identified and filtered out. Therefore, reconstruction-based anomaly detection methods have become an effective means to accomplish surface defect detection tasks in industrial products. When the network is trained to be too robust, it tends to perfectly reconstruct defect images as well, thus evading detection.
However, this type of image reconstruction technique is trained only using normal samples, and real defect images have never been involved in the entire process. This makes the inference of the entire network somewhat biased. The reality is that the scarcity of real defect images prevents their inclusion in the training process, and artificially synthesized defects generally differ significantly from real defects. As a result, the trained network exhibits poor generalization ability and fails to detect real defective products. Additionally, the authenticity of the reconstructed images serves as a criterion for assessing the performance of the reconstruction network. While autoencoders primarily focus on the reconstruction effect on high-dimensional images without considering low-dimensional features, Ganomaly takes into account the reconstruction consistency of low-dimensional latent vectors. However, training Ganomaly [
9] is often challenging and struggles to converge to the global optimum.
In response to the aforementioned issues, this study was inspired by the DRAEM [
10] concept to create more realistic and plausible synthetic anomaly images. This approach addresses the problem of defect images not being involved in the training process. An image reconstruction network was designed with deep feature consistency, and the network’s ability to separate defects was enhanced by utilizing the larger effective receptive field provided by the use of oversized convolutional kernels. This resulted in the generation of defect region prediction maps. By calculating the loss function using the predicted maps and the real defect regions, the possibility of the network model directly reconstructing defect images was eliminated, thus achieving more accurate surface defect detection in industrial products. The main contributions of this study are as follows:
A methodology for creating more realistic synthetic defect images is designed.
An image reconstruction network with depth feature consistency is constructed.
A defect prediction network with a widely effective receptive field is being constructed.
3. Method
The defect detection algorithm model proposed in this study, which is based on the prediction of defect maps through the learning of abnormal distance function, is composed of an image reconstruction network and an anomaly separation network (as shown in
Figure 2).
The image reconstruction network is trained to ensure that the reconstructed image and the original normal image have highly similar high-level semantic information and low-level semantic information, resulting in high visual similarity between the two. The anomaly separation network takes the reconstructed image and the synthesized abnormal image as inputs and aims to learn the distance function between the abnormal image and the real image, thereby generating accurate abnormal segmentation images and completing the defect detection task. The mechanism for synthesizing anomalies adopts a simple cut-and-patch method to mimic real anomalies and add a large number of realistic defect samples, thus compensating for the sample imbalance problem caused by the lack of defect images in the training data of the image reconstruction method.
3.1. Abnormal Synthesis Process
Defects can be commonly understood as the situation where the contextual information of a certain region on the foreground target is significantly different from that of the surrounding areas and is unrelated to the target background. Unlike DRAEM, we emphasize the authenticity of synthesizing anomalies. Based on this principle, the process of generating synthetic abnormal images can be divided into three stages (as shown in
Figure 3).
In the first stage, an input image I is selected and a sample A is randomly extracted from the normal images in the same dataset to serve as the anomaly source. The foreground object corresponding to the region is obtained by using edge detection with dilated padding or by directly setting a grayscale threshold, resulting in the corresponding mask images and . We use a Perlin noise generator to generate random noise texture image P, which is then compared with a preset threshold to produce a binary mask image .
In the second stage, since
P is randomly generated, the unobstructed areas of
(the white area of
in
Figure 3) may appear within the specified range (the size of the image), but we want the synthetic anomaly to appear on the foreground object. Therefore, the anomaly source mask image
is first multiplied pixel-wise with the Perlin noise mask image
to obtain the mask image
, and the defect region is constrained within the valid range. Then, the input image mask image
is multiplied pixel-wise with
to obtain the final mask image
(the same as
M in
Figure 2). Therefore, the final mask image
is defined as:
In the third stage,
is used to extract a portion of the region from sample
A, and similarly,
is used to extract the corresponding region from input image
I, which is then blended using random interpolation to obtain the final defect image. It is then combined with the other regions
of the input image
I to obtain the final synthesized anomaly image. Therefore, the anomaly image
is defined as:
where
is pixel-wise multiplication, while
is a random interpolation coefficient with
. The defect region created using the random interpolation blending method includes both the partial information of the original image
I and the information from the anomaly source image
A, which makes the synthesized anomaly diverse and realistic.
Figure 4 presents a set of examples of synthesized anomaly images.
Therefore, our synthetic anomaly method ensures that the anomaly cases appear only on the foreground object, independent of the background, and the anomalies produced are more realistic.
3.2. Image Reconstruction Network
The reconstruction module consists of an autoencoder and a deep feature vector extractor, which aim to extract key information from synthesized defective images and reconstruct the original image (as shown on the left in
Figure 2) using the reconstruction network. The network structure of the deep feature vector extractor is identical to the encoder part of the autoencoder but does not participate in network parameter updates. Instead, before each training session, all the parameters of the encoder are copied to the corresponding locations of the feature extractor. The intuition behind this design is that the entire reconstruction network, constrained by both the reconstruction loss function and deep feature loss function, can learn to reconstruct normal images or synthesized anomaly images into normal images via continuous training. In other words, the encoder part of the autoencoder can extract key information for perfect reconstruction from different input images, and its ability to extract key features continues to improve. Therefore, it is reasonable to use the feature extractor with the same parameter settings to extract deep features for the reconstructed image.
The
loss function is commonly employed to compute the sum of squared pixel differences between generated and real images. However, it is heavily influenced by noise and outliers and exhibits poor recovery performance for edge details. The
loss is defined as follows:
The
[
27] loss function can be used to measure the structural similarity between the generated image and the original image and can compensate for the shortcomings of the
loss function. The
loss is defined as follows:
The variables H and W in Equations (3) and (4) represent the height and width of the input image I, respectively, which denotes the reconstructed image generated by the network, and is the similarity function used to measure the similarity between I and .
The two loss functions are combined proportionally to form the visual image reconstruction loss function
, which is used to measure the loss of image reconstruction in terms of visual perception.
where
is a hyperparameter used to balance the two loss functions.
In addition, the loss function is calculated based on the deep feature vectors of the extracted input image z and the reconstructed image , in order to ensure that the generated image is close to the original one in terms of high-level semantic information. This part of the loss is defined as .
Therefore, the loss function of the image reconstruction network is formulated as follows:
where
and
are hyperparameters used to balance the visual loss and deep feature loss, respectively, in the loss function of the image reconstruction network.
3.3. The Large Convolutional Kernel Defect Prediction Network
The RepLKNet network proposed by Xiaohan Ding et al. [
28] uses a large 31 × 31 convolutional kernel for computation, which has a larger effective receptive field compared to the approach of using multiple small convolutional kernels to form an equivalent large one, demonstrating good performance on ImageNet [
29] classification, COCO [
30] detection, and ADE20K [
31] segmentation tasks. The defect prediction network adopts an autoencoder architecture and employs U-Net [
32] network connections (as shown on the right in
Figure 2). The reconstructed image
and the synthesized abnormal image
are concatenated at the channel level and inputted into the network. The network learns an appropriate distance metric between the reconstructed image
and the input abnormal image
, predicting the probability of defects occurring at the pixel level. The design concept of using large convolutional kernels is employed in the encoder part of the network, where the concatenated image
is inputted with a size of 256 × 256 and six channels. After being processed via four stem layers, the output is a feature map with 128 channels and a size of 64 × 64. The feature map then enters the stage block, which includes four stages that use large convolutional kernels of sizes [31, 29, 27, 13] to extract information. To address the optimization problems, the small kernel reparameterization is introduced. The synthesized defects are generated using Gaussian noise, and the distribution of the abnormal areas is random, resulting in an imbalance of the defect and normal areas. Focal Loss [
33] has shown good performance in dealing with sample imbalance and difficult classification problems. Therefore, it is selected as the loss function
for the defect prediction network:
where
is defined as:
In our model, p represents the probability that each pixel position in the predicted abnormal image outputted by the defect prediction network is an abnormal area.
Taking into account the two parts mentioned above, the overall loss function
of the network is formulated as follows:
where
is the final mask image, representing the ground truth, and
is the defect prediction image.
3.4. Abnormality Score
The defect prediction image
can serve as a criterion for judging whether there are abnormalities. After being smoothed via mean filtering to aggregate local abnormal information, the final image-level abnormality score is obtained by utilizing maximum pooling:
where ∗ represents the convolutional operator,
is a mean filter with a size of
,
is the maximum pooling operation, and the abnormality score
corresponds to the maximum value in the feature map after maximum pooling.
4. Experiments
The performance of this method was evaluated and compared with other advanced methods in the field of defect detection. Furthermore, the effectiveness of each component module of the proposed method was validated via ablation experiments.
4.1. Experimental Setup
We evaluated our method on the MVTec anomaly detection dataset, which is currently a challenging benchmark test set used to evaluate and compare different defect detection algorithms. MVTec AD contains approximately 5000 real industrial defect images from 15 different categories in 13 industrial sectors, including approximately 2500 defective images. The dataset also provides pixel-level mask annotations to indicate the location and shape of the defects in the images. In anomaly detection, image-level AUROC is commonly used to evaluate the algorithm’s ability to detect anomalies. To evaluate the performance of our proposed method, we used image-level AUROC as an evaluation metric in anomaly detection. Additionally, we also used average precision (AP) as a benchmark for evaluating the model’s ability to locate defects.
In the experiment, we trained the network on the MVTec AD dataset for 700 epochs, with a learning rate set to 10. We performed fine-tuning by multiplying the learning rate by 0.1 at 400 and 600 epochs to achieve global optimization. Throughout the training process, we saved the best-performing model. The hyperparameters in the loss function were set to , , and , respectively.
During training, we also used data augmentation via image rotation to compensate for the limited number of training samples. We still used MVTec AD as a source of anomaly images for defect manufacturing to create more realistic defect images and improve the model’s robustness. The experiment was conducted on a computer equipped with an NVIDIA RTX 3090 GPU.
4.2. Anomaly Detection
Samet Akcay et al. [
34] proposed the anomalib library based on the PyTorch Lightning architecture, which includes several state-of-the-art anomaly detection algorithms. We reproduced these anomaly detection algorithms on a computer equipped with an NVIDIA RTX 3090 GPU. The parameter settings for all methods remained consistent with the original papers, and a quantitative comparison was conducted against our proposed algorithm (as shown in the
Table 1 and
Table 2). Our method achieved the highest AUROC in 14 out of the 15 categories in the dataset, with an average value of 99.70% when rounded to two decimal places. This is 1.1 percentage points higher than the previous best-performing method, and it outperformed the baseline method DRAEM in all aspects. Furthermore, based on the ROC curve, the optimal threshold for distinguishing between defective and non-defective items was determined. The accuracy of defect detection reached 98.41%, with an average inference time of 0.041 s per sample during testing. Moreover, the results demonstrate the exceptional stability of our method on texture-based datasets, with nearly all the values of AUROC approaching 100%, as well as on several datasets of regular-shaped objects. The test results of some categories are shown in
Figure 5, and the distribution of predicted defect locations almost coincides with the actual situation. Taking the cable dataset as an example, we show their ROC curves in
Figure 6, and it can be seen that the area under the curves is close to 1.
Figure 7 are visualizations of box plots for
Table 1 and
Table 2, which intuitively demonstrate the different distributions of results for various testing methods. Our method has the most concentrated distribution among all methods.
Figure 8 displays comparisons between our method and three other methods, PaDim, DRAEM, and STFPM, in terms of predicted and ground truth images for some samples. It can be observed that our method is closer to the ground truth images. The model performs poorly on several types of data, which can be explained by the fact that our defect synthesis method creates abnormal images that are relatively realistic, posing a greater challenge to anomaly detection.
4.3. Defect Localization
We compared the performance of our method with several latest pixel-level anomaly detection methods in terms of the AP performance metric (as shown in the
Table 3). Our method outperformed the baseline method DRAEM in terms of AP scores in all 15 categories, with a numerical improvement of 31.47%. Our method also surpassed other detection methods (data sourced from DRAEM). We also take the cable dataset as an example and show the obtained AP curve in
Figure 9. It can be seen that the precision values can still maintain a relatively high level at high recall rates, indicating that our model can predict the true anomaly distribution accurately after training.
4.4. Ablation Experiments
In order to demonstrate the effectiveness of the network structure, we designed several sets of control experiments, mainly evaluating from three aspects: model design, abnormal image source selection, and network training.
4.4.1. Model Structure
We incorporated a deep feature extractor on the basis of the reconstruction network autoencoder and evaluated its impact on anomaly detection. Through comparative experiments (as shown in item 1 and 2 in
Table 4), it was found that the reconstruction network, with the addition of the deep feature extractor, had some improvement in detection performance compared to DRAEM. This can be explained by the fact that the addition of deep feature loss makes the reconstructed image and the original input image visually and deeply feature-wise closer, making the information contained in the reconstructed image more abundant and specific.
Next, we fixed the existing autoencoder reconstruction network and conducted comparative experiments on the encoding part of the defect prediction network using the RepLKNet structure, which showed significant improvement in performance compared to the baseline model. This is because the actual receptive field of the larger convolution kernel is larger than the effective receptive field of the stacked small convolution kernels, as proven in the RepLKNet paper. A larger receptive field allows the network to better understand the global structure and contextual information in the image, avoiding overfitting during network training and thus learning more general features in the image.
4.4.2. Abnormal Appearance
We evaluated the proposed new anomaly synthesis method by changing the anomaly source from the DTD dataset used by DRAEM to the MVTec anomaly detection dataset. From the data (as shown in items 1 and 4 in
Table 4), it can be seen that this approach slightly improved the detection performance. This may be due to the use of random linear interpolation during the anomaly synthesis process, which allowed the synthesized defective images to retain some of the original image information, allowing the reconstruction network to more accurately recover the original image from these residual information. Furthermore, for some of the object datasets, the defect positions we created accurately appeared on the foreground objects, which is in line with the consensus and allows the network to learn towards discriminating real defects. Under the premise of using the MVTec anomaly detection dataset as the anomaly source, experiments were conducted by adding a deep feature extractor and a large kernel convolution encoder (as shown in items 4, 5, 6, and 7 in
Table 4), and the results showed that the network that included all parts (as shown in item 7 in
Table 4) had the best performance, confirming the effectiveness and indispensability of the design and composition of the reconstruction network and the defect prediction network.
Figure 10 presents examples of performance in each ablation experiment, and it can be observed that our final model displays the results that are closest to the ground truth images.
4.4.3. Training Method
The structure of the deep feature extractor we designed is exactly the same as the encoder part of the reconstruction network, but the training strategy for this part is different from direct training and parameter sharing, instead using a direct copying approach. The experimental results showed that the effect of direct training without parameter copying is comparable to that of the network that only changed the anomaly synthesis method (as shown in items 4 and 8 in
Table 4). This suggests that if a similar form of the feature extractor structure is trained directly, it may in turn affect the model’s anomaly detection capability, whereas our parameter copying training method achieved the best results (as shown in items 7 and 8 in
Table 4). This is because the autoencoder is constrained by the loss function between the input image and the reconstructed image. After multiple rounds of training, the encoder part learns the ability to extract key feature information from normal or synthesized abnormal input images and uses the decoder to reconstruct the deep features with less data into the original normal image. The feature extraction ability of this encoder is based and unquestionable. Therefore, copying all parameters directly to the deep feature extractor allows it to extract the key features of the reconstructed image, ensuring consistency in deep features between the original and reconstructed images. If the deep feature extractor is directly involved in network parameter updates, the validity of the key information extracted by the extractor will be questioned due to the lack of direct constraints like the reconstruction loss of the autoencoder. Although the deep feature loss correction is used to make the extracted features close to the intermediate layer features of the autoencoder, the cost is that it greatly misleads the network training direction in the early stages of training, making it impossible for the network to converge to the optimal point. This is also one of the factors why the anomaly detection performance of methods such as GANomaly with directly trained feature extractors is not good enough.
5. Conclusions
A semi-supervised defect detection algorithm based on defect map prediction with realistic synthetic anomalies is proposed in this paper. Our method demonstrates excellent performance in industrial product defect detection tasks. After conducting experiments on the MVTec dataset, which consists of 15 different categories, our method outperformed other recent detection methods by 1.1 percentage points on the AUROC evaluation metric, showcasing its strong generalization capability. Furthermore, our method surpassed the best-performing DRAEM by 31.5% on the defect localization evaluation metric AP, indicating a significant improvement in localization accuracy. This is because we only learn the distance function between normal and abnormal samples, rather than directly learning the features of anomalies. By employing various data preprocessing techniques such as affine transformations and image enhancement, combined with the utilization of synthetically generated realistic abnormal images as input samples for training, the network has acquired enhanced resistance to interference and robustness. We discussed the design of the two sub-modules, analyzed the benefits of parameter copying in the feature extractor, and demonstrated the effectiveness of large kernel convolution in expanding the receptive field in practical applications via experiments.