Anomaly Detection Neural Network with Dual Auto-Encoders GAN and Its Industrial Inspection Applications

Recently, researchers have been studying methods to introduce deep learning into automated optical inspection (AOI) systems to reduce labor costs. However, the integration of deep learning in the industry may encounter major challenges such as sample imbalance (defective products that only account for a small proportion). Therefore, in this study, an anomaly detection neural network, dual auto-encoder generative adversarial network (DAGAN), was developed to solve the problem of sample imbalance. With skip-connection and dual auto-encoder architecture, the proposed method exhibited excellent image reconstruction ability and training stability. Three datasets, namely public industrial detection training set, MVTec AD, with mobile phone screen glass and wood defect detection datasets, were used to verify the inspection ability of DAGAN. In addition, training with a limited amount of data was proposed to verify its detection ability. The results demonstrated that the areas under the curve (AUCs) of DAGAN were better than previous generative adversarial network-based anomaly detection models in 13 out of 17 categories in these datasets, especially in categories with high variability or noise. The maximum AUC improvement was 0.250 (toothbrush). Moreover, the proposed method exhibited better detection ability than the U-Net auto-encoder, which indicates the function of discriminator in this application. Furthermore, the proposed method had a high level of AUCs when using only a small amount of training data. DAGAN can significantly reduce the time and cost of collecting and labeling data when it is applied to industrial detection.


Introduction
To solve the problem of manual inspection, automated optical inspection (AOI) that uses image processing algorithms for industrial inspection has been developed [1][2][3]. Furthermore, the automatic detection system has been applied in computer diagnosis tasks, such as monitoring respiration symptoms in body area networks [4]. However, AOI is limited as it can only perform inspection tasks with a simple background and single defect type. Recently, researchers have started to apply convolutional neural networks (CNN) to image recognition, and successively proposed classic CNN architectures such as VGG [5], Inception [6][7][8], ResNet [9], and DenseNet [10]. CNNs have a greater classification ability compared with traditional image processing algorithms. A growing In recent years, researchers have improved the image reconstruction ability of GAN [24][25][26] using CNN and batch normalization [24], Wasserstein loss [25], and dual auto-encoder architecture [26]. This study proposed a GAN-based anomaly detection neural network with dual auto-encoders (DAGAN) to enhance GAN-based anomaly detection in the industry. Furthermore, a series of studies on DAGAN's industrial detection capabilities were conducted: 1. Training and verification of DAGAN using the public industrial inspection dataset, MVTec AD, and comparing it with previous GAN-based anomaly detection networks. 2. Verification of DAGAN's detection ability in an actual production line with two datasets (surface glass of mobile phone and wood defect detection datasets). 3. Verification of DAGAN's inspection capability with less training data.

Generative Adversarial Network (GAN)
GAN [27] is an unsupervised learning neural network that learns to generate images with a probability distribution similar to that of the training data. The network uses the game theory to design the loss function of the neural network, where the generator and discriminator compete, for training.

Boundary Equilibrium Generative Adversarial Network (BEGAN)
BEGAN [26] is a GAN model released by Google. By designing both the generator and discriminator as an auto-encoder, BEGAN ensures that the training is more stable and easier to converge to the expected balance point. Its image reconstruction ability is better than that of GAN, and it does not have to consider model collapse and training imbalance.

AnoGAN
AnoGAN [18] is the first attempt to use GAN for anomaly detection. Its main objective is to use normal samples to train GAN, which will generate a fake image with a probability distribution similar to that of the normal sample. By defining the threshold of residual score between the image to be tested and fake image, the network can recognize anomaly samples. However, it requires a significant amount of computing resources.

GANomaly
Samet Akcay et al. [19] developed GANomaly. Unlike AnoGAN, GANomaly does not need to minimize the residual score between the detection image and generated fake image through iteration, but directly creates the fake image after the image is imported by the encode-decode generator, which greatly reduces the computing resources and improves the anomaly detection ability of GANomaly. However, the image reconstruction ability of GANomaly is still not stable in all tasks.

Skip-GANomaly
Samet Akcay et al. [20] proposed an improved model of GANomaly, Skip-GANomaly. Inspired by U-Net [28], the architecture of skip-connection was added to Skip-GANomaly, which exhibits an outstanding ability to reconstruct images. The performance of Skip-GANomaly is more stable than that of AnoGAN and GANomaly. However, Skip-GANomaly does not perform well in all dataset categories, which might be caused by model collapse during the training process.
The advantages and limitations of AnoGAN, GANomaly, and Skip-GANomaly are presented in Table 1. Inspired by Skip-GANomaly, the proposed method, DAGAN, has been designed with a highly stable and excellent network architecture of GAN-based anomaly detection to overcome the limitation of the previous works. Table 1. Advantages and limitations of AnoGAN, GANomaly, and Skip-GANomaly.

Advantages
Training without anomaly data.
Significant improvement in detection time.
Better ability of image reconstruction.

Limitations
Excessive time to detection. Cannot reconstruct complex images.
Model collapse during training.

Pipeline
The pipeline of DAGAN, as shown in Figure 2, comprises a generator and discriminator. Inspired by Skip-GANomaly [20] and U-Net [28], generator G (.) is designed as an auto-encoder with skip-connection architecture. It can generate a fake image, x', with almost the same probability distribution as that of the input image, x. The skip-connection architecture provides DAGAN with an excellent reconstruction ability. Conversely, DAGAN's discriminator D (.) is inspired by BEGAN. This discriminator is used to receive the fake image, x'. D (.) can identify the difference between the image, x, and fake image, x'. The dual auto-encoder architecture ensures that DAGAN training is more stable and easier to converge to the best balance point. In the training process, only the normal samples are input, which provides the generator with better reconstruction ability for a normal sample than for an anomaly sample. Hence, one can identify normal and anomaly samples using a proper residual score, which is defined to represent the residual between image x to be tested and fake image x'. The proposed method maintains the advantages of Skip-GANomaly and BEGAN in architecture design and has strong image reconstruction ability and training stability.

Training Objective
To achieve the goal of anomaly detection, this study has referenced and improved the loss functions of Skip-GANormaly and BEGAN. The loss functions are presented as follows: 1. Adversarial loss: To provide the generator with the best image reconstruction ability, the adversarial loss function is referred. This loss function, as shown in Equation (1), will reduce the difference between input image x and generated fake image G(x) as much as possible when training generator G(.), whereas discriminator D(.) will distinguish the original input image, x, and fake image, x, generated by generator G(.) as much as possible. The goal is to minimize the adverse loss of generator G(.) and maximize the adverse loss of discriminator D(.). The adversarial loss can be expressed as: 2. Contextual loss of generator: To provide generator G(.) with better image reconstruction ability, the proposed method uses a contextualized loss function to represent the difference between x and G(x) pixels. It is defined as the L2 distance between the input graph, x, and generated fake image, G(x). This ensures that the fake image is consistent with the input image as much as possible. The equation of contextual loss of generator is defined as: 3. Contextual loss of discriminator: To converge to the best balance point shortly during training, a contextual loss of discriminator is set. This loss is used to represent the L2 distance between image x and image D(x) formed by the discriminator. This ensures that the original image and image generated by the discriminator is consistent with the input image as much as possible.
A contextual loss of discriminator is defined as: In the training process, DAGAN can be trained by the weighted summation of the above three loss functions. The definition of the weighted summation loss function is as follows: where λ adv " λ Gcon and λ Dcon are the weights of three loss functions.

Detection Process
To perform the task of anomaly detection, it is necessary to design the detection process, as shown in Figure 3. First, image x, which is to be tested, is input into generator G(.). After the generator generates the fake image, G(x), the residual score, R(x, G(x)), between x and G(x) is calculated through the residual score calculator, which is defined as The residual score, R(x, G(x)), of the normal sample will be lower because only normal samples are trained. Through the calculation of the residual score, the residual score, R(x, G(x)), of the entire dataset is linearized to the range of 0 ∼ 1 to facilitate the subsequent setting of thresholds and analysis. By adjusting different thresholds θ, when the residual score of the test image is greater than or equal to the test threshold, i.e., R(x, G(x)) ≥ θ, then the product to be tested is an anomaly product.

Datasets
To ensure that the proposed method has good detection ability in industrial inspection, three datasets were used to train and verify DAGAN. The datasets are described below:

MVTec AD
The MVTec AD dataset [29] was collected by the MVTec software GmbH team. It contains 15 common industrial inspection categories: five of them are texture categories and ten are object categories. The data are shown in Figure 4. This dataset is commonly used for validation of industrial detection deep learning models [30][31][32]. In this dataset, there are 3629 training images and 1725 verification images, and the resolution of the image is between 700 × 700 and 1024 × 1024.

Production Line Mobile Phone Screen Glass Dataset
In this study, a line scan camera was used to take the images of mobile phone screen glass pieces, and the images were divided into normal and anomaly samples. The normal and anomaly images in the dataset are shown in Figure 5. Notably, in the industrial inspection, there was dust adsorption on the mobile phone screen glass. However, the dust can be removed only by wiping, and thus, it poses a challenge in the mobile phone screen glass detection. In this dataset, 200 pieces of mobile phone cover glass were scanned using the camera. There were 329 training and 54 validation images. The image resolution was 128 × 128.

Production Line Wood Surface Dataset
The wood surface dataset contained images of normal and anomaly wood products that were captured by a line scan camera. This dataset comprised six labels, such as normal products, chalk, holes, black, and knots. The sample images of each category are shown in Figure 6. This dataset contained 3075 training data of normal samples and 740 validation data of normal samples and anomaly images. The resolution of the image was 256 × 256.

Training Detail
To ensure that the training is fast and effective, Adam was used as an optimizer, and the learning rate was set to 0.001. The loss function has been defined in Equation (4). The weights of the loss function were set to λ adv = 1, λ Gcon = 40, λ Dcon = 1, and the number of training steps was set to 20,000. Furthermore, the detection ability of a U-Net auto-encoder was applied to test the necessity of the discriminator. In this study, the model with the best detection ability in the training process was used to verify the results. The experimental hardware used in this work was an Intel i7-9700k 3.6 GHz CPU (INTEL MICROELECTRONICS ASIA LTD., TAIWAN, Taipei, Taiwan) and anlNvidia RTX 2080ti 11 Gb GPU (GIGABYTE Technology, New Taipei, Taiwan), and keras was used as the deep learning framework for training and verification.

Evaluation
The area under the curve (AUC) of the receiver operating characteristics (ROC) was used to evaluate the performance of detection in this study. The AUC is an effective method to evaluate the detection ability of a binary detection model, which is also widely used as a model evaluation method of deep learning.

MVTec AD Dataset
As summarized in Table 2 and Figure 7, the proposed method had the best performance in 9 of the 15 categories of MVTec AD. In addition, in the other 6 categories, while its AUC was not the highest, it was almost the same as the highest value obtained. Notably, the detection ability of the proposed method in the four categories of carpet, hazelnut, tile, and toothbrush was significantly higher than that of AnoGAN, GANomaly, and Skip-GANomaly. These four detection tasks were similar in that they have a relatively complex variation information. The backgrounds of carpet and tile had an irregular texture, hazelnut was the only category with inconsistent sample orientation in MVTec AD, and there were multiple colors of bristles in the toothbrush. When training these more complex tasks, AnoGAN and GANomaly had difficulty in reconstructing the images. Although Skip-GANomaly had the architecture of skip connection, the complexity of the image might have increased the possibility of mode collapse during its training. However, because the proposed method had a strong image reconstruction ability and was easy to converge to the best balance point in the training process, its advantages were remarkable in the categories with high complexity. Additionally, the AUCs of DAGAN were significantly higher than those of the U-Net were in the four categories, i.e., carpet, metal nut, toothbrush, and transistor. This is because without the discriminator, the only goal of the U-Net auto-encoder is to reconstruct the image perfectly. Consequently, it reconstructed the defect in the image, which decreased its detection ability in these categories. Therefore, the discriminator is necessary in this application. Moreover, Figure 8 shows the heat maps generated by the proposed method after detecting the MVTec AD dataset. Notably, the proposed method can classify the defects, and obtain the location, area, and contour of some defects from the generated heat maps, which is crucial to industrial inspection.   Table 3 presents the AUCs of the glass and wood surface defects detected by the proposed method, U-Net auto-encoder, and the other three GAN-based anomaly detection models. As mentioned in Section 5.1, the proposed method had a better detection ability than the other three when the detection image had a more complex variation. Owing to the noise of dust in the good products of the mobile phone screen glass and the variety of wood background textures in the actual production line, both of these datasets require a more solid reconstruction ability to avoid model collapse in the training process. Thus, the AUC was significantly higher when using the proposed method for training and verification. Further, the AUCs of DAGAN were better than those of the U-Net auto-encoder. This indicates that the removal of the discriminator will cause a decline in detection ability, as previously mentioned.  Figure 9 shows the heat maps generated by the proposed method after detecting the mobile phone screen glass and wood surface datasets on the production line. In the same way, the proposed method can also show the residual value of each pixel through the heat maps of these two datasets.

Training with Few Data
In this study, four categories, i.e., bottle, tile, actual production line wood surface, and actual production line glass, which represented the detection of object samples, detection of texture samples, and complex detection items on the production line, respectively, were selected as the test categories for training with few data. A total of 2 n (0 ≥ n ≥ 7) images were used for training and to examine the influence of reducing the number of training samples in the proposed method. The AUCs under different n are presented in Table 4 and Figure 10 In the four learning categories, a reduction in the number of training samples has little effect on AUCs, which indicates that the proposed method still has a high reducibility to unfamiliar normal product data. It can significantly reduce the time and cost of collecting and labeling data when it is applied to industrial detection.  10. AUCs of training the proposed method (DAGAN) with few data (2 n , 0 ≥ n ≥ 7).

Conclusions
In this study, a GAN-based anomaly detection model, DAGAN, was proposed and discussed. By combining the advantages of Skip-GANomaly and BEGAN, the model showed a great reconstruction ability and stability in the training process. Three datasets, MVTec AD, wood surface defects, and glass surface defects of mobile phones, were employed to train and verify the proposed method and for comparison with the previous GAN-based anomaly detection models. The AUCs of the proposed method were significantly higher than those of the other three GAN-based anomaly detection models were in the categories with high variability or noise. Furthermore, the proposed method exhibited better detection ability than the U-Net auto-encoder. Additionally, this study examined the influence of detection capability with different quantities of training data. In this study, four categories were used, and 2 n (0 ≥ n ≥ 7) images were utilized during the training process. The result demonstrated that the proposed method could maintain a high level of AUC even when a small quantity of training data was entered, which indicated that the proposed method has a good ability to reconstruct unfamiliar normal product data.