1. Introduction
Methods based on machine learning and deep learning have remarkably improved industrial defect detection performance [
1,
2,
3]. However, practical industrial scenarios pose challenges to the current detection methods, such as data problems. Acquiring large datasets for manufacturing applications remains a challenging proposition, due to the time and costs involved [
4]. The small number of defect samples and data imbalances can lead to overfitting during the training of supervised deep-learning methods and poor performance in testing [
5].
Data augmentation is a very powerful method for building useful deep-learning models and for reducing validation errors [
6]. Image data augmentation mainly includes traditional and learning-based methods. Traditional methods can increase the number of samples, but cannot create new defect samples. In contrast, learning-based methods such as GAN (Generative Adversarial Nets) [
7], AAE (Adversarial AutoEncoders) [
8], and VAE (Variational Auto-Encoder) [
9] can model the distribution of a real dataset and synthesize new samples that are different from the original dataset, which increases both the number and the diversity of the dataset. Based on cutting-edge work in image synthesis [
10,
11,
12,
13], industrial image generation can be carried out, to achieve the augmentation of few-sample datasets. Currently, diverse learning-based augmentation methods are emerging, to alleviate data problems in industrial defect detection [
14,
15,
16,
17,
18,
19,
20].
However, there are still some challenges that need to be addressed in the current learning-based defect image augmentation methods.
Insufficient retention of realistic background textures. Textures provide important and unique information for intelligent visual detection and identification systems [
5]. In industrial defect detection, any slight change to real textures can disturb the detection results. Researchers usually perform defect image generation based on non-defective samples, due to their easy availability in industrial manufacturing. This requires generation methods that preserve the real normal backgrounds to the maximum extent possible. Many works [
15,
16,
17,
19] have used CycleGAN [
21] to generate a defective image for an input normal image, where normal backgrounds may be excessively falsified, since they do not constrain the treatment of the normal background.
The independent controls of the normal backgrounds, defect shapes, and defect textures are rarely considered. If independent control and operation of the three are achieved, then they can be arbitrarily combined to obtain an infinite number of defective images from a normal image. However, current effective methods such as SDGAN [
15], Defect-GAN [
16], and SIGAN [
17] control the three as a whole and can only obtain one defect image from a normal image based on well-trained models, whose randomness and diversity are insufficient. Moreover, the pixel-level annotations of the generated defect images can be acquired if we separately process the normal backgrounds and defect regions. Defect-GAN generates a spatial distribution map, to indicate what is modified in the source image compared with the generated image, but it does not decouple the backgrounds and defect regions and cannot obtain accurate binary annotations from the spatial distribution map.
Lack of exploration of the generation of non-uniform complex structure defects with binary annotations. There are multiple semantic regions in a non-uniform structure image, where texture features are different and the corresponding defects are distinct. As shown in
Figure 1, there are two types of textures in a zipper image: fabric and zipper teeth, whose defect contents are significantly disparate. In terms of such non-uniform structure images, networks must generate conforming defects at the specified locations, to obtain realistic synthetic results. In addition, obtaining binary annotations for the generated complex texture defect images is also a challenging task. Shuanlong N. et al. [
18] used random seeds to construct input masks, while this is limited to simple stripe defects in some uniform textures. Du-Ming T. et al. [
19] adopted two CycleGANs to preserve normal backgrounds and threshold segmentation to obtain binary annotations. However, this work also only synthesized defects with uniform textures, and the overall networks contained four generators and four discriminators, which were too complicated.
To tackle these challenges, we proposed a MDGAN (mask-guided defect generation adversarial network) based on CGAN [
22]. The MDGAN can generate realistic defects in regions specified by the input binary mask. First, we introduce a BRM (background replacement module) to extract normal backgrounds using a binary mask to replace the contents at the corresponding positions in the feature maps. The BRM achieves the preservation of the normal backgrounds and facilitates the separate control of the normal backgrounds and the shape of defects. In addition, the generated defect textures can be controlled by training MDGAN separately for different types of defects. Second, we proposed a DDM (double discrimination module), to extract the defect features from the whole feature map with the guidance of binary masks and measure the authenticity of the whole and the local based on one discriminator. In addition, we constructed a pseudo-normal background for each defect image, to provide paired training inputs. This preprocess ensures MDGAN generates defects according to normal features in the same regions, thus enabling generation of defects with non-uniform structures. Finally, the outputs of MDGAN and the input binary masks were combined to construct our pixel-level annotated synthetic datasets.
In summary, the main contributions of this work are as follows.
- (1)
We constructed corresponding pseudo-normal backgrounds for defective images, which solves the problem of the lack of paired training inputs in industrial defect generation and avoids the dependence on CycleGAN.
- (2)
We proposed a MDGAN, to achieve independent control of the normal backgrounds, defect shapes, and defect textures of images. The addition of BRM achieves the preservation of normal backgrounds and enables the acquisition of binary annotations. Our DDM focuses on the defect region and the whole image simultaneously, ensuring the quality of the generated results.
- (3)
Since BRM can achieve total preservation of the normal background in the generated defect images, our MDGAN can also achieve defect transfer between datasets with similar defect contents.
The subsequent sections of this article are organized as follows:
Section 2 proposes the MDGAN.
Section 3 introduces the related datasets used in this work.
Section 4 details the generation, ablation, comparison, and segmentation experiments. Finally, we summarize our work in
Section 5.
4. Experiments
MDGAN was used to synthesize defect images based on the above datasets. Except for the metal nut and capsule, which involved three-channel color images, the images consisted of single-channel grayscale images.
Section 4.2 shows the generated quality, diversity, and annotation accuracy of MDGAN. The effectiveness of BRM and DDM are demonstrated in
Section 4.3. We also compare MDGAN with the most commonly used CycleGAN in
Section 4.5, to certify the effectiveness of our methods. Then we assess the advantages of our synthetic samples over traditional augmented results. Finally, we explore the possibility of defect transfer using MDGAN in
Section 4.6.
4.1. Implementations
To construct the pseudo-normal backgrounds, we first obtain the main area where defects may occur in defect image D and select normal image N by thresholding, and draw the minimum external rectangular boxes of these areas. Then we conduct affine transformation for D and N, so that the length and width of the two boxes are parallel and the center points are overlapped. Finally, the defect contents in D are filled by N, and the filled image is reversely transformed to the attitude of the original D. These two affine transformations can be simplified as Equations (1)–(3). We performed the above processes based on the OpenCV library, where functions such as cv2.getRotationMatrix2D() and cv2.warpAffine() were used.
MDGAN was trained for each type of defect under each item. We augmented the training image pairs by rotation, flipping, and random cropping. The binary mask was normalized to [−1, 1] when inputting to the MDGAN and [0, 1] when calculating losses. The number of output channels of the 12 convolution layers of the generator from input to output were 64, 128, 256, 256, 512, 512, 512, 256, 256, 128, and 64, respectively; those of the discriminator were 32, 128, 256, 256, 512, and 1, respectively. The dimension of latent space was 8, and the hyperparameters of Equation (15) were set as , , , and . The Adam optimizer was used with = 0.5, = 0.999 to train MDGAN, with a batch size = 20 and learning rate = 0.0004. We trained the MDGAN for 500 iterations on one NVIDIA GeForce RTX 3090 GPU of a sever with Intel(R) Xeon(R) Gold 622306R CPU @ 2.90 GHz. During training, backgrounds could be retained without the last BRM in some datasets, which could simplify the architecture and obtain a better transition between normal backgrounds and defect regions. Therefore, we eliminated the last BRM in the defects of the grid.
4.2. Synthetic Results
Some synthetic defect samples are shown in
Figure 5 and
Figure 6. As shown in
Figure 5, MDGAN achieved defect image generation for all categories. Good synthesis results were obtained for complex and weak defects, such as the fine-grained texture defects (zipper-fb, zipper-fi, grid-thread, etc.), color defects (metal nut-color), metallic defects (grid-mc), and weak defects (phone band). As seen from the
Defect contents in
Figure 5, the generated defect contents were realistic and accurately distributed in the locations marked by the mask, achieving an accurate correspondence between the synthetic defect images and the input binary masks. In addition, MDGAN preserved the background structures in the
Grid reasonably well, despite canceling the last BRM. In summary, with the help of BRM and DDM, MDGAN was able to generate realistic defect images, while preserving the original real backgrounds outside the annotations and achieved a natural transition between generated defects and real background, so that the generated defect images were accurately labeled by input binary masks.
Controllability of the backgrounds, defect shapes, and defect textures. We separately processed these three aspects, to verify that MDGAN could generate various defect images for a normal background and obtain accurate annotations.
Figure 6 shows the generated results when normal backgrounds, defect shapes, and defect textures were changed, respectively. First, the defect images in
Figure 6a shared the same binary mask. This shows that BRM added different backgrounds to the generator, to obtain multiple defect samples, whose annotations were the same, and the defects varied with the background. Second,
Figure 6b shows the generated defect images of the zipper, with the same background
NB and five different masks. Defect images in the same row shared the same mask on the left side. It can be seen that the BRM could replace different background regions with different masks, which assisted MDGAN in generating multiple defects with different shapes and contents on the same normal image. Third,
Figure 6c shows multiple categories of generated defects in the grid, where the defect images shared the same normal background (
NB) and annotations (
Mask), but had different defective textures. As seen from
Figure 6c and each row in
Figure 6b, MDGAN could produce multiple types of defect in the same specified region of the same normal image when the training sets were different.
In summary, based on BRM and DDM, MDGAN achieved independent control of the background, defect shape, and defect texture, and was able to generate a huge number of diverse and high-quality defect samples for a normal image. In reality, various defects may appear in a normal image, and our experimental results fit actual situations. Moreover, since MDGAN accurately controls defect shapes and preserves backgrounds using the given binary mask, the output defect regions are precisely labeled by the mask. Thus, our synthetic samples can be used to train segmentation models, which is beneficial for detecting defects.
4.3. Ablation Experiments
To verify the effect of BRM on the background retention and DDM on generated quality, we separately removed the two modules and performed ablation experiments on the above datasets, where the remaining settings were exactly the same as in the formal experiments.
Without BRM. Figure 7 shows the comparison results of the with/without (wo) BRM for the same pairs of test images. As can be seen, the results generated without BRM have drastically and unreasonably modified backgrounds, while MDGAN retained the details of the backgrounds well. In particular, substantial modifications to the original backgrounds led to the loss of real structures in the
phone band. The generated defect contents did not appear accurately in the annotated positions without BRM, resulting in inaccurate binary annotations.
In order to quantify the background retention ability of BRM, we employed the structure similarity index measure (SSIM) to evaluate the background similarity. The SSIM value is between [0, 1], and the larger the value, the more similar the two images. Using the same test set, and based on the with/wo BRM models, to synthesize 1000 defective samples, the SSIM between the normal background areas of the output and the input was calculated. The mean SSIM (mSSIM) is shown in
Table 2. It can be seen that the backgrounds generated by MDGAN were highly similar to the real backgrounds. Both the qualitative and quantitative results showed that BRM can modulate the real backgrounds in the feature map of the generator, which restricted the location of the generated defects, avoided the loss of backgrounds in the training, and finally facilitated MDGAN in generating defect samples with realistic backgrounds and accurate binary annotations.
Without DDM. All DDMs were removed in the discriminator. The mask and image were channel-level connected and input into the discriminator. To match the formal experiments, the number of output channels of the first three convolutional layers was doubled. As shown in
Figure 8, the discriminator’s binding on the defect quality decreased after removing the DDM. As shown in
Figure 8a, the generated defects without DDM contained unrealistic stripes and the zipper teeth were not smooth enough, being totally different from the real images. In addition, as shown in
Figure 8b, there were unreasonable black contents in the generated
capsule-crack without DDM, while MDGAN smoothly stripped the black blocks and generated clear red defects on them. On
capsule-fm, without DDM covered the annotated region with only fuzzy uniform white blocks, while MDGAN generated realistic and detailed scratches. That is, the networks could not inherit and generate the real defects without DDM. Overall, the addition of DDM assisted the discriminator in focusing on both on the whole image and the local defects, to improve judgment and enable MDGAN to accurately capture the stripes and grayscale distribution of defects and generate more realistic and higher quality images.
4.4. Comparison with CycleGAN
CycleGAN-based methods are leading the way in defect synthesis. To verify the advantages of MDGAN over other methods, CycleGANs were trained on the above datasets. To help CycleGAN retain the normal background, we added a L1 loss between the input and output to the original losses [
17]. Some of results generated by CycleGAN and MDGAN based on the same backgrounds are shown in
Figure 9. Despite the addition of the L1 loss, CycleGAN still modified the normal backgrounds, and it could convert the input normal images into defect images for the zipper-fi, zipper-sqt, zipper-spt, and grid-broken. Moreover, there were stretches and artifacts in the generated results of the capsule and phone band. This indicated that CycleGAN could only convert the source image into the most similar target image seen during the training for few-sample datasets, resulting in either a failed generation or loss of structures in the source image. In contrary, MDGAN generated realistic defect images for each category of input normal backgrounds and retained the original normal textures.
To quantify the generation quality, 1000 samples were generated on the same testing set based on CycleGAN and MDGAN, and the FID [
30] between the generated and the real defect datasets was calculated. The lower the FID, the closer the generated and real features. As shown in
Table 3, MDGAN obtained smaller FID than CycleGAN for most defect types. Nevertheless, CycleGAN had a lower FID for the grid-glue, grid-bent, and zipper-fb. In grid-glue, as shown in
Figure 9, CycleGAN simply memorized training sets and translated all test images into the most similar training images, resulting in a low FID. In the other two items, MDGAN generated defects in the given annotated regions, and when the mask shapes used in testing differed significantly from the real, the generated features were a little far away from the real features, resulting in a high FID.
Overall, MDGAN constructed pseudo-normal images, to efficiently acquire pairs of inputs without relying on CycleGAN and obtained a higher quality than CycleGAN for most items. Moreover, MDGAN imports random noise to construct latent representations for real defects; in contrast with CycleGAN, which improves both the randomness of the generation and the diversity of results, and is thus more suitable for few-sample synthesis. Furthermore, due to the background preservation with BRM and the constraints on quality from DDM, MDGAN successfully generated defect images with pixel-level annotations, while preserving the real normal backgrounds. Meanwhile, CycleGAN generated samples with only image-level annotations, which could not be used for defect segmentation experiments.
4.5. Detection Performance
To verify the advantages of MDGAN for detection over traditional augmentation, synthetic samples from MDGAN were added to the real segmentation training set RAW, to construct EL; and the traditional brightness adjustment, rotation, and noise injection were adopted to obtain the training set AUG. The number of datasets is shown in
Table 1. Two types of segmentation network, UNet and sResNet, were trained using the above three training sets. sResNet is a UNet-like segmentation model, where the skip-connections are removed and the convolution layers are replaced by the Res-blocks in ResNet [
31]. UNet consists of six downsampling and six upsampling layers whose number of output channels are 32, 64, 128, 256, 512, 256, 512, 256, 128, 64, 32, and 1. sResNet consists of four downsampling layers, four res-blocks, and four upsampling layers, whose number of output channels are 32, 64, 128, 256, 256, 256, 256, 256, 128, 64, 32, and 1. The output of the two networks is a single-channel map with the same size as the input. We used cross entropy loss to train the two models. The Adam optimizer was used with
= 0.5,
= 0.999, batch size = 50, and learning rate = 0.0005 on an NVIDIA GeForce RTX 3090 GPU. The mIoU (mean Intersection over Union) and F1 coefficient were calculated on the same test set at 500th iterations.
The mean testing results of each item are shown in
Table 4, where higher values indicate a better detection performance. It can be seen that EL greatly improved the detection performance in contrast to AUG and RAW. The mean results were substantially improved in EL. Among the mean results of the various items, EL outperformed RAW and AUG, with an mIoU improvement up to
7.3% (UNet-capsule) and F1 up to
6.2% (sResNet-capsule). On the contrary, AUG reduced the overall detection results (zipper).
Figure 10 shows a qualitative comparison of the segmentation results; the results of EL had less over-kill and escape versus RAW and AUG and were closer to the ground truths.
Overall, the inclusion of synthetic samples alleviates the data problems of class imbalance, lack of diversity, and few samples, and improves the detection performance. Compared with the traditional augmented samples, there were various backgrounds, defect contents, and shapes of binary annotations in our synthesized datasets, which were very different from the original training sets and could assist the networks in seeing and remembering richer defect information during training. When testing, the EL-trained models learned more knowledge and were more conducive to detecting unseen samples. This demonstrated that the synthetic samples from MDGAN could be used to train the supervised segmentation networks and that our work has practical application value.
4.6. Defect Transfer
It can be seen from the above that MDGAN achieved defect image generation with complete retention of backgrounds. Therefore, we could try to employ the MDGAN-trained source dataset to perform defect synthesis on a target dataset and used only the synthetic target defect images to train models to detect real defects in the target dataset, thus achieving defect transfer and zero-shot detection of the target dataset.
Figure 11 shows the related images and procedure of defect transfer, where the first and second rows in the
Source are from phone bands on the reverse side (phone band1) and curved glass, respectively, and those in
Target are from phone band on the front side (phone band2) and phone cover glass, respectively. It can be seen that the
Source and
Target have different structures, but their defect contents were similar. In consequence, we trained MDGAN based on phone band1 and curved glass and tested it on phone band2 and phone cover glass. The generated results are shown in the middle parts of
Figure 11. It can be seen that, despite the differences in the backgrounds on the two sides of the defect transfer, MDGAN still generated defects in the annotated areas (
Defect contents) and obtained defect images (
GD) similar to the real ones. As shown in
Res, MDGAN only modified the annotated regions and fully retained the targeted normal backgrounds after transfer.
In order to verify the effect of the transferred defect samples, we only used the transferred defect samples to train the segmentation networks and adopted the real images as the test set. The number of datasets and the test results are shown in
Table 5. It can be seen that good results were obtained on the real test set based on segmentation models trained only using transferred defect samples. The AUC (area under curve) was up to
0.971 (UNet) for phone band1 detection and
0.991 (sResNet) for phone cover glass detection.
Based on the generated samples from MDGAN trained using another dataset, we achieved defect detection on two real industrial surface defect datasets, showing that MDGAN could achieve defect transfer between datasets with similar defects but different backgrounds. Hence, we could train MDGAN based on the existing source defect datasets and then synthesize a large number of defect samples on the new target normal backgrounds when the type of products changed. As long as the defect textures are similar, the resource consumption for recollecting and labeling datasets can be greatly reduced by MDGAN, which is highly valuable for intelligent manufacturing.
5. Conclusions
This paper proposes MDGAN, to tackle the problems of falsifying backgrounds, lack of pixel-level annotations, and less attention being given to non-uniform complex structures in the current defect synthesis methods. Guided by binary masks, MDGAN modulates the real background into the generator using BRM and employs DDM to achieve discrimination of both local and global information. Due to the background-preserving effect of BRM and the quality constraint of DDM, MDGAN solves the problem of falsifying backgrounds and enriches the diversity of datasets. Finally, defect samples with accurate pixel-level annotations on multiple datasets with complex textures were synthesized using MDGAN. In addition, the qualitative and quantitive results showed that MDGAN obtained a better quality than the commonly used CycleGAN. The segmentation results demonstrated that the synthetic samples from MDGAN greatly improved the detection performance, with an improvement of IoU up to 7.3% and F1 up to 6.2%. Furthermore, based on the excellent background retention capability of MDGAN, we successfully synthesized target defect images using MDGAN trained on a source dataset and achieved the defect detection of real target samples, based only on the synthesized samples.
There are some aspects that need to be improved in our work. Since the binary masks are directly given by the test sets, the feature and the diversity of the generated defect shapes can be limited. In follow-up work, we will explore the synthesis method of producing both defect images and annotations using networks, as a way to enrich the defect shapes.