Mask-Guided Generation Method for Industrial Defect Images with Non-uniform Structures

: Defect generation is a crucial method for solving data problems in industrial defect detection. However, the current defect generation methods suffer from the problems of background information loss, insufﬁcient consideration of complex defects, and lack of accurate annotations, which limits their application in defect segmentation tasks. To tackle these problems, we proposed a mask-guided background-preserving defect generation method, MDGAN (mask-guided defect generation adversarial networks). First, to preserve the normal background and provide accurate annotations for the generated defect samples, we proposed a background replacement module (BRM), to add real background information to the generator and guide the generator to only focus on the generation of defect content in speciﬁed regions. Second, to guarantee the quality of the generated complex texture defects, we proposed a double discrimination module (DDM), to assist the discrimi-nator in measuring the realism of the input image and distinguishing whether or not the defects were distributed at speciﬁed locations. The experimental results on metal, fabric, and plastic products showed that MDGAN could generate diversiﬁed and high-quality defect samples, demonstrating an improvement in detection over the traditional augmented samples. In addition, MDGAN can transfer defects between datasets with similar defect contents, thus achieving zero-shot defect detection.


Introduction
Methods based on machine learning and deep learning have remarkably improved industrial defect detection performance [1][2][3].However, practical industrial scenarios pose challenges to the current detection methods, such as data problems.Acquiring large datasets for manufacturing applications remains a challenging proposition, due to the time and costs involved [4].The small number of defect samples and data imbalances can lead to overfitting during the training of supervised deep-learning methods and poor performance in testing [5].
Data augmentation is a very powerful method for building useful deep-learning models and for reducing validation errors [6].Image data augmentation mainly includes traditional and learning-based methods.Traditional methods can increase the number of samples, but cannot create new defect samples.In contrast, learning-based methods such as GAN (Generative Adversarial Nets) [7], AAE (Adversarial AutoEncoders) [8], and VAE (Variational Auto-Encoder) [9] can model the distribution of a real dataset and synthesize new samples that are different from the original dataset, which increases both the number and the diversity of the dataset.Based on cutting-edge work in image synthesis [10][11][12][13], industrial image generation can be carried out, to achieve the augmentation of few-sample datasets.Currently, diverse learning-based augmentation methods are emerging, to alleviate data problems in industrial defect detection [14][15][16][17][18][19][20].
However, there are still some challenges that need to be addressed in the current learning-based defect image augmentation methods.
Insufficient retention of realistic background textures.Textures provide important and unique information for intelligent visual detection and identification systems [5].In industrial defect detection, any slight change to real textures can disturb the detection results.Researchers usually perform defect image generation based on non-defective samples, due to their easy availability in industrial manufacturing.This requires generation methods that preserve the real normal backgrounds to the maximum extent possible.Many works [15][16][17]19] have used CycleGAN [21] to generate a defective image for an input normal image, where normal backgrounds may be excessively falsified, since they do not constrain the treatment of the normal background.
The independent controls of the normal backgrounds, defect shapes, and defect textures are rarely considered.If independent control and operation of the three are achieved, then they can be arbitrarily combined to obtain an infinite number of defective images from a normal image.However, current effective methods such as SDGAN [15], Defect-GAN [16], and SIGAN [17] control the three as a whole and can only obtain one defect image from a normal image based on well-trained models, whose randomness and diversity are insufficient.Moreover, the pixel-level annotations of the generated defect images can be acquired if we separately process the normal backgrounds and defect regions.Defect-GAN generates a spatial distribution map, to indicate what is modified in the source image compared with the generated image, but it does not decouple the backgrounds and defect regions and cannot obtain accurate binary annotations from the spatial distribution map.
Lack of exploration of the generation of non-uniform complex structure defects with binary annotations.There are multiple semantic regions in a non-uniform structure image, where texture features are different and the corresponding defects are distinct.As shown in Figure 1, there are two types of textures in a zipper image: fabric and zipper teeth, whose defect contents are significantly disparate.In terms of such non-uniform structure images, networks must generate conforming defects at the specified locations, to obtain realistic synthetic results.In addition, obtaining binary annotations for the generated complex texture defect images is also a challenging task.Shuanlong N. et al. [18] used random seeds to construct input masks, while this is limited to simple stripe defects in some uniform textures.Du-Ming T. et al. [19] adopted two CycleGANs to preserve normal backgrounds and threshold segmentation to obtain binary annotations.However, this work also only synthesized defects with uniform textures, and the overall networks contained four generators and four discriminators, which were too complicated.
However, there are still some challenges that need to be addressed in the current learning-based defect image augmentation methods.
Insufficient retention of realistic background textures.Textures provide important and unique information for intelligent visual detection and identification systems [5].In industrial defect detection, any slight change to real textures can disturb the detection results.Researchers usually perform defect image generation based on non-defective samples, due to their easy availability in industrial manufacturing.This requires generation methods that preserve the real normal backgrounds to the maximum extent possible.Many works [15][16][17]19] have used CycleGAN [21] to generate a defective image for an input normal image, where normal backgrounds may be excessively falsified, since they do not constrain the treatment of the normal background.
The independent controls of the normal backgrounds, defect shapes, and defect textures are rarely considered.If independent control and operation of the three are achieved, then they can be arbitrarily combined to obtain an infinite number of defective images from a normal image.However, current effective methods such as SDGAN [15], Defect-GAN [16], and SIGAN [17] control the three as a whole and can only obtain one defect image from a normal image based on well-trained models, whose randomness and diversity are insufficient.Moreover, the pixel-level annotations of the generated defect images can be acquired if we separately process the normal backgrounds and defect regions.Defect-GAN generates a spatial distribution map, to indicate what is modified in the source image compared with the generated image, but it does not decouple the backgrounds and defect regions and cannot obtain accurate binary annotations from the spatial distribution map.
Lack of exploration of the generation of non-uniform complex structure defects with binary annotations.There are multiple semantic regions in a non-uniform structure image, where texture features are different and the corresponding defects are distinct.As shown in Figure 1, there are two types of textures in a zipper image: fabric and zipper teeth, whose defect contents are significantly disparate.In terms of such non-uniform structure images, networks must generate conforming defects at the specified locations, to obtain realistic synthetic results.In addition, obtaining binary annotations for the generated complex texture defect images is also a challenging task.Shuanlong N. et al. [18] used random seeds to construct input masks, while this is limited to simple stripe defects in some uniform textures.Du-Ming T. et al. [19] adopted two CycleGANs to preserve normal backgrounds and threshold segmentation to obtain binary annotations.However, this work also only synthesized defects with uniform textures, and the overall networks contained four generators and four discriminators, which were too complicated.To tackle these challenges, we proposed a MDGAN (mask-guided defect generation adversarial network) based on CGAN [22].The MDGAN can generate realistic defects in regions specified by the input binary mask.First, we introduce a BRM (background To tackle these challenges, we proposed a MDGAN (mask-guided defect generation adversarial network) based on CGAN [22].The MDGAN can generate realistic defects in regions specified by the input binary mask.First, we introduce a BRM (background replacement module) to extract normal backgrounds using a binary mask to replace the contents at the corresponding positions in the feature maps.The BRM achieves the preservation of the normal backgrounds and facilitates the separate control of the normal backgrounds and the shape of defects.In addition, the generated defect textures can be controlled by training MDGAN separately for different types of defects.Second, we proposed a DDM (double discrimination module), to extract the defect features from the whole feature map with the guidance of binary masks and measure the authenticity of the whole and the local based on one discriminator.In addition, we constructed a pseudo-normal background for each defect image, to provide paired training inputs.This preprocess ensures MDGAN generates defects according to normal features in the same regions, thus enabling generation of defects with non-uniform structures.Finally, the outputs of MDGAN and the input binary masks were combined to construct our pixel-level annotated synthetic datasets.
In summary, the main contributions of this work are as follows.
(1) We constructed corresponding pseudo-normal backgrounds for defective images, which solves the problem of the lack of paired training inputs in industrial defect generation and avoids the dependence on CycleGAN.(2) We proposed a MDGAN, to achieve independent control of the normal backgrounds, defect shapes, and defect textures of images.The addition of BRM achieves the preservation of normal backgrounds and enables the acquisition of binary annotations.
Our DDM focuses on the defect region and the whole image simultaneously, ensuring the quality of the generated results.(3) Since BRM can achieve total preservation of the normal background in the generated defect images, our MDGAN can also achieve defect transfer between datasets with similar defect contents.
The subsequent sections of this article are organized as follows: Section 2 proposes the MDGAN.Section 3 introduces the related datasets used in this work.Section 4 details the generation, ablation, comparison, and segmentation experiments.Finally, we summarize our work in Section 5.

Methods
This section introduces the construction of the paired training inputs, the architectures, and the training process of MDGAN.

Pseudo-Normal Background
Image-to-image (I2I) translation [23] is the most effective method to convert images in the source domain into the target domain, where paired source-target images are needed in training.Defect synthesis based on normal images is an I2I task, where the source domain consists of normal industrial images and the defect images constitute the target domain.However, it is almost impossible to obtain exactly corresponding normal-defective pairs in industrial manufacturing.Many works relied on CycleGAN to avoid this problem.Unfortunately, CycleGAN lacks randomness and cannot retain the background, as shown in Section 1.Therefore, we construct pseudo-normal backgrounds for the defect images.First, we select a similar normal image N for the defect image D, whose binary mask is M. N contains areas where the defect contents in D appears.Then calibrate N to obtain N c by affine transformation matrix T, where N c i,j , N i,j are points of homogeneous form in N c and N, respectively, θ is the rotation angle in the anticlockwise direction, and T x and T y are translation parameters.Affine transformation can ensure that the area used for filling in the normal image is aligned with the defect area.Finally, use N c to replace the defect regions in D, where M − means (1-M), and , ⊕ means element-level multiplication and addition, respectively.Then B is used as the pseudo-normal background in the source domain to train the MDGAN.

MDGAN
As shown in Figure 2, MDGAN consists of a generator and a discriminator.A BRM is proposed to modulate the background in the generator using binary masks, as a way to avoid the modification of the background and control the generated defect shapes.A DDM is proposed, to divide the feature map into a whole feature branch and a defect feature branch, which can guide the discriminator to focus on both the whole and the local regions.Only a discriminator is needed to facilitate the generator to output high-quality defect images.Overall, MDGAN is able to generate images with defects appearing in regions specified by binary masks and preserve the normal backgrounds, combining them with input masks to obtain defect samples with pixel-level annotations.
where M means (1-M), and ⨀, ⨁ means element-level multiplication and addition, respectively.Then B is used as the pseudo-normal background in the source domain to train the MDGAN.

MDGAN
As shown in Figure 2, MDGAN consists of a generator and a discriminator.A BRM is proposed to modulate the background in the generator using binary masks, as a way to avoid the modification of the background and control the generated defect shapes.A DDM is proposed, to divide the feature map into a whole feature branch and a defect feature branch, which can guide the discriminator to focus on both the whole and the local regions.Only a discriminator is needed to facilitate the generator to output high-quality defect images.Overall, MDGAN is able to generate images with defects appearing in regions specified by binary masks and preserve the normal backgrounds, combining them with input masks to obtain defect samples with pixel-level annotations.

Architectures
As shown in Figure 2, the generator is a UNet-like [24] architecture, whose inputs are Gaussian noise, pseudo-normal background, and a 0-1 binary mask.The size of the output is same as the input images.In the output image, the defective contents are at the "1" positions specified by the mask, and the original normal backgrounds are at the "0" positions.The discriminator adopts patchGAN [23] to process the input defect image and binary mask, and its output indicates the probability that the input defect image annotated by the mask is real.
Background Replacement Module (BRM).As shown in Figure 2, BRM employs a binary mask to fuse the real background to the specific positions of the feature map.First, the input binary mask M is average pooling down-sampled, convolved, and activated by Sigmoid, to obtain a weight map f with values of [0, 1].f has the same size as the feature

Architectures
As shown in Figure 2, the generator is a UNet-like [24] architecture, whose inputs are Gaussian noise, pseudo-normal background, and a 0-1 binary mask.The size of the output is same as the input images.In the output image, the defective contents are at the "1" positions specified by the mask, and the original normal backgrounds are at the "0" positions.The discriminator adopts patchGAN [23] to process the input defect image and binary mask, and its output indicates the probability that the input defect image annotated by the mask is real.
Background Replacement Module (BRM).As shown in Figure 2, BRM employs a binary mask to fuse the real background to the specific positions of the feature map.First, the input binary mask M is average pooling down-sampled, convolved, and activated by Sigmoid, to obtain a weight map f with values of [0, 1].f has the same size as the feature map d of the current layer.The input normal background B is downsampled and convolved with 1 × 1 kernels, to obtain a background feature map B with the same size as f .B is modulated into the generator to obtain the input of the next layer and the skip-connected F, Double Discrimination Module (DDM).As shown in Figure 2, DDM divides the current feature map of the discriminator into whole image features and local defect features for processing.First, M is processed to obtain a weight map with the same size as the feature map, and then the weight map is used to extract the local defect information.Finally, the extracted contents are connected with the original feature map and input into the next layer.Thus, DDM assists the discriminator in processing both the local and global information and ensures the realism of the generated defects.Therefore, we need two discriminators to achieve the above tasks without DDM, which slows down the training and makes the architectures more complex.

Training Objectives
First, a binary mask M; a real defect image D, which is the ground truth of the generator; and a constructed pseudo-normal background B are given, and we sample random noise z from the Gaussian distribution N (0, I) as code from latent space.Next, z is mapped and reshaped to obtain x z with the same size as B. The input of the first convolution layer of the generator is x, M, and B are processed by the generator to obtain the output D r = G(M, z, B) and the input feature map d l r of the last BRM.Defect region losses.First, we calculate the defect reconstruction loss of D r and D, where 1 is the L1 loss.In order to improve the diversity of the generated defects, we sample another z ∼ N (0, I) to obtain the new outputs D = G(M, z , B) and d l , and calculate the diversity loss of the defect regions, Essentially, D r and D are synthetic defect images with the same background and annotations, but different defect contents.
Normal background losses.According to the above description, we calculate the reconstruction loss of the normal backgrounds, In addition, when the normal background of d l (input of the last BRM) is as similar as possible to D, the last BRM can only modulate the details of D into d l .On the contrary, MDGAN will depend on the last BRM to substantially replace the normal background of d l , which may lead to a large incoherence between the normal and the defect regions.Therefore, in order to avoid a dependence on the last BRM and obtain a more coherent transition between the defect contents and the normal background of the final output, we calculate L r3 to constrain the backgrounds of d l to be as similar as possible to D, Whole image losses.To ensure the coherence of the defect and background regions after replacement, the gradient loss [25] at the boundary between the two regions is calculated, where ∇ is the gradient and F(X) !0→1 sets all the non-zero elements in X to 1.
In order to stabilize the training process, we add the gradient penalty of WGAN-GP [26,27] to the discriminator.We construct two hybrid samples based on the real and the generated defect samples, where α 1 and α 2 are random numbers [0, 1] and the discriminator needs to process the two hybrid samples, which only back-propagate gradients to the discriminator Dis.The gradient penalty loss is Then we calculate the adversarial loss, to constrain the reality of the images, Full objective.Ultimately, we obtain the final training objective where λ r , λ d , λ g , and λ gp control the contribution of each loss to the whole.Based on the trained MDGAN, inputting the normal background and the binary mask to the generator, the output is the synthesized defect image whose pixel-level annotation is the binary mask.

Training Sets
MVTec-AD [28] is a commonly used public dataset in industrial vision tasks.MVTectest contains multiple classes of defect images with pixel-level annotations, and MVTectrain contains many normal images.Therefore, we employed MVTec-AD to construct the training and testing sets of MDGAN.
In this work, experiments were conducted on four items, including the grid, zipper, capsule, and metal nut in MVTec-AD.Figure 3 shows the training images cropped from the original large images.It can be seen that there are multiple complex texture regions in these items.In addition, the phone band images from the actual production line were used in this work.We classified the phone band defects into dirty, roll, and scratch, to facilitate the network in reducing unnecessary blending.In addition, the defect samples for subsequent segmentation testing needed to be pre-preserved.The numbers of the relevant datasets are shown in Table 1.The training sets for generation and segmentation were taken from the same original images.Zipper-combined means there are multiple classes of defects in one image, which were only used in segmentation testing.As shown in Table 1, we replaced some long defect names with initials, to simplify the description.

Testing Sets
We employed two methods to obtain binary masks to construct the rich MDGAN test sets.First, MVTec-AD provided many binary masks of defect samples, which characterized the shapes of industrial defects and could be cropped as inputs for MDGAN.Second, to generate more defects with different shapes, we acquired a large number of binary masks based on Perlin noise [29].Figure 4 shows the process of Perlin-noise-based mask generation.The normal images and binary masks were cropped to obtain image pairs.Based on the above two methods, we could obtain an arbitrary number of test sets.We employed two methods to obtain binary masks to construct the rich MDGAN test sets.First, MVTec-AD provided many binary masks of defect samples, which characterized the shapes of industrial defects and could be cropped as inputs for MDGAN.Second, to generate more defects with different shapes, we acquired a large number of binary masks based on Perlin noise [29].Figure 4 shows the process of Perlin-noise-based mask generation.The normal images and binary masks were cropped to obtain image pairs.Based on the above two methods, we could obtain an arbitrary number of test sets.

Experiments
MDGAN was used to synthesize defect images based on the above datasets.Except for the metal nut and capsule, which involved three-channel color images, the images consisted of single-channel grayscale images.Section 4.2 shows the generated quality, diversity, and annotation accuracy of MDGAN.The effectiveness of BRM and DDM are demonstrated in Section 4.3.We also compare MDGAN with the most commonly used CycleGAN in Section 4.5, to certify the effectiveness of our methods.Then we assess the advantages of our synthetic samples over traditional augmented results.Finally, we explore the possibility of defect transfer using MDGAN in Section 4.6.

Implementations
To construct the pseudo-normal backgrounds, we first obtain the main area where defects may occur in defect image D and select normal image N by thresholding, and draw the minimum external rectangular boxes of these areas.Then we conduct affine transformation for D and N, so that the length and width of the two boxes are parallel and the center points are overlapped.Finally, the defect contents in D are filled by N, and the filled image is reversely transformed to the attitude of the original D. These two affine transformations can be simplified as Equations ( 1)-(3).We performed the above processes based on the OpenCV library, where functions such as cv2.getRotationMatrix2D() and cv2.warpAffine() were used.
MDGAN was trained for each type of defect under each item.We augmented the training image pairs by rotation, flipping, and random cropping.The binary mask was normalized to [−1, 1] when inputting to the MDGAN and [0, 1] when calculating losses.The number of output channels of the 12 convolution layers of the generator from input to output were 64, 128, 256, 256, 512, 512, 512, 256, 256, 128, and 64, respectively; those of the discriminator were 32, 128, 256, 256, 512, and 1, respectively.The dimension of latent space was 8, and the hyperparameters of Equation (15) were set as  = 10,  = 15,  = 10, and  = 10.The Adam optimizer was used with  = 0.5,  = 0.999 to train MDGAN, with a batch size = 20 and learning rate = 0.0004.We trained the MDGAN for 500 iterations on one NVIDIA GeForce RTX 3090 GPU of a sever with Intel(R) Xeon(R) Gold 622306R CPU @ 2.90GHz.During training, backgrounds could be retained without the last BRM in some datasets, which could simplify the architecture and obtain a better transition between normal backgrounds and defect regions.Therefore, we eliminated the last BRM in the defects of the grid.

Experiments
MDGAN was used to synthesize defect images based on the above datasets.Except for the metal nut and capsule, which involved three-channel color images, the images consisted of single-channel grayscale images.Section 4.2 shows the generated quality, diversity, and annotation accuracy of MDGAN.The effectiveness of BRM and DDM are demonstrated in Section 4.3.We also compare MDGAN with the most commonly used CycleGAN in Section 4.5, to certify the effectiveness of our methods.Then we assess the advantages of our synthetic samples over traditional augmented results.Finally, we explore the possibility of defect transfer using MDGAN in Section 4.6.

Implementations
To construct the pseudo-normal backgrounds, we first obtain the main area where defects may occur in defect image D and select normal image N by thresholding, and draw the minimum external rectangular boxes of these areas.Then we conduct affine transformation for D and N, so that the length and width of the two boxes are parallel and the center points are overlapped.Finally, the defect contents in D are filled by N, and the filled image is reversely transformed to the attitude of the original D. These two affine transformations can be simplified as Equations ( 1)-(3).We performed the above processes based on the OpenCV library, where functions such as cv2.getRotationMatrix2D() and cv2.warpAffine() were used.
MDGAN was trained for each type of defect under each item.We augmented the training image pairs by rotation, flipping, and random cropping.The binary mask was normalized to [−1, 1] when inputting to the MDGAN and [0, 1] when calculating losses.The number of output channels of the 12 convolution layers of the generator from input to output were 64, 128, 256, 256, 512, 512, 512, 256, 256, 128, and 64, respectively; those of the discriminator were 32, 128, 256, 256, 512, and 1, respectively.The dimension of latent space was 8, and the hyperparameters of Equation ( 15) were set as λ r = 10, λ d = 15, λ g = 10, and λ gp = 10.The Adam optimizer was used with β 1 = 0.5, β 2 = 0.999 to train MDGAN, with a batch size = 20 and learning rate = 0.0004.We trained the MDGAN for 500 iterations on one NVIDIA GeForce RTX 3090 GPU of a sever with Intel(R) Xeon(R) Gold 622306R CPU @ 2.90 GHz.During training, backgrounds could be retained without the last BRM in some datasets, which could simplify the architecture and obtain a better transition between normal backgrounds and defect regions.Therefore, we eliminated the last BRM in the defects of the grid.

Synthetic Results
Some synthetic defect samples are shown in Figures 5 and 6.As shown in Figure 5, MDGAN achieved defect image generation for all categories.Good synthesis results were obtained for complex and weak defects, such as the fine-grained texture defects (zipper-fb, zipper-fi, grid-thread, etc.), color defects (metal nut-color), metallic defects (grid-mc), and weak defects (phone band).As seen from the Defect contents in Figure 5, the generated defect contents were realistic and accurately distributed in the locations marked by the mask, achieving an accurate correspondence between the synthetic defect images and the input binary masks.In addition, MDGAN preserved the background structures in the Grid reasonably well, despite canceling the last BRM.In summary, with the help of BRM and DDM, MDGAN was able to generate realistic defect images, while preserving the original real backgrounds outside the annotations and achieved a natural transition between generated defects and real background, so that the generated defect images were accurately labeled by input binary masks.
fb, zipper-fi, grid-thread, etc.), color defects (metal nut-color), metallic defects (grid-mc), and weak defects (phone band).As seen from the Defect contents in Figure 5, the generated defect contents were realistic and accurately distributed in the locations marked by the mask, achieving an accurate correspondence between the synthetic defect images and the input binary masks.In addition, MDGAN preserved the background structures in the Grid reasonably well, despite canceling the last BRM.In summary, with the help of BRM and DDM, MDGAN was able to generate realistic defect images, while preserving the original real backgrounds outside the annotations and achieved a natural transition between generated defects and real background, so that the generated defect images were accurately labeled by input binary masks.Controllability of the backgrounds, defect shapes, and defect textures.We separately processed these three aspects, to verify that MDGAN could generate various defect images for a normal background and obtain accurate annotations.Figure 6 shows the generated results when normal backgrounds, defect shapes, and defect textures were changed, respectively.First, the defect images in Figure 6a shared the same binary mask.This shows that BRM added different backgrounds to the generator, to obtain multiple defect samples, whose annotations were the same, and the defects varied with the background.Second, Figure 6b shows the generated defect images of the zipper, with the same background NB and five different masks.Defect images in the same row shared the same mask Controllability of the backgrounds, defect shapes, and defect textures.We separately processed these three aspects, to verify that MDGAN could generate various defect images for a normal background and obtain accurate annotations.Figure 6 shows the generated results when normal backgrounds, defect shapes, and defect textures were changed, respectively.First, the defect images in Figure 6a shared the same binary mask.This shows that BRM added different backgrounds to the generator, to obtain multiple defect samples, whose annotations were the same, and the defects varied with the background.Second, Figure 6b shows the generated defect images of the zipper, with the same background NB and five different masks.Defect images in the same row shared the same mask on the left side.It can be seen that the BRM could replace different background regions with different masks, which assisted MDGAN in generating multiple defects with different shapes and contents on the same normal image.Third, Figure 6c shows multiple categories of generated defects in the grid, where the defect images shared the same normal background (NB) and annotations (Mask), but had different defective textures.As seen from Figure 6c and each row in Figure 6b, MDGAN could produce multiple types of defect in the same specified region of the same normal image when the training sets were different.
In summary, based on BRM and DDM, MDGAN achieved independent control of the background, defect shape, and defect texture, and was able to generate a huge number of diverse and high-quality defect samples for a normal image.In reality, various defects may appear in a normal image, and our experimental results fit actual situations.Moreover, since MDGAN accurately controls defect shapes and preserves backgrounds using the given binary mask, the output defect regions are precisely labeled by the mask.Thus, our synthetic samples can be used to train segmentation models, which is beneficial for detecting defects.

Ablation Experiments
To verify the effect of BRM on the background retention and DDM on generated quality, we separately removed the two modules and performed ablation experiments on the above datasets, where the remaining settings were exactly the same as in the formal experiments.
Without BRM. Figure 7 shows the comparison results of the with/without (wo) BRM for the same pairs of test images.As can be seen, the results generated without BRM have drastically and unreasonably modified backgrounds, while MDGAN retained the details of the backgrounds well.In particular, substantial modifications to the original backgrounds led to the loss of real structures in the phone band.The generated defect contents did not appear accurately in the annotated positions without BRM, resulting in inaccurate binary annotations.In order to quantify the background retention ability of BRM, we employed the structure similarity index measure (SSIM) to evaluate the background similarity.The SSIM value is between [0, 1], and the larger the value, the more similar the two images.Using the same test set, and based on the with/wo BRM models, to synthesize 1000 defective samples, the SSIM between the normal background areas of the output and the input was calculated.The mean SSIM (mSSIM) is shown in Table 2.It can be seen that the backgrounds generated by MDGAN were highly similar to the real backgrounds.Both the qualitative and quantitative results showed that BRM can modulate the real backgrounds in the feature map of the generator, which restricted the location of the generated defects, avoided the loss of backgrounds in the training, and finally facilitated MDGAN in generating defect samples with realistic backgrounds and accurate binary annotations.In order to quantify the background retention ability of BRM, we employed the structure similarity index measure (SSIM) to evaluate the background similarity.The SSIM value is between [0, 1], and the larger the value, the more similar the two images.Using the same test set, and based on the with/wo BRM models, to synthesize 1000 defective samples, the SSIM between the normal background areas of the output and the input was calculated.The mean SSIM (mSSIM) is shown in Table 2.It can be seen that the backgrounds generated by MDGAN were highly similar to the real backgrounds.Both the qualitative and quantitative results showed that BRM can modulate the real backgrounds in the feature map of the generator, which restricted the location of the generated defects, avoided the loss of backgrounds in the training, and finally facilitated MDGAN in generating defect samples with realistic backgrounds and accurate binary annotations.Without DDM.All DDMs were removed in the discriminator.The mask and image were channel-level connected and input into the discriminator.To match the formal experiments, the number of output channels of the first three convolutional layers was doubled.As shown in Figure 8, the discriminator's binding on the defect quality decreased after removing the DDM.As shown in Figure 8a, the generated defects without DDM contained unrealistic stripes and the zipper teeth were not smooth enough, being totally different from the real images.In addition, as shown in Figure 8b, there were unreasonable black contents in the generated capsule-crack without DDM, while MDGAN smoothly stripped the black blocks and generated clear red defects on them.On capsule-fm, without DDM covered the annotated region with only fuzzy uniform white blocks, while MDGAN generated realistic and detailed scratches.That is, the networks could not inherit and generate the real defects without DDM.Overall, the addition of DDM assisted the discriminator in focusing on both on the whole image and the local defects, to improve judgment and enable MDGAN to accurately capture the stripes and grayscale distribution of defects and generate more realistic and higher quality images.
stripped the black blocks and generated clear red defects on them.On capsule-fm, without DDM covered the annotated region with only fuzzy uniform white blocks, while MDGAN generated realistic and detailed scratches.That is, the networks could not inherit and generate the real defects without DDM.Overall, the addition of DDM assisted the discriminator in focusing on both on the whole image and the local defects, to improve judgment and enable MDGAN to accurately capture the stripes and grayscale distribution of defects and generate more realistic and higher quality images.

Comparison with CycleGAN
CycleGAN-based methods are leading the way in defect synthesis.To verify the advantages of MDGAN over other methods, CycleGANs were trained on the above datasets.To help CycleGAN retain the normal background, we added a L1 loss between the input and output to the original losses [17].Some of results generated by CycleGAN and MDGAN based on the same backgrounds are shown in Figure 9.Despite the addition of the L1 loss, CycleGAN still modified the normal backgrounds, and it could convert the

Comparison with CycleGAN
CycleGAN-based methods are leading the way in defect synthesis.To verify the advantages of MDGAN over other methods, CycleGANs were trained on the above datasets.
To help CycleGAN retain the normal background, we added a L1 loss between the input and output to the original losses [17].Some of results generated by CycleGAN and MDGAN based on the same backgrounds are shown in Figure 9.Despite the addition of the L1 loss, CycleGAN still modified the normal backgrounds, and it could convert the input normal images into defect images for the zipper-fi, zipper-sqt, zipper-spt, and grid-broken.Moreover, there were stretches and artifacts in the generated results of the capsule and phone band.This indicated that CycleGAN could only convert the source image into the most similar target image seen during the training for few-sample datasets, resulting in either a failed generation or loss of structures in the source image.In contrary, MDGAN generated realistic defect images for each category of input normal backgrounds and retained the original normal textures.

Comparison with CycleGAN
CycleGAN-based methods are leading the way in defect synthesis.To verify the advantages of MDGAN over other methods, CycleGANs were trained on the above datasets.To help CycleGAN retain the normal background, we added a L1 loss between the input and output to the original losses [17].Some of results generated by CycleGAN and MDGAN based on the same backgrounds are shown in Figure 9.Despite the addition of the L1 loss, CycleGAN still modified the normal backgrounds, and it could convert the input normal images into defect images for the zipper-fi, zipper-sqt, zipper-spt, and gridbroken.Moreover, there were stretches and artifacts in the generated results of the capsule and phone band.This indicated that CycleGAN could only convert the source image into the most similar target image seen during the training for few-sample datasets, resulting in either a failed generation or loss of structures in the source image.In contrary, MDGAN generated realistic defect images for each category of input normal backgrounds and retained the original normal textures.To quantify the generation quality, 1000 samples were generated on the same testing set based on CycleGAN and MDGAN, and the FID [30] between the generated and the real defect datasets was calculated.The lower the FID, the closer the generated and real features.As shown in Table 3, MDGAN obtained smaller FID than CycleGAN for most defect types.Nevertheless, CycleGAN had a lower FID for the grid-glue, grid-bent, and zipper-fb.In grid-glue, as shown in Figure 9, CycleGAN simply memorized training sets and translated all test images into the most similar training images, resulting in a low FID.In the other two items, MDGAN generated defects in the given annotated regions, and when the mask shapes used in testing differed significantly from the real, the generated features were a little far away from the real features, resulting in a high FID.Overall, MDGAN constructed pseudo-normal images, to efficiently acquire pairs of inputs without relying on CycleGAN and obtained a higher quality than CycleGAN for most items.Moreover, MDGAN imports random noise to construct latent representations for real defects; in contrast with CycleGAN, which improves both the randomness of the generation and the diversity of results, and is thus more suitable for few-sample synthesis.Furthermore, due to the background preservation with BRM and the constraints on quality from DDM, MDGAN successfully generated defect images with pixel-level annotations, while preserving the real normal backgrounds.Meanwhile, CycleGAN generated samples with only image-level annotations, which could not be used for defect segmentation experiments.

Detection Performance
To verify the advantages of MDGAN for detection over traditional augmentation, synthetic samples from MDGAN were added to the real segmentation training set RAW, to construct and the traditional brightness adjustment, rotation, and noise injection were adopted to obtain the training set AUG.The number of datasets is shown in Table 1.Two types of segmentation network, UNet and sResNet, were trained using the above three training sets.sResNet is a UNet-like segmentation model, where the skip-connections are removed and the convolution layers are replaced by the Res-blocks in ResNet [31].UNet consists of six downsampling and six upsampling layers whose number of output channels are 32, 64, 128, 256, 512, 256, 512, 256, 128, 64, 32, and 1. sResNet consists of four downsampling layers, four res-blocks, and four upsampling layers, whose number of output channels are 32, 64, 128, 256, 256, 256, 256, 256, 128, 64, 32, and 1.The output of the two networks is a single-channel map with the same size as the input.We used cross entropy loss to train the two models.The Adam optimizer was used with β 1 = 0.5, β 2 = 0.999, batch size = 50, and learning rate = 0.0005 on an NVIDIA GeForce RTX 3090 GPU.The mIoU (mean Intersection over Union) and F1 coefficient were calculated on the same test set at 500th iterations.
The mean testing results of each item are shown in Table 4, where higher values indicate a better detection performance.It can be seen that EL greatly improved the detection performance in contrast to AUG and RAW.The mean results were substantially improved in EL.Among the mean results of the various items, EL outperformed RAW and AUG, with an mIoU improvement up to 7.3% (UNet-capsule) and F1 up to 6.2% (sResNetcapsule).On the contrary, AUG reduced the overall detection results (zipper).Figure 10 shows a qualitative comparison of the segmentation results; the results of EL had less over-kill and escape versus RAW and AUG and were closer to the ground truths.Overall, the inclusion of synthetic samples alleviates the data problems of class imbalance, lack of diversity, and few samples, and improves the detection performance.Compared with the traditional augmented samples, there were various backgrounds, defect contents, and shapes of binary annotations in our synthesized datasets, which were very different from the original training sets and could assist the networks in seeing and remembering richer defect information during training.When testing, the EL-trained models learned more knowledge and were more conducive to detecting unseen samples.This demonstrated that the synthetic samples from MDGAN could be used to train the supervised segmentation networks and that our work has practical application value.

Defect Transfer
It can be seen from the above that MDGAN achieved defect image generation with complete retention of backgrounds.Therefore, we could try to employ the MDGANtrained source dataset to perform defect synthesis on a target dataset and used only the synthetic target defect images to train models to detect real defects in the target dataset, thus achieving defect transfer and zero-shot detection of the target dataset.
Figure 11 shows the related images and procedure of defect transfer, where the first and second rows in the Source are from phone bands on the reverse side (phone band1) and curved glass, respectively, and those in Target are from phone band on the front side (phone band2) and phone cover glass, respectively.It can be seen that the Source and Target have different structures, but their defect contents were similar.In consequence, we trained MDGAN based on phone band1 and curved glass and tested it on phone band2 and phone cover glass.The generated results are shown in the middle parts of Figure 11.It can be seen that, despite the differences in the backgrounds on the two sides of the defect transfer, MDGAN still generated defects in the annotated areas (Defect contents) and obtained defect images (GD) similar to the real ones.As shown in Res, MDGAN only modified the annotated regions and fully retained the targeted normal backgrounds after transfer.
get have different structures, but their defect contents were similar.In consequence, we trained MDGAN based on phone band1 and curved glass and tested it on phone band2 and phone cover glass.The generated results are shown in the middle parts of Figure 11.It can be seen that, despite the differences in the backgrounds on the two sides of the defect transfer, MDGAN still generated defects in the annotated areas (Defect contents) and obtained defect images (GD) similar to the real ones.As shown in Res, MDGAN only modified the annotated regions and fully retained the targeted normal backgrounds after transfer.In order to verify the effect of the transferred defect samples, we only used the transferred defect samples to train the segmentation networks and adopted the real images as the test set.The number of datasets and the test results are shown in Table 5.It can be seen that good results were obtained on the real test set based on segmentation models trained only using transferred defect samples.The AUC (area under curve) was up to 0.971 (UNet) for phone band1 detection and 0.991 (sResNet) for phone cover glass detection.Based on the generated samples from MDGAN trained using another dataset, we achieved defect detection on two real industrial surface defect datasets, showing that MDGAN could achieve defect transfer between datasets with similar defects but different backgrounds.Hence, we could train MDGAN based on the existing source defect datasets and then synthesize a large number of defect samples on the new target normal In order to verify the effect of the transferred defect samples, we only used the transferred defect samples to train the segmentation networks and adopted the real images as the test set.The number of datasets and the test results are shown in Table 5.It can be seen that good results were obtained on the real test set based on segmentation models trained only using transferred defect samples.The AUC (area under curve) was up to 0.971 (UNet) for phone band1 detection and 0.991 (sResNet) for phone cover glass detection.Based on the generated samples from MDGAN trained using another dataset, we achieved defect detection on two real industrial surface defect datasets, showing that MDGAN could achieve defect transfer between datasets with similar defects but different backgrounds.Hence, we could train MDGAN based on the existing source defect datasets and then synthesize a large number of defect samples on the new target normal backgrounds when the type of products changed.As long as the defect textures are similar, the resource consumption for recollecting and labeling datasets can be greatly reduced by MDGAN, which is highly valuable for intelligent manufacturing.

Conclusions
This paper proposes MDGAN, to tackle the problems of falsifying backgrounds, lack of pixel-level annotations, and less attention being given to non-uniform complex structures in the current defect synthesis methods.Guided by binary masks, MDGAN modulates the real background into the generator using BRM and employs DDM to achieve discrimination of both local and global information.Due to the background-preserving effect of BRM and the quality constraint of DDM, MDGAN solves the problem of falsifying backgrounds and enriches the diversity of datasets.Finally, defect samples with accurate pixel-level annotations on multiple datasets with complex textures were synthesized using MDGAN.In addition, the qualitative and quantitive results showed that MDGAN obtained a better quality than the commonly used CycleGAN.The segmentation results demonstrated that the synthetic samples from MDGAN greatly improved the detection performance, with an improvement of IoU up to 7.3% and F1 up to 6.2%.Furthermore, based on the excellent background retention capability of MDGAN, we successfully synthesized target defect images using MDGAN trained on a source dataset and achieved the defect detection of real target samples, based only on the synthesized samples.

Figure 1 .
Figure 1.An example of a defect sample with non-uniform structures.The top two boxes indicate a type of fabric defect.The bottom box indicates a type of teeth defect.

Figure 1 .
Figure 1.An example of a defect sample with non-uniform structures.The top two boxes indicate a type of fabric defect.The bottom box indicates a type of teeth defect.

Figure 2 .
Figure 2. Architectures of MDGAN.Conv-3 means the convolution kernel size is 3 × 3, Upsample-2 means the feature map is upsampled two times.AvgPooling-k means the average pooling kernel size is k × k, which depends on the size of d. cat means the channel-level connection of inputs.

Figure 2 .
Figure 2. Architectures of MDGAN.Conv-3 means the convolution kernel size is 3 × 3, Upsample-2 means the feature map is upsampled two times.AvgPooling-k means the average pooling kernel size is k × k, which depends on the size of d. cat means the channel-level connection of inputs.

Figure 3 .
Figure 3. Non-uniform structure images in training sets.Defect Images (256 2 ) are cropped from the original real images.Pseudo NB (256 2 ) are the cropped constructed pseudo-normal backgrounds.Binary Masks (256 2 ) are the cropped binary annotation images.In particular, the size of training images in Capsule and phone band were 320 2 and 128 2 , respectively.

Figure 3 .
Figure 3. Non-uniform structure images in training sets.Defect Images (256 2 ) are cropped from the original real images.Pseudo NB (256 2 ) are the cropped constructed pseudo-normal backgrounds.Binary Masks (256 2 ) are the cropped binary annotation images.In particular, the size of training images in Capsule and phone band were 320 2 and 128 2 , respectively.

Figure 4 .
Figure 4. Example of generating a binary mask from Perlin noise.Set the values greater than 0.4 in the Original noise map (256 2 ) to 1 and other values to 0 to obtain a Binary mask (256 2 ).

Figure 4 .
Figure 4. Example of generating a binary mask from Perlin noise.Set the values greater than 0.4 in the Original noise map (256 2 ) to 1 and other values to 0 to obtain a Binary mask (256 2 ).

Figure 5 .
Figure 5. Synthesis results for each item.Normal background indicates the real background image, Mask is the corresponding binary mask, and both were input into the generator to obtain the synthetic defect image Generated defect.Defect contents were segmented by the Mask from the Generated defect.

Figure 5 .Figure 6 .
Figure 5. Synthesis results for each item.Normal background indicates the real background image, Mask is the corresponding binary mask, and both were input into the generator to obtain the synthetic defect image Generated defect.Defect contents were segmented by the Mask from the Generated defect.Machines 2022, 10, x FOR PEER REVIEW 10 of 17

Figure 6 .
Figure 6.Examples of one-to-many generation.NB and mask denote the normal background and the binary mask input into MDGAN, respectively.(a) Background variation.Color denotes the generated metal nut color defects.(b) Defect shape and texture variation.Spt, Rough, and Bt denote the three types of zipper defects generated from the same background.(c) Defect texture variation.

Machines 2022 ,
10, x FOR PEER REVIEW 11 of 17of the backgrounds well.In particular, substantial modifications to the original backgrounds led to the loss of real structures in the phone band.The generated defect contents did not appear accurately in the annotated positions without BRM, resulting in inaccurate binary annotations.

Figure 7 .
Figure 7.Comparison of with and without BRM.Background and Mask are input normal backgrounds and binary masks, MDGAN is the synthetic result of MDGAN.Without BRM is the synthetic result after canceling BRM.

Figure 7 .
Figure 7.Comparison of with and without BRM.Background and Mask are input normal backgrounds and binary masks, MDGAN is the synthetic result of MDGAN.Without BRM is the synthetic result after canceling BRM.

Figure 8 .
Figure 8.Comparison of with and without DDM.Real is the real defect image.MDGAN and Without DDM are the generated defect samples with and without DDM, respectively.(a) One type of effect without DDM.part is the enlarged defect in the red box of the defect.(b) Another effect.

Figure 8 .
Figure 8.Comparison of with and without DDM.Real is the real defect image.MDGAN and Without DDM are the generated defect samples with and without DDM, respectively.(a) One type of effect without DDM.part is the enlarged defect in the red box of the defect.(b) Another effect.

Figure 8 .
Figure 8.Comparison of with and without DDM.Real is the real defect image.MDGAN and Without DDM are the generated defect samples with and without DDM, respectively.(a) One type of effect without DDM.part is the enlarged defect in the red box of the defect.(b) Another effect.

Figure 9 .
Figure 9. Generated results of MDGAN and CycleGAN with L1 loss based on the same normal backgrounds NB.MDGAN was generated from the NB and a binary mask by MDGAN, and CycleGAN was generated from the NB by CycleGAN.Defects are indicated by red boxes.

Figure 9 .
Figure 9. Generated results of MDGAN and CycleGAN with L1 loss based on the same normal backgrounds NB.MDGAN was generated from the NB and a binary mask by MDGAN, and CycleGAN was generated from the NB by CycleGAN.Defects are indicated by red boxes.

Figure 11 .
Figure 11.The first row shows the two types of Phone bands, and the second is the two types of glass.Red boxes indicate real defects.Source is the defect image used for training MDGAN.Target means the images to be transferred.NB is the normal background cut from Big NB.Mask is the input binary mask.GD is the generated target defect image.Real defect is defect image of the target datasets to be tested.

Figure 11 .
Figure 11.The first row shows the two types of Phone bands, and the second is the two types of glass.Red boxes indicate real defects.Source is the defect image used for training MDGAN.Target means the images to be transferred.NB is the normal background cut from Big NB.Mask is the input binary mask.GD is the generated target defect image.Real defect is defect image of the target datasets to be tested.

Table 1 .
The number of datasets (original/cropped images).Training means the training sets of MDGAN.EL/AUG means the number of augmented samples from MDGAN (EL) and traditional methods (AUG).RAW and Seg-test are the real training and testing sets of segmentation, respectively.

Table 2 .
Quantitative results of the ablation experiments.Bold fonts indicate better results.

Table 2 .
Quantitative results of the ablation experiments.Bold fonts indicate better results.

Table 3 .
FID between generated results and real images.Bold fonts indicate better results.

Table 4 .
Comparison of the mean test results for the three training sets.Bold fonts indicate the optimal IoU and F1 among the three results.R and A denote RAW and AUG, respectively.
Figure 10.Comparison of UNet-based segmentation results.Test and Ground Truth denote the test images and the corresponding ground truth (256 2 ).RAW, AUG, and EL denote the test results obtained by the three training sets (256 2 ).

Table 5 .
The numbers of transferred training sets and real test sets, and the test results of segmentation.

Table 5 .
The numbers of transferred training sets and real test sets, and the test results of segmentation.