An Unsupervised Generative Adversarial Network-Based Method for Defect Inspection of Texture Surfaces

: Recently, deep learning-based defect inspection methods have begun to receive more attention—from both researchers and the industrial community—due to their powerful representation and learning capabilities. These methods, however, require a large number of samples and manual annotation to achieve an acceptable detection rate. In this paper, we propose an unsupervised method of detecting and locating defects on patterned texture surface images which, in the training phase, needs only a moderate number of defect-free samples. An extended deep convolutional generative adversarial network (DCGAN) is utilized to reconstruct input image patches; the resulting residual map can be used to realize the initial segmentation defects. To further improve the accuracy of defect segmentation, a submodule termed “local difference analysis” (LDA) is embedded into the overall module to eliminate false positives. We conduct comparative experiments on a series of datasets and the ﬁnal results verify the effectiveness of the proposed method.


Introduction
Surface defect detection is a hot topic in the field of industrial production, as even small flaws may destroy the appearance of a product [1,2]. As a result, a real-time online process where unqualified products are recognized and discarded or improved is necessary on the production lines in many factories. Traditional methods of achieving this are based mainly on the labor of experienced engineers; these are both time-consuming and costly, and open the process to human error. With the rapid development of computer vision technology, vision-based inspection approaches have, in recent years, gradually come to play an essential role in surface defect detection. As the most representative nondestructive strategy, computer vision-based inspection approaches substitute human eyes and decisionmaking abilities with optical lenses and intelligent inspection algorithms.
The textures of product surfaces such as textiles, ceramic tiles, steel, etc., can be generally categorized into two groups: patterned (homogeneous) and nonperiodic. Defects on patterned texture surfaces resulting from machine faults or material problems used to be identified as irregularities, and their occurrence can have a harmful effect on product appearance and even performance. To address this, many computer vision-based defect inspection methods have emerged in the last few decades. This remains a challenging strategy, however, for the following reasons: on the one hand, the imaging quality of texture surfaces is easily influenced by certain physical factors (e.g., illumination conditions, shooting angle, sensor errors); on the other, these inspection algorithms have a narrow range of application because of the variety of types of emerging defects (e.g., different scales, varying degrees of contrast with texture background).
To address these problems, a patterned texture surface defect detection method based on background reconstruction and local difference analysis (LDA) is proposed in this paper.
The contributions of this paper are as follows:

Related Works
In this section, we review several representative existing texture surface defect detection methods. These can be divided into two groups: conventional texture analysis-based approaches and deep learning-based approaches.
"Texture analysis" is an umbrella term for a variety of methods based on handcrafted feature extraction and synthesis. It can be further divided into four categories: (i) statistical methods, (ii) structural methods, (iii) spectrum methods, and (iv) model-based methods. Among these, statistical methods and spectrum methods have, to date, proven to be the most practical.
Structural methods commonly regard the texture model as a composition of texture primitives which are arranged according to certain spatial placement rules; a general pipeline of structural methods is the extraction of texture primitives and inference of arrangement rules in sequence. Tolba and Raafat [10] proposed a multiscale structural similarity index measurement (MS-SSIM) -based method, achieving a 99.1% success rate. Spectrum methods utilize information from different domains, or synthesize spatial domains and transformed domains, to complete the detection task. Fourier transform (FT) [5,11], wavelet transform (WT) [12,13] and Gabor transform [14,15] are most frequently used approaches in practice. Sari Sarraf and Goddard [16,17] proposed a method combining discrete wavelet transform and edge fusion to segment defects from fabric texture images; they achieved an 89% success rate over 3700 images containing 26 different kinds of defects.
Model-based methods mostly concentrate on the understanding of a random field, which is actually a stochastic modeling under a simple function of an array of random variables. The most famous approaches include the autoregressive model [18][19][20], Markov random fields [21,22], fractal models, and others. Moradi and Zayed [23] used a hidden Markov model to implement a real-time defect detection system for sewer tunnels.
Although the aforementioned texture analysis-based methods have seen much success in past decades, they mainly depend on handcrafted features; this imposes a severe restriction on application range. It is additionally worth mentioning that these methods show extreme sensitivity to illumination consistency and noise influence.
With the rapid development of deep learning technology, computer vision has made breakthroughs in such application fields as object detection, semantic segmentation, target tracking, etc. Recently, research on deep learning-based surface defect detection has attracted a lot of attention, and a variety of relevant studies have been published [24][25][26][27]. Compared with traditional texture analysis methods, deep learning-based methods possess powerful feature representation learning ability and are skilled at solving complex prob-lems. Generally speaking, defect detection methods based on deep learning technology can be categorized into three classes, according to their learning strategies: (i) supervised learning-based methods, (ii) transfer learning-based methods, and (iii) unsupervised learning-based methods.
Supervised learning-based methods are suitable for situations where a number of defect-free and defective samples with reasonable annotations are provided. In [28], the authors proposed a twofold joint detection convolutional neural network (CNN) to automatically extract powerful image features for defect detection on the DAGM (German Association for Pattern Recognition) dataset. In most cases, however, there are not enough samples for model training, and the appearance of faulty products is random; supervised learning-based methods thus impose many restrictive conditions.
Transfer learning can mitigate the problem to a certain extent; it pretrains a model through a common large dataset, then fine-tunes it using specific defective samples. Ren et al. [29] developed a dense prediction model based on a pretrained deep neural network to detect wood texture defects. However, transfer learning-based methods still require a certain amount of labeled sample data, and sometimes show poor success rates because of the mismatching between source domain and target domain.
Unsupervised learning methods do not need any labeled samples at all, which is beneficial for a majority of industry production scenarios. It has been shown that autoencoder (AE) and its variants-convolutional AE (CAE), convolutional denoising AE (CDAE), etc.-are the most successful models, with their distinctive performance of coding and decoding. Mei et al. [30] constructed a Gaussian pyramid-based CDAE architecture for unsupervised learning-based defect inspection, where the Gaussian pyramid is utilized to implement image patch reconstruction and synthesis at different resolutions. They achieved a fair detection success rate over patterned texture images.
Generative adversarial network (GAN) is another kind of unsupervised learning methods to tackle relevant anomaly detection issues, whose primary aim is to produce realistic images. Schlegl et al. [31] proposed a deep convolutional generative adversarial network named AnoGAN to model the manifold of the training data, accompanying a novel reconstruction scheme based on the mapping from image space to a latent space. Hu et al. [32] extended the standard deep convolutional generative adversarial network (DCGAN) by introducing an encoder-like component by which a given query image can be constructed. Then a residual map and a likelihood map can be obtained to realize the detection and localization of surface defects.

Basic Principle and Theoretical Foundations
The proposed method is based on the following observation: Given that a CNN such as CAE, applied to reconstruct the input image, is trainedonly on defect-free images, its reconstruction effectiveness in the defective region of an image will be inferior to that in the defect-free region.
In other words, the corresponding gray level difference between the original and reconstructed images will be larger in the defective region than in the defect-free region. Figure 1a displays a defective surface image of fabric and Figure 1b is a reconstructed version through the proposed CNN. Figure 1c shows the residual map between the above two images and Figure 1d presents the plot of the residual map in 3D perspective. It can be seen from Figure 1 that the discriminative sensitivity in the image reconstruction process can be applied to locate defective regions, and the proposed method follows this principle exactly. be seen from Figure 1 that the discriminative sensitivity in the image reconstruction process can be applied to locate defective regions, and the proposed method follows this principle exactly. In general, deep learning-based models employed for image reconstruction are based primarily on CAE. The proposed model uses a similar structure to input and output image data; however, it adopts a different training method, namely, adversarial learning, rather than the conventional method. More specifically, the proposed model is an extended version of DCGAN, which derives from GAN and addresses its training instability through some minor changes. So, to explain our method, it is important to introduce the theoretical foundation of DCGAN.
The primary goal of DCGAN is to generate realistic images. It consists of two CNNs: generator (G) and discriminator (D). The generator maps a latent vector , which usually obeys a specified distribution, to a generated image ; the discriminator maps and the real image sample to a scalar, which is considered as a score ranged [0, 1] to indicate how realistic the input image is. The training phase alternates between two process: (i) the discriminator attaches a high score to and a low one to ; (ii) the generator tries to generate a more realistic image to 'fool' the discriminator, from which will get a higher score. The two subnetworks compete with each other during each iteration until the Nash equilibrium of the cost function is reached. Then, in the test phase, the welltrained generator can be used to obtain a realistic image.
The proposed model additionally includes a special feature-based LDA module. Its goal is to cooperate with the residual analysis to obtain a better detection result in the testing phase. As is well-known, the CNN has the ability of representation learning, from which the extracted hierarchical features are more generalized and robust to solve certain vision problems. The proposed method uses this kind of characteristic to locate defects in a region level. Specifically, there is one encoder before and the same encoder after the generator in the DCGAN respectively, which are used to extract the features of the input In general, deep learning-based models employed for image reconstruction are based primarily on CAE. The proposed model uses a similar structure to input and output image data; however, it adopts a different training method, namely, adversarial learning, rather than the conventional method. More specifically, the proposed model is an extended version of DCGAN, which derives from GAN and addresses its training instability through some minor changes. So, to explain our method, it is important to introduce the theoretical foundation of DCGAN.
The primary goal of DCGAN is to generate realistic images. It consists of two CNNs: generator (G) and discriminator (D). The generator maps a latent vector z, which usually obeys a specified distribution, to a generated image x ; the discriminator maps x and the real image sample x to a scalar, which is considered as a score ranged [0, 1] to indicate how realistic the input image is. The training phase alternates between two process: (i) the discriminator attaches a high score to x and a low one to x ; (ii) the generator tries to generate a more realistic image x to 'fool' the discriminator, from which x will get a higher score. The two subnetworks compete with each other during each iteration until the Nash equilibrium of the cost function is reached. Then, in the test phase, the well-trained generator can be used to obtain a realistic image.
The proposed model additionally includes a special feature-based LDA module. Its goal is to cooperate with the residual analysis to obtain a better detection result in the testing phase. As is well-known, the CNN has the ability of representation learning, from which the extracted hierarchical features are more generalized and robust to solve certain vision problems. The proposed method uses this kind of characteristic to locate defects in a region level. Specifically, there is one encoder before and the same encoder after the generator in the DCGAN respectively, which are used to extract the features of the input and reconstructed images; a region-level prediction can then be drawn from the similarity analysis between the two features.

Architecture of the Model and Training Pipeline
As shown in Figure 2a, the proposed model is composed of four basic CNN components, among which En 1 En 2 and D are similar in structure to an encoder and De is a decoder. These encoder-like or decoder-like networks are altogether a series of convolutional blocks. Specifically, for encoder-like En 1 and En 2 , every block consists of a convolutional layer, a batch normalization layer, and a leaky rectified linear unit (ReLU) layer, except that the first block removes the intermediate batch normalization layer and the last block has only a convolutional layer. The decoder-like De is merely a symmetric version of the above two encoders; however, the convolutional layer in each block is replaced by a transposed convolutional layer and the leaky-ReLU layer is substituted by a ReLU layer, except that the last block is a tanh layer. The discriminator D is essentially the same as the encoder, but there is a slight variation on the number of channels in the last block, where the discriminator has just one output channel and subsequently attaches a sigmoid layer. and reconstructed images; a region-level prediction can then be drawn from the similarity analysis between the two features.

Architecture of the Model and Training Pipeline
As shown in Figure 2a, the proposed model is composed of four basic CNN components, among which and are similar in structure to an encoder and is a decoder. These encoder-like or decoder-like networks are altogether a series of convolutional blocks. Specifically, for encoder-like and , every block consists of a convolutional layer, a batch normalization layer, and a leaky rectified linear unit (ReLU) layer, except that the first block removes the intermediate batch normalization layer and the last block has only a convolutional layer. The decoder-like is merely a symmetric version of the above two encoders; however, the convolutional layer in each block is replaced by a transposed convolutional layer and the leaky-ReLU layer is substituted by a ReLU layer, except that the last block is a tanh layer. The discriminator is essentially the same as the encoder, but there is a slight variation on the number of channels in the last block, where the discriminator has just one output channel and subsequently attaches a sigmoid layer. In the testing phase, pixel-level and regional-level differences between and are evaluated after image reconstruction to obtain the final segmentaion result.
As mentioned above, the proposed model is trained on defect-free image samples. In practice, a training dataset is composed of many multiscale image patches which are randomly sampled from the different levels of an image pyramid, as shown in Figure 3. Image pyramid is able to adjust the image scale, and here Gaussian pyramid is used to downsample on a whole image. It is based on the fact that defects always take on different sizes, e.g., small dots, thin lines, and even larger areas. Therefore, to completely detect and locate these defects, the model should be trained on multiscale samples. Specifically, a whole image passes through the Gaussian pyramid to generate multiple images whose sizes decrease with the power of two; then, a fixed-size extraction is carried out on these whole images to form ultimate training samples. In the testing phase, pixel-level and regional-level differences between I org and I rec are evaluated after image reconstruction to obtain the final segmentaion result.
As mentioned above, the proposed model is trained on defect-free image samples. In practice, a training dataset is composed of many multiscale image patches which are randomly sampled from the different levels of an image pyramid, as shown in Figure 3. Image pyramid is able to adjust the image scale, and here Gaussian pyramid is used to downsample on a whole image. It is based on the fact that defects always take on different sizes, e.g., small dots, thin lines, and even larger areas. Therefore, to completely detect and locate these defects, the model should be trained on multiscale samples. Specifically, a whole image passes through the Gaussian pyramid to generate multiple images whose sizes decrease with the power of two; then, a fixed-size extraction is carried out on these whole images to form ultimate training samples. Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 15 The object function in the training phase plays an essential role in the final defect detection result, which comprises three optimized objects-, and , respectively-in the proposed model, as shown in Figure 2a.
is designed to improve the quality of image reconstruction through each iteration a, which can be written as: where and represent the input and reconstructed images and ‖•‖ denotes the 1norm.
is defined as the difference between the extracted features of and , respectively, which can be expressed as: The introduction of this optimized object is meant to assist in the promotion of image reconstruction quality. Additionally, it is utilized to accomplish the LDA task in the testing phase.
is introduced to guarantee that the reconstructed image be more realistic, which is the primary original goal of the GAN. But unlike DCGAN, the proposed model uses the output of the penultimate CNN block to compare the degree of realism of the input image of the discriminator. The third optimized object can thus be defined as: where (•) denotes the mapping from the input to the output of penultimate block in the discriminator.
The final loss function is the weighted sum of the above three optimized components as follows: where , and are all corresponding weighted values summed to 1 and belonging to hyperparameters that need to be adjusted according to the specific dataset.

Defect Detection and Localization
So long as the proposed model has been trained well, it can be used to determine whether a new image sample is defective or not and, if so, where the defect is located. It should be noted that every CNN other than the discriminator is utilized in the testing phase, as shown in Figure 2b. The object function in the training phase plays an essential role in the final defect detection result, which comprises three optimized objects-L rec , L f ea and L adv , respectively-in the proposed model, as shown in Figure 2a.
L rec is designed to improve the quality of image reconstruction through each iteration a, which can be written as: where I and I represent the input and reconstructed images and · 1 denotes the 1-norm. L f ea is defined as the difference between the extracted features of I and I , respectively, which can be expressed as: The introduction of this optimized object is meant to assist in the promotion of image reconstruction quality. Additionally, it is utilized to accomplish the LDA task in the testing phase.
L adv is introduced to guarantee that the reconstructed image be more realistic, which is the primary original goal of the GAN. But unlike DCGAN, the proposed model uses the output of the penultimate CNN block to compare the degree of realism of the input image of the discriminator. The third optimized object can thus be defined as: where f (·) denotes the mapping from the input to the output of penultimate block in the discriminator.
The final loss function is the weighted sum of the above three optimized components as follows: L = w rec L rec + w f ea L f ea + w adv L adv (4) where w rec , w f ea and w adv are all corresponding weighted values summed to 1 and belonging to hyperparameters that need to be adjusted according to the specific dataset.

Defect Detection and Localization
So long as the proposed model has been trained well, it can be used to determine whether a new image sample is defective or not and, if so, where the defect is located. It should be noted that every CNN other than the discriminator D is utilized in the testing phase, as shown in Figure 2b.
Given a M × N pixel-sized texture image, r × c pixel-sized image patches are extracted in sequence where the size is similar to that in the training phase. The well-trained model can then reconstruct each image patch, and all of them are then rearranged to a full-sized reconstructed image. Next, a pixel-level comparison is conducted upon the input and reconstructed images, which can be expressed as: where I residual denotes the result of the pixel-level difference which is termed residual image and abs(·) represents the absolute value operation.
Besides the reconstruction use case, the proposed model is also employed to implement LDA. Figure 4b depicts the final detection result which is directly derived from the residual image; it contains some sparkles. In fact, such inferences often appear in the borders of texture patterns, and noise points can also be misjudged as defects. Generally, the gray values of these areas are higher than those of the true defect regions. As a result, a relatively precise localization of defect can be realized only by a pixel-level difference. Figure 4c displays the detection result with local difference analysis (LDA), where previous sparkles are eliminated. The LDA module therefore plays an important role in the localization of defects.

Given a
× pixel-sized texture image, × pixel-sized image patches are extracted in sequence where the size is similar to that in the training phase. The well-trained model can then reconstruct each image patch, and all of them are then rearranged to a full-sized reconstructed image. Next, a pixel-level comparison is conducted upon the input and reconstructed images, which can be expressed as: where denotes the result of the pixel-level difference which is termed residual image and (•) represents the absolute value operation.
Besides the reconstruction use case, the proposed model is also employed to implement LDA. Figure 4b depicts the final detection result which is directly derived from the residual image; it contains some sparkles. In fact, such inferences often appear in the borders of texture patterns, and noise points can also be misjudged as defects. Generally, the gray values of these areas are higher than those of the true defect regions. As a result, a relatively precise localization of defect can be realized only by a pixel-level difference. Figure 4c displays the detection result with local difference analysis (LDA), where previous sparkles are eliminated. The LDA module therefore plays an important role in the localization of defects.  In the testing phase, two feature vectors are obtained, which are powerful feature representations with respect to the input and reconstructed image patches. The reconstructed version of a defective image patch shows lower sensitivity in abnormal regions, so there are apparent differences between their corresponding features. For a defect-free image patch, the opposite is true. A similarity analysis can thus be carried out between the two feature vectors, through which the probability of defect occurrence can be acquired. The above description can be formulated as follows: where = (1,2, … , / ), and = (1,2, … , / ). ( , ) and ( , ) denote the size of the whole image and the image patch, respectively. The top-right corner ( , ) represents the appointed image patch in the mask image . , denotes the feature vector for the original image , as does , for the reconstructed image . is a threshold to control the sensitivity of being evaluated as a defective region. All mask patches , are then rearranged to form a full-sized mask image . The first row in Figure 5 displays four defective images with different texture background; the second demonstrates four corresponding mask images through which regional-level rough defect localization can be accomplished. In the testing phase, two feature vectors are obtained, which are powerful feature representations with respect to the input and reconstructed image patches. The reconstructed version of a defective image patch shows lower sensitivity in abnormal regions, so there are apparent differences between their corresponding features. For a defect-free image patch, the opposite is true. A similarity analysis can thus be carried out between the two feature vectors, through which the probability of defect occurrence can be acquired. The above description can be formulated as follows: where r = (1, 2, . . . , M/m), and c = (1, 2, . . . , N/n). (M, N) and (m, n) denote the size of the whole image and the image patch, respectively. The top-right corner (r, c) represents the appointed image patch in the mask image I mask . f r,c i denotes the feature vector for the original image I org , as does f r,c o for the reconstructed image I rec . ε is a threshold to control the sensitivity of being evaluated as a defective region. All mask patches I r,c mask are then rearranged to form a full-sized mask image I mask . The first row in Figure 5 displays four defective images with different texture background; the second demonstrates four corresponding mask images through which regional-level rough defect localization can be accomplished. Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 15 Together with the residual image and mask image, a fusion strategy is put into effect to synthesize them and obtain the final detection result. The process can be formulated as: where ⊙ denotes the element-wise multiplication operation (Hadamard product) and and represent the mean and standard deviation of the residual image respectively.
The entire preceding discussion can be summarized the following Algorithm 1:
Obtain the reconstructed image patches , and two corresponding feature vectors , and , .
Rearrange , into a full-sized reconstructed image and obtain the residual image through Equation (5). Compare the similarity value between , and , to generate the mask image patches , .
Rearrange , and obtain a full-sized mask image with a threshold value as shown in Equation (6).
Synthesize and to obtain the final segmentation image through Equation (7).

Experiments and Discussion
In this section, we detail several experiments we conducted to evaluate the performance of the proposed defect detection method both qualitatively and quantitatively. First, experimental data and parameter configuration are briefly described. Second, the effectiveness of the local difference module is shown. Finally, the inspection performance of the proposed method is compared with that of several other unsupervised methods. Together with the residual image and mask image, a fusion strategy is put into effect to synthesize them and obtain the final detection result. The process can be formulated as: where denotes the element-wise multiplication operation (Hadamard product) and µ and σ represent the mean and standard deviation of the residual image I residual respectively. The entire preceding discussion can be summarized the following Algorithm 1: Rearrange I r,c rec into a full-sized reconstructed image I rec and obtain the residual image I residual through Equation (5).
Compare the similarity value between f r,c i and f r,c o to generate the mask image patches I r,c mask . Rearrange I r,c mask and obtain a full-sized mask image I mask with a threshold value ε as shown in Equation (6).
Synthesize I residual and I mask to obtain the final segmentation image I result through Equation (7).

Experiments and Discussion
In this section, we detail several experiments we conducted to evaluate the performance of the proposed defect detection method both qualitatively and quantitatively. First, experimental data and parameter configuration are briefly described. Second, the effectiveness of the local difference module is shown. Finally, the inspection performance of the proposed method is compared with that of several other unsupervised methods.

Dataset Description and Implementation Details
The dataset used in our experiment is composed of TILDA Textile Texture [33], Patterned Fabric [34,35] and MVTec Anomaly Detection [36]. The first contains six kinds of texture images, with 50 defect-free images in each category. The second includes three kinds of texture pattern-namely box, dot, and star-each with 25 to 30 defect-free samples and defective samples which can be utilized as training dataset and testing dataset respectively. The last contains over 5000 high-resolution images divided into fifteen different object and texture categories, each with a set of defect-free training images and a test set of images with various kinds of defects. We totally selected ten kinds of texture background and constructed ten datasets accordingly, where four from the first (c1, c2, c3 and c5), there from the second (box, dot and star) and three from the last (leather, tile and wood).
As described in Section 3.2, the proposed method is composed of four submodules: two encoders En 1 and En 2 , one decoder De, and a discriminator D. The detailed parameter setup is depicted in Tables 1 and 2. To further evaluate the performance of the proposed method, a quantitative analysis based on the following three indicators was undertaken.
where TP represents the number of correctly detected defect pixels, FP denotes the number of falsely detected defect pixels and FN is the number of falsely detected background pixels. F1-Measure is a more convincing indicator, taking precision and recall into account. All three of these indicators are ranged [0, 1]; the higher the value, the better the performance. The proposed method was implemented on a desktop computer with six cores, 16 GB memory, and a GTX 1660 Nvidia GPU. The code was implemented using Python 3.7 with packages Pytorch, Numpy, and OpenCV.

Local Difference Analysis Module
As mentioned in Section 3.3, the LDA module can effectively improve the accuracy of defect segmentation. It can remove falsely detected background regions by feature analysis. Figure 6 exhibits its special effect in detail, while Table 3 records the comparative quantitative results of methods with and without LDA.

Local Difference Analysis Module
As mentioned in Section 3.3, the LDA module can effectively improve the accuracy of defect segmentation. It can remove falsely detected background regions by feature analysis. Figure 6 exhibits its special effect in detail, while Table 3 records the comparative quantitative results of methods with and without LDA.  The first method directly adopts morphological operation on the residual image. Even so, there were many falsely detected regions, and low quantitative indicators were drawn. The second method applied LDA, i.e., it removed almost all of the interfering factors. This is reflected by the fact that all three indicators increased by some degree.

Comparative Experiments of Different Methods
In order to evaluate the effectiveness of our method, we carried out comparative experiments between the proposed method and three other unsupervised methods: PHOT [37], AnoGAN [31] and MSCDAE [30]. All methods were trained and tested with the same dataset. The inspection results over eight different kinds of texture sample are illustrated in Figures 7-9.  The first method directly adopts morphological operation on the residual image. Even so, there were many falsely detected regions, and low quantitative indicators were drawn. The second method applied LDA, i.e., it removed almost all of the interfering factors. This is reflected by the fact that all three indicators increased by some degree.

Comparative Experiments of Different Methods
In order to evaluate the effectiveness of our method, we carried out comparative experiments between the proposed method and three other unsupervised methods: PHOT [37], AnoGAN [31] and MSCDAE [30]. All methods were trained and tested with the same dataset. The inspection results over eight different kinds of texture sample are illustrated in Figures 7-9.
The PHOT method is based on an observation in the Fourier representation of signals where spectral phase can retain many important features of original signals; this does not work for spectral magnitude. The PHOT method always introduces a high alarm rate because of the global reconstruction scheme (Figure 7a,c, Figures 8a and 9a). The second method, AnoGAN, applies generative adversarial network to model the manifold of the defect-free samples and an inverse mapping from image space to latent space is conducted to generate the corresponding reconstructed image for a given query sample. Then a pixel-level comparison is used to realize segmentation of anomaly area. It does not consider the spatial dependencies among pixels and achieves a considerably high false positive rate, as shown in Figures 7c, 8b and 9a. The third method, MSCDAE, is another deep learning-based method, but its basic component is convolutional autoencoder. Its reconstruction ability is realized through CDAE and a Gaussian pyramid is introduced to improve the robustness of the final inspection result. The MSCDAE method exhibits good performance in images where there is a great difference in gray level between defect and background, as shown in Figure 7a,b,d, Figures 8a and 9c, but performs poorly in Figure 8b,c due to the imperfect reconstruction ability of CAE. In contrast to the above three, our method employs adversarial learning to enhance image reconstruction ability, and introduces a regional analysis module to reduce the false positive rate. It showed superior performance in almost all images, as shown in the last row from Figures 7-9.

Comparative Experiments of Different Methods
In order to evaluate the effectiveness of our method, we carried out comparative experiments between the proposed method and three other unsupervised methods: PHOT [37], AnoGAN [31] and MSCDAE [30]. All methods were trained and tested with the same dataset. The inspection results over eight different kinds of texture sample are illustrated in Figures 7-9.  The PHOT method is based on an observation in the Fourier representation of signals where spectral phase can retain many important features of original signals; this does not work for spectral magnitude. The PHOT method always introduces a high alarm rate because of the global reconstruction scheme (Figures 7a,c, 8a and 9a). The second method, AnoGAN, applies generative adversarial network to model the manifold of the defect-free samples and an inverse mapping from image space to latent space is conducted to generate the corresponding reconstructed image for a given query sample. Then a pixel-level comparison is used to realize segmentation of anomaly area. It does not consider the spatial dependencies among pixels and achieves a considerably high false positive rate, as shown in Figures 7c, 8b and 9a. The third method, MSCDAE, is another deep learningbased method, but its basic component is convolutional autoencoder. Its reconstruction ability is realized through CDAE and a Gaussian pyramid is introduced to improve the robustness of the final inspection result. The MSCDAE method exhibits good performance  Tables 4-6, where the bold data indicates better results. It can be seen that the previous three methods always fail in some cases. Specifically, the PHOT method has comparably low F1-measure values in Figures 7a-c and 8a; its global reconstruction strategy resulted in many falsely detected regions. The AnoGAN method has very low precision values in almost all cases, because it merely adopts a global reconstruction scheme and ignores the space connection. The MSCDAE method could hardly segment defects in Figure 8b,c because of its poor reconstruction ability over complex texture images.
The proposed method achieved superior segmentation performance in almost every test. The reasons for this can be summarized in the following two points: firstly, the adversarial learning-based training strategy can improve the ability of image reconstruction; second, the regional analysis module can remove a majority of falsely detected background regions.

Conclusions
In this paper, we propose an unsupervised learning-based method to detect and segment defects in texture images. A novel model based on a deep convolutional generative adversarial network (DCGAN) is utilized to reconstruct the input image; the obtained residual image is utilized to predict the position of defect at a global view. An embedded local difference analysis module is presented to locate defects at the regional level, which can eliminate falsely detected areas. Finally, the pixel-level and regional-level predictions are synthesized to obtain the final result. A series of comparative experiments were conducted and confirmed that the proposed method is effective and has a wide range of applications.
Our method showed superior performance with a wide range of texture surfaces. Yet, there are still difficulties in detecting some confusing defects which have a high degree of similarity with texture background in appearance. In the future, we would like to explore some machine learning-based strategies to improve the distinctiveness between foreground and background areas.