Neural Subspace Learning for Surface Defect Detection

: Surface defect inspection is a key technique in industrial product assessments. Compared with other visual applications, industrial defect inspection suffers from a small sample problem and a lack of labeled data. Therefore, conventional deep-learning methods depending on huge supervised samples cannot be directly generalized to this task. To deal with the lack of labeled data, unsupervised subspace learning provides more clues for the task of defect inspection. However, conventional subspace learning methods focus on studying the linear subspace structure. In order to explore the nonlinear manifold structure, a novel neural subspace learning algorithm is proposed by substituting linear operators with nonlinear neural networks. The low-rank property of the latent space is approximated by limiting the dimensions of the encoded feature, and the sparse coding property is simulated by quantized autoencoding. To overcome the small sample problem, a novel data augmentation strategy called thin-plate-spline deformation is proposed. Compared with the rigid transformation methods used in previous literature, our strategy could generate more reliable training samples. Experiments on real-world datasets demonstrate that our method achieves state-of-the-art performance compared with unsupervised methods. More importantly, the proposed method is competitive and has a better generalization capability compared with supervised methods based on deep learning techniques.


Introduction
Visual inspection is a key step in surface-defect detection of industrial products for ensuring product quality. Compared with manual inspection, automated inspection systems based on computer vision are much more efficient and reliable. A company can save numerous workers by the usage of a vision-based automatic system. In this paper, we focus on the task of product surface defect detection, which is one of the most important steps in industrial manufacturing processes.
The majority of conventional defect detection methods can be summarized in four categories: statistical-based methods [1,2], structural-based methods [3], spectral-based methods [4] and model-based methods [5][6][7][8]. However, most conventional methods are built on hand-crafted visual features and concentrate on studying the linear subspace structure. They suffer from poor generalization and robustness, especially for cases in complex circumstances and variable illumination.
Recently, deep neural networks(DNNs) have demonstrated competitive performance in various fields. They have a powerful ability to extract high-level features. DNNs [9][10][11][12][13][14][15] have also gained great improvements in the task of defect detection compared with traditional methods. Most existing DNN-based inspection methods are based on supervised learning, which implies that a large number of manually annotated data are required. However, the collection of manually annotated data in industrial manufacturing processes is difficult and expensive. Although data augmentation technologies [16][17][18] based on generative adversarial networks (GANs) bring us plausible solutions, the data's generative ability is limited by the number of samples, especially for the industrial products, whose amounts are only several hundreds or even dozens. Therefore, most existing DNN-based methods still suffer from the small sample problem and a lack of labeled data.
In this paper, a novel unsupervised defect detection method based on neural subspace learning is proposed. The proposed method is created by combining the clear mathematical theory of traditional subspace learning and the powerful learning ability of DNNs. The main assumption of the proposed method is that defect images can be decomposed into two main components: dominant content and sparse flaw regions. This hypothesis is reasonable and common for human-made products and is used in many vision tasks [19][20][21][22]. The dominant background component will be learned by the proposed neural subspace learning method, and the sparse defects will be computed by solving a 1 variational regularization problem.
First, to deal with the lack of labeled data, an unsupervised neural subspace learning method is proposed. The proposed method strives to learn the dominant background component by exploring the latent manifold property of the data in an unsupervised way. Low-rank and sparse representation are two typical manifold structures in subspace learning. Compared with traditional linear subspace learning methods, the proposed method tries to explore the nonlinear subspace structure by integrating traditional lowrank representation theory and sparse coding into a deep autoencoder architecture. In detail, the low-rank property of the latent space is approximated by limiting the dimension of the encoded feature, and the sparse representation property is simulated by a quantized autoencoder.
Second, to deal with the small sample problem, a novel data augmentation strategy called thin-plate-spline deformation is proposed. Due to the fact that most defects to be detected have continuous contours, traditional rigid transformation based augmentation methods will introduce numerous samples in conflict with the ground truth data's distribution. In contrast, the proposed thin-plate-spline deformation method can generate more reliable training samples by non-rigid spline transformation.
In summary, the main contributions of the proposed method include:

1.
A novel, non-rigid data augmentation method is proposed for surface defects detection. The proposed thin-plate-spline deformation method can generate more reliable training samples than rigid transformation based methods.

2.
A novel, unsupervised neural subspace learning method is proposed by combining the clear mathematical theory of traditional subspace learning and the powerful learning ability of DNNs. 3.
The proposed method achieves competitive performance and has better generalization than other methods.
This paper is organized as follows: Section 2 reviews the existing works of defect detection using supervised and unsupervised methods. In Section 3, we introduce the principle and algorithmic framework of the proposed method in detail. Section 4 contains the implementation details of the proposed algorithm and the experimental results. Conclusions are discussed in Section 5.

Related Work
Computer vision has been widely used in defect detection systems. In this section we will briefly review the related work about defect detection, including traditional methods based on hand-crafted features and data-driven deep learning methods.
The majority of traditional approaches can be divided into four categories: statisticsbased approaches, structural-based approaches filter-based approaches and model-based approaches [23]. We think these methods are different in extracting high level semantic features consistent with defects, especially the filter-based methods. Statistics-based methods detect defects by computing the spatial distribution of image pixels, such as histogram-of-oriented-gradient [24], co-occurrence matrix [1], and local-binary-pattern [2].
Structural-based approaches mainly focus on the spatial location of textural elements. The main methods are primitive measurement [25], edge features [26], skeleton representation [27] and morphological operations [3]. Filter-based methods employ filter banks to generate features that consist of filter responses. The main methods include spatial domain filtering [28], frequency domain analysis [29] and joint spatial/spatial frequency [4]. Modelbased approaches try to obtain certain models with special distributions or attributes, which require a high computational complexity. The main methods include fractal model [5], random field model [6], texem model [7], auto-regressive [8] and the modified PCA model [22]. Especially, the modified PCA model [22] demonstrated that defect-free regions usually had low-rank attributes.
Deep-learning techniques have demonstrated great advantages in the task of defect detection. These works can be classified into supervised learning and unsupervised methods according to whether the annotated data are required.
Supervised learning. Song et al. [9] proposed EDR-Net which utilized the salient object detection method for strip surface defects. Zhang et al. [30] leveraged a real-time surface-defect segmentation network (FDSNet) based on a two-branch architecture to locate the flaw areas. Dong et al. [10] proposed a pyramid feature fusion network with global attention for surface-defect detection. Augustauskas et al. [31] proposed a pixel-level defect segmentation network by using the residual connection and attention gate. Although supervised learning methods obtain pretty good results, a large number of manually annotated data are required. The collection of manually annotated data in industrial manufacturing processes is difficult and expensive. Most existing DNN-based methods suffered from the small sample problem and the lack of labeled data.
Unsupervised learning. Auto-Encoder [32], as a kind of effective and powerful artificial neural network, can learn high-level features by the encoder-decoder architecture. It has been studied for the task of defect detection. Chow et al. [33] trained an auto-encoder on defect-free images and argued that defective pixels would obtain a high reconstructive error. Mujeeb et al. [34] chose binary cross-entropy as a loss function, which measured the distribution error between the reconstructed result and the original data. Tian et al. [35] used cross-patch similarity loss and iteratively chose the best latent variables iteratively. Bergmann et al. [36] proposed the use of a perceptual loss function for examining interdependencies among image regions. The loss function was defined to measure the structural similarity by taking into account luminance, contrast, and structural information. In addition, Mei et al. [11] proposed a convolutional denoising auto-encoder network by utilizing multiple Gaussian pyramid features. Similar ideas based on multi-scale features are also used by this literature [37,38]. Yan et al. [39] proposed an adversarial auto-encoder network to monitor defective regions in rolling production. The method combined the power of GAN and the variational auto-encoder, which could be served as a nonlinear dimension reduction technique.
Despite the powerful learning ability of auto-encoder, most existing unsupervised methods did not take into account the prior manifold structure of the data space, such as low-rank and, sparse representation. In this paper, a novel unsupervised defect detection method is proposed by integrating the structure prior to low-rank and sparse representation into the auto-encoder architecture.

Methodology
In order to deal with the small sample problem and the lack of labeled data in industrial defect inspection, a novel unsupervised neural subspace learning method is proposed in this section. The pipeline of the proposed method is shown in Figure 1. Firstly, a novel data augmentation method based on non-rigid thin-plate-spline transformation is proposed to deal with the small sample problem. Secondly, a novel auto-encoder regularized by low-rank and sparse representation priors is designed to learn the dominant feature of the image background. Finally, the defects can be detected by solving a 1 variational problem.

Data Augmentation
Previous researchers usually use rigid transformation, such as flipping, rotating and cropping operations, to augment training datasets. However, due to the fact that most defects in the training data have continuous contours, these rigid operations do not generalize "new" samples. In this paper, a novel data augmentation method based on thin-plate-spline (TPS) deformation method [40] is proposed.
We model the image as a function defined on a 2-dimensional grid X . In order to generate highly reliable samples similar to the true defect images, we first compute a shifted grid Y by shifting each grid point on X through sampling distance in a uniform distribution randomly, then the TPS deformation technique is utilized to fit a smooth warping function between the original grid X and Y to obtain a more realistic twist. In detail, the warping transformation f (·) can be computed by solving the following optimization problem where f xx , f xy and f yy denote the second order partial derivatives of warping function f . β is a regularization constant determining the smoothness of the warp. We find that satisfactory results can be obtained when β is randomly generated from the interval [−0.1, 0.1].
Equation (1) can be effectively solved by the method [41]. The augmented training data set is generated through random TPS warping. Figure 2a shows some augmented results by the proposed augmentation method on different kinds of surfaces. It is easy to find that the curve distortions by TPS do not deviate from the defect characteristics in the ground truth training data, and more reliable training samples can be generated by the non-rigid transformation.

Neural-Subspace Learning with Low-Rank and Sparse Representation Priors
The proposed neural subspace learning method combines the nonlinear learning ability of neural networks and manifold structure priors whose theories and effects have been proved and evaluated in traditional subspace learning [42,43]. By utilizing the priors of manifold structure regularization, such as low-rank and sparse representation, the proposed method can accomplish defect detection in an unsupervised way.
Given the training data X ∈ R m×n composed of a collection of defect images X = [X 1 , X 2 , · · · , X N ] without any annotation, where X i ∈ R m is the i-th training sample, m is the dimension of each sample and n is the number of training samples. We assume that X can be decomposed into two components, dominant content and sparse flaw regions. This hypothesis is reasonable and common in many vision tasks [19][20][21][22]. In order to explore the intrinsic manifold structure of the dominant component, low-rank representation is a popular choice [44][45][46] in traditional subspace learning as shown in Equation (2), where Z ∈ R n×n is self-representation coefficient matrix, N represents the sparse flaw component, λ is a const regularization parameter for adjusting the degree of sparsity, · * denotes the nuclear norm, which is a convex approximation of matrix rank and · 1 defines the sparse elementwise 1 norm. Although the nuclear norm is an optimal convex approximation of matrix rank, a complex and expensive singular value decomposition operation is required in each iteration. Therefore, another approximation based on the Frobenious norm is proposed [47], where U ∈ R n×d , V ∈ R d×n , d n and · F is the Frobenious norm. In addition to the reduction of optimization complexity, we can learn more from Equation (3). Matrix U and V can be seen as two transformations, i.e., data X is first encoded to a low-dimensional space by XU, then another transformation V maps the latent feature back to the original data space by (XU)V. It is interesting to find that the main idea of optimization problem (3) coincides with popular autoencoder, i.e., U and V can be seen as encoder and decoder function. One of the main differences between subspace learning model (3) and deep autoencoder is that traditional subspace learning methods focus on studying linear subspace structure, while autoencoder is designed to learn the nonlinear manifold structure. Despite methods for defect detection ignore the powerful manifold priors used in traditional subspace learning, and they suffer from the small sample problem.
In this section, a novel neural subspace learning framework is proposed by integrating deep autoencoder into the traditional subspace learning method. The proposed neural subspace learning can be represented as follows, where E(·), D(·) are respectively the encoder network and decoder network, and θ is the parameters of networks E(·) and D(·). Firstly, the linear transformations U and V in traditional subspace learning are substituted by deep neural networks E(·) and D(·). Benefiting from the nonlinear learning potential of neural networks, the proposed method can learn high-level feature representation compared with linear learning methods.
Secondly, in order to make full use of the manifold structure prior, we integrate the deep autoencoder into the traditional learning framework (3). On the one hand, this strategy improves the interpretability of the proposed algorithm. On the other hand, the proposed method can accomplish defect detection in an unsupervised way.
Two kinds of manifold structure priors are utilized in this paper, namely low-rank and sparse representation. As the dominant defect-free surfaces of industrial products are always simple and homologous, low-rank is a proper latent prior. As for encoder and decoder do not have explicit matrix expression, it is hard to optimize 1 2 (3), we apply two tricks in the designing of a neural network to accomplish the low-rank regularization. The first trick is to implicitly impose a rank constraint on the learned representation by limiting the dimensions of the encoded feature F = E(X) to a relatively small constant d n. This implies that the rank of the encoded feature matrix E(X) is at most d. The second trick is to remove all nonlinear activation in the decoder to ensure that the rank can not be magnified after a series of linear decode transformations.
Low-rank prior provides a global regularization for the learned feature space. Sparse representation (5) can explore more clues about the local relationship of the image. min P,α where P is the overcomplete dictionary, and α is the representation coefficient. In order to integrate the sparse representation prior into the autoencoder, we propose a neural sparse coding framework inspired by the idea of vector quantized variational autoencoder [48]. First, the dictionary P can be simulated by the optimal codebook in VQ-VAE. The learned dictionary P can be regarded as a group on the basis of the latent space extracted by the encoder network. Instead of choosing the nearest code as introduced in reference [48], we propose to learn a sparse representation coefficient vector C s by feeding the encoded feature into a shallow multilayer perceptron (MLP) φ as shown in Figure 3, where C s = φ(F). To keep the main component learned by VQ-VAE, a regularization loss based on the 1 distance between the learned sparse coefficient C s and the one-hot vector C o learned by VQ-VAE is introduced. The encoded feature F can be reconstructed by a sparse represnetationF = PC s . Finally,F will be fed into the decoder network to produce the dominant content by D(F).
The total loss function for the deep neural-subspace learning method can be summarized as follows, In summary, the proposed model strives to make full use of the advantage of both subspace learning and neural autoencoder. On the one hand, the method can learn highlevel features by utilizing deep neural network. On the other hand, the method can accomplish unsupervised defect detection by integrating low-rank and spare representation priors into the neural network.
The proposed model can be trained by alternate optimization. First, given an initial N, the parameters of the convolutional auto-encoder network and MLP φ can be optimized by minimizing Next, fixing the parameters of all neural networks, the flaw component can be optimized by minimizing The above optimization problem can be solved analytically by the soft thresholding operator, Equations (7) and (8) are alternately optimized until the derivative of the overall objective is below a certain threshold or the maximum iteration number is reached. In the testing phase, given a defect image I, its flawless image can be directly obtained by D(E(I)) and the noise N can be calculated by Equation (8).

Experiments
In this chapter, we will conduct our experiments on two kinds of defect datasets [49,50] and show the results compared with state-of-the-art methods: PCA [51], structural similarity autoencoder (SSIMAE) [36], feature augmented VAE (FAVAE) [38] and encoder-decoder residual network (EDR) [9] and surface-defect segmentation network (FDSNet) [30]. All experiments are implemented using Pytorch and trained on a single Nvidia GeForce RTX 2080ti GPU on Ubuntu 20 system. The learning rate of the Adam optimizer is 0.001 and the constant parameters λ and µ are set to 0.1 and 200, respectively. The training epoch is set to 100 with the batch size set to 64. We follow the rule of taking 80% samples for training and 20% for the test.

SD-Saliency-900
SD-Saliency-900 dataset [49] is related to the defect detection of strip steel. In this dataset, we randomly pick 480 images for training and 120 images for testing. For each training image, we use the TPS technique to generate two deformation images. Therefore, we obtain a total of 480 × 3 training images. Before training, we set the initial noise N to zero matrices and its size is equal to the size of the input image. It only costs 6 ms for predicting the corresponding clean image and defect region for a given image.

CrackForest
CrackForest dataset [50] contains 118 defection images of cracked roads. In this dataset, we randomly select 94 images for training and 24 images for testing. First, we transform the original color image into a grayscale image with the size of 128 × 128. Then, we generate five images for each training image by using the TPS deformation method. In this way, we can expand the original training set by 5 times and obtain 564 training images. In this dataset, the initial noise N is also set to zero matrices. It only takes 6 ms to predict the defect area of a picture.

Evaluation
We use four quantitative metrics to evaluate the performance of different methods in this paper. They are accuracy, recall, precision and F1-score, respectively, and which are defined by: where true positive (TP) refers to positives that are correctly identified, and true negative (TN) means negatives that are correctly identified. False positive (FP) indicates that negatives still yield positive test results and while false negative (FN) is a test result that falsely indicates that a condition does not hold. Accuracy is the proportion of correct predictions among the total number. Recall measures the proportion of positives that are correctly identified, and Precision is the fraction of relevant instances among the retrieved instances. F1-score that combines precision and recall is the harmonic mean of Precision and Recall.

Experimental Results
In this section, the method proposed in this paper is first compared with three unsupervised methods by qualitative and quantitative analysis. We also compare the proposed method with two state-of-the-art supervised methods to evaluate the performance and generalization. Finally, an ablation study is conducted to illustrate the advantages of the proposed data augmentation method based on the TPS deformation technique. Table 1 displays the performance of different defect detection methods on SD-Saliency-900 and CrackForest datasets. It can be found that our method achieves the best results in all of the above four metrics for the SD-Saliency-900 dataset. For another dataset, our method still obtains the best performance in three metrics. For example, compared with PCA, SSIMAE, FAVAE, the F1-score has increased by at least 20 points in the SD-Saliency-900 dataset and almost 30 points in the CrackForest dataset. Although PCA based method get the highest score in the Precision metrics, the Recall metric lags far behind other methods. Some defect detection samples are displayed in Figures 4 and 5 for different methods on the test datasets of SD-Saliency-900 and CrackForest. For each method, we first separate the original defect image into the clean image and the defect foreground image. Then the defect binary map is obtained by threshold operation. We can find that the FAVAE method detects too many flawless areas because it heavily depends on the thermodynamic map, the detection results of the SSIMAE method contain too much noise due to it is achieved by comparing the similarity of local patches in the image and the PCA method is very suitable for detecting the small defect of the road surface. Compared with these methods, our approach achieves the best detection results.

Comparison with Supervised Method
In order to further demonstrate the effectiveness of our unsupervised method, the proposed approach is compared with state-of-the-art defect detection methods, EDR [9] and FDSNet [30]. Both are supervised learning methods. As FDSNet need additional data pre-processing process and they only provide the results of the SD-Saliency-900 dataset, we only compare it to the dataset in the paper. Table 2 shows the results of different methods. Although our methods is weaker than the supervised methods in the SD-Saliency-900 dataset, the proposed method obtains satisfactory results in the CrackForst dataset, especially on the Recall and F1-score metrics, which are important for product quality. These results are acceptable because supervision promotes the network to extract more discriminating features. The visualization results of some test samples are shown in Figures 6 and 7. It can be seen that the detection results of these three methods are very similar. However, supervised learning methods requires a lot of labor to label the data. Our method uses an unsupervised strategy, which is more suitable for industrial production.   In addition, in order to compare the generalization of these methods on a new dataset, we use the cross-domain strategy for testing. Namely, train the model on dataset A and test on datset B. Table 3 gives the quantitative statistical results of EDR method, FDSNet method and ours. It can be found that the EDR and FDSNet methods with supervised learning have very poor performance on the new dataset. As a comparison, our method looks better. Compared with the results without cross-domain test in Table 1, the performance has not dropped much in the precision metric. Figures 8 and 9 show some visualization results by using a cross-domain strategy. From these figures, we can find that the EDR method is completely ineffective in detecting defects. Our method still has good performance.

Comparison among Different Data Augmentation Strategy
In order to verify the robustness and effectiveness of thin-plate-spline deformation on data augmentation, we conduct some experiments on the SD-Saliency-900 dataset, including training the same network on the dataset without augmentation, with rigid augmentation, with non-rigid augmentation, and with both rigid and non-rigid augmentation, respectively. On the same training dataset, we augment 960, 960 and 1920 images through TPS deformation, rigid transformation and the combination of the above two operations, respectively. The statistical performance of the same test dataset is shown in Table 4. We can find that, the prediction performance of the network which is trained on the dataset without data augmentation falls short. Non-rigid transformation can bring more remarkable performance than rigid operations, such as the Precision and F1-score, and the effectiveness of the combination of two operations is not obvious. Therefore, we do not adopt the combination strategy in this paper.

Conclusions
In this paper, a novel unsupervised defect detection method based on neural subspace learning is proposed. The proposed method combines the clear mathematical theory of traditional subspace learning and the powerful learning ability of DNNs. According to the basic assumption that defect images can be decomposed into dominant content and sparse flaw regions, two typical manifold priors, low-rank and sparse representation, are incorporated into the deep neural network. In addition, a new data augmentation method called thin-plate spline deformation is proposed in this paper. Experiments on two datasets demonstrate that the proposed method achieves competitive performance on defect detection tasks.

Conflicts of Interest:
The author declares no conflict of interest.