1. Introduction
As one of the applications of machine vision, surface-defect detection is more difficult than target or object detection, which is caused by the complex shape of surface defects, small amount of defect data, poor detection environment, etc. [
1,
2,
3,
4]. The traditional image processing methods can quickly acquire some features of surface defects, such as Sobel [
5], Canny [
6], Prewiit [
7], and LBP [
8], and use these features to match and recognize defects. However, these features are greatly influenced by noise, light and a complex background [
9], making the preconditions too harsh to achieve good performance. In addition, the classic machine learning methods (support vector machine—SVM [
10,
11,
12], etc.), which need feature engineering, are difficult to use in defect detection, owing to the wide variety of defects, random defect shape, unfixed defect position and varying defect degree. In contrary, deep learning is suitable for defect detection since it is rarely affected by the environment, does not require feature engineering, and only needs raw images to complete the task, end to end [
13]. With a variety of merits in the field of surface-defect detection, the neural network can analyze complex image features, and give accurate and detailed multidimensional expression. Moreover, deep learning has strong transplantation ability. The detection of different defects can be transferred by fine tuning with only a small amount of data [
14,
15].
Defect detection based on deep learning contains the following research current topics: (1) Defect detection in the case of small datasets [
13]—since the defect data are usually limited, a small number of samples is almost a prerequisite for surface defect detection. (2) Online real-time defect detection [
16]—considering that the actual defect detection in the industrial site is basically a pipeline operation, there is a certain speed (“samples/second” while processing in real-time) requirement. (3) Defect detection based on physical inference [
13]—due to the lack of defect data, data-driven inference is difficult to improve further. Integrating physics-based inference is likely to be the key to improving detection accuracy.
PSIC-Net carries out the pixel-level segmentation of the defects and the image-level classification of the defective images and the non-defective images. The network model proposed is mainly aimed at the surface defects of industrial products, such as scratches and depressions on the surface of metal products, discoloration and stains on the surface of textured products. This end-to-end convolutional neural network model completes two tasks of defect segmentation and defect classification through a three-stage network architecture (called feature extraction network, invers convolution network and classification network, respectively). This three-stage architecture can acquire key features from a small number of defective training samples and achieve high segmentation accuracy and classification accuracy by the improvement of the loss function and the design of the training mode. In addition, since the segmentation and classification networks share most of the convolutional network layers, the time cost of inference can be faster. Moreover, according to the experiment results, it can meet the real-time requirements of the industrial assembly line.
In this study, three databases (KolektorSDD [
17], kolektorSDD2 [
18] and DAGM [
19]) are used. They contain different defects: KolektorSDD mainly contains scratches of metal surfaces; KolektorSDD2 has several types defects of metal surfaces; DAGM has the artificial defects of texture surfaces. Some examples can be seen in
Figure 1.
2. Related Work
Studies on deep learning–based surface defect detection are very extensive, and can be roughly divided into three categories, according to specific functions: defect classification, defect detection and defect segmentation.
Defect classification [
20,
21,
22] uses the classification network in deep learning to input the raw image into the network, and the output result is a binary classification judgment of whether the image contains defects. In the field of computer vision, this method is called image classification, whose training requires a relatively small amount of data, and whose data are not so difficult to label. This method, however, cannot locate and segment a defect in the image, and cannot deal with an image containing several different types of defects.
Defect detection [
23,
24,
25,
26] is an improved version of defect classification in which image preprocessing is added. The raw image is firstly segmented into several patches. After the patches go through a neural network, the output of the network is whether the patches have defects or not. In the end, defective patches are framed in the raw image to obtain the rough localization of the defects. This method segments the raw image only through sliding windows, which realizes the defect position with high efficiency. However, the annotation difficulty of training data also increases, and it is also difficult to choose the appropriate window size for defects of different scales.
Defect segmentation [
17,
27,
28,
29,
30,
31,
32,
33,
34,
35] usually refers to pixel-wise segmentation, that is, to judge whether each pixel is a pixel of the defect, and then segment the defective pixels from the raw image. This method can accurately locate the defect position to the greatest extent, at the cost of pixel-by-pixel annotation of the training data.
Since defect segmentation and defect classification are both applied in this paper, the following mainly introduces several representative studies similar to the method in this paper.
Tabernik et al. [
17] detect cracks on the surface of industrial product images and propose a two-stage network, which includes a segmentation network and a decision network. The first stage is a segmentation network that locates surface cracks at the pixel level. The second network is a decision network, which can infer whether the image presents defects or not. Its inputs are the output of the feature extraction network combined with the output of the segmentation network. Moreover, the network is trained and tested with the dataset, KolektorSDD. This method acquires satisfactory detection accuracy by using a small dataset. The method finally obtains 99.9% average precision. In addition, the inference time is about 10 ms. However, the shortcoming is that the precision of the segmentation part is not adequate, and the size of the segmentation output image is 1/8 of the raw image.
The study of [
36] is an optimization model of the two-stage neural network model based on segmentation [
17]. It reduces the training time and improves the accuracy of surface defect detection by introducing the end-to-end training mode. The average precision in DAGM and KolektorSDD almost reaches 100%. However, the segmentation is only a means to improve the accuracy of the classification. The results of the segmentation have not been measured and optimized.
Bozic et al. [
18] improve the model in [
36], which can adopt weakly supervised learning on image-level labels and strongly supervised learning on pixel-level labels. This hybrid supervised model can find a balance between annotation difficulty and classification accuracy, which is of great significance for practical industrial applications. The model uses a two-stage network to output segmentation results and classification results; the classification accuracy of three datasets almost reaches 100%. The disadvantage is that the study does not focus on the segmentation, which has not been measured and optimized.
Tao et al. [
34] propose an algorithm for defect segmentation and defect classification. The algorithm is divided into detection and classification modules. To be specific, the detection module uses a cascaded autoencoder (CASAE) to segment the defects, and the classification module uses tiny CNN to classify the defects. This method uses 50 raw images containing defects and expands the training data to 3000 images through data enhancement. The problem of the defect regions being too small to locate is solved by using the weighted cross-entropy loss function. The segmentation accuracy reaches 89.60% and the classification accuracy reaches 86.82%.
He and Liu [
27] propose a general industrial defect detection framework based on regression and classification, which respectively completes the tasks of defect segmentation and defect classification through detection module and classification module. The detection module is an improvement of Resnet18 [
37], and the output layer is a linear regression unit. Since the classification module has fewer computations than the detection module, it uses complex structure Resnet101 [
37] to improve the classification accuracy. In this method, 38 images in AigleRN [
38] and 1150 images in DAGM are adopted as experimental data. The final average F-measure values are 93.75% and 91.50%, respectively, and the mean IoU of segmentation is 84.50%.
Dong et al. [
31] propose a pixel-level surface-defect detection network: PGA-NET. Firstly, this network extracts multi-scale features from the backbone network, and fuses features with different resolutions by pyramid feature fusion. Then, effective information is transferred from a low-resolution feature map to high-resolution feature map by a global context attention mechanism. Through the boundary refinement module, the accuracy of the defect segmentation is improved. The mean IoU of segmentation results achieves high accuracy on all four datasets (NEU-SEG [
39]: 82.15%, DAGM: 74.78%, MT defect [
40]: 71.31%, and Road defect [
41]: 79.54%).
Liong et al. [
33] propose an automatic detection system for leather defects. This system adopts a machine vision method based on a convolutional neural network architecture to identify the location of leather defects and then predicts each defect instance. In order to make the boundary segmentation more accurate, this study also acquires the boundary from the deduction of geometric graphics. The segmentation accuracy of this algorithm for test data reaches 70.35%.
Compared with relevant methods, PSIC-Net combines both defect segmentation and defect classification, and takes into account the difficulties of a small number of sample data, which performs well in real-time detection. The network shares the convolutional layers of feature extraction, and the following two parts of network process defect segmentation and defect classification independently. It not only saves time cost, but also refines the two tasks.
3. Methods
This paper proposes a convolutional neural network model suitable for surface defect segmentation and classification: pixel-wise segmentation and image-wise classification network (PSIC-Net). Composed of a three-stage network architecture, this model can extract the key features, spatial location information and semantic information, and complete defect segmentation and image classification tasks, respectively. The model adopts a two-step training mode so that the parameters of the segmentation network and classification network are not constrained and will not lead to confusion or non-convergence. Moreover, the model improves the loss function in the training process so that the parameters can converge quickly and accurately.
3.1. Network Framework
The framework of PSIC-Net is mainly divided into three parts as shown in
Figure 2. The first part is the feature extraction network, which consists of 10 convolutional layers and 3 maximum pooling layers. After each convolutional layer, there is a batch normalization (BN) and a rectified linear unit (ReLU). The second part is the invers convolution network, which connects the last layer of the feature extraction network through a 1 × 1 convolution layer. It consists of 6 deconvolution layers (three of which are used for double up-sampling), 2 element-wise addition layers and 2 convolution layers. The third part is the classification network, which consists of 3 maximum pooling layers, 3 convolution layers, 4 global pooling layers and 1 full connection layer. The input of classification network concatenates the last layer of the feature extraction network and the output layer of the invers convolution network. Finally, it outputs the probability of defects.
3.1.1. Feature Extraction Network
The feature extraction network is composed of 10 convolution layers and 3 maximum pooling layers. Each maximum pooling layer reduces the resolution of the image by two times, so the size of the final feature image is 1/8 of the original image. The first nine convolutional layers use 5 × 5 convolutional kernels, and the tenth layer uses the 15 × 15 convolutional kernel. Moreover, the first and second layers are set with 32 channels, the third to fifth layers with 64 channels, the sixth to ninth layers with 128 channels, and the tenth layer with 1024 channels. It can be seen that the convolutional network is set up in a gradually increasing number of layers and channels. This network structure can better extract the features of semantic information in the deep layers and still retain better spatial location features in the shallow layers, which is a win–win network setting mode for segmentation and classification. In addition, BN and Relu are connected after each convolutional layer to improve the convergence speed in the training process, make the model more stable, and prevent over-fitting and gradient disappearance [
17]. Dropout is not used in this network since the weight sharing mechanism of the convolutional layer provides sufficient regularization. Because the number of defect samples is much smaller than the number of defect-free samples, not using dropout can prevent the small number of defect features and tiny defect features from being discarded.
Figure 3 demonstrates the network structure. It should be noted that the size of the image in
Figure 3 is just an example to show how the image size changes as the network deepens. The initial sizes of the images are not uniform, but have small size changes.
Feature extraction network is the key to the segmentation and classification of defects. Due to the scarcity of defect samples and the possibility that defects are minor, we increase the receptive field of the convolution layer in this part and retain all feature details as much as possible. To be specific, both the pooling operation and the large convolution kernel in the deep layer are designed to significantly increase the receptive field. The number of convolution layers between each maximum pooling layer increases successively, which can increase the capacity of features with the large receptive field. Finally, the selection of the maximum pooling layer, rather than other down-sampling methods, considers that the maximum pooling layer can retain small and important features [
17].
3.1.2. Invers Convolution Network
The invers convolution network consists of 6 deconvolutional layers, 2 element-wise addition layers and 2 convolution layers as shown in
Figure 4. Among them, the first convolutional layer is used to integrate the features of the last layer of the feature extraction network to obtain a heatmap. The first
deconvolutional layer with 64 channels is responsible for doubling the heatmap, so that the resolution of the heatmap becomes 1/4 of the raw image. Then, 2 deconvolutional layers with the same size and number of channels are connected (the deconvolutional layer does not magnify the resolution here). After that, the skip-layer structure [
42] is introduced. That is, the feature map, which is downsampled twice in the feature extraction network (1/4 of the raw image), is added and fused with the heatmap that has the same shape here. This structure can re-introduce the features in the shallow layer so as to ensure the accuracy of the spatial position and the accuracy of the edge region segmentation. After that, the above structure is repeated once, but the difference is that the number of deconvolutional layers is reduced to 2 layers (here, the first layer of deconvolutional is still used for up-sampling, and the resolution of the heatmap is now 1/2 of the raw image), and the number of channels is reduced to 32. The feature map in the feature extraction network, which was pooled once, is added and fused with the heatmap here. Finally, the network is restored to the raw image size through 1-layer deconvolutional up-sampling. A single-channel
convolutional layer is added to output the segmentation prediction graph.
The invers convolutional network is inspired by FCN [
42] and DeconvNet [
43] since they complete the segmentation task in a fairly effective way, which is especially critical for a model with a small number of sample data. Invers convolutional layers achieves the lifting of resolution, and element-wise addition layers fuse the feature map and heatmap. Both of them are conducive to the robustness and accuracy of the whole network.
3.1.3. Classification Network
The design of the classification network refers to the classification network in [
17] as shown in
Figure 5. The classification network consists of 3 convolutional layers, 3 maximum pooling layers and 4 global pooling layers. The input concatenates the last layer of the feature extraction network and the output of the invers convolution network. The number of channels of the convolutional layer increases as the image resolution decreases, which can balance the computing cost of each layer. After three rounds of convolution and pooling, the network connects one global maximum pooling layer and one global average pooling layer, respectively, to reduce the parameters and integrate features, and finally obtains two
feature vectors. In order to further improve the accuracy of the classification results, the output of the invers convolution network is also connected to one global maximum pooling layer and one global average pooling layer, respectively, to obtain two
feature vectors. Because the global pooling layers output a one-dimensional vector for each channel, it can eliminate the dimension mismatch between the invers convolutional network and the classification network. Finally, a fully connected layer is used to concatenate the feature vectors as the output. The output is the probability of whether the image contains defects.
After using the last layer of the feature extraction network as input, the classification network still carries out three rounds of convolutional and down-sampling operations to ensure that the overall defect features can be completely retained. The output of the invers convolutional network is introduced to prevent the classification network from over-fitting. In the training process, the classification network and invers convolutional network are adversarial and fuse with each other, making the final classification result more accurate.
3.2. Training
Since the whole network is composed of a relatively independent segmentation network (feature extraction network and invers convolutional network are collectively called the segmentation network) and classification network, a two-step training mode of the two networks is proposed, which allows the parameters of the two networks to be trained, according to the different tasks (segmentation or classification). The two-step training mode can reduce the interference and influences between the two networks to the minimum. In fact, the end-to-end training mode is also considered. Bozic et al. [
36] propose a total loss function as shown in Equation (
1).
where
and
represent segmentation loss and classification loss, respectively,
is an additional classification loss weight to prevent the classification loss from dominating the total loss, and
is a mixed factor which is limited by the super parameter: epoch. It can balance the contribution of each network in the final loss too.
The experiment in this paper tests this training mode. The results demonstrate that the training time is indeed shortened, but the output results, especially the segmentation results, are not comparable to the two-step training mode. The primary analysis is that the weight parameters of each part may be balanced, due to the restraint of the two networks in the end-to-end training mode, which not only increases the difficulty of network convergence, but also affects the implementation of the two networks and fails to achieve any benefits. Therefore, the final training mode is determined to train the segmentation network first, then freeze the feature extraction network and invers convolutional network parameters, and finally train the classification network and perform the fine-tuning. This training mode can avoid parameter weights over-fitting the invers convolutional network or the classification network, improving the accuracy of both the segmentation and classification.
The problem of sample imbalance exists in both the segmentation and classification of PSIC-Net. There are fewer positive samples (defective samples) and much more negative samples (non-defective samples). If the positive and negative samples are multiplied by the same weight coefficient, it is easy to predict the positive samples into negative samples. Therefore, this paper introduces the weighted cross-entropy loss function [
44,
45]. Assigning a larger penalty weight to the classification errors of positive samples and multiplying the classification errors of negative samples by a smaller weight can improve the accuracy of both segmentation and classification. In addition, since the mechanism of the segmentation network is to classify every pixel, the weighted cross-entropy function can also improve the segmentation accuracy of small defects and the boundary. In industry, the influence of false negative cases is much greater than that of false positive cases, which is another reason why we introduce weighted cross-entropy loss. The weighted cross-entropy loss function adopted in this paper is shown in Equation (
2).
where
denotes the weighted cross-entropy loss function. The loss function is computed over all pixels in the training image
.
is the class-balancing factor on a per-pixel term basis.
and
.
and
denote the defect-free and defect ground truth label pixels, respectively.
3.3. Inference
Once PSIC-Net is trained, images can be input for inference. The input image can be of any size since the full connected layer of the classification network is obtained after global pooling. There is no dimension-matching problem. In order to verify the universality of the network to all kinds of surface-defect data, the public defect datasets used for training and test in this paper are KolektorSDD [
17], KolektorSDD2 [
18] and DAGM [
19].
The inference results of PSIC-Net have two outputs, namely, the defect segmentation and the image classification. The first output is the pixel-wise segmentation output by the invers convolutional network, which is a mask image obtained by probability. The size of the output image is the same as that of the raw image. The defect and the background will be distinguished by different colors, as shown in
Figure 6.
The second output is the probability, which represents whether there is a defect in the image inferred by the classification network.
5. Discussion
Extensive experiments have proved that PSIC-Net has a significant result on the classification and segmentation of surface defects. Through the design of the network, the selection of the training mode and the improvement of the loss function, the PSIC-Net can acquire features from a small number of sample data, and can classify and segment the defect quickly and accurately. At the academic level, PSIC-Net has achieved state-of-the-art classification and segmentation. In the intelligent manufacturing scenario, PSIC-Net can also provide some practical ideas for defect detection.
Generally, in actual industrial production, it is only necessary to know whether the product is defective. However, people need to segment defects when they study the causes of defects or specific defect features. The two networks address different problems. Because the three sub networks are relatively independent, the segmentation network and classification network have the ability to operate independently. The reasons why we propose the classification network are as follows. Firstly, whether training or inference, the speed of the classification network is much faster than that of the segmentation network. This is because the defect segmentation is, essentially, to classify each pixel, which will take a long time to classify the pixels of the whole image. We can disable the output of segmentation network when some tasks that require faster detection speed are accepted. Secondly, only from the accuracy of classification, the effect of the classification network is much better than that of the segmentation network. The segmentation network classifies each pixel, which will lose many defective spatial and overall features, resulting in the decline of the classification accuracy. However, the classification network is continuously refined from the overall features of the image, which can better extract the features of defects and improve the accuracy. This is why most of the existing defect detection applications use defect classification rather than defect segmentation, which contains more information.
PSIC-Net still has some topics that need to be improved and studied in depth:
The network is sensitive to data, and the results may fluctuate slightly even if the data remain unchanged. Making the network more stable during training is needed.
The guidance of the segmentation network results to the classification network needs to be improved. In the experiment, it is found that a small number of defect data successfully segmented by the segmentation network are not successfully classified by the classification network. Strengthening the synergy of the two networks to improve the accuracy of the classification network also needs to be further explored.