AnomalySeg: Deep Learning-Based Fast Anomaly Segmentation Approach for Surface Defect Detection

: Product quality inspection is a crucial element of industrial manufacturing, yet flaws such as blemishes and stains frequently emerge after the product is completed. Most research has utilized detection models and avoided segmenting networks due to the unequal distribution of faulty information. To overcome this challenge, this work presents a rapid segmentation-based technique for surface defect detection. The proposed model is based on a modified U-Net, which introduces a hybrid residual module (SAFM), combining an improved spatial attention mechanism and a feedforward neural network in place of the remaining downsampling layers, except for the first layer of downsampling in the encoder, and applies this residual module to the decoder structure. Dilated convolutions are also incorporated in the decoder to obtain more spatial information about the feature defects and to reduce the gradient vanishing problem of the model. An improved hybrid loss function with Dice and focal loss is introduced to alleviate the small defect segmentation problem. Comparative experiments were conducted on different segmentation-based inspection methods, revealing that the Dice coefficient (DSC) evaluated by the proposed approach is better than previous generic segmentation benchmarks on KolektorSDD, KolektorSDD2, and RSDD datasets, with fewer parameters and FLOPs. Additionally, the detection network displays higher precision in recognizing the characteristics of minor flaws. This paper proposes a practical and effective technique for anomaly segmentation in surface defect identification, delivering considerable improvements over previous methods.


Introduction
Surface defect detection has always been essential for quality control and condition assessment.Effective quality inspection of industrial products can greatly decrease the defective rate of these products.As the traditional inspection approach, manual visual inspection is accomplished through subjective discrimination, and the accuracy of defect detection depends on the rich experience that inspectors learn from extensive training.Therefore, manual inspection is generally considered inefficient and laborious and lacks consistency and reliability.As technology evolves, detection methods now include eddy current detection [1], leakage detection [2], capacitance detection [3], ultrasonic detection [4], etc.These methods vary in terms of the size and types of detectable defects, quality of detection, and accuracy limitations.In particular, in industrial environments, complex working conditions can introduce significant interference to traditional defect detection factors.The acquisition of industrial defect samples and the unknown irregularity of these defects can have certain impacts.Therefore, for traditional defect detection methods, accurately identifying workpiece defects, reducing interference from external factors, and acquiring defect datasets can be great challenges.In attempt to address these problems, numerous research efforts have been conducted on the rapid and accurate location of surface defects in automated production.
The image processing-based method converts the detected image to the non-spatial domain, and identifies the defects by comparing the discriminative attributes obtained from the defective textures and defect-free ones.Common image processing-based methods include wavelet transform [19], Fourier transform [23], template match [24], rank decomposition [25], and so on.Machine learning-based methods generally consist of two phases: feature extraction and pattern classification.In the feature extraction phase, the feature vector is obtained from the input image via feature descriptors that are manually designed by experts.These descriptors encode and represent the image information.Shallow neural networks are the primary architecture in the pattern classification phase.The feature vector is then forward propagated through the pre-trained network to produce the probability of the category, which is utilized to determine whether the input image contains defect areas and categories.These handcrafted features include the local binary pattern (LBP) feature [15], a gray-level co-occurrence matrix (GLCM) [26], and other grayscale statistical features [9,16,24,27].Although these two inspection methods, following different technical routes, have achieved successful results, they both have limitations.Firstly, the image processing-based method requires a significant amount of computing resources in most conditions and may encounter computational bottlenecks on small terminal devices.Moreover, the manually designed features are not sufficiently robust and discriminative enough to deal with complex on-site environments.
Given the development of defect detection technology, techniques based on deep convolutional neural networks have gradually alleviated the above problems and demonstrated excellent performance in different vision tasks [28].Through end-to-end learning, deep neural network-based methods can automatically extract features, avoiding the difficulty of manually designing features.Inspired by these successful detection architectures, research on surface defect detection using deep neural networks has emerged.According to the type of neural network used in the defect detection task, the inspection methods can be categorized into three classes: classification-based, detection-based, and segmentationbased.The classification-based approach requires a combination of the upper sliding window method to classify and locate defects.The detection-based method needs to locate and frame the defects, while simultaneously identifying their categories [12,17,18,21,[29][30][31][32][33][34][35].Segmentation-based methods need to produce a pixel-wise probability map to determine the shape of defects, thereby measuring the actual area of anomalies.Ho et al. [36] designed an object-oriented defect detection method to address the defect problem in complex contexts, employing a deep residual neural network (DRNN) that performs both feature extraction and classification operations.The addition of cascade layers increases the system's filtering depth.This network significantly improves the detection accuracy of the convolutional algorithm.Yang et al. [37] proposed a new defect detection network (NDDnet) for solving problems such as inadequate feature processing, and used an attention fusion block instead of the initial skip connection, which made the segmentation network emphasize the defect region well and improve detection accuracy.In [34], an optimized MobileNet-SSD was applied to the defect detection of the sealing surface, suggesting better performance in accuracy and speed than lightweight network methods and traditional machine learning methods.Although these detection-based methods demonstrate better results in surface defect detection, they require larger volumes of labeled datasets than traditional machine learning methods.Surface defect datasets are often imbalanced, where anomalous data are severely under-sampled due to their low occurrence frequency [14].
Instead of requiring massive amounts of labeled data like in the above three detection methods, among the segmentation-based methods, the reconstruction-based methods [33] can develop and train anomaly detection models simply by entering a large number of normal, defect-free samples into the model.This technique is trained to reconstruct an image similar to the input image and identify defects by comparing reconstructed textures that are extremely different from the normal data.Tao et al. [38] devised a novel cascaded network with autoencoders to segment metal surface defects and a compact CNN for classifying their specific classes, which are experimentally demonstrated to have strong robustness and high accuracy in metallic defect detection.Mei et al. [39] proposed an unsupervised learning model that only needs defect-free samples to implement defect detection.A convolutional denoising autoencoder network is adopted to reconstruct image patches at different Gaussian pyramid levels, where detection results are synthesized.Li et al. [40] proposed an anomaly detection method based on double attention and consistency loss to solve the surface defect detection problems, effectively enabling the network to separate defective images from defect-free images through channel attention and pixel attention, and achieving good performance in comparison with other detection methods.These aforementioned approaches only desire normal data in the training process.Nevertheless, these reconstruction models make it difficult to make a trade-off between network depth and generalization ability.Thus, focusing on these problems of anomaly detection on surface defects, we propose a fast anomaly segmentation model to deal with the above challenges; moreover, our proposed defect detection model was built on an improved U-Net architecture that introduces a hybrid residual module, combining a spatial attention mechanism and feed-forward neural networks, outperforming existing general-purpose segmentation methods.
The remainder of this paper is organized as follows.The adopted methods of the proposed work are delineated in Section 2, including dataset preparation, network architecture, and loss function.More explicit parameter settings and other optimization operations are then introduced in Section 3. Furthermore, an assessment is carried out to validate the performance of the proposed anomaly detection approach in Section 4. A comparison is also made between the proposed approach and other excellent methods.Finally, our work is summarized in Section 5 with a discussion of the limitations of the proposed methodology and directions for future improvement.

Dataset Preparation
All images in this experiment were taken from the KolektorSDD, KolektorSDD2, and RSDD datasets with labels containing only anomalies.Unlabeled data represent defect-free samples and do not distinguish between defect types.The Kolektor dataset was constructed from images of defective production items provided and annotated by the Kolektor Group d. o. o. [20,41], and taken in a real controlled industrial environment.
For the KolektorSDD image set, there are 50 defective images and 350 defect-free images.The declared resolution of each surface image is 1408 × 512 pixels, but it was verified that its height is actually between 1240 and 1270 pixels.For the KolektorSDD2 image set, there are 356 images with visible defects and 2979 images without defects, each with a resolution of 230 × 630 pixels.Therefore, before being fed into the network, these surface images were scaled to a fixed size and implemented by the bilinear interpolation algorithms.Most of the defects in these two types of images are scratches, spots, notches, etc. Examples of such images are depicted in Figure 1.The rail surface defect detection (RSDD) dataset is a standard dataset for rail inspection.The RSDD dataset includes two types: the Type-I RSDD dataset, with 67 challenging images captured from fast rail, and the Type-II RSDD dataset, with 128 challenging images captured from transport rail.Surface defect images in the Type-I RSDD dataset have a fixed width of 160 pixels but vary in height between 1000 and 1282 pixels.In contrast, surface defect images in the Type-II RSDD dataset have a fixed resolution of 1250 × 55 pixels.For convenience in calculations, surface images from the Type-I RSDD dataset are scaled to a fixed size, similar to those in the Kolektor dataset.Examples of such images are exhibited in Figure 2.Meanwhile, in order to train our proposed model, we cropped the original images and performed the image enhancement process.This resulted in 67 defective images and 201 defect-free images for the RSDD Type-I dataset, and 128 defective images and 384 defect-free images for the RSDD Type-II dataset.
Based on the presentation of the three types of datasets mentioned above, we found that the indicated defects in them range in size from centimeters to millimeters.The centimeter-sized defects include cracks, grooves, etc., while the millimeter-sized defects include fine scratches, dots, etc.Thus, we divided the dataset into a train set, validation set, and test set in a ratio

Network Architecture
Surface defects in images acquired from real industrial environments are usually small compared to other common segmentation objects.They occur mainly due to material aging, mechanical damage, and environmental effects.Generalized segmentation networks cannot be directly used for anomaly detection because multiple downsampling operations are used in their networks, which may lose the semantic information of small defects, such as cracks and breakage points.Inspired by reference [42], in this study, we eliminate all downsampling layers of the encoder in the U-Net network, retaining only the first layer, and propose a hybrid residual module (SAFM), combining a spatial attention mechanism and feed-forward neural network to replace the downsampling layers in the U-Net network, in which the spatial attention mechanism is structurally optimized, as demonstrated in Figure 3. AnomalySeg is an improved work based on AnatomyNet [42], which is a variant of the 3D U-Net for organs-at-risk (OARs) segmentation.Taking into account the consistency of the effect, the network structure in this paper still broadly follows the latter, performing the conversion of 3D convolutional layers to 2D convolutional layers.In contrast to [42], we introduce a hybrid residual module that replaces the traditional 3 × 3 convolution with 3 × 1 and 1 × 3 convolutions on the main path of the residual module, greatly reducing computational effort, as illustrated in Figure 4.An improved spatial attention module (SAFM) is included in its branch, which first replaces the original 7 × 7 convolution with 3 × 3, 5 × 5, and 7 × 7 dilated convolutions based on spatial attention to capture feature information at more scales.Secondly, with reference to the transformer structure, it incorporates a feed-forward neural network (FFN) to optimize the truth value of the activation function with the necessary normalization.Thus, the SAFM substitutes the original squeeze excitation (SE) module in each residual block to improve the extraction rate of spatial feature information, as shown in Figure 5.The arithmetic formula for its structure is as follows: where F is the input feature and F ′ is the output feature, f denotes the convolution operation, which includes 3 × 3, 5 × 5, and 7 × 7; moreover, σ denotes the sigmoid function.GAP and GMP denote the Global Avg pool and Global max pool, respectively, R 1 (x) and R 2 (x) express the output of two types of residual modules.The proposed network has empirically verified that adopting spatial attention is superior to channel-wise attention in emphasizing the meaningful features of small defects.The input to AnomalySeg is a cropped surface image, with or without small defects, in the form N, C, H, W, where N, C, H, and W denote the number of batch sizes, the number of channels, the height of the feature map, and the width of the feature map, respectively.Considering the trade-off between accuracy and inference speed, the initial number of channels for the entire network is set to 24.The number of channels doubles with the high-level feature maps and ultimately reaches eight times the number of initial channels.In the output block, we replace the transposed convolution with bilinear interpolation followed by a convolution layer to avoid the checkerboard artifacts of deconvolution.After experimental comparisons, the final layer of AnatomyNet is replaced with a convolutional layer with 1 × 1 kernels and a sigmoid function.

Loss Function
The prototype of defect detection is commonly considered a hard sample mining problem because defect areas only account for 1% of the entire surface image.From the learning perspective, the hard sample contributes less to the loss compared with the negative sample.Neural networks cannot learn effective features from extremely unbalanced segmentation to distinguish between small defect areas and defect-free backgrounds.As a result, defective areas are often missed or only partially identified.
A common strategy to alleviate the small defect segmentation problem is to adjust the loss distribution of samples, such as using weighted cross-entropy, perceptual loss, or exponential logarithmic loss.However, these loss functions focus more on fine-grained classification and do not perform well in anomaly segmentation tasks.For instance, with an unbalanced dataset, the weighted cross-entropy loss function struggles to deal with category imbalance.In this work, we exploit a hybrid loss composed of focal [43] and Dice loss for small anomaly segmentation [42].Dice loss guides the learning of model parameters from the perspective of sample similarity, rather than re-weighting.The focal loss can force the model to learn how to discriminate hard samples from the background.We test Dice loss, focal loss, and combined Dice-focal loss, respectively, on the Kolektor dataset using UNet and our proposed method.The results obtained are shown in Table 1.The total loss can be described by the following: ) where TP, FN, and FP, respectively symbolize the true positives, false negatives, and false positives of defective areas, p n represents the prediction probability for pixel n being a defect, g n shows the ground truth for pixel n being a defect, λ is the balancing coefficient between L dice and L ′ f ocal , which is described in Section 3.2, L ′ f ocal is the focal loss function, varied from Equation (4), α is the weighting factor to balance the loss distribution between positive and negative samples, set as 0.1 here, γ is the focusing parameter to adjust the weight of easily classified examples, set as 2 here, and N is the total number of pixels in the surface images.

Hyperparameters
Aiming to alleviate the instability of learning caused by randomly initialized weights, the learning rate is gradually increased from 0 to a fixed learning rate.The fixed learning rate is set at 0.1/M where M represents the number of batch sizes.Next, the learning rate is annealed to a small value using a cosine function.The warm-up and annealing of the learning rate have empirically demonstrated better performance in terms of convergence.
The experiments for the proposed approach were performed on a personal computer with an RTX 2080Ti.The specific parameters of the hardware and software used are shown in Table 2. Depending on the training task of our proposed network model and the analysis of three types of datasets, the number of batch sizes is set at 8 for the KolektorSDD, KolektorSDD2, and RSDD datasets.The total number of training epochs is set at 15 for the Kolektor dataset and 20 for the RSDD dataset.The number of warm-up epochs is set between [2,5], based on the training epochs.The learning rate is warmed up to 0.1/M in the warm-up epochs and annealed to 0.0001 in the remaining epochs, where M denotes the number of batch sizes.

Balancing the Loss Term
At the beginning of the training stage, the focal loss is significantly larger than the Dice loss to encourage the model to learn the data distribution of the background.After several training epochs, the focal loss converges to a small value, which is still larger than the Dice loss.Meanwhile, the Dice loss oscillates and is difficult to converge due to the uncertainty of the pixels.In order to avoid becoming trapped in the local minima of the loss, the balancing coefficient λ is progressively reduced from 1 to a small value to balance the loss term.

Choosing Network Structures
Before designing the network structure of the proposed approach, we compared different network structures based on specific metrics.The standard U-Net consists of convolution layers as the basic block.However, the 19-layer encoder-decoder network was empirically found to have poor performance in anomaly detection.To learn more effective feature information, we explored two other convolutional block structures: (a) the residual block, and (b) the attention mixed residual block.Altogether, we discuss the performance of the following four architectures: 1.
SAFM Res UNet, the architecture implemented in AnomalySeg (Figure 4) with spatial attention feed-forward residual blocks.

2.
Res UNet, modifying the SAFM residual blocks in SAFM Res UNet to residual blocks.

3.
SE Res UNet, modifying the SAFM residual blocks in SAFM Res UNet to SE residual blocks.
These models were trained and validated on the same dataset using identical training strategies.The performances measured by the intersection over union (IoU) and Dice coefficient (DSC) on the validation dataset are summarized in Table 3.We observed some meaningful insights from this study.Firstly, spatial attention consistently shows better performance than channel-wise attention in anomaly detection.It appears that spatial attention aids in quickly and accurately locating anomaly features by filtering out a large amount of useless background information.Secondly, to capture multi-scale spatial information, dilated convolution is introduced into the spatial attention module, which performs 3 × 3, 5 × 5, and 7 × 7 convolution operations, respectively, and then is scaled by a feed-forward neural network.This approach can effectively extract multi-scale feature information.Furthermore, when combined with the residual structure of the spatial attention module, it exhibits excellent performance in anomaly detection.

Comparing to Generic Segmentation Methods
After determining the network architecture and training strategy, we compared the performance with existing segmentation methods.Due to the lack of successful studies on small-defect semantic segmentation applications, we chose generic segmentation benchmarks.We conducted experiments comparing several commonly used segmentation methods and our method in terms of CPU inference time for the KolektorSDD, Kolek-torSDD2, and RSDD datasets, respectively.Our method performs faster inference than the other methods for a single process operation, as shown in Figure 6.Secondly, we performed experiments comparing the detection effectiveness of several commonly used segmentation methods with our proposed method on three types of datasets, namely Kolek-torSDD, KolektorSDD2, and RSDDs, respectively.The results are shown in Figures 7-9.The original image, the corresponding masked image, and the predicted image are represented in these figures.The results demonstrate that our proposed approach achieves superior performance on all three types of datasets.For rigorous comparison, these models were all trained in the same way and experimentally compared on the KolektorSDD series and RSDDs validation sets, as shown in Table 4.As a result, the FCN method [44] has the most parameters, and the DeepLab version 3 method [45] requires the largest number of FLOPs.Since AnatomyNet [42] is specifically designed for 3D segmentation, optimal performance on small defect segmentation cannot be expected.In contrast, the model proposed in this paper outperforms most algorithms in the segmentation of small defects and achieves better outcomes in terms of inference speed and segmentation accuracy.AnomalySeg achieves Dice coefficients of 95.2% and 92.9% on the KolektorSDD series of datasets, and 86.5% on the RSDD dataset, outperforming the base work, AnatomyNet, by about 9%, 13%, and 15%, respectively, in small defect segmentation.Ultimately, the segmentation results show that the designed model performs well on crack-like defects but relatively poorly on dot-like defects.As a result, we discussed the poor performance and concluded that small-shaped defects may still fail to capture comprehensive feature information during the feature extraction process, leading to poor detection performance.It is also possible that the defective regions in the detected samples are similar to the non-defective background, which subsequently causes false and missed detections.Through the validation of our proposed model and approach, we found that the KolektorSDD series dataset has a higher defect detection rate.On the one hand, this dataset captures defect images of electronic commutators.Such products have higher defect requirements in industrial manufacturing environments because reducing the defect rate of this product will decrease the cost of product quality inspection and improve production efficiency.On the other hand, the RSDD series of datasets is mostly collected from objects working in external natural scenes, which have relatively low requirements for defects.However, in real track scenes, the detection difficulty increases.Therefore, our proposed method, in combination with automated equipment, is valuable for detecting rail defects.

Conclusions and Future Outlook
Product quality inspection is of great importance in automated industrial manufacturing.In this study, we propose a surface defect detection technique based on fast anomaly segmentation to address the problem of small defects on product surfaces.To tackle the challenge of small defect segmentation, our proposed model is built on an improved U-Net architecture that removes the downsampling layer from the encoder, retaining only the first layer.To improve the defect detection ability of the model, we propose a hybrid residual module (SAFM) that combines spatial attention and a feed-forward neural network instead of the downsampling layer in the encoder to extract features for defects.The spatial attention in the module is superior to the channel attention, which filters out unwanted background information and helps to locate the defect information quickly and accurately.Secondly, dilated convolution is introduced in this module, which can capture multi-scale feature spatial information.In order to solve the problem of severe imbalance in segmentation, a hybrid loss function combining Dice loss and focus loss was designed, enhancing the model's ability to discriminate between defective regions and defect-free backgrounds.Experimental results show that the model achieves significant results in surface defect recognition.Notably, our model is very lightweight compared to previous techniques, requiring only 2.37M parameters.Overall, this study presents a practical and successful defect detection method for product surface quality inspection.
Our future work can be summarized as follows.
(1) Our proposed model is applicable to most other common defect types, which we will continue to detect and validate in our later work.However, the detection model for micro-level defects needs further exploration and will be the focus of our work.(2) According to our investigation and analysis, tinylevel defects appear in many different products with varying quality requirements, posing a significant challenge for the collection and organization of our dataset.(3) Since the area of minor defects in products varies, determining how to obtain more comprehensive information on defect characteristics is also a worthy concern for us.This will be greatly helpful in dealing with minor defects.

Figure 1 .
Figure 1.Several samples from the KolektorSDD and KolektorSDD2 datasets, showcasing defect visual images, with their masks on the top and defect-free images on the bottom.
of 7:2:1 for model training, parameter optimization, and model evaluation, respectively.After dividing the dataset, the KolektorSDD dataset was split into 280 training samples, 80 validation samples, and 40 test samples.The KolektorSDD2 dataset was divided into 2330 training samples, 666 validation samples, and 338 test samples.The RSDD Type-I dataset comprises a total of 536 samples (364 training, 120 validation, and 52 test samples), and the RSDD Type-II dataset comprises a total of 1024 samples (712 training, 208 validation, and 104 test samples).

Figure 2 .
Figure 2. Several samples of the RSDD dataset, including defect images and their masks.

Figure 3 .
Figure 3. Network architecture of the segmentation detector proposed for the anomaly detection.

Figure 4 .
Figure 4. Structural framework diagram of two types of residual blocks; (a) indicates the SAFM residual module with a dilation of 1, (b) indicates the SAFM residual module with a dilation of 3.

Figure 5 .
Figure 5. Structural framework diagram of the spatial attention feed-forward module (SAFM).

Figure 6 .
Figure 6.Inference time of generic segmentation methods on CPU for three types of datasets.

Figure 9 .
Figure 9. Detection results of the six network models under the RSDD dataset.Columns from left to right are represented by (a).Original; (b).Lraspp-mobilenetV3; (c).Deeplabv3-resnet50; (d).Fcn-resnet50; (e).U-Net; (f).AnatomyNet; (g).Our proposed approach, which covers the original and masked images in each method.Each row represents one sample image.Red denotes the predictions.Green represents the ground truths.

Table 1 .
The results of the Kolektor datasets by utilizing UNet and our proposed method.

Table 2 .
Hardware and software configuration information.

Table 3 .
Comparison of several metrics for different residual module network structures.

Table 4 .
Performance comparisons with generic segmentation methods, showing the DSC on validation sets.