Deep Metallic Surface Defect Detection: The New Benchmark and Detection Network

Metallic surface defect detection is an essential and necessary process to control the qualities of industrial products. However, due to the limited data scale and defect categories, existing defect datasets are generally unavailable for the deployment of the detection model. To address this problem, we contribute a new dataset called GC10-DET for large-scale metallic surface defect detection. The GC10-DET dataset has great challenges on defect categories, image number, and data scale. Besides, traditional detection approaches are poor in both efficiency and accuracy for the complex real-world environment. Thus, we also propose a novel end-to-end defect detection network (EDDN) based on the Single Shot MultiBox Detector. The EDDN model can deal with defects with different scales. Furthermore, a hard negative mining method is designed to alleviate the problem of data imbalance, while some data augmentation methods are adopted to enrich the training data for the expensive data collection problem. Finally, the extensive experiments on two datasets demonstrate that the proposed method is robust and can meet accuracy requirements for metallic defect detection.


Introduction
Surface defects have a greatly adverse effect on the quality of industrial products. Metallic defects detection has been exploited to satisfy predefined quality requirements for the industry. Therefore, metallic surface defect detection has attracted increasing interest in recent years and has achieved a positive improvement for the quality control in industrial applications [1]. However, metallic surface defect detection is easily influenced by many environmental factors such as illumination, light reflection, and metal material. These factors significantly increase the difficulty of surface defect detection.
Several defects captured in the industry are shown in Figure 1. In the real-world environment, the defect types are varied and complex, including crazing, inclusion, patches, pitted surface, and scratches. However, most existing defect datasets are poor in data scale and defect richness, even limited to only a few categories. Specifically, the dataset size is generally limited to several hundred, which may lead to a detection model with weak robustness and generalization under complex industrial scenarios. To solve such a problem, it is necessary to introduce a new benchmark that is closer to realistic scenarios. Thus, we construct a new metallic surface defect dataset, named the "GC10-DET".
In the real-world industrial environment, machine vision techniques are usually employed to detect metallic surface defects along the production line. Generally, these techniques refer to traditional image processing and deep learning that aim to analyze and detect defects collected in the manufactories. Although traditional image processing techniques have been successfully exploited to detect surface defects, deep learning-based approaches show great advantages in both surface defects The main idea of traditional image processing techniques is to describe surface defects via well-designed hand-crafted features. The commonly used hand-crafted features contain LBP (local binary patterns), HOG (a histogram of the oriented gradient), GLCM (a gray level co-occurrence matrix), and other statistical features. For an input metallic surface image, the crucial point is to select suitable features to represent the defect information. According to the representation of the surface defects, a classifier is trained to recognize and classify the defects. These detection approaches have obtained a great improvement for various surface defect detections. However, traditional image processing methods cannot be directly deployed in reality, since they usually need complex threshold settings for defects recognition, which are sensitive to some environmental factors such as lighting conditions and background. If the environmental factors change, these threshold settings should be carefully adjusted again, otherwise, the algorithm is not applicable to the new environment due to lack of adaptability and robustness.
In this paper, we propose an end-to-end metallic surface defect network based on the Single Shot MultiBox Detector [6]. For each location on the feature map, the proposed model can separate the output space of the defect bounding boxes into a set of default boxes with different aspect ratios and multiple scales. For prediction, the proposed network generates confidence scores that denote the probabilities belonging to each object category for each default box. Besides, the proposed network can make suitable adjustments to search a better matching box. In addition, due to the significant imbalance between the positive and negative examples, we introduce a hard negative mining method to alleviate the problem of data imbalance. Furthermore, to solve the expensive data collection problem, we also adopt some data augmentation methods to enrich the training data. In summary, the main contributions of this paper are as follows: • We contribute a new dataset named "GC10-DET" that includes 10 defect types collected in real industry situations.

•
We propose a novel end-to-end defect detection and classification network based on the Single Shot MultiBox Detector combined with a hard negative mining method and data augmentation method.

•
The extensive experiments on two datasets demonstrate the effectiveness of the proposed method and the superiority of our dataset.

Related Work
Numerous studies have been conducted for defect detection, yet, they have not been limited to metallic surfaces. These approaches can be mainly divided into two categories: the traditional methods and the deep learning methods, which are based on hand-crafted features or shallow learning techniques, respectively.

Traditional Method
Traditional methods mainly refer to traditional image processing techniques and shallow learning techniques (machine learning). Traditional image processing techniques extract hand-crafted features to describe and detect defects, which can be mainly divided into four categories: structural-based, threshold-based, spectral-based, and model-based methods [7]. In detail, the commonly used structuralbased methods include skeleton-based [8], template match [9], edge-based [10], and morphological operations [11]. The threshold-based methods mainly contain the iterative optimal threshold [12], the Otsu method [13], contrast adjustment threshold method [14], and the Kittler method [15]. The spectral-based methods commonly consist of Fourier transform [16], wavelet transform [17], and Gabor transform [18], which are commonly used in image processing. Finally, model-based methods include the low-rank matrix model [19] and Gaussian mixture entropy model [20]. In general, shallow learning methods have two critical steps including feature extraction and classification. For an input surface image, hand-crafted methods are used to extract effective features for defect representation, then a special classifier is trained to judge whether the surface has defects. Local binary patterns (LBP) [21] and a histogram of oriented gradient (HOG) [22] are the most used features. There are lots of other features, such as co-occurrence matrix (GLCM) [23] and some grayscale statistical features [24,25]. However, the above detection methods cannot be directly deployed to the metallic surface, since traditional image processing techniques are very sensitive to illumination and background clutter. Multiple parameters need to be constantly adjusted for changed environmental factors; even the whole algorithm needs to be re-designed again. These approaches generally aim at only one specific environment, which is difficult to deploy in the more challenging real-world due to the lack of robustness and adaptability.

Deep Learning Method
Since the introduction of AlexNet [26], convolutional neural networks have been successfully deployed to detect surface defects. The authors in [27] outperformed classic computer vision approaches via combining hand-crafted features and support vector machines, which also demonstrate the superiority of deep learning in surface defect detection. However, this work was limited as they did not use ReLU and batch normalization in their network. Similarly, the authors in [28] proposed segmentation architecture for surface defect detection based on deep learning. In this work, ReLU was exploited as the activation function. In [29], the OverFeat network [30] was implemented to detect 5 different types of surface defects. The OverFeat network was trained on 1.2 million defect images from the ILSVRC2013 dataset including general objects. To compare deep networks with different amount of layers for surface detection, Weimer et al. [31] evaluated networks ranging from 5 layers to 11 layers. However, their method is inefficient since it extracted small patches and classified them respectively. Recently, Racki et al. [32] followed a two-stage segmentation network, in which several changes were conducted to increase the size of the receptive field. Racki et al. [32] and Weimer et al. [31] proposed to apply their networks to real-world samples rather than synthetic ones. However, the dataset only consists of a small number of defect images. Furthermore, there are other datasets reaching the hundreds or thousands. Lin et al. [33] proposed a LEDNet to exploit image-annotation and large batch sizes. This method must choose the network carefully since the number of training samples is an important factor that influences the performance of the detection system. The pre-trained models are often trained on ImageNet [34] and MS COCO [35] datasets.

Overview of Our Industrial System
Our industrial system consists of four major stages in a sequential manner: host computer, production line, server and detection results. The pipeline of the system architecture is shown in Figure 2. The host computer is the core of the system that controls the operation of the entire system. The production line is in the industry for defect image collection and production. We deploy our detection model on the server for quality estimation. Finally, we obtain detection results as feedback for the product line. For detail, the goal of the detection model is to detect and classify defects. The input original image is firstly transformed by several data augmentation methods. Secondly, these images are fed into the detection network for training, Our model can both detect and classify defects. Besides, a hard negative mining method is developed to speed up the convergence of the model. The entire system can be well deployed in the actual industrial environment. Overview of our industrial system. Our industrial system consists of host computers, production lines, servers, and detection results. The host computer is to control the operation of the entire system while the server is to deploy a defect detection model for the production line. Finally, detection results provide feedback for the production line.

Data Collection for Production Line
The data collection system consists of a set of linear array CCD cameras with a direct current (DC) light source to avoid the presence of stripes produced by an alternating current (AC). For some production lines, such as a hot-rolled strip production line, the running speed can achieve 10 m/s. Thus, the use of high-speed linear CCD cameras is able to improve the detection speed and the resolution of captured images. For a wide format steel plate, 4096 pixel line scan CCD cameras can be stitched to capture a complete image. The steel plate images are captured in this way and then we transmit these images to the server. The server exploits a large number of computing resources to detect the corresponding defects. Finally, results are output to the console for quality control.
To be rigorous, we introduce the brands, parameters and types of the related equipment for data collection as follows: • Camera: The brand of camera is Teledyne while the camera model is DALSA LA-CM-04K08A.
The type of lens is ML-3528-43F of Moritex. The pixel size is 7.04 µm × 7.04 µm. • Server The running memory is 32G with the GPU cards of NVIDIA RTX 2082ti.

Detection Model
As shown in Figure 3, the detection model is based on the Single Shot MultiBox Detector, which merely takes an input image and ground truth object boxes during the training process.
In a convolutional fashion, the detection model adopts multi-scale feature maps to evaluate a set of boxes with different aspect ratios at each position. For each box, the network predicts both the offsets and the confidences for each category. During training, these boxes are matched with the ground truth boxes. The loss is a weighted sum between Smooth L1 and Softmax Loss. The base of the detection model is a feed-forward convolutional neural network, consisting of two major modules: VGG16 model and a non-maximum suppression procedure to output the final detection results. We then add extra architectures into the network including multi-scale feature maps and predictors for detection.

Multi-Scale Feature Maps
Several convolutional layers are added to the end of the base VGG16 network. The goal is to progressively decrease the feature size and detect at multiple scales.

Predictors for Detection
Each convolutional layer in our network can produce a fixed set of predicted parameters using a set of convolutional filters. The basic element to predict detections is a 3 × 3 × c convolution kernel, where c is the number of the channels. The convolution kernel is used to produce the confidence for categories of each box. For each location of the feature map, it is applied to output the value.

Defect Default Boxes
At the top network, a set of defect default bounding boxes are matched with each feature map cell for multi-scale feature maps. The feature map is produced by convolutional filters to associate with defect boxes; thus, each box position is relative to both fixed and corresponding cells. In each feature map cell, we predict the offsets for the defect box and confidence scores that indicate the probabilities belonging to each category. Besides, at a given location, we calculate c class scores and the 4 offsets corresponding to the original default box shape. Thus, total(c + 4)k convolutional filters are needed to produce the feature maps, where k is the number of default boxes. Therefore, an m × n feature map has (c + 4)kmn outputs.

Loss Function
The training objective is inspired by the MultiBox objective [6] to handle multiple object classes. We use x p ij to indicate if i-th default box matches with the j-th ground truth box of category p. If matched, let x p ij = 1, otherwise, let x p ij = 0. Thus, we obtain ∑ i x p ij ≥ 1. The overall loss function is a combination of the localization loss and the confidence loss via weighted sum, which can be written as: where N is the number of the matched default boxes, c is the center of the box, p is the predicted box and g is the ground truth. Besides, if there are no matched boxes (N = 0), the loss is set as 0. Then, we regress to offsets for the center (c x , c y ), width (w), and height (h) of the default box (d). Thus, the localization loss is written as: where i is the indicator of the positive samples and m ∈ {c x , c y , w, h}. The confidence loss is a softmax loss for multiple classes and their confidence (c). It can be written as: and the weight term α is set as 1 via cross validation.

Matching Strategy
During the training process, the corresponding ground truth is to be selected from default boxes for the loss computation. These selected ground truth boxes vary over different aspect ratios and scales. Inspired by [36], we match default boxes to any ground truths according to a Jaccard overlap that is higher than a threshold. This operation allows the network to output high prediction scores for multiple overlapping boxes rather than selecting only the one with maximum overlap.

Hard Negative Mining
It is obvious that most of the default boxes are negatives after matching, especially when the number of default boxes becomes large. This would introduce a significant bias because of the imbalance between the positive and negative training samples. To solve this problem, we exploit their confidence loss to choose the highest confidence default boxes so that the ratio between the negatives and positives is limited (at most 3:1). This can lead to a more stable and faster training.

Data Augmentation
In order to obtain a robust model for various shapes and size of the object, we make a data augmentation for each training image like [ssd] as follows: (1) Use the entire original image. (2) Select a patch so that the minimum Jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9. (3) Randomly select a patch. (4) The size of each selected patch is [0.1, 1] of the original size. The aspect ratio is between 1/2 and 2.

Experiments
In this section, we conduct a series of experiments to evaluate the proposed method using real defect images of a metallic surface. First, we provide a brief introduction of the used datasets and experimental settings. Then, the experimental results are presented in both visual and quantitative analyses. Finally, we conclude the whole work and present future work.

Description of NEU-DET
NEU-DET [21] is the Northeastern University (NEU) surface defect dataset that includes six types of surface defects, i.e., rolled-in scale (Rs), patches (P), crazing (Cr), pitted surface (Ps), inclusion (In) and scratches (Sc). The collected defects are on the surface of the hot-rolled steel strip. The dataset includes 1800 gray-scale images, i.e., 300 samples in each class of surface defects. The detailed defects are as follows: • Inclusion: Inclusion is a typical defect of metal surface defects. Some inclusions are loose and easy to fall off, some pressed into the plate.

Description of GC10-DET
The GC10-DET dataset is available on the github (Website: https://github.com/lvxiaoming2019/ GC10-DET-Metallic-Surface-Defect-Datasets). GC10-DET is the surface defect dataset collected in a real industry. It contains ten types of surface defects, i.e., punching (Pu), weld line (Wl), crescent gap (Cg), water spot (Ws), oil spot (Os), silk spot (Ss), inclusion (In), rolled pit (Rp), crease (Cr), waist folding (Wf). The collected defects are on the surface of the steel sheet. The dataset includes 3570 gray-scale images. Table 1 shows the comparison of NEU-DET and GC10-DET dataset. The detailed defects are as follows: • Punching: In the production line of the strip, the steel strip needs to be punched according to the product specifications; mechanical failure may lead to unwanted punching, resulting in punching defects.

•
Welding line: When the strip is changed, it is necessary to weld the two coils of the strip, and the weld line is produced. Strictly speaking, this is not a defect, but it needs to be automatically detected and tracked to be circumvented in subsequent cuts. • Crescent gap: In the production of steel strip, cutting sometimes results in defects, just like half a circle. • Water spot: A water spot is produced by drying in production. Under different products and processes, the requirements for this defect are different. However, because the water spots are generally with low contrast, and are similar to other defects such as oil spots, they are usually detected by mistake.

•
Oil spot: An oil spot is usually caused by the contamination of mechanical lubricant, which will affect the appearance of the product.
• Silk spot: A local or continuous wave-like plaque on a strip surface that may appear on the upper and lower surfaces, and the density is uneven in the whole strip length direction. Generally, the main reason lies in the uneven temperature of the roller and uneven pressure. • Inclusion: Inclusion is a typical defect of metal surface defects, usually showing small spots, fish scale shape, strip shape, block irregular distribution in the strip of the upper and lower surface (global or local), and is often accompanied by rough pockmarked surfaces. Some inclusions are loose and easy to fall off and some are pressed into the plate. • Rolled pit: Rolled pits are periodic bulges or pits on the surface of a steel plate that are punctate, flaky, or strip-like. They are distributed throughout the strip length or section, mainly caused by work roll or tension roll damage. • Crease: A crease is a vertical transverse fold, with regular or irregular spacing across the strip, or at the edge of the strip. The main reason is the local yield along the moving direction of the strip in the uncoiling process. • Waist folding: There are obvious folds in the defect parts, a little more popular, a little like wrinkles, indicating that the local deformation of the defect is too large. The reason is due to low-carbon.

Performance Evaluation
We adopt Recall, Average Precision (AP), and mean Average Precision (mAP) for performance evaluation. Recall represents the ratio of correctly detected images and all testing images for each defect category. AP represents the average detected precision for each defect category; mAP is the mean of average detected precision for all defect categories.
To be rigorous, we introduce the best parameter tuning process in this section. The abovementioned deep methods adopt a pre-trained model on the ImageNet, which can be helpful to extract basic image features including edge, texture and so on. Therefore, the SSD method utilizes VGG16 as the pre-trained model, YOLO-V2 uses Darknet19 model, YOLO-V3 uses Darknet53 model, and Faster R-CNN adopts Resnet50 model. We claim parameter tuning for them respectively, as follows: • Learning Rate: In the classical back propagation algorithm, the learning rate is determined by training experience. The larger training rate denotes the larger weight updating, which can accelerate the convergence of the model, but if the learning rate is too large, it may cause the oscillation of the training. Besides, a slower learning rate may lead to a slow convergence of the training process. Thus, we adjust as follows: (1) A large learning rate is used to initialize the model, and the learning rate decreases as training iterations increase. (2) Initial learning rate is set from 0.1 to 0.00001, and the best one is selected through experiments. Thus, we obtain the best learning rate as follows: SSD (0.0005), Faster-RCNN (0.01), YOLO-V2 (0.0005), and YOLO-v3 (0.0005).  Table 2 shows the detailed comparison results of Recall on the NEU-DET dataset. Some detection results of NEU-DET are shown in Figure 4. The proposed method can obtain the best results on the defects of Cr, In, Pa, Ps, and Rs, while the SSD300 is slightly higher than proposed method on Sc (0.990 vs. 0.981). Table 3 shows the detailed comparison results of AP and mAP on the NEU-DET dataset. The proposed method can obtain the best results on the defects of Cr, Pa, Ps, Rs, and Sc, while the SSD300 is slightly higher than the proposed method on In (0.796 vs. 0.763) and Rs (0.621 vs. 0.581). As shown in Tables 2 and 3, the YOLO methods are difficult to distinguish the six types of defects. The reason may be because the defects on the surface generally are small scale, which cannot be well solved by YOLO-V2 and YOLO-V3 with fixed scale detection. However, the proposed method adopts multi-scale cells to better distinguish multi-scale defects and the mAP can reach 0.724. While Faster-RCNN exploits anchor boxes to overcome this problem, it is still lower than the proposed method. Table 4 shows the detailed comparison results of Recall on the GC10-DET dataset. Some detection results of GC10-DET are shown in Figure 5. The proposed method can obtain the best results on the defects of Pu, Wl, Cg, Ws, Os, Ss, In and Wf, while the SSD300 is slightly higher than proposed method on Rp (0.667 vs. 0.333) and Faster-RCNN is higher than proposed on Cr (1 vs. 0.857).    Table 5 shows the detailed comparison results of AP and mAP on the GC10-DET dataset. The proposed method can obtain the best results on the defects of Pu, In, and Rp, while the SSD300 is slightly higher than proposed on Wl (0.  As shown in Tables 2 and 3, the YOLO methods are difficult to distinguish between the six types of defects. The reason may be because the defects on the surface generally are small scale, which cannot be well solved by YOLO-V2 and YOLO-V3 with fixed scale detection. However, proposed method adopts multi-scale cells to better distinguish multi-scale defects, and the mAP can reach 0.724. While Faster-RCNN exploits anchor boxes to overcome this problem, it is still lower than proposed method. Table 6 shows the detailed comparison results of precision on the NEU-DET dataset. The proposed method can obtain the best results on the defects of Cr, In, Pa, Ps, Rs, and Sc, while traditional methods had the worse results. It is noticed that different hand-crafted features provided different results, because the representation of one hand-crafted feature is limited. Although we try our best to assign the parameters of the traditional methods such as threshold, they still performed worse than the proposed method, which uses deep convolutional network.

Computational Time Comparisons
As shown in Table 7, the proposed method can work with a relatively fast speed. To process one image of NEU-DET, the proposed method performed a similar computational time to SSD, i.e., 27 ms vs. 29 ms, while the result is 6 s vs. 7 s for the whole testing set. On the GC10-DET, to process one image, the proposed method performed a second computational speed, i.e., 33 ms vs. 29 ms (SSD), while results for whole testing set came third with 8 s vs. 4.49 s (YOLO-V2). Although the computational time of YOLO-V2 may be slightly smaller than the proposed method, the accuracy of the proposed method is higher. In addition, as shown in Table 8, the traditional methods generally cannot meet the requirements in real-time.

Conclusions
In this paper, we contribute a new dataset called GC10-DET for metallic surface defect detection. The GC10-DET dataset has various challenges regarding defect types, defect images, and dataset scales.
Besides, we propose an end-to-end defect detection and classification network based on the Single Shot MultiBox Detector. To solve the significant imbalance between the positive and negative examples, we present a hard negative mining method to effectively train our network. Furthermore, to enrich the training data, we also introduce some data augmentation methods into our training. Finally, extensive experiments demonstrate that the proposed method is robust for metallic defect detection.