Real-Time Steel Surface Defect Detection with Improved Multi-Scale YOLO-v5

: Steel surface defect detection is an important issue when producing high-quality steel materials. Traditional defect detection methods are time-consuming and uneconomical and require manually designed prior information or extra supervisors. Surface defects have different representations and features at different scales, which make it challenging to automatically detect the locations and defect types. This paper proposes a real-time steel surface defect detection technology based on the YOLO-v5 detection network. In order to effectively explore the multi-scale information of the surface defect, a multi-scale explore block is especially developed in the detection network to improve the detection performance. Furthermore, the spatial attention mechanism is also developed to focus more on the defect information. Experimental results show that the proposed network can accurately detect steel surface defects with approximately 72% mAP and satisﬁes the real-time speed requirement.


Introduction
Steel surface defect detection is an important topic in material science research [1].As one of the most important fundamental materials, steel contributes to numerous industry productions, such as airplanes, automobiles and high-speed railways.Among various steel productions, flat steel is the dominant product and contributes the most to industrial applications.As such, the quality of flat steel is vital for daily life.
Unfortunately, there are usually defects on flat steel surfaces, making it challenging to generate high-quality steel industrial productions.There are six typical defects on steel surfaces: crazing, inclusion, patches, pitted surface, rolled-in scale and scratches [2]. Figure 1 demonstrates the typical types of different steel surface defects in the North East University Detection (NEU-DET) dataset.The defects lead to bad quality in the flat steel, making it challenging to produce high-quality industrial productions.
In traditional factories, steel surface defect detection relies on human supervision, which is time-consuming and uneconomical [3].On the one hand, extra supervisors require more resources than automatic detection does.On the other hand, human-dependent detection cannot ensure quality supervision around-the-clock.With the development of the industrial vision area, computers come to be a powerful tool to detect surface defects.Previous works rely on hand-crafted feature extractors and machine learningbased classifiers to localize the defects.An artificial neural network (ANN), support vector machine (SVM), k-nearest neighbor (KNN) and other machine learning technologies have been widely applied in different steel surface defect detection methods [4][5][6].However, these works suffer from lower precision and cannot satisfy the real-time speed requirement.To boost the accuracy and improve the speed, there are convolutional neural networks (CNNs) especially developed for defect detection.Mustafa et al. used different image classification methods to recognize the diverse steel surface defects [7].He et al. utilized a multi-level feature fusion network and classified the different kinds of defects [8].These works demonstrate good performances with two-stage object detection networks, separating the localization and classification steps, which is time-consuming.
In contrast to two-stage networks, you-only-look-once (YOLO) series methods utilize one-stage objection detection technology and achieve real-time speed [9][10][11][12][13].In particular, the fifth version, YOLO-v5, achieves the state-of-the-art detection performance which has been widely utilized in various situations, such as letter recognition [13], circuit defect detection [14] and fabric detection [15].However, the traditional YOLO-v5 method cannot effectively explore steel surface defects.On the one hand, as shown in Figure 1, there are different representations of the defects, making it difficult to accurately localize and classify the defect areas.On the other hand, the defects vary from different scales.The extra small or large defects are challenging to explore and detect.
This paper proposes an improved multi-scale YOLO-v5 technology for real-time steel surface defect detection.In particular, we develop a multi-scale block to effectively explore the defects.Convolutions with different filter sizes are especially developed to process the input images and generate the multi-scale information.The multi-scale image features are aggregated by one convolutional layer for information fusion, which boosts the representation capacity of the network.Furthermore, a spatial-attention mechanism is developed to concentrate more on the defect areas and improve the detection accuracy.The experimental results show that the improved multi-scale YOLO-v5 method can more accurately detect the steel surface defects than the original version, which satisfies the real-time speed requirement.
Our contributions can be concluded as follows: • We propose an improved multi-scale YOLO-v5 network for effective steel surface defect detection, which achieves a high detection accuracy and demonstrates a good robust performance.

•
We develop the multi-scale block and spatial attention mechanism to process the steel surface images, which effectively explore the defect information and improve the accuracy of the network.

•
Experimental results show that the improved network has a higher prediction accuracy than the vanilla YOLO-v5 method, which satisfies the real-time speed requirement.

Steel Surface Defect Detection
Steel surface defect detection is one of the most important tasks in the industrial vision research area.The task of steel surface defect detection is to automatically find the defects of the steel and guarantee the quality of industrial productions.In previous works, different computer vision and machine learning-based methods have been used to accurately detect defects.There are researchers utilizing the probability model to describe the steel surfaces without defects, and regarding the outliers as the examples with bad quality [16].To this end, a dynamic threshold technology is developed to detect the outliers.Wang et al. used the histogram of image features to model the difference between the defect-free examples and bad cases [17].However, this method considers the detection in the gray scale, which cannot effectively explore the color information of the images and limits the accuracy.Moreover, the defects have different scales and representations.The statistic-based methods cannot effectively distinguish the diverse defects from each other and suffer from poor accuracy.
There are also works regarding the task in the spatial domain and utilizing some filter-based methods to detect and localize the defects.A Gabor filter was considered to explore the hole-like defects and achieved a good detection performance on different scales [18].Hough transformation was also utilized to model the different kinds of defects and improve the robustness of the detection [19].Edge information was also considered in the defect detection procedure [20].Yang et al. utilized a convolution operation to explore the contour and edge information of the steel image and modeled the complex steel defects.
Recently, CNN has demonstrated its amazing performance on the objection detection task [21][22][23].Most of the works are established based on the YOLO [9] series, fast RCNN [24] series or other object detection methodologies.Shi et al. developed an improved faster RCNN method for accurate steel surface defect detection [25].Zhan proposed a bilaterally symmetric UNet for detection [26].Yang and Guo also modified the YOLO network for detecting the defects [27].Recently, there are also generative adversarial network (GAN)based methods for defect detection [28][29][30].

Deep Learning for Classification and Object Detection
Deep learning has demonstrated great effectiveness in the object classification task, which is usually considered as the backbone of object detection methods.LeNet [31] was the first CNN-based method for image classification, which proved to be a success on hand-written digital number recognition.In 2012, AlexNet demonstrated its amazing performance on the ImageNet [32] competition with 62.5% accuracy, and won the first prize with a large improvement from the second prize method.After that, there are numerous classification networks with well-designed architectures and good classification performances.VGGNet [33] utilized a very deep network to improve the classification accuracy to 74.0%.ResNet [34] introduced the residual connection into image classification and achieved better performances than previous works with 78.4% accuracy.The residual connection has been widely developed in various network designs.DenseNet [35] provided a dense connection to build the network for better information transmission with 79.2% accuracy.Recently, there have been different modifications on ResNet and DenseNet to improve the classification performance.Zhang et al. proposed a multi-level residual network design and improved the network representation capacity [36].Gao et al. developed a multi-scale backbone for ResNet, and proposed Res2Net for image classification [37].Xie et al. brought the aggregated residual transformation from ResNet and named their new network ResNext [38].
Based on the classification backbones, there are numerous object detection networks achieving good performances.R-CNN [39] is the first CNN-based method for object detection, which utilized a CNN to explore the features and used SVM for classification.After that, fast R-CNN [40] modified the structure of R-CNN, and made it to be an endto-end technology for better performance with fast speed.Faster R-CNN [41] further improved the fast R-CNN with higher speed and accuracy.Mask R-CNN [42] combined the object detection and the semantic segmentation.
The R-CNN series methods are two-stage detection technologies, which separate the localization and classification steps and become time-consuming.To boost the speed of detection, the YOLO series methods are proposed to meet the real-time requirement.YOLO-v1 [9] is the first YOLO-series method with restricted parameters and computational cost.To boost the precision, YOLO-v2 [10] used a new backbone and considered other modifications with faster speed.After that, YOLO-v3, v4 and v5 successively improved the performance with well-designed network components.
Except for R-CNN and YOLO, there are also other network architectures for object detection.SSD [43] added the multi-scale feature exploration to YOLO and achieved better performance.RetinaNet [44] used focal loss and regarded ResNet as the backbone to boost the accuracy.CenterNet [45] considered the object detection in an anchor-free style and jointly improved the speed and the accuracy.FCOS [46] also developed a fully convolutional network for one-stage object detection.

Deep Learning for Defect Detection
With the development of deep learning technology, numerous CNN-based methods have been especially designed for defect detections.The deep learning-based defect detection works usually concentrate on the effective network design and utilize wellestablished architectures to improve the network representation technology and boost the detection accuracy [47].Wu [50].In their work, the localization and classification operations are decoupled into two different modules.Additionally, they established the multi-hierarchical aggregation block and the locally non-local block for boosting the network performance.With the elaborate network designs, their network achieves state-of-the-art performance with 91.45% mAP on FPCBs' defect detection.Guan et al. also developed an improved YOLOv5 network to detect the ceramic ring defects [51].In their work, the attention mechanism was specially addressed into the YOLOv5 backbone and improved the detection performance.Their method achieves 89.9% mAP on the ceramic ring defect and achieves a state-of-the-art performance.Mo et al. proposed a weighted double-low-rank decomposition technology for fabric defect detection [52].In contrast to other YOLO-based and R-CNN-based networks, their work regards the defect detection as an optimization-based problem and utilizes an alternating direction method of multipliers (ADMMs) to solve the task.With the new perspective and the novel methodology, their work achieves higher accuracy and better detection performance than other fabric defect detection methods.In contrast to the objective-oriented detection methods, Zeng et al. proposed a reference-based defect detection network for all defect detection tasks [53].This method using a well-aligned template reference to estimate the potential defects of the input images.
Importantly, there are different deep learning-based works concentrating on the steel defect detection.YOLO and R-CNN are the most two popular baselines for developing the detection network.Su et al. developed an improved YOLO-v4 network for steel surface defect detection [54].In their method, the channel attention mechanism was specially developed to capture the global information of the image feature.After that, an ICIoU loss function was introduced to replace the CIoU, which can more effectively solve the data imbalance issue.Their method achieves 78.63% mAP on the steel surface defect detection dataset and proves to be one of the state-of-the-art methods.Xie et al. also developed an improved faster R-CNN method for fast and accurate surface defect detection [55].They modified the backbone of the faster R-CNN to better explore the image feature and achieve more accurate detection results.Beyond the YOLO and R-CNN series, there are also different technologies especially devised for steel surface defect detection.Tian et al. proposed a complementary adversarial network-driven surface defect detection for different types of the defects [56].In their work, an encoding-decoding architecture was specially developed for image segmentation and the discriminator loss was considered for its better performance.Additionally, the dilated convolution and the edge detection are also considered in the network to effectively explore the image feature.Zhan also developed a bilaterally symmetric U-shaped network, dubbed BSU-Net, for effective surface defect detection [26].In BSU-Net, an enhanced U-Net and a feature expanding network are combined to classify whether the image has defects.Cheng and Yu.considered the RetinaNet as the backbone, and embedded the channel attention mechanism and the adaptively spatial feature fusion into the detection procedure to boost the accuracy [57].Guan et al. devised a U-shaped architecture to detect the defects, which used the VGG-19 as a feature extractor to extract the information.Furthermore, the structural similarity and the decision tree are utilized to evaluate the image quality [58].Han et al. developed a twostage edge reuse network embedding the saliency information into defect detection [59].In their method, an edge-aware foreground-background integration module was especially devised to explore the saliency and further concentrate on the defect information.

Attention Mechanism
Attention mechanism proves to be an effective component for CNN to boost the representation performance and improve the predicting results.In general, the attention mechanism can be separated into three different kinds: the channel-wise attention mecha-nism [60], the spatial attention mechanism [61] and the non-local attention mechanism [62].The channel-wise attention mechanism embeds the image features into a vector and gives different weights to different feature channels.In contrast to the channel-wise attention, the spatial attention mechanism finds the weights for every pixel of the image features.The non-local attention mechanism calculates the global relationship of the image feature, and utilizes the matrix multiplication operation to conduct the attention procedure.The attention mechanism has been widely used in different computer vision and image processing tasks, such as image super-resolution [60], image dehazing [63], object detection [61] and image segmentation [62].
The attention mechanism is also widely considered in defect detection areas.Wang et al. used the spatial attention mechanism to detect the subway tunnel defects [64].Li et al. devised a dynamic attention graph convolution mechanism for the point cloud defect detection [65].Wu and Lu combined the spatial attention, channel attention and the non-local attention mechanisms for fabric defect detection and achieved a 91.6% mAP performance [66].Chen et al. used the deformable convolution and the channel attention mechanism for building the strip steel surface defect detection network [67].Peng et al. also developed a fabric defect detection network with both the spatial and channel attention mechanisms.

Method
In this section, we firstly introduce the design of the proposed network.Then, the multi-scale block and the spatial attention mechanism are described.Finally, we demonstrate the detailed implementation of the proposed network.

Network Design
Figure 2 shows the network design of the proposed improved multi-scale YOLO-v5 method.This network is composed of three different components: the bottleneck, the head and the detector.The input image is firstly processed by the bottleneck to explore the multi-scale features.Then, the proposed features are aggregated and further processed by the head.Finally, the multi-scale features are sent to the detector for classification and localization.As shown in the figure, the bottleneck is composed of the combination of convolution, batch normalization and SiLU activation (CBS), the multi-scale sequence (MS) and the spatial pyramid pooling fusion (SPPF).There are five CBSs, fifteen MSs and one SPPF in the bottleneck.In MS, there are three multi-scale blocks (MBs) and one CBS for multi-scale feature fusion.The SPPF is composed of two CBSs and three max pooling (MaxPool) operations.The MaxPool operations explore the image feature in the spatial pyramid pooling fashion.Then, CBS combines and fuses the multi-scale features.
The head of the network is composed of four CBSs, twelve MSs and several bicubic operations to maintain the image resolution.The head combines features from the different stages of the bottleneck, and uses CBS and MSs for better multi-scale feature fusion.Finally, the multi-scale features of the head are sent to the detector for object detection and localization.
The detector follows the vanilla YOLO-v5 design [13], which regresses the bias of different anchors and localizes the objects.The detector contains three scales to effectively explore the small and large objects.For each scale, there are three anchors to localize the defects.

Design of the Multi-Scale Block and Spatial Attention
Figure 3a shows the design of the MB.There are two multi-scale convolutions (MSConv) to explore the hierarchical image information.After that, one CSB with skip connection builds the residual structure for better gradient transmission.Figure 3b demonstrates the design of MSConv.In the MSConv, two 1 × 1 and two 3 × 3 convolutions crossly process the image feature and explore the multi-scale information.After that, one 1 × 1 convolution combines the features of two convolutions for information fusion and keeps the number of channels.A spatial attention (SA) mechanism is specifically developed to further concentrate on the defect information and improve the detection performance.Finally, a skip connection is introduced for better gradient transmission.Figure 4 shows the design of the SA.In the figure, we can find that the SA has two convolutions, one ReLU activation and one sigmoid activation.The convolutions decrease and increase the channel number symmetrically.The ReLU activation introduces the non-linearity to the attention exploration.Finally, the sigmoid activation introduces the non-negativity to the attention.The spatial attention mechanism follows an encoder-decoder design, which can effectively explore the spatial correlation of the input image feature.The sigmoid activation brings the non-negativity to the feature and gives higher weights to the detected areas, which helps boost the network representation capacity and improve the detection accuracy.

Implementation Details
Table 1 shows the parameter settings of the proposed improved multi-scale YOLO-v5 network.The component index follows the order in Figure 2. The scale of CBS means decreasing the resolution of the image feature by s times, and the scale of bicubic means increasing the resolution by s times, where s is the scale.During the training phase, we used the same loss functions as the YOLO-v5, including the coordinate loss, the target confidence loss and the target classification loss.The weights and the implementation are entirely the same as the vanilla YOLO-v5 design for a fair comparison.

Settings
We chose the NEU-DET [2] dataset to train and test our model.NEU-DET contains 1800 steel surface defect images with six typical defects: pitted surface, rolled-in scale, scratches, crazing, inclusion and patches.Among the images, we randomly chose 60% for training, 20% for validation and 20% for testing.We trained the network on one NVIDIA RTX 3080-Ti GPU.The batch size was chosen as 16.We updated the network for 100 epochs.The optimizer was chosen as Adam with a learning rate as 10 −3 .
The measurements of the performance are chosen as precision, recall and mean average precision (mAP).The precision and the recall are defined as and where TP, FP and FN are the true positive, false positive and false negative samples, respectively.P is the precision and R is the recall.

Results
To demonstrate the effectiveness of our method, we mainly compared the improved version with two vanilla YOLO-v5 network settings: YOLOv5-s and YOLOv5-m.We firstly compared the computational complexity of different methods.Table 2 shows the parameters, GFLOPs and time costs of different methods.The GFLOPs are calculated by processing one 640 × 640 image.In the table, we can find that our method satisfies the real-time speed requirement and has the ability to process more than 190 images per second.To demonstrate the effectiveness of our method, we compared it with YOLO-v7tiny [68], one of the state-of-the-art object detection methods.Table 3 shows the precision, recall, mAP50 and mAP50-95 comparisons among the different object detection methods.In the table, we can find that our network achieves the highest scores on all testing indicators.From this point of view, our method can effectively detect the defects of the steel surfaces.Figure 5 shows the PR-curve among different methods.In the figure, we can find that our method has a larger area under the curve (AUC), which denotes a better performance than the other methods.To further investigate the effectiveness of our method, we also demonstrated the precision, recall and F1 curves.Figure 6 shows the results of different indicators.We can find that our method has a good performance on different kinds of defects.Finally, we demonstrated the visualized results of the steel surface detect detection.Figure 7 shows the comparison between ground-truth and our prediction results.In the figure, we can find that our method can predict most of the defects on the steel and has a robust performance in terms of defects with different scales.The performance gain comes from the well-designed network architecture.In Table 2, our method has similar parameters, GFLOPs and time costs to YOLO-v5m.In contrast, the performance of our method is superior to YOLO-v5m.It should be noticed that YOLO-v5m is a larger version of YOLO-v5s, whose performance improvement is limited.From this point of view, the performance gain comes from the new architecture rather than the larger network.It should be noticed that the best mAP50 performance of Table 3 is of approximately 0.72, which is lower than other reports.This is because we used an entirely different data organization protocol from other papers.In our work, the NEU-DET dataset is split by 60%, 20% and 20% for training, validation and testing, respectively.The amount of training data is much smaller than in other works for ensuring the generation performance.To fairly compare the effectiveness of different methods, we re-trained different methods under the same protocol, the results of which are reliable for measuring the performances.

Conclusions
In this paper, we proposed an improved multi-scale YOLO-v5 network for steel surface defect detection.To focus on diverse defects at different scales, we developed a multi-scale block to effectively explore the defects with different resolutions.To further improve the network performance and concentrate more on the defect areas, we developed a spatial attention mechanism to give higher weights to abnormal information.The experimental results show that the improved multi-scale YOLO-v5 network can effectively detect different kinds and scales of defects and satisfies the real-time speed requirement.

Figure 2 .
Figure 2. Network design of the proposed improved multi-scale YOLO-v5 method.

Figure 4 .
Figure 4. Design of the spatial attention (SA) mechanism.

Figure 5 .
Figure 5. PR curve comparisons among different methods.

Figure 6 .
Figure 6.Precision, recall and F1 curves of our method.

Figure 7 .
Figure 7. Visualized results of the steel surface defect detection: (a-c) Groundtruth.(d-f) Prediction results.Zoom-up for better view.
et al. applied an SSD-based detection network for accurate PCB defect detection [48].An et al. developed an improved faster R-CNN network for fabric defect detection, which utilizes the VGG-16 backbone for feature extraction and builds a multi-scale feature pyramid model for the RPN network [49].Luo et al. developed a decoupled two-stage network for the FPCB surface defect detection

Table 1 .
Parameter settings of the improved multi-scale YOLO-v5 network.

Table 2 .
Computational complexity comparisons among different methods.