Steel Surface Defect Detection Algorithm Based on YOLOv8

: To improve the accuracy of steel surface defect detection, an improved model of multi-directional optimization based on the YOLOv8 algorithm was proposed in this study. First, we innovate the CSP Bottleneck with the two convolutions (C2F) module in YOLOv8 by introducing deformable convolution (DCN) technology to enhance the learning and expression ability of complex texture and irregular shape defect features. Secondly, the advanced Bidirectional Feature Pyramid Network (BiFPN) structure is adopted to realize the weight distribution learning of input features of different scales in the feature fusion stage, allowing for more effective integration of multi-level feature information. Next, the BiFormer attention mechanism is embedded in the backbone network, allowing the model to adaptively allocate attention based on target features, such as flexibly and efficiently skipping non-critical areas, and focusing on identifying potentially defective parts. Finally, we adjusted the loss function from Complete-Intersection over Union (CIoU) to Wise-IoUv3 (WIoUv3) and used its dynamic non-monotony focusing property to effectively solve the problem of overfitting the low quality target bounding box. The experimental results show that the mean Average Precision (mAP) of the improved model in the task of steel surface defect detection reaches 84.8%, which depicts a significant improvement of 6.9% compared with the original YOLO8 model. The improved model can quickly and accurately locate and classify all kinds of steel surface defects in practical applications and meet the needs of steel defect detection in industrial production.


Introduction
Steel, as a crucial foundational material, plays a key role in the development of the national economy [1].It is extensively used in various sectors, including construction [2], manufacturing, transportation, energy, and more.Steel quality directly affects engineering safety and economic efficiency.However, surface defects produced during the production of steel (such as Crazing, Inclusion, Patches, Pitted surface, roll-in Scale, Scratches, etc. [3]) are likely to cause safety hazards.Therefore, accurate and efficient detection of steel surface defects is crucial.Detection methods are mainly divided into three categories: manual detection, traditional photoelectric detection, and advanced machine vision detection [4].Among them, although manual detection is intuitive, it is limited by factors such as high labor cost and significant differences in subjective judgment, resulting in low accuracy and efficiency.Traditional photoelectric detection methods include eddy current testing [5], magnetic leakage testing [6], infrared testing [7], and laser scanning detection [8].However, due to their excessive costs, these methods have not been widely adopted.
In recent years, with the advancement of technologies like machine vision, the detection of surface defects in steel has evolved towards automation and artificial intelligence.Utilizing high-resolution cameras, advanced image processing algorithms, and deep learning models, automated detection and classification of surface defects in steel can be achieved.These technologies enhance the accuracy and efficiency of detection while reducing the influence of human factors.Numerous researchers have explored both traditional machine learning and deep learning techniques for steel defect detection.Traditional machine learning often involves research on feature extraction methods.Park et al. [9] proposed a two-stage statistical-based detection approach using statistical methods for feature processing, followed by classification using Support Vector Machines (SVM) to identify defect objects, demonstrating the detection of most defects.Xu et al. [10] employed multi-scale geometric analysis to decompose images into various detail levels, calculating statistical features for each sub-band to transform them into high-dimensional feature vectors.To reduce dimensionality and extract key features, a graph embedding algorithm was used to dimensionally reduce high-dimensional feature vectors, followed by SVM for steel defect classification.Yun et al. [11] utilized discrete wavelet transform for image processing, optimizing extracted features and locating defects using dynamic programming.Hu et al. [12] proposed a method using BP neural networks and SVM for steel surface defect detection.When binarily processing images of defective steel to extract features and determine defect types, experimental comparisons showed higher classification accuracy for the SVM model, while the BP neural network exhibited faster recognition speeds.Liu et al. [13] introduced a defect classification method using an ensemble of Extreme Learning Machines (ELM) based on locally binary patterns to capture local texture and structural information for feature extraction.ELMs were trained as individual models, and their outputs were aggregated to determine the final classification decision.Ashour et al. [14] proposed a feature extraction method combining Discrete Shearlet Transform (DST) with a Gray-Level Co-occurrence Matrix (GLCM).Firstly, the Shearlet transform is employed to capture texture and edge information from the image.After this, GLCM is applied to the transformed subbands, followed by principal component analysis to reduce the dimensionality of the obtained high-dimensional features.The defect classification is performed using a supervised SVM.
In recent years, deep learning has become increasingly popular, as, unlike traditional machine learning methods, it can learn more complex patterns in large datasets.Deep learning has been used to solve various problems in different fields, such as Unmanned Aerial Vehicle data processing, climate change prediction, and environmental analysis [15][16][17][18].Therefore, deep learning is a powerful tool that has the potential to address many of the world's most pressing problems.As a result, many researchers now utilize deep learning for steel defect detection.Soukup et al. [19] proposed using Convolutional Neural Networks (CNN) training under the fully supervised strategy to improve the detection performance of steel surface defects and further improve the efficiency through regularization method.Yi et al. [20] proposed a method combining symmetrical surround saliency maps with CNN.Symmetrical surround saliency maps segment defect areas, followed by deep CNN for defect recognition.This end-to-end defect recognition approach avoids the separation of feature extraction and image classification present in traditional methods, thereby enhancing detection efficiency and accuracy.Damacharla et al. [21] used the transfer learning strategy and applied ResNet and DenseNet encoders on the basis of the U-NET model to improve the accuracy of the steel defect detection.He et al. [22] introduced an improved Fast R-CNN network model that integrates multi-scale feature maps from different network layers, minimizing feature loss and improving steel defect detection.Uraon et al. [23] presented an FPN+Resnet network model for multi-defect detection in complex backgrounds.Bouguettaya et al. [24] proposed a network model combining MobileNet-V2 and Xception with transfer learning.The strategy of deep ensemble learning was employed to retain the advantage of fast execution while addressing potential issues related to model size in traditional deep learning methods.Akhyar et al. [25] integrated deformable convolution and deformable RoI pooling techniques into the cascaded R-CNN architecture, enhancing the model's adaptability to changes in target shapes.A guided anchor proposal strategy directed the network's attention towards regions that might contain targets.Furthermore, the introduction of random scaling and ultimate scaling techniques aided the model in more accurate target processing.Lan et al. [26] improved the CasMVS Net using a threedimensional reconstruction network and introduced multi-scale feature enhancement.
Extracting features at various scales and effectively fusing them improved accuracy.Point cloud processing techniques were combined to locate and identify surface defects on steel plates more precisely.Xia et al. [27] proposed an improved YOLOv5s model.A large core C3 module that can be reparametrized is designed innovatively, which enhances the model's ability to perceive and extract features effectively in complex texture environments.The training strategy of convolution kernel of different sizes corresponding to feature maps of different scales is adopted to adapt to defects of different shapes.Raj et al. [28] proposed the YOLOV7-CSF model, which introduced a lightweight and low-cost coordinate attention mechanism into the head structure of YOLOv7, then adopted SCYLLA-Intersection over Union loss function to improve detection efficiency.Huang et al. [29] proposed the WFE-YOLOv8s model based on YOLOv8s, replacing the original C2F module with a new CFN structure, reducing the number of network parameters and GFLOPs, and improving the algorithm accuracy through an EMA attention mechanism, which increased by 4.7 percentage points compared with the mAP of the original model.
The steel defect detection algorithms mentioned above, although excellent in terms of innovation, have certain limitations that cannot be ignored.Firstly, when faced with a variety of different steel surface defects, machine learning algorithms rely on traditional manual feature extraction, which is relatively weak in generalization, and it is difficult to flexibly respond to variable defect types.Second, although CNN performs well in multiple tasks, it may have shortcomings in global feature extraction, which in turn limits its ability to effectively identify small, detail-rich defects.Furthermore, in the industrial environment with limited computing resources, the existing complex network architecture may not be conducive to real-time monitoring and online feedback due to the slow computing speed, which affects the operating efficiency of the entire production line and product quality control.To solve the above problem, this paper proposes an improved YOLO8 algorithm, which aims to improve the accuracy of steel surface defect detection on the basis of ensuring real-time performance.The primary contributions of this paper are as follows: 1.
Adding deformable convolutions in the backbone network to enhance the adaptability to target shapes or local structures.This improvement significantly improved mAP by about 3.2 percentage points.

2.
Substituting the original model's feature fusion structure with BiFPN to capture feature information from different scales more effectively, thereby improving detection accuracy.Building on the previous step, approximately a 1.6 percentage point increase in mAP.

3.
Integrating BiFormer into the backbone network for adaptive attention to targets and allocation of computational resources, leading to improved detection accuracy.The mAP again grew by about one percentage point.

4.
Implementing WIoUv3 to enhance the accuracy of predicting bounding boxes for targets and their generalization.The mAP gained another 1.1 percentage points.

YOLOv8 Algorithm Introduction
Faster R-CNN, RetinaNet, SSD, and YOLO series algorithms are all classic target detection algorithms.Faster R-CNN is a two-stage target detection algorithm with high precision but limited speed.RetinaNet introduces Focal Loss to improve small target detection, which is fast but has room for improvement in accuracy.SSD realizes real-time and efficient detection with single-stage detection and multi-scale feature mapping, but it has poor recognition accuracy for small targets.YOLOv8 inherits the features of the YOLO series, and, compared with the previous version, it has optimized the network structure and improved the loss function, the sample matching strategy, and the training strategy.The following will introduce the improvement of each part in detail.After the above key optimization measures, YOLOv8 not only greatly improves the reasoning speed but also improves the detection accuracy to some extent.YOLOv8 currently has YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, YOLOv8x, and other versions, which are mainly different in model size, parameter number, computing resource requirements, and detection accuracy.The model size is arranged according to the above positions.These releases are designed to meet the needs of different tasks, hardware, and performance.Embedded devices or edge computing platforms in industrial environments often have strict limits on computing resources, while the YOLOv8n model is small, requires relatively low memory and computing power, and is suitable for deployment to hardware with limited resources.Although the performance of YOLOv8 largely depends on the quality and diversity of training data, it is still a relatively good choice for the application scenarios of steel defect detection, so this paper chooses to improve on YOLOv8n.The network architecture of YOLOv8 is depicted in Figure 1.
improves the detection accuracy to some extent.YOLOv8 currently has YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, YOLOv8x, and other versions, which are mainly different in model size, parameter number, computing resource requirements, and detection accuracy.The model size is arranged according to the above positions.These releases are designed to meet the needs of different tasks, hardware, and performance.Embedded devices or edge computing platforms in industrial environments often have strict limits on computing resources, while the YOLOv8n model is small, requires relatively low memory and computing power, and is suitable for deployment to hardware with limited resources.Although the performance of YOLOv8 largely depends on the quality and diversity of training data, it is still a relatively good choice for the application scenarios of steel defect detection, so this paper chooses to improve on YOLOv8n.The network architecture of YOLOv8 is depicted in Figure 1.The YOLOv8 algorithm consists of four main components: Input, Backbone, Neck, and Head.Mosaic was used for data enhancement at the input side, which increased the diversity of the training set, effectively reduced the overfitting phenomenon, and improved the generalization ability of the model in new scenarios.To prevent excessive data enhancement and excessive data augmentation, this operation is disabled in the last 10 epochs of training.The Backbone module, responsible for feature extraction, includes modules such as C2F and Spatial Pyramid Pooling Fusion (SPPF).C2F comprises convolutional layers, split operations, and multiple Bottleneck units.The design of the C2F module is inspired by the advanced concept of Efficient Layer Aggregation Network (ELAN), an innovative architecture designed to improve the computational efficiency and performance of networks in the fields of computer vision and deep learning.At the same time, the C2F module also draws on the design characteristics of C3 and effectively collects multi-level feature information by building a connection mechanism between multiple branches, therefore obtaining more accurate and rich gradient signals.This design shows The YOLOv8 algorithm consists of four main components: Input, Backbone, Neck, and Head.Mosaic was used for data enhancement at the input side, which increased the diversity of the training set, effectively reduced the overfitting phenomenon, and improved the generalization ability of the model in new scenarios.To prevent excessive data enhancement and excessive data augmentation, this operation is disabled in the last 10 epochs of training.The Backbone module, responsible for feature extraction, includes modules such as C2F and Spatial Pyramid Pooling Fusion (SPPF).C2F comprises convolutional layers, split operations, and multiple Bottleneck units.The design of the C2F module is inspired by the advanced concept of Efficient Layer Aggregation Network (ELAN), an innovative architecture designed to improve the computational efficiency and performance of networks in the fields of computer vision and deep learning.At the same time, the C2F module also draws on the design characteristics of C3 and effectively collects multilevel feature information by building a connection mechanism between multiple branches, therefore obtaining more accurate and rich gradient signals.This design shows a good effect on the small target detection task, because it can alleviate the situation of weak features and difficult localization of small targets, improving the detection accuracy and stability of the model.SPPF dynamically adjusts the size of the input image through max pooling operations, ensuring consistent feature sizes and increasing receptive fields.It also enables the fusion of local and global features to help the network learn multiple layers of semantic information.
The Neck module utilizes the Path Aggregation Feature Pyramid Network (PAFPN) structure for feature fusion.This network combines the Feature Pyramid Network (FPN) [30] and the Path Aggregation Network (PANet) [31].FPN employs a multi-level feature pyramid network in a top-down manner to fuse low-level and high-level image features, producing feature maps at different scales.PANet introduces an additional pathway in a bottom-up manner after extracting features from the original input image, significantly reducing the computational cost of feature propagation.By combining FPN and PANet, the feature maps are adaptively fused, addressing the issue of scale differences to improve the detection performance of various sizes of targets and enhancing the network's feature representation capability.The Head module adopts the decoupled head paradigm [32], removing the previous objectness branch and retaining separate classification and regression branches for predicting class labels and bounding boxes, respectively.This approach enables better focus on category and boundary information.

Improved YOLOv8 Algorithm
The improved network structure diagram is shown in Figure 2. Firstly, C2f_DCNv2, combined with deformable convolution, is used to replace the C2f in the original network, which enhances the ability of the model to capture complex shapes and irregular target features.Then, PAFPN is adjusted to BiFPN, which improves the accuracy of the model in dealing with defects of various sizes, especially in identifying small or complex surface defects.Secondly, the BiFormer attention mechanism is introduced to enhance the model's focus on defect details within images.Finally, the original CIoU is replaced with WIoUv3 to improve the accuracy of bounding box regression.By improving the above four aspects of YOLOv8, the robustness and accuracy of the model for defect type, size, and location changes are improved.

Introducing C2F_DCNv2
Deformable convolution differs from traditional convolution in that the convolutional positions are not fixed and can be adjusted adaptively [33].It introduces learnable offsets on top of traditional convolution.Since the offsets can be fractional, bilinear interpolation is used to calculate the positions of pixels and obtain their corresponding feature values.The output formula for deformable convolution is as follows.
where w(p n ) represents the weight values of the convolutional kernel at p n position, x(p 0 + p n + ∆p n ) represents the feature value of the feature map at p 0 + p n + ∆p n position, and represents the offset added on top of p 0 + p n the position.Using deformable convolution allows for covering objects at different scales, enhancing the model's feature representation capability, and improving detection performance.However, the introduction of offset may lead to covering irrelevant regions, thereby disturbing feature extraction and degrading the overall performance.To address this issue, a team from the University of Science and Technology of China proposed an upgraded version of deformable convolution known as DCNv2 [34].DCNv2 introduces a modulation mechanism by incorporating a modulation parameter ∆m k ∈ [0, 1], which learns the weights of sampling points.For uninteresting regions, the weight coefficient ∆m k is assigned a small value.Although additional learning parameters are required, this technique is worth introducing due to the improvements in model generalization and performance.The output formula for DCNv2 is as follows: combined with deformable convolution, is used to replace the C2f in the original network, which enhances the ability of the model to capture complex shapes and irregular target features.Then, PAFPN is adjusted to BiFPN, which improves the accuracy of the model in dealing with defects of various sizes, especially in identifying small or complex surface defects.Secondly, the BiFormer attention mechanism is introduced to enhance the model's focus on defect details within images.Finally, the original CIoU is replaced with WIoUv3 to improve the accuracy of bounding box regression.By improving the above four aspects of YOLOv8, the robustness and accuracy of the model for defect type, size, and location changes are improved.In this paper, the idea of DCNv2 is integrated into the C2F module of YOLOv8, resulting in the proposed C2F_DCNv2 layer.The backbone network is primarily responsible for extracting low-level features and global information from the original image, and these feature details are crucial for comprehending the overall context of the entire image.Therefore, the last three C2F modules in the backbone network are replaced with C2F_DCNv2 modules.The introduction of DCNv2 allows for an increased effective receptive field and the sampling of effective locations.The Conv operation in the Bottleneck of the C2F module is replaced with DCNv2.After the introduction of DCNv2 in YOLOv8, the model can more accurately capture the detailed information of the boundary and complex shape of the target object, especially for the detection of objects with changeable morphology.The structures of DCNv2 and the modified Bottleneck are illustrated in Figure 3 and Figure 4, respectively.

Bi-Directional Feature Pyramid Network
In order to better fuse multi-scale features, the PAFPN in YOLOv8 is replaced with the BiFPN [35].Compared to other structures, BiFPN can effectively fuse features without increasing computational cost.The main idea behind BiFPN is to construct a feature pyramid by utilizing information flow from both the bottom-up and top-down directions while employing a repeated weighted fusion approach at each pyramid level.By leveraging information flow from both directions, BiFPN can fuse features at various levels to better accommodate objects of assorted sizes.Through the repeated weighted fusion process, BiFPN enhances the accuracy and generalization capability of the model, thus improving object detection performance.Compared to the PAFPN feature fusion network, BiFPN has the following differences: 1  ⃝ Removal of unidirectional input nodes (these nodes do not participate in cross-level feature fusion, and their impact on the overall network performance is relatively small, so removing them simplifies the network structure). 2  ⃝ The original input node and output node of the same layer are connected, so that the feature map of the layer can be better retained and utilized in the process of feature fusion.This can enhance the information transmission and fusion ability of the same layer feature map and improve the perception and recognition ability of the target. 3 ⃝ By repeating the process, the network gradually fuses features from more levels, resulting in a more comprehensive and semantically expressive final feature representation. 4 ⃝ Instead of simple feature map stacking or addition, as in traditional fusion methods, BiFPN uses weighted feature fusion.Since features may have different semantic information and resolutions, their fusion is performed with distinct weights to ensure a more comprehensive and accurate feature representation.Due to its complex connection patterns, an accurate training strategy needs to be designed.By combining BiFPN to form BiFPN_Concat2 and BiFPN_Concat3 modules, setting learnable parameters and learning weights for different branches, Concat operations are applied separately to feature maps for two-branch and three-branch configurations.The structural diagrams of PAFPN and BiFPN are shown in Figure 5 and Figure 6, respectively.
tion mechanism by incorporating a modulation parameter k m Δ ∈[0,1], wh weights of sampling points.For uninteresting regions, the weight coefficien signed a small value.Although additional learning parameters are requir nique is worth introducing due to the improvements in model generalizatio mance.The output formula for DCNv2 is as follows:

Biformer Attention Mechanism
To improve the feature extraction capability of YOLOv8 for small objects, the BiFormer attention mechanism [36] is introduced at the end of the Backbone.In the YOLOv8 model, the backbone serves as the feature extraction component and the neck serves as the feature fusion part; therefore, adding the BiFormer attention mechanism at the end of the backbone can better extract feature information from the input images.BiFormer utilizes sparse sampling to preserve fine-grained feature information, enabling better feature representation on smaller feature maps and addressing the low accuracy issue in small object detection.The core component of BiFormer is the BiFormerBlock, which consists of three parts: Region partition and input projection, region-to-region routing with directed graph, and token-to-token attention.First, an input feature map is linearly projected to obtain the query (Q), key (K), and value (V).The calculation formulas for Q, K, and V are as follows: In the above formula, X ∈ R H×W×C .The input feature map is divided into S × S regions, with each region containing H × W/S 2 feature vectors.X is reshaped into X r ∈ R S 2 ×H×W/S 2 ×C , and W q W k W v represent the mapping weights for query, key, and value, respectively.
Electronics 2024, 13, x FOR PEER REVIEW

Bi-Directional Feature Pyramid Network
In order to better fuse multi-scale features, the PAFPN in YOLOv8 is the BiFPN [35].Compared to other structures, BiFPN can effectively fuse fea increasing computational cost.The main idea behind BiFPN is to construct amid by utilizing information flow from both the bottom-up and top-do while employing a repeated weighted fusion approach at each pyramid leve ing information flow from both directions, BiFPN can fuse features at var better accommodate objects of assorted sizes.Through the repeated weighte cess, BiFPN enhances the accuracy and generalization capability of the mo proving object detection performance.Compared to the PAFPN feature fu BiFPN has the following differences: ① Removal of unidirectional input nodes do not participate in cross-level feature fusion, and their impact on th work performance is relatively small, so removing them simplifies the netwo ② The original input node and output node of the same layer are connecte feature map of the layer can be better retained and utilized in the process of f This can enhance the information transmission and fusion ability of the sam map and improve the perception and recognition ability of the target.③ By

Biformer Attention Mechanism
To improve the feature extraction capability of YOLOv8 for small ob Former attention mechanism [36] is introduced at the end of the Backbone.In model, the backbone serves as the feature extraction component and the ne the feature fusion part; therefore, adding the BiFormer attention mechanism the backbone can better extract feature information from the input images.B lizes sparse sampling to preserve fine-grained feature information, enabling b representation on smaller feature maps and addressing the low accuracy is object detection.The core component of BiFormer is the BiFormerBlock, whic three parts: Region partition and input projection, region-to-region routing w graph, and token-to-token attention.First, an input feature map is linearly obtain the query (Q), key (K), and value (V).The calculation formulas for Q, as follows: In the above formula, The input feature map is divided in gions, with each region containing  Then, a directed graph is constructed using an adjacency matrix to determine which key-value pairs should be involved in the attention computation, i.e., which regions should be focused on for a given region.Next, the region-to-region routing index matrix is used to determine the remaining candidate regions, referred to as routing regions.Finally, fine-grained token-to-token attention is applied to interact with tokens in the routing regions, resulting in attention outputs.It is a dual-level routing dynamic sparse attention mechanism.Compared to traditional global attention mechanisms, this approach can filter at the coarse-grained region level and apply fine-grained token-to-token attention within the routing regions.It achieves flexibility in computation allocation and improves computational efficiency by selectively attending to relevant parts of tokens using adaptive querying, skipping irrelevant regions.BiFormer adopts a four-level pyramid structure with a downsampling factor of 32.Specifically, the first stage uses overlapping block embedding, and the second to fourth stages use block merging modules to reduce the input spatial resolution and increase the number of channels.Although BiFormer optimizes the computational efficiency, the introduction of this attention slightly increases the computational overhead of the overall algorithm.The network structure of BiFormer is shown in Figure 7.Then, a directed graph is constructed using an adjacency matrix to determine which key-value pairs should be involved in the attention computation, i.e., which regions should be focused on for a given region.Next, the region-to-region routing index matrix is used to determine the remaining candidate regions, referred to as routing regions.Finally, fine-grained token-to-token attention is applied to interact with tokens in the routing regions, resulting in attention outputs.It is a dual-level routing dynamic sparse attention mechanism.Compared to traditional global attention mechanisms, this approach can filter at the coarse-grained region level and apply fine-grained token-to-token attention within the routing regions.It achieves flexibility in computation allocation and improves computational efficiency by selectively attending to relevant parts of tokens using adaptive querying, skipping irrelevant regions.BiFormer adopts a four-level pyramid structure with a downsampling factor of 32.Specifically, the first stage uses overlapping block embedding, and the second to fourth stages use block merging modules to reduce the input spatial resolution and increase the number of channels.Although BiFormer optimizes the computational efficiency, the introduction of this attention slightly increases the computational overhead of the overall algorithm.The network structure of BiFormer is shown in Figure 7.

Wise-IoU
In object detection tasks, bounding boxes are commonly used to represent the position and size of the targets.Bounding boxes are usually represented by four coordinate values, i.e., the coordinates of the top-left and bottom-right corners.Traditional IoU loss function evaluates the overlap between two boxes by computing the ratio of their intersection to their union.However, the IoU loss function has some limitations.For example,

Wise-IoU
In object detection tasks, bounding boxes are commonly used to represent the position and size of the targets.Bounding boxes are usually represented by four coordinate values, i.e., the coordinates of the top-left and bottom-right corners.Traditional IoU loss function evaluates the overlap between two boxes by computing the ratio of their intersection to their union.However, the IoU loss function has some limitations.For example, it may inaccurately assess the overlap between large and small objects, it may be unfair for boxes of different scales, and it cannot distinguish cases where there is a high overlap but the target is not properly localized.In YOLOv8, to address the bounding box regression problem, the CIoU loss [37] is used as the loss function.CIoU loss is an improved loss function that considers factors such as position offset, scale difference, and aspect ratio, providing a more accurate assessment of the similarity between predicted and ground truth boxes.The loss function is defined as follows: In the CIoU loss function, b represents the center coordinates of the predicted box, b gt represents the center coordinates of the ground truth box, and ρ is the Euclidean distance between the two center points, i.e., the straight-line distance between the center point of the predicted box and the center point of the ground truth box.c represents the diagonal distance of the minimum enclosing box that contains both the predicted box and the ground truth box.α is a positive weight parameter, and V is a parameter that measures the similarity of aspect ratios between the predicted box and the ground truth box.It is used to penalize cases where there is a significant difference in aspect ratio between the predicted box and the ground truth box.
Although the CIoU loss outperforms traditional IoU calculation in addressing issues such as bounding box offset and aspect ratio imbalance in object detection, it is mainly used to enhance the fitting capability of bounding box regression.The presence of low-quality data in the dataset, however, can lead to overfitting if the bounding box regression is overly emphasized for these low-quality examples, thereby reducing the detection performance of the model.In this paper, Wise-IoUv3 [38] is used to replace CIoU and solve this problem by using a dynamic non-monotonic focal mechanism that effectively leverages the potential of the non-monotonic focal mechanism.
WIoUv1 is an attention-based bounding box loss that constructs distance attention based on distance metrics.The formula is as follows: In the formula, (x, y) and (x gt , y gt ) represent the center coordinates of the ground truth box and the predicted box, respectively.W g , H g represent the widths and the heights of the minimum enclosing region that simultaneously contains both the ground truth and predicted boxes.
Based on WIoUv1, WIoUv3 uses gradient gain as the focusing coefficient and nonmonotonic dynamic focusing coefficient to consider the allocation of loss function so that the model pays more attention to those samples that are difficult to accurately match the target, thereby enhancing detection accuracy and robustness.The computational formula for WIoUv3 is as follows.
where L IoU represents the sliding average of the momentum m and δ is a variable parameter when β = δ, r = 1.A small outlier β usually means that the anchor frame matches the real target to a higher degree, so the anchor frame quality is relatively good.Assigning a smaller gradient gain to this type of high-quality anchor frame during training ensures that the optimization process pays more attention to those anchors of ordinary quality, thus making the bounding frame regression more accurate.For anchors with large outliers, their matching degree with the actual marked frames is often low, indicating that their quality is poor.At this point, a small gradient gain is assigned to prevent these low-quality examples from producing excessive harmful gradients in the process of backpropagation, affecting the overall model convergence and optimization efficiency.In terms of computational principles, replacing CIoU with Wise-IoUv3 in YOLOv8 can reduce harmful gradients while maintaining attention to these samples, which is beneficial for better learning of the model and improving its localization performance.

Experimental Environment and Dataset
The experimental environment used in this study employed the Windows 10 operating system with 32 GB of memory.The computer's CPU was an i5-12400F with a clock frequency of 2.50 GHz, while the GPU was an NVIDIA GeForce RTX 3090.Python version 3.8.16 was utilized, along with the PyTorch 2.0.0 deep learning framework and CUDA 11.7 for accelerated computations.
The dataset used in the experiment is NEU-DET, which is a steel surface defect detection dataset created by the team led by Professor Ke-Chen Song from Northeastern University.This dataset consists of six different types of defects on steel plates, including Crazing (fine cracks or fractures on the surface of the steel plate), Inclusion (impurities or foreign substances present on the surface of the steel plate), Patches (large patches or uneven areas on the surface of the steel plate), Pitted_surface (small pits or corrosion spots on the surface of the steel plate), Rolled-in_scale (presence of oxidation or rolled-in scales on the surface of the steel plate), and Scratches (scratches or scrapes on the surface of the steel plate).The dataset contains a total of 1800 images of steel plate surface defects, with 300 samples for each type of defect.Each image is 200 × 200 pixels in size and is provided as a grayscale image.The defect areas in the image are marked to facilitate the training and evaluation of defect detection.The NEU-DET data set is divided into training set, test set and verification set according to the ratio of 8:1:1.The training set contains 1440 images, the test set contains 180 images, and the verification set contains 180 images.Such a partition ratio can maintain the diversity and representative of the data set while providing a sufficient sample size for training, testing, and verifying the performance of the algorithm.

Evaluation Metrics
In this paper, precision (P), recall (R), mAP@0.5,GFLOPS, Params and FPS are used as evaluation indicators.The accuracy rate P is expressed as the proportion of the model that is actually true in the predicted true class sample.The recall rate R represents the proportion of all actual true class samples that the model successfully predicts to be true.Meanwhile, mAP represents mean average precision, and mAP@0.5 means that the IoU threshold is set to 0.5 during mAP calculation.That is, the detection frame is correct only when the IoU between the detection frame and the real target is greater than 0.5 (this is a common threshold setting for object detection tasks).GFLOPS refers to the number of floating point operations performed by the model per second, Params refers to the number of parameters in the model, and Params can evaluate the complexity and scale of the model.FPS stands for frames per second and represents the number of frames a model is capable of in regard to inferences or predictive operations per unit of time.The calculation formula of accuracy P, recall R and mAP is as follows.
In the above formula, TP represents the true case, FP represents the false positive case, and FN represents the false negative case.

Ablation Experiments
To evaluate and validate the effectiveness of the proposed improvements, five ablation experiments were conducted.Under consistent environments and training parameters, training or testing was performed using experimental groups and control groups, and the corresponding results were recorded.The results of the ablation experiments are presented in Table 1.The first group demonstrated the original, unmodified YOLOv8 algorithm with a mAP value of 0.779 on the steel surface defect detection task.In the second group, the C2F_DCNv2 module is introduced to replace the original C2F structure.This change makes the model more flexible to deal with complex shapes and irregular target features, thus improving the detection accuracy.The experimental results show that the mAP value rises to 0.812.On this basis, the third group further replaced the PAFPN structure with the more efficient BIFPN for feature fusion.This change helps the model to better integrate multi-scale feature information, especially when identifying small or complex surface defects, showing advantages.The experimental results show that the mAP value is further increased to 0.828.Subsequently, the fourth group added the BiFormer attention mechanism to enhance the correlation between the model's understanding of the global image structure and local details.Although this resulted in an increase in Params and GFLOPs, the experimental results proved that the improvement effectively improved the detection performance, and the mAP value reached 0.837.Finally, the fifth group applied the WIoUv3 loss function to the optimized network structure to guide the bounding box regression process more accurately.The final experimental results showed that the cumulative effect of these improvements was significant, and the mAP value was increased to 0.848.While each improvement will reduce the FPS somewhat, the fifth group has a minimum of 142.8 frames per second, which is an acceptable range that meets the real-time needs of industrial inspection.Through a series of ablation experiments, we can clearly see the positive contribution of each improvement point to the overall detection effect, thus proving the effectiveness of these improvement points.The PR curve graphs in Figures 9 and 10 illustrate the experimental results of YOLOv8 and the improved YOLOv8 under the same conditions.The figures display the mAP@0.5 values for each category as well as the overall mAP@0.5.From the graphs, it can be observed that the improved algorithm has increased the mAP from 77.9% to 84.8%, resulting in a 6.9 percentage point improvement.It is worth noting that, in the YOLOv8 detection, the mAP value for the "rolled-in_scale" category was only 0.465, whereas after the improvement, it reached 0.716.This demonstrates a significant enhancement compared to the baseline model for this particular category.The PR curve graphs in Figures 9 and 10 illustrate the experimental results of YOLOv8 and the improved YOLOv8 under the same conditions.The figures display the mAP@0.5 values for each category as well as the overall mAP@0.5.From the graphs, it can be observed that the improved algorithm has increased the mAP from 77.9% to 84.8%, resulting in a 6.9 percentage point improvement.It is worth noting that, in the YOLOv8 detection, the mAP value for the "rolled-in_scale" category was only 0.465, whereas after the improvement, it reached 0.716.This demonstrates a significant enhancement compared to the baseline model for this particular category.

Comparative Experiments
Figure 11 shows the prediction results before and after the improvement of the Yolov8 model.By comparing Figure 11, it is evident that the improved algorithm achieves more accurate object localization and a certain increase in confidence scores.
To validate the effectiveness of the proposed algorithm, the algorithm in this paper is compared with mainstream object detection models, including SSD, Fast RCNN, DETR, YOLOv5s, YOLOv7, and YOLOv8n, on the NEU-DET dataset.The experimental results are presented in Table 2 for comparison.
Table 2 displays the values of map@0.5 and fps for different models across various defects.From Table 2, it can be observed that there is a slight difference in detection accuracy between the SSD and Fast R-CNN algorithms for steel surface defect detection.Although Fast R-CNN may have a slightly better detection performance, as a two-stage detection algorithm, it has a relatively large number of parameters and noticeably slower detection speed.DETR introduced transformer architecture for target detection.When dealing with fine-grained steel surface defects, due to the global self-attention mechanism, it may encounter problems of high memory consumption and slow convergence speed, so the detection effect of DETER is poor.The detection effect of YOLOv5s and YOLOv7 is also unsatisfactory in the case of small defect size and dense distribution.YOLOv8 is the fastest detection algorithm with relatively good detection performance.The improved YOLOv8 algorithm demonstrates overall improvement in detection performance while only slightly decreasing the detection speed.The number of parameters is also maintained at a satisfactory level.Furthermore, it exhibits significant improvement in detecting certain challenging defect types, such as cracks and rolled-in scale.Based on the above findings, for the task of steel surface defect detection, the proposed algorithm in this paper outperforms other algorithms and performs better in completing the detection task.The PR curve graphs in Figures 9 and 10 illustrate the experimental results of YOLOv8 and the improved YOLOv8 under the same conditions.The figures display the mAP@0.5 values for each category as well as the overall mAP@0.5.From the graphs, it can be observed that the improved algorithm has increased the mAP from 77.9% to 84.8%, resulting in a 6.9 percentage point improvement.It is worth noting that, in the YOLOv8 detection, the mAP value for the "rolled-in_scale" category was only 0.465, whereas after the improvement, it reached 0.716.This demonstrates a significant enhancement compared to the baseline model for this particular category.Figure 11 shows the prediction results before and after the improvement of the Yolov8 model.By comparing Figure 11, it is evident that the improved algorithm achieves more accurate object localization and a certain increase in confidence scores.To validate the effectiveness of the proposed algorithm, the algorithm in this is compared with mainstream object detection models, including SSD, Fast RCNN, YOLOv5s, YOLOv7, and YOLOv8n, on the NEU-DET dataset.The experimental are presented in Table 2 for comparison.

Conclusions and Future Outlook
In addressing the issue of steel surface defect detection, this paper proposes an improved YOLOv8 algorithm.The algorithm replaces the convolutions in Bottleneck with DCNv2 to enlarge the receptive field and enhance feature extraction capabilities.Additionally, PAFPN is adjusted to BiFPN for improved feature fusion, thereby enhancing the accuracy of detecting various classes.The introduction of the BiFormer attention mechanism in the backbone network strengthens feature enhancement for improved detection outcomes.Moreover, CIoU is replaced with WIoUv3, employing a dynamic non-monotonic focus mechanism to concentrate on anchors of ordinary quality, thereby enhancing overall detection performance.The mAP of the improved model in this paper can reach 84.8% on the NEU-DET dataset.Compared with the WFE-Yolov8S algorithm proposed by Huang et al., mAP is about five percentage points higher.The effectiveness of the improved model was verified by ablation experiment and comparison experiment.
In this study, we focused on the specific challenge of steel surface defect detection, designing and implementing a series of optimizations for this task to significantly improve the model's accuracy in identifying small targets.Although our core work is closely focused on steel surface defects, the proposed improvement strategies do have broad prospects and universal value for cross-field applications.These strategies not only show excellent results in solving the problem of small and complex defects detection on steel surfaces, but also the core ideas and technical means behind them are also applicable to other diverse scenarios involving small target detection (for example, in fields such as electronic component defect detection and subtle lesion detection in medical image diagnosis).
Although the work in this paper has made some progress in improving mAP, it is worth pointing out that the study is conducted in a supervised way, and it would require high labor costs to obtain adequate steel defect labeling data in practical applications.In future research, we can focus on applying semi-supervised learning to the field of steel defect detection.Due to the wide variety and shape of steel surface defects, how to mine and learn effective feature representation in the absence of labels is a challenge.However, despite these challenges, the application of semi-supervised learning also presents clear advantages.On the one hand, it reduces the dependence on large-scale annotation data, thus saving the high cost of manual annotation; on the other hand, by combining labeled and unlabeled data for training, the model may improve the generalization performance of unencountered or less common defect types, enhancing the adaptability in real scenarios.This has certain research significance for the scenario of limited label resources and everchanging defect categories in the industrial field.


In this paper, the idea of DCNv2 is integrated into the C2F module of sulting in the proposed C2F_DCNv2 layer.The backbone network is prima ble for extracting low-level features and global information from the origin these feature details are crucial for comprehending the overall context of the Therefore, the last three C2F modules in the backbone network are r C2F_DCNv2 modules.The introduction of DCNv2 allows for an increased ceptive field and the sampling of effective locations.The Conv operation in t of the C2F module is replaced with DCNv2.After the introduction of DCNv the model can more accurately capture the detailed information of the bound plex shape of the target object, especially for the detection of objects with cha phology.The structures of DCNv2 and the modified Bottleneck are illustrat and Figure4, respectively.

Figure 3 .
Figure 3. Network structure of the DCNv2.

Figure 3 .
Figure 3. Network structure of the DCNv2.

Figure 4 .
Figure 4.The Bottleneck network structure of this paper.

Figure 4 .Figure 5 .
Figure 4.The Bottleneck network structure of this paper.
k W v represent the mapping weights for que value, respectively.

Figure 8 17 Figure 8 .
Figure 8 shows the confusion matrix for the improved model.The horizontal axis represents the true value, and the vertical axis represents the predicted value.It can be seen that most of the predicted values correspond to the real values, so the model has a good prediction performance.ronics 2024, 13, x FOR PEER REVIEW 13 of 17

Figure 8 .
Figure 8. Confusion matrix graph of the improved YOLOv8 algorithm.

Figure 8 .
Figure 8. Confusion matrix graph of the improved YOLOv8 algorithm.

Figure 10 .
Figure 10.P-R curve of the improved YOLOv8 algorithm.

Figure 10 .
Figure 10.P-R curve of the improved YOLOv8 algorithm.

Figure 11
Figure11shows the prediction results before and after the improvement Yolov8 model.By comparing Figure11, it is evident that the improved algorithm ac more accurate object localization and a certain increase in confidence scores.

Figure 11 .
Figure 11.Comparison of detection results.(a) Detection effect of YOLOv8 model; (b) Im detection effect of YOLOv8 model.

Figure 11 .
Figure 11.Comparison of detection results.(a) Detection effect of YOLOv8 model; (b) Improved detection effect of YOLOv8 model.

Table 2 .
Comparison of Detection Performance of Different Algorithms.