DBCW-YOLO: A Modified YOLOv5 for the Detection of Steel Surface Defects

: In steel production, defect detection is crucial for preventing safety risks, and improving the accuracy of steel defect detection in industrial environments remains challenging due to the variable types of defects, cluttered backgrounds, low contrast, and noise interference. Therefore, this paper introduces a steel surface defect detection model, DBCW-YOLO, based on YOLOv5. Firstly, a new feature fusion strategy is proposed to optimize the feature map fusion pair model using the BiFPN method to fuse information at multiple scales, and CARAFE up-sampling is introduced to expand the sensory field of the network and make more effective use of the surrounding information. Secondly, the WIoU uses a dynamic non-monotonic focusing mechanism introduced in the loss function part to optimize the loss function and solve the problem of accuracy degradation due to sample inhomogeneity. This approach improves the learning ability of small target steel defects and accelerates network convergence. Finally, we use the dynamic heads in the network prediction phase. This improves the scale-aware, spatial-aware, and task-aware performance of the algorithm. Experimental results on the NEU-DET dataset show that the average detection accuracy is 81.1, which is about (YOLOv5) 6% higher than the original model and satisfies real-time detection. Therefore, DBCW-YOLO has good overall performance in the steel surface defect detection task.


Introduction
Steel is an important raw material that plays an important role in industrial manufacturing.Therefore, ensuring steel quality is a crucial and demanding task.During the steel manufacturing process, the production environment and processing equipment limitations can result in various surface defects on the product, such as cracks, scratches, plaques, punches, indentations, and other imperfections.These defects can affect both the aesthetics and quality of steel [1].The detection of defects on the surface of steel is, therefore, an essential part of industrial production.
The earliest method of defect detection was manual visual inspection.However, the traditional manual visual inspection method suffers from high subjectivity and empirical variation.This may limit the reliability of the test results.In addition to this, manual inspection is inefficient and costly, which limits the further development of traditional manual visual inspection methods.As machine learning continues to advance, defect detection methods on the basis of machine learning gradually replace manual detection methods.An adaptive method for detecting steel surface defects by exploiting the Haar wavelet transform was proposed by Xu et al. [2] and was fruitful.Ai et al. [3] used the features statistically derived from the magnitude spectrum obtained by Fourier transform for crack detection on the steel plate surface.In another approach, Medina et al. [4] used Gabor filters for spatial and frequency domain defect detection in steel coils.These methods have made great strides compared to manual testing.However, machine learning methods require different analytical processing for specific images, resulting in poor robustness and suboptimal detection accuracy using machine learning defect detection.
Over the past few years, deep learning has achieved considerable advances in flaw detection thanks to its powerful learning capabilities [5,6].Deep learning-based object detection methods are mainly classified into one-and two-stage methods.A two-stage model, such as the R-CNN series [7][8][9], follows a two-step model, first generating candidate regions and then classifying them after refining their positions.The two-stage detection method performs well on the detection error rate and missed detection rate.However, the speed of detection is relatively slow, and it is not able to achieve the requirements of real-time detection.Therefore, one-stage detection methods have emerged.The widely used one-stage object detection models at this stage include SSD [10], YOLO series [11][12][13][14], and Retina Net [15].One-stage target detection algorithms have gained popularity in target detection applications that require efficient and real-time performance due to their fast detection speed, end-to-end training, fewer hyperparameters, applicability to multiple tasks, excellent small target detection capability, and better real-time performance.YOLO series algorithms are more representative algorithms inside one-stage object detection, but the accuracy of one-stage target detection algorithms is insufficient.
While target detection algorithms have shown high accuracy in detecting defects with small-scale variations, they still perform poorly in detecting targets with large-scale variations.Most current target detection algorithms rely heavily on the prediction of feature mappings that provide limited information about multi-scale targets.Therefore, detailed image information needs to be utilized wisely.For example, deep networks can capture more comprehensive semantic information.However, they may be less suitable for detecting defects in ground resolution.Therefore, the performance of feature extraction networks in capturing multi-scale features is of particular interest.Moreover, for the problem of target feature loss, we can conclude that the improvement of the detection head is essential.
YOLO still fails to detect complex defects with sufficient accuracy.YOLO still needs to be optimized for improved detection accuracy.Thus, this research aims to design a steel defect detection model that can ensure high detection accuracy and a reasonable detection speed.
Based on these characteristics, to enhance the accuracy of defect detection, this paper introduces a new one-stage inspection model, DBCW-YOLO, on the basis of YOLOv5.The algorithm uses YOLOv5 as the baseline model, and for the up-sampling part, an up-sampling module (CARAFE) [16] is proposed to enhance the receptive field and obtain much semantic information.For the YOLOv5 head, add the dynamic detection head (DyHead) [17] to enhance the detection ability of the original.For the YOLOv5 model's loss function, the WIoU [18] is used to improve the baseline model's training stability, improving the model's training efficiency.
Therefore, the main contributions are listed as follows: 1.
Enhanced feature fusion capability using cross-scale connectivity and embedding lightweight up-sampling (CARAFE) into the YOLOv5 network to cope with the steel defect fusion capability with a large scale of variation and to ensure the lightness of the network by improving the receptive field.

2.
We use the dynamic non-monotonic focusing mechanism to replace the CIoU boundary loss function in the original model with the WioU, which enhances the competitiveness of middle-quality anchor frames and simultaneously reduces the harmful gradient generated by low-quality examples.

3.
Embed the self-attention mechanism detection head (DyHead) into the YOLOv5 detection stage to enhance the detection ability of the model.
Our model is targeted to improve the characteristics of steel defects.First, BiFPN and the up-sampling module, CARAFE, are used to enhance the algorithm's focus on multiscale information for steel surface defects with large-scale variations.Second, to address the inadequacy of the CIoU aspect ratio of the original model loss function, we introduce a WIoU to enhance the capability of the boundary loss function.For the weak detection ability of the model, we embed a dynamic detection head (DyHead) to improve the detection ability of the model.In addition, appropriate ablation experiments are designed to validate the effectiveness of the models and the individual modules.The experimental results indicate that DBCW-YOLO can maintain high detection accuracy while also having realtime detection capability.Experiment results denote that DBCW-YOLO has an mAP of 81.1% and 33.8 FPS (frames per second) an mAP improvement of approximately 6% over the YOLOv5 model.It can provide a solution to the problem of low defect detection on steel surfaces due to large changes in defect size and strong background interference in industrial scenarios.

Traditional Method
There are two main steps in machine learning-based methods.Firstly, the feature extraction rules are designed according to different types of defects for feature extraction, and then the features are inputted into the classifier to realize defect classification.A framework for extracting features of steel surface defects hidden in non-uniform patterns was proposed by Luo et al. [19].By introducing the generalized complete local binary pattern (GCLBP), an improvement of the complete noise invariant local structure pattern (ICNLP) was made.The defect identification classification was achieved using the nearest neighbor classifier.Liu et al. [20] improved the contour transform and kernel spectral regression for metal surface defect detection by enhancing feature extraction in a multiscale subspace.Wang et al. [21] used a guiding template to detect strip surface defects.They sorted the image by grayscale and subtracted the sorted test image from the guide template to segment strip surface defects.Inspired by the bootstrap template, the accuracy of defect detection can be improved by elevating the focus on the defective region.Cardellicchio et al. [22] proposed an adopted high-throughput data acquisition using the laser profilometry processing method and proposed a lightweight machine learning algorithm for defect detection, which is capable of high-precision real-time monitoring.Since traditional machine learning methods rely on artificially designed feature extraction principles, this causes poor generalization capability of the machine learning methods, which is easily affected by interference and noise, thus reducing the detection accuracy.
In fact, due to the many connections between traditional methods, it may be possible to use several traditional methods at the same time to jointly achieve the detection of defects.In general, conventional methods have strong limitations and require reanalyzing and designing feature extraction rules for different types of defects.For example, having dimensions that do not vary much or having sharp defect contours with low noise and high contrast under specified lighting environments.Machine learning possesses some robustness.However, artificial features have the disadvantage of weak characterization and poor adaptability.It is difficult to meet industrial needs using machine learning methods.

Deep Learning Method
With its accuracy and speed, deep learning target recognition algorithms are widely used in the industry.Deep learning-based target detection can be categorized into twostage detection and one-stage detection.Two-stage detection methods generally have high localization and target recognition accuracies but slower detection speeds.A combination of ResNet50 and an improved Faster R-CNN algorithm was proposed by Wang et al. [23] for detecting steel surface defects.Three improvements are proposed by them to the Faster R-CNN, including enhanced feature pyramid networks (FPNs), spatial pyramid pooling (SPP), and matrix NMS algorithms, to obtain better performance.Li et al. [24] proposed a method for pre-processing tunnel surface images to improve their quality and avoid repeated detection.They also offered a multilayer feature fusion network to detect defects on the tunnel surface combined with the Faster R-CNN, achieving high detection precision.However, the speed of detection for the two-stage algorithm is significantly lower compared to the one-stage algorithm.
Conversely, one-stage methods have faster detection speeds but may have lower accuracy.Yu et al. [25] have introduced a lightweight and powerful PCB defect detection network (led-net) and built a new backbone and neck network that can efficiently fuse multi-scale features.The loss function with adaptive localization is used to calculate the localization loss and increase detection accuracy.Cheng [26] proposed a new channel attention module based on the optimized Retina Net model to enable the model to acquire more essential channel characteristics.The Adaptive Spatial Feature Fusion (ASFF) [27] module is embedded into the model, enabling the model to improve its use of spatial features.Cardellicchio et al. [28] created a bridge defect dataset and used YOLOv5 to detect bridge defects, contributing to the monitoring of bridge condition and safety.To improve the model's feature extraction ability, Li et al. [29] utilized a convolutional encoder-decoder module with residual blocks in YOLOv4 to enhance the model's feature detection ability and improve learning representation.Additionally, they designed a feature alignment module using the attention mechanism.Finally, they employed three decoupled heads for separate outputs.Lu et al. [30] used a simplified BiFPN combined with YOLO to detect citrus defects with 98.7% accuracy.Guo et al. [31] merged the TRANS module in Transformer with the YOLOv5 backbone.These features, combined with global information, improve the model's capability to dynamically adjust to objects at different scales.YOLOv5 achieves a better detection effect.

Basic Model
Considering the computational resources and algorithm detection effect in industrial scenarios, after comparing the YOLO series of algorithms, we chose the lighter YOLOv5m 6.0 as the improved benchmark model.Its main network structure is illustrated in Figure 1.The network contains four main parts.In the input part, several key data enhancement and processing technologies are adopted.Among them, the mosaic data enhancement increases the variety and complexity of the training samples by stitching together four randomly selected images to create a single large image.Adaptive anchor frame calculation dynamically adjusts anchor frame size and position according to target size and position.For the input portion of the backbone, adaptive image sizing is used to dynamically adjust the input to satisfy the backbone section requirements.Then, after pre-processing and image enhancement operations are completed on the images, the images are input to the backbone module, which extracts features from the processed images.The neck module then fuses the acquired features, which generates three different kinds of feature information: large, medium, and small.Finally, the extracted and fused feature information is input into the head module, and the final result is output after detection.

DBCW-YOLO
The large-scale variation of steel surface defects and the strong background disturbance led to low discriminability of semantic information and poor detection of small targets.To enhance semantic discriminability, it is essential to obtain the scene information

DBCW-YOLO
The large-scale variation of steel surface defects and the strong background disturbance led to low discriminability of semantic information and poor detection of small targets.To enhance semantic discriminability, it is essential to obtain the scene information of the neighboring domains for information correlation and to acquire a profound comprehension of the correlation among different categories of imperfections.The DBCW-YOLO algorithm is an enhancement of the YOLOv5 algorithm, and the network structure is illustrated in Figure 2. To achieve higher detection accuracy, the model feature extraction is firstly enhanced by the strategy of BiFPN cross-scale connectivity, and the up-sampling algorithm of the YOLOv5 neck is optimized using the CARAFE module structure, which enhances the expression of features and improves the model's ability to capture contextual information.Secondly, for the large variation of sample quality and poor detection, the WIoU loss function is introduced to reduce the impact of sample quality and improve the efficiency of the model.Finally, to improve the representation of the model head, the DyHead module is introduced in the head to improve the steel defect detection performance.

DBCW-YOLO
The large-scale variation of steel surface defects and the strong background disturbance led to low discriminability of semantic information and poor detection of small targets.To enhance semantic discriminability, it is essential to obtain the scene information of the neighboring domains for information correlation and to acquire a profound comprehension of the correlation among different categories of imperfections.The DBCW-YOLO algorithm is an enhancement of the YOLOv5 algorithm, and the network structure is illustrated in Figure 2. To achieve higher detection accuracy, the model feature extraction is firstly enhanced by the strategy of BiFPN cross-scale connectivity, and the up-sampling algorithm of the YOLOv5 neck is optimized using the CARAFE module structure, which enhances the expression of features and improves the model's ability to capture contextual information.Secondly, for the large variation of sample quality and poor detection, the WIoU loss function is introduced to reduce the impact of sample quality and improve the efficiency of the model.Finally, to improve the representation of the model head, the DyHead module is introduced in the head to improve the steel defect detection performance.

Improved Feature Fusion
Feature fusion performs an essential role in target detection tasks.In YOLOv5, the PANet architecture is utilized for feature fusion.This cascade of feature maps transformed

Improved Feature Fusion
Feature fusion performs an essential role in target detection tasks.In YOLOv5, the PANet architecture is utilized for feature fusion.This cascade of feature maps transformed by the same size is not fully utilized for features between different sizes, making the detection accuracy limited.Moreover, the nearest neighbor interpolation method is the original up-sampling algorithm of the adopted neck network in YOLOv5.However, relying on the nearest neighbor pixel values does not allow us to obtain the subtle information of the image; the feature-aware domain is relatively small, and the edge information in the image produces an obvious jagged effect.To improve the detection abilities of the model, this paper proposes an enhanced feature fusion network, which introduces the idea of BiFPN [32] multi-scale feature fusion in the YOLOv5 neck.Moreover, this paper introduces a lightweight up-sampling module called CARAFE to improve the up-sampling algorithm for feature fusion in YOLOv5 without incurring additional computational costs.
BiFPN uses bidirectional cross-scale connectivity and weighted feature map fusion to optimize the model.Bidirectional fusion is used to construct top-down and bottom-up bidirectional channels to fuse information from different scales of the backbone network.The fusion scales are up-sampled and down-sampled for the same feature resolution scale, and horizontal connections are added between the input and output nodes of the same feature to fuse as many features as possible simply without increasing the cost.In this study, the strategy of BiFPN is used, which establishes forward and backward crosslayer feature transfer paths at different layers using bidirectional connectivity to enhance semantic representation and differentiation.The use of shallow features and fusion of multi-scale information are improved to enhance the model's ability to recognize targets at different scales.
The structure of PANet is shown in Figure 3a.BiFPN's structure is shown in Figure 3b.
bidirectional channels to fuse information from different scales of the backbone network.
The fusion scales are up-sampled and down-sampled for the same feature resolution scale, and horizontal connections are added between the input and output nodes of the same feature to fuse as many features as possible simply without increasing the cost.In this study, the strategy of BiFPN is used, which establishes forward and backward cross-layer feature transfer paths at different layers using bidirectional connectivity to enhance semantic representation and differentiation.The use of shallow features and fusion of multiscale information are improved to enhance the model's ability to recognize targets at different scales.
The structure of PANet is shown in Figure 3a.BiFPN's structure is shown in Figure 3b.For the problem of lost up-sampling information, this paper adopts a lightweight upsampling module, CARAFE, to enhance the up-sampling algorithm of YOLOv5, which fully captures the semantic information in steel defect images and enhances the feature mapping capability, and it does not require more computational cost.
CARAFE is an up-sampling operator that utilizes feature adaptation and feature reorganization.It is mainly composed of two parts: a content-aware reorganization module and a kernel prediction module.Its function is mapped from the input features of shape   For the problem of lost up-sampling information, this paper adopts a lightweight up-sampling module, CARAFE, to enhance the up-sampling algorithm of YOLOv5, which fully captures the semantic information in steel defect images and enhances the feature mapping capability, and it does not require more computational cost.
CARAFE is an up-sampling operator that utilizes feature adaptation and feature reorganization.It is mainly composed of two parts: a content-aware reorganization module and a kernel prediction module.Its function is mapped from the input features of shape H × W × C, and the feature map with shape δH × δW × C (δ denotes the up-sampling ratio) is output by up-sampling kernel prediction and feature reorganization.Moreover, the newly generated feature map includes more semantic information.The network comparison of the original network and the improved network is illustrated in Figure 4.The function of the kernel prediction module is to generate a reorganized convolutional kernel.The input feature mapping is first compressed by a 1 × 1 convolution operation to reduce the computational effort.Next, the compressed input feature mapping is up-sampled for kernel prediction using an encoder, and the channel dimensions are expanded in the spatial dimension to gain an up-sampled kernel of shape up up H W k k      .In the end, the up-sampled kernel is normalized so that its convolu- tion weights sum to 1.
The module for reorganizing content-aware maps each location in the output feature map back to the input feature map.Then, the region centered at up up k k  is taken out, and the up-sampled kernel at that point after prediction is made dot product to gain the output value.The same up-sampling kernel is used for different channels in the same position.The function of the kernel prediction module is to generate a reorganized convolutional kernel.The input feature mapping is first compressed by a 1 × 1 convolution operation to reduce the computational effort.Next, the compressed input feature mapping is upsampled for kernel prediction using an encoder, and the channel dimensions are expanded in the spatial dimension to gain an up-sampled kernel of shape δH × δW × k up × k up .In the end, the up-sampled kernel is normalized so that its convolution weights sum to 1.

H Channel Compressor
The module for reorganizing content-aware maps each location in the output feature map back to the input feature map.Then, the region centered at k up × k up is taken out, and the up-sampled kernel at that point after prediction is made dot product to gain the output value.The same up-sampling kernel is used for different channels in the same position.
All calculation parameters is 2

DyHead
Thanks to the large differences in the scale of the steel flaws, the network head needs to have the capability to detect steel flaws at different scales.However, the YOLOv5 model contains only three detection heads, which may cause missing detection when dealing with small target detection.At present, many researchers are increasing the detection layer to four layers from the original model to ensure that the fusion of shallower feature maps has more powerful semantic information and more accurate location information.The model improves the improvement to the sensitivity of the small target in a more comprehensive and accurate detection of steel defects and provides more reliable support for industrial inspection, etc.
In YOLOv5, the backbone network outputs a three-dimensional tensor with dimensions of horizontal × space × channel.Therefore, it improvs the integration of the variety of feature scales due to the difference in target scales and the different types and spatial positions of the object contained in the potential positional relationship features.This paper introduces the dynamic head block (DyHead) in the neck section.The DyHead enables dynamic detection of scale, space, and task awareness attention simultaneously.That is, an attention method is applied to each specific dimension of the feature tensor.The three-dimensional feature tensor is given on the detection layer F ∈ R L×S×C .The attention function is calculated in Equation ( 1) as follows: where W represents the attention function, L stands for the level of the feature graph, S stands for the result of multiplying the height and width of the feature graph, and C stands for the channel numbers in the feature graph.π L (•), π S (•), and π C (•) are three attention functions applied to dimensions L, S, and C.These three attention sequences are applied to the detection head and can be used multiple times in superposition.In this paper, two groups of π L (•), π S (•), and π C (•) modules are superimposed successively to enhance the representation effect of the detection head and to improve the detection ability of the model for small flaws.Only two groups are added to ensure the calculation amount of the model.The single DyHead structure is shown in Figure 6.
sions of horizontal × space × channel.Therefore, it improvs the integration of the variety of feature scales due to the difference in target scales and the different types and spatial positions of the object contained in the potential positional relationship features.This paper introduces the dynamic head block (DyHead) in the neck section.The DyHead enables dynamic detection of scale, space, and task awareness attention simultaneously.That is, an attention method is applied to each specific dimension of the feature tensor.The threedimensional feature tensor is given on the detection layer FR   . The attention function is calculated in Equation ( 1) as follows: where W represents the attention function, L stands for the level of the feature graph, S stands for the result of multiplying the height and width of the feature graph, and C stands for the channel numbers in the feature graph.The computational procedure for each of the three attention modules is as follows: In Equation ( 2), f is a linear function composed of approximately convolution operations to achieve feature dimensionality reduction and δ(x) is the activation function, which is a hard sigmoid.In Equation ( 3), K stands for the sparse number of sampling locations.p j + ∆p j is a movable position determined by a self-learning space displacement ∆p j used to focus on some discriminative positions and ∆m j is a self-learning importance scalar at position p j , both of which are learned from input features at the intermediate level of F. In Equation ( 4), the feature slice of the channel C is F C , and θ(•) is a superfunction for activation threshold control learning.Its implementation is the same as dynamic Relu, where α, β are learnable parameters through which different channels are activated differently to achieve attention operations.These three attention mechanisms are applied sequentially in the model and can be stacked together several times to form the desired DyHead block.

Wise-IoU
The loss function of the Bounding Box Regression (BBR) is a key part of target detection, and the quality of detection is largely up to how the loss function is designed.As an essential part of the bounding box loss function, its accurate definition can significantly enhance the quality of the detection part.Therefore, choosing a more appropriate loss function becomes the primary task of target detection.The YOLOv5 used is the CIoU loss.
The CIoU loss function adds the calculation of aspect ratios and does not balance the dataset itself.Calibrating steel data perfectly is difficult, and low-quality samples may exist due to their specific features.Consequently, the CIoU did not have a dynamic measure of data quality during the testing of this sample.To improve the detection accuracy, a dynamic measure of the quality of the anchor box is needed.This will overcome the shortcomings of the loss function.This article optimizes the bounding box loss function using the WIoU.The WIoU BBR loss function distinguishes the quality of the anchor box using outliers, which refer to the degree of abnormality.A smaller degree of anomaly is assigned for high-quality anchor boxes and a larger degree of outlier is assigned for low-quality anchor boxes.As a result, the data contain a greater number of anchor boxes of medium quality, which enhance the main decisions and improve the overall detector's capability.The parameter diagram of Wise-IoU is shown in Figure 7.
channels are activated differently to achieve attention operations.These three attention mechanisms are applied sequentially in the model and can be stacked together several times to form the desired DyHead block.

Wise-IoU
The loss function of the Bounding Box Regression (BBR) is a key part of target detection, and the quality of detection is largely up to how the loss function is designed.As an essential part of the bounding box loss function, its accurate definition can significantly enhance the quality of the detection part.Therefore, choosing a more appropriate loss function becomes the primary task of target detection.The YOLOv5 used is the CIoU loss.
The CIoU loss function adds the calculation of aspect ratios and does not balance the dataset itself.Calibrating steel data perfectly is difficult, and low-quality samples may exist due to their specific features.Consequently, the CIoU did not have a dynamic measure of data quality during the testing of this sample.To improve the detection accuracy, a dynamic measure of the quality of the anchor box is needed.This will overcome the shortcomings of the loss function.This article optimizes the bounding box loss function using the WIoU.The WIoU BBR loss function distinguishes the quality of the anchor box using outliers, which refer to the degree of abnormality.A smaller degree of anomaly is assigned for high-quality anchor boxes and a larger degree of outlier is assigned for low-quality anchor boxes.As a result, the data contain a greater number of anchor boxes of medium quality, which enhance the main decisions and improve the overall detector's capability.The parameter diagram of Wise-IoU is shown in Figure 7.If the anchor box can achieve a high match with the target box, then a competent loss function should mitigate the effects of geometric factors, and less intervention during model training means that the model is likely to achieve a higher generalization capacity.On this basis, the distance-attention mechanism was constructed, and a WIoUv1 with a two-layer attention mechanism was obtained.
R W IOU ∈ [1, e ) will enhance the L IOU of the middle-quality candidate box.
L IOU ∈ [0, 1] overwhelmingly decreases the R W IOU of the high-quality candidate box, and it focuses on the distance between the prediction box and the centroid of the candidate box when the intersection over the union (IoU) is large.
where W g and H g are* the size of the smallest closed box (Figure 7).To prevent R W IOU from creating gradients that impede convergence, W g and H g are separated from the computed graph (the superscript * stands for this work).Therefore, there is no need to consider the introduction of new metrics to remove barriers to convergence.

Dataset
In this article, the real-world benchmark dataset, NEU-DET, is selected to complete the experiment.These data include six categories, and the number of defects in each type is 300.The six categories of defects are Crazing (CR), Inclusion (In), Patches (Pa), Rolled-in Scale (RS), and Scratches (Sc).The detection image is displayed in Figure 8.

Dataset
In this article, the real-world benchmark dataset, NEU-DET, is selected to c the experiment.These data include six categories, and the number of defects in ea is 300.The six categories of defects are Crazing (CR), Inclusion (In), Patches (Pa), in Scale (RS), and Scratches (Sc).The detection image is displayed in Figure 8.

Index of Evaluation
To comprehensively evaluate the improvements in the algorithm's performa to compare it with other algorithms, in this paper, several assessment indicators a including precision (P), recall (R), average precision (AP) for single-type precisio average precision (mAP) for multi-type precision, and frames per second (FPS) fo tion speed.FPS is frames per second.Therefore, in this paper, experimental valida carried out using the same equipment.Calculations of P, R, AP, and mAP are di in Equations ( 7)- (10) as follows:

Index of Evaluation
To comprehensively evaluate the improvements in the algorithm's performance and to compare it with other algorithms, in this paper, several assessment indicators are used, including precision (P), recall (R), average precision (AP) for single-type precision, mean average precision (mAP) for multi-type precision, and frames per second (FPS) for detection speed.FPS is frames per second.Therefore, in this paper, experimental validation was carried out using the same equipment.Calculations of P, R, AP, and mAP are displayed in Equations ( 7)-( 10) as follows:

Experimental Environment
The environment and relevant parameters of the experiment are displayed in Table 1.To prove the advantage of DBCW-YOLO, this paper uses several mainstream algorithms to compare NEU-DET datasets.In industrial applications, firstly, the detection accuracy must be guaranteed.Secondly, considering the production speed, the algorithm must have a decent detection speed.Therefore, an accuracy metric (mAP) and a FPS detection speed metric are selected to be shown in Table 2.The experimental results are displayed in Table 2. From the data in Table 2, we can conclude that the algorithm in this paper has the highest accuracy in the table, reaching 81.1%.The DBCW-YOLO algorithm has the highest detection effect of four kinds of defects.Among these algorithms, YOLOv7 has the fastest detection speed, but the accuracy of each class is not very high, and the overall ability is general.The detection accuracy for all types of defects is better than the newer YOLOv8.Although YOLOv3 and YOLOv5l have good detection results in some defects, the overall average accuracy still has a certain gap compared with our proposed methods.This is because the DBCW-YOLO proposed by us can better extract features and take into account the large variation of steel defect scales.In summary, our proposed DBCW-YOLO achieves high detection accuracy and good FPS.
The result in Table 3 shows that our method surpassed the original method in most of the P and F1 in all the test items, and AP was superior in all of them, which verified the validity of our method.A comparison of the DBCW-YOLO and YOLOv5 under each type is given in Figure 9.The figure illustrates the improvement of the detection results of different types of defects in the original model and the DBCW-YOLO model.The accuracy improvement of the two types of defects, Cr and RS, which are smaller targets that are more difficult to detect, and DBCW-YOLO greatly improves the AP values of these two defects.DBCW-YOLO greatly improves the AP values of these two defects.In Figure 9, the AP of CR in the improved YOLOv5 has increased by more than 10% compared with other algorithms, and the AP of RS has also increased by 5.8% compared with YOLOv5, and the effect is powerful compared to other algorithms.This suggests that DBCW-YOLO acquires deeper features and improves results significantly for small targets.The AP values of the other four defects have good detection results compared with other algorithms.The overall mAP was 81.1 percent.In Table 2, DBCW-YOLO outperforms the other methods for most of the defects detected, and the effect is substantially improved.By comparing Figure 9, we can conclude that the overall defect detection capability of the method proposed in this paper is significantly improved, and it can meet the needs of real-time detection in the industry.

Ablation Experiment
According to Table 4, we know that our improvement is useful.The mA YOLOv5m is 75.3%, and the mAP value of DBCW-YOLO is 81.1%, which has the effect on all six types of defects.Cr increased by 11.2%,In increased by 7. creased by 1.4%, PS raised by 2.3%, RS raised by 11.4%, and Sc raised by 1.1 function of each module, ablation experiments were conducted in this article, res and the mAP value was significantly improved by each module, while the incre mAP value by module superposition was still 4% and 4.3%.Therefore, our ex proved the usefulness of every module.For the two types of defects in the detec of the benchmark model, Cr and Sc both increased by more than 10%.Com YOLOv5m, the overall mAP value of DBCW-YOLO increased by 5.8%, which v detection capability of DBCW-YOLO.

Ablation Experiment
According to Table 4, we know that our improvement is useful.The mAP value of YOLOv5m is 75.3%, and the mAP value of DBCW-YOLO is 81.1%, which has improved the effect on all six types of defects.Cr increased by 11.2%,In increased by 7.3%, Pa increased by 1.4%, PS raised by 2.3%, RS raised by 11.4%, and Sc raised by 1.1%.For the function of each module, ablation experiments were conducted in this article, respectively, and the mAP value was significantly improved by each module, while the increase in the mAP value by module superposition was still 4% and 4.3%.Therefore, our experiments proved the usefulness of every module.For the two types of defects in the detection effect of the benchmark model, Cr and Sc both increased by more than 10%.Compared to YOLOv5m, the overall mAP value of DBCW-YOLO increased by 5.8%, which verified the detection capability of DBCW-YOLO.Experiments indicate that compared with the base model and other models of the network, our method improves the accuracy of defect detection in steel structures, which further proves the superiority of the DBCW-YOLO algorithm.

Conclusions
In this paper, the DBCW-YOLO model is presented due to the challenges of difficult image detection of small-and medium-sized defects in steel structures.In DBCW-YOLO, we propose a lightweight up-sampling method, namely, CARAFE, to enhance the baseline model.Aiming at the insufficient learning ability of the model for sample defects, a feature fusion method combining the BiFPN strategy and the lightweight up-sampling method, CARAFE, is presented.Furthermore, we introduce the WIoU to enhance the model's ability to learn weight information from feature maps.At the prediction phase, we employ a dynamic head (DyHead) to further improve the detection performance.Meanwhile, a dynamic head (DyHead) is used to improve the detection performance in the network prediction phase.Experimental results illustrate that our model achieves significant performance compared with other models.
It is worth noting that the type of steel structure selected for this experiment is relatively homogeneous, and the applicability of DBCW-YOLO could be improved.Therefore, future research will include extending the dataset to cover more different types of metal defects to improve the overall capability and adaptability of the model.
H W C , and the feature map with shape H W C   ( denotes the up-sampling ratio) is output by up-sampling kernel prediction and feature reorganization.Moreover, the newly generated feature map includes more semantic information.The network comparison of the original network and the improved network is illustrated in Figure 4.

Figure 4 .
Figure 4. Original network (left) and improved network (right).In Figure4, Module 1 stands for the part of the kernel prediction module and Module 2 stands for the part of the content-aware reassembly, whose structures are illustrated in Figure5.The parameters are explained as follows: N: batch size, C: input channel of the feature mapping, H: image height, W: image width, Cm: compression channel, k 2 en : encoder size, δ: up-sample ratio, and k 2 up : recombination core size.

Figure 5 .
Figure 5.The structure of CARAFE.

Figure 5 .
Figure 5.The structure of CARAFE.
C .These three attention se- quences are applied to the detection head and can be used multiple times in superposition.In this paper, two groups of successively to enhance the representation effect of the detection head and to improve the detection ability of the model for small flaws.Only two groups are added to ensure the calculation amount of the model.The single DyHead structure is shown in Figure6.

Figure 6 .Figure 6 .
Figure 6.A specific structure of the DyHead.The computational procedure for each of the three attention modules is as follows:

Figure 9 .
Figure 9. AP comparison of various types.

Table 1 .
Experimental environment and parameters.

Table 2 .
Detect result comparison.In Table2, the best result for each detect are in bold.

Table 3 .
The comparison of detecting results on NEU-DET.
9. AP comparison of various types.

Table 4 ,
W is the WIoU, BC is CARAFE and BIFPN, D is DyHead, and DW, BCW, and DBCW are their combination.