SDD-YOLO: A Lightweight, High-Generalization Methodology for Real-Time Detection of Strip Surface Defects

: Flat-rolled steel sheets are one of the major products of the metal industry. Strip steel’s production quality is crucial for the economic and safety aspects of humanity. Addressing the challenges of identifying the surface defects of strip steel in real production environments and low detection efficiency, this study presents an approach for strip defect detection based on YOLOv5s, termed SDD-YOLO. Initially, this study designs the Convolution-GhostNet Hybrid module (CGH) and Multi-Convolution Feature Fusion block (MCFF), effectively reducing computational complexity and enhancing feature extraction efficiency. Subsequently, CARAFE is employed to replace bilinear interpolation upsampling to improve image feature utilization; finally, the Bidirectional Feature Pyramid Network (BiFPN) is introduced to enhance the model’s adaptability to targets of different scales. Experimental results demonstrate that, compared to the baseline YOLOv5s, this method achieves a 6.3% increase in mAP 50 , reaching 76.1% on the Northeastern University Surface Defect Database for Detection (NEU-DET), with parameters and FLOPs of only 3.4MB and 6.4G, respectively, and FPS reaching 121, effectively identifying six types of defects such as Crazing and Inclusion. Furthermore, under the conditions of strong exposure, insufficient brightness, and the addition of Gaussian noise, the model’s mAP 50 still exceeds 70%, demonstrating the model’s strong robustness. In conclusion, the proposed SDD-YOLO in this study features high accuracy, efficiency, and lightweight characteristics, making it applicable in actual production to enhance strip steel production quality and efficiency.


Introduction
Strip steel, as a crucial core product in the metal industry, plays an indispensable role in various fields such as construction, automotives, machinery manufacturing, aerospace, and beyond [1].With the flourishing development of high-end industries like precision metal manufacturing, the metal industry has increasingly stringent requirements for the quality of strip steel products.Throughout the production process of steel materials, various factors such as raw material quality, manufacturing equipment, and the production environment influence the occurrence of diverse types of defects on the product surface, including cracks, voids, and scratches, among others [1].These defects cause significant economic losses to the metal manufacturing industry and pose serious safety hazards to society.In recent years, the field of metal surface defect detection has garnered increasing attention, with notable improvements in the effectiveness and efficiency of detection technology [2].However, the detection of metal surface defects is prone to interference from various factors during the production process, such as light reflection, variations in light intensity, and material properties, thereby augmenting the challenge of metal surface defect detection [3].
Metals 2024, 14, 650 2 of 20 Therefore, enhancing the capability of strip steel surface defect detection is of paramount importance for improving product quality and manufacturing efficiency [4].
Since the late 20th century, scholars in the metal industry have been dedicated to researching the detection and classification of metal surface defects.Initially, detection methods primarily comprised eddy current testing [5], infrared detection [6], magnetic particle testing [7], and visual inspection; however, these methods are costly and inefficient.In recent years, with the continuous advancement of computer vision technology, object detection-based metal surface defect detection technology has been widely applied in industrial production and gradually replaced traditional detection methods [8,9].
As a significant research direction in computer vision, object detection can be categorized into two types based on feature extraction methods: traditional object detection methods and deep learning methods.Traditional object detection methods can be broadly divided into three categories.Firstly, methods such as Local Binary Patterns (LBP) [10], Histogram of Oriented Gradients (HOG) [11], and Gray Level Co-occurrence Matrix (GLCM) [12] extract the features of surface defects by manually designing parameters [13].The second category comprises techniques based on statistical and spectral methods, such as Fourier Transform [14], Wavelet Transform [15], and Gabor Filters [16].The last category comprises methods based on machine learning models, such as autoregressive models [17] and Markov Random Field models [18].While these methods have made certain advancements in the field of metal surface defect detection, they are limited by the sensitivity of images to lighting conditions and backgrounds, and the inability of shallowly extracted manually designed features to effectively represent images with complex backgrounds.Therefore, despite the development of various traditional machine learning-based metal surface defect detection models, these models still fail to be effectively applied in practical production [19].
With the continuous development of artificial intelligence technology and the improvement of GPU performance, deep learning technology has shown unique application potential in metal surface defect detection [20].Convolutional Neural Networks (CNNs) have been highly acclaimed for their powerful feature extraction capabilities, and many scholars have applied deep learning technology to the research of metal surface defect detection [21].For instance, Lin et al. [22] proposed a multi-model cascaded CNN based on MobileNet, aiming to reduce false positives in industrial optical defect detection without considering detection speed.Li et al. [23] developed a parameter-complex integrated framework for industrial railway defect detection, aiming to improve the detection performance for railway defects.Zhou et al. [24] combined attention mechanism modules with the YOLOv5s model, improving detection performance while reducing detection efficiency.Zhang et al. [25] combined the lightweight convolutional layer GSConv with YOLOv5s to increase the detection rate of strip steel surface defects at the cost of reducing detection accuracy.Lv et al. [26] proposed a high-precision strip steel surface defect detection model based on the improved YOLOv7.Li et al. [27] improved on YOLOv7, maintaining high defect detection capabilities while slightly reducing model complexity.Although these studies have made certain contributions to the field of metal surface defect detection, these models have not yet achieved a good balance between detection accuracy and efficiency.Furthermore, since the performance of these detection models is mainly evaluated on ideal environment datasets, the detection of metal surface defects in practical applications, especially strip steel surface defect detection, is influenced by factors such as overexposure and uneven brightness.Therefore, current metal surface defect detection models still cannot overcome these challenges and achieve both detection accuracy and speed in practical strip steel surface defect detection applications.
To address the current issues in strip steel surface defect detection applications, this study proposes a lightweight, highly generalized real-time strip defect detection method named SDD-YOLO.While maintaining excellent detection performance and efficiency, this method features simpler parameters and a lighter model, meeting the demands of the metal forging industry.The main contributions of this study are as follows: The remainder of this study is organized as follows: Section 2 introduces related research on the original YOLOv5s network, multiscale feature fusion, and lightweight networks.Then, Section 3 describes the dataset and the proposed SDD-YOLO method.The detailed analysis of the experimental process is presented in Section 4. Finally, Section 5 summarizes the work of this study.

YOLOv5
The You Only Look Once (YOLO) series is regarded as one of the classics in object detection technology [28].Its fifth generation (YOLOv5) was introduced in 2020 [29] and is considered one of the cutting-edge object detection algorithms in the field of deep learning.The implementation method mainly involves dividing the entire image into a series of grids and predicting various information for each grid, including the presence of objects, their positions, sizes, categories, etc. YOLOv5 has been thoroughly tested on several common deep learning techniques, selecting effective techniques to achieve satisfactory experimental results.On Tesla V100, YOLOv5 achieves real-time detection speeds of 156 FPS on the COCO2017 dataset with an accuracy of 56.8% AP.In recent years, YOLOv5 has been widely applied in various fields such as industry [30,31], agriculture [32,33], etc.The structure of YOLOv5 mainly consists of four parts.The first part is the Input, including image data augmentation concatenation, setting three initial anchors, and adaptive scaling of image size.The structures of the remaining three parts are illustrated in Figure 1.
Backbone: The backbone network of YOLOv5 consists of three parts: CBS, C3, and SPPF, which convert the original image into multi-layer feature maps and extract key features for subsequent object detection.The CBS module is the cornerstone of convolutional neural networks, encompassing Conv [34], BatchNorm [35], and SiLU [36], to extract local spatial information from images and endow them with the magic of nonlinear transformation.The C3 module, focusing on high accuracy, ingeniously enhances the computational efficiency of the network, enabling a higher level of speed and efficiency for object detection.For feature extraction, Cross-Stage Partial Networks (CSP) [37] and Spatial Pyramid Pooling Fusion (SPPF) [38] are employed to extract feature maps of different sizes from input images.The clever design of CSP not only reduces computational burden but also improves inference speed.Meanwhile, the SPPF module, like a spatial pyramid, can handle images of different resolutions, and expand the perception range while reducing the computational burden, thus comprehensively refining the overall features of the targets.spatial information from images and endow them with the magic of nonlinear transformation.The C3 module, focusing on high accuracy, ingeniously enhances the computational efficiency of the network, enabling a higher level of speed and efficiency for object detection.For feature extraction, Cross-Stage Partial Networks (CSP) [37] and Spatial Pyramid Pooling Fusion (SPPF) [38] are employed to extract feature maps of different sizes from input images.The clever design of CSP not only reduces computational burden but also improves inference speed.Meanwhile, the SPPF module, like a spatial pyramid, can handle images of different resolutions, and expand the perception range while reducing the computational burden, thus comprehensively refining the overall features of the targets.Neck: The Neck of YOLOv5 adopts a fusion structure of FPNs [39] and PANs [40], combining traditional FPN layers with bottom-up Feature Pyramid Networks (PAN), and cleverly integrating extracted semantic features with positional features.Simultaneously, the fusion of backbone network layers with detection layers injects richer feature information into the model.These two structures complement each other, enhancing features extracted from different network layers, further improving detection accuracy and capability.
Head: The Head is mainly used to predict targets of different sizes on feature maps.YOLOv5 inherits the multiscale prediction Head of YOLOv4 and integrates three layers of feature mapping to enhance the detection performance of targets of different sizes.The Head of YOLOv5 employs three detection Heads responsible for detecting target objects and predicting their categories and positions.These three Heads correspond to feature maps of 20 × 20, 40 × 40, and 80 × 80, accurately outputting targets of different sizes [41].
In the YOLOv5 model series, there are four models, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which are divided according to different depth and width parameters.With the increase in model type, the performance gradually improves, but it is accompanied by the complexity of network structure and the slowing down of detection speed.However, in this study, the steel strip surface defect detection model needs to display surface defects in real time and minimize the consumption of operating memory.Therefore, this study selected the relatively simple network structure and fast detection speed of YOLOv5s as the baseline model.

Multi-Scale Feature Fusion
Multi-scale feature fusion is a powerful tool for improving model performance, especially in the field of object detection [42].Traditional neural networks often use fixedsize filters or pooling operations to process input images, which often leads to the loss of Neck: The Neck of YOLOv5 adopts a fusion structure of FPNs [39] and PANs [40], combining traditional FPN layers with bottom-up Feature Pyramid Networks (PAN), and cleverly integrating extracted semantic features with positional features.Simultaneously, the fusion of backbone network layers with detection layers injects richer feature information into the model.These two structures complement each other, enhancing features extracted from different network layers, further improving detection accuracy and capability.
Head: The Head is mainly used to predict targets of different sizes on feature maps.YOLOv5 inherits the multiscale prediction Head of YOLOv4 and integrates three layers of feature mapping to enhance the detection performance of targets of different sizes.The Head of YOLOv5 employs three detection Heads responsible for detecting target objects and predicting their categories and positions.These three Heads correspond to feature maps of 20 × 20, 40 × 40, and 80 × 80, accurately outputting targets of different sizes [41].
In the YOLOv5 model series, there are four models, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which are divided according to different depth and width parameters.With the increase in model type, the performance gradually improves, but it is accompanied by the complexity of network structure and the slowing down of detection speed.However, in this study, the steel strip surface defect detection model needs to display surface defects in real time and minimize the consumption of operating memory.Therefore, this study selected the relatively simple network structure and fast detection speed of YOLOv5s as the baseline model.

Multi-Scale Feature Fusion
Multi-scale feature fusion is a powerful tool for improving model performance, especially in the field of object detection [42].Traditional neural networks often use fixed-size filters or pooling operations to process input images, which often leads to the loss of low-level details or high-level semantic information.To address this issue, introducing multiscale feature fusion becomes an inevitable choice.There are various methods to achieve multiscale feature fusion, among which a common approach is to concatenate or overlay feature maps of different scales, enabling the network to integrate information from various scales for decision-making.Another approach is to generate feature maps of different levels using a pyramid structure and then fuse them to capture details and semantic information at different scales.Through multiscale feature fusion, the model can better adapt to objects or scenes of different scales and sizes, enhancing model robustness and performance in complex scenarios.However, traditional top-down Feature Pyramid Networks (FPNs) often fail to fully utilize features of different scales due to the limitation of unidirectional information flow, thus requiring more effective methods to address this issue [43].The YOLOv5 algorithm adopts the Path Aggregation Network (PANet) network for feature fusion, which introduces a bottom-up Path Aggregation Network compared to FPNs, realizing bidirectional information flow [44].However, the PANet network requires more parameters and computational resources, resulting in slower speeds, making it unsuitable for real-time object detection.If low-level feature information is insufficient or some information is lost, the PANet method may lead to decreased detection accuracy.The BiFPN is proposed as a novel network structure for multiscale feature fusion.Compared to traditional unidirectional FPNs, the BiFPN improves fusion accuracy and efficiency by utilizing bidirectional connections and feature node fusion in the Feature Pyramid Network, solving the problem of unidirectional FPNs' inability to fully utilize feature information at different scales [45][46][47].Therefore, in this study, the original FPN and PANet structures in YOLOv5 are improved to the BiFPN network to achieve more efficient multiscale feature fusion.

Lightweight Network
The lightweight network is a neural network model designed for scenarios with limited computational resources.Its design aims to maintain good performance while minimizing the number of model parameters and computational complexity as much as possible.Typically, lightweight networks adopt various optimization strategies such as simplifying network structures, reducing parameter count, and lowering network layer complexity to efficiently operate in resource-constrained environments.This type of network has wide applications in scenarios such as mobile devices, embedded systems, and edge computing, meeting the demands of limited computational and storage resources while achieving fast, accurate inference and processing tasks.
To reduce computational costs while maintaining model detection efficiency, researchers have proposed various methods.Some methods focus on reducing the precision of weights to make the model more compact [48].Additionally, there are a series of methods aimed at reducing the number of less important parameters in pruned training models.For example, MADNet [49] is a dense lightweight network designed to achieve stronger multiscale feature expression and feature correlation learning.In terms of feature extraction, Liu et al. [50] constructed a network with expanded convolutions and attention modules, using pooling operations of different sizes to encode surrounding semantic information.
However, these methods typically achieve compression of pre-trained networks or direct training of small-scale networks, rather than solely focusing on model size while ignoring overall performance.Taking into account its performance comprehensively, the SDD-YOLO network proposed in this study effectively reduces computational complexity and model size, truly achieving lightweight and efficient characteristics.

Dataset
To validate the effectiveness of the proposed model, this study selected the Northeastern University Surface Defect Database for Detection (NEU-DET) [51] to evaluate the performance of SDD-YOLO and other models.The NEU-DET dataset consists of six types of defects: Crazing (Cr), Inclusions (In), Patches (Pa), Pitted Surface (Ps), Rolled-in Scale (Rs), and Scratches (Sc).Each type of defect contains 300 grayscale images with a resolution of 200 × 200 pixels, totaling 1800 images.In this study, the NEU-DET dataset was divided into training, validation, and test sets in proportions of 80%, 10%, and 10%, respectively.The training set was used to optimize network parameters to minimize the loss function, the validation set was used to validate the performance of the model during training, and the test set was used to evaluate the accuracy of the trained network in surface defect recognition.Figure 2 shows samples of the six typical surface defects in NEU-DET.

Methods
YOLOv5s, recognized as a lightweight neural network for object detection, exhibits relatively lower costs in inference computation and training speed.To strike a balance between detection speed and accuracy, this study opted to utilize YOLOv5s as the foundation for improving the identification network in this investigation.Building upon the YOLOv5 network with C3 as its backbone, which maintains a relatively fast speed while enhancing detection performance, this study introduces the CGH module based on the C3 structure and establishes the novel network SDD-YOLO.Furthermore, this study introduces a novel feature fusion method termed MCFF.This method employs convolutions with larger receptive fields to extract richer feature scales and adaptively extract features, thereby augmenting the network's multiscale recognition capability for surface defects.To enhance the receptive field of convolutional neural networks and the resolution of feature maps, CARAFE replaces traditional upsampling methods in this study.Finally, the incorporation of BiFPNs enables SDD-YOLO to combine more feature information while conserving computational resources, thereby enhancing the network's information extraction capability.
Building upon the YOLOv5s network structure, this study introduces CARAFE and BiFPNs, combined with the CGH module and MCFF module, proposing an SDD-YOLO

Methods
YOLOv5s, recognized as a lightweight neural network for object detection, exhibits relatively lower costs in inference computation and training speed.To strike a balance between detection speed and accuracy, this study opted to utilize YOLOv5s as the foundation for improving the identification network in this investigation.Building upon the YOLOv5 network with C3 as its backbone, which maintains a relatively fast speed while enhancing detection performance, this study introduces the CGH module based on the C3 structure and establishes the novel network SDD-YOLO.Furthermore, this study introduces a novel feature fusion method termed MCFF.This method employs convolutions with larger receptive fields to extract richer feature scales and adaptively extract features, thereby augmenting the network's multiscale recognition capability for surface defects.To enhance the receptive field of convolutional neural networks and the resolution of feature maps, CARAFE replaces traditional upsampling methods in this study.Finally, the incorporation of BiFPNs enables SDD-YOLO to combine more feature information while conserving computational resources, thereby enhancing the network's information extraction capability.
Building upon the YOLOv5s network structure, this study introduces CARAFE and BiFPNs, combined with the CGH module and MCFF module, proposing an SDD-YOLO network for strip steel surface defect detection.The network structure is illustrated in Figure 3. Subsequently, this study provides a detailed explanation of the CGH module and MCFF module proposed in this study.

CGH Module
YOLOv5 is a classic object detection model, with its C3 module structure comprising three standard convolutional layers and a bottleneck layer.Although the C3 module adopts the CSP (Cross Stage Partial) structure to reduce computational complexity, its complexity remains relatively high compared to traditional single convolutional structures.This may increase the time cost of training and inference.To address this issue, this study proposes the CGH module.The CGH module replaces the bottleneck layer with GhostConv based on C3 and also replaces the conventional convolutional layers on the branches with GhostConv.Additionally, to address issues such as overfitting, gradient vanishing, and gradient explosion caused by excessive network depth, this study utilizes residual connections to replace Concat operations and removes the last conventional convolutional layer.The structure of the CGH module is illustrated in Figure 4.

CGH Module
YOLOv5 is a classic object detection model, with its C3 module structure comprising three standard convolutional layers and a bottleneck layer.Although the C3 module adopts the CSP (Cross Stage Partial) structure to reduce computational complexity, its complexity remains relatively high compared to traditional single convolutional structures.This may increase the time cost of training and inference.To address this issue, this study proposes the CGH module.The CGH module replaces the bottleneck layer with GhostConv based on C3 and also replaces the conventional convolutional layers on the branches with GhostConv.Additionally, to address issues such as overfitting, gradient vanishing, and gradient explosion caused by excessive network depth, this study utilizes residual connections to replace Concat operations and removes the last conventional convolutional layer.The structure of the CGH module is illustrated in Figure 4.  Deep neural networks often generate many similar redundant feature maps during feature extraction, consuming a large amount of computational resources.Although these feature maps are crucial for network understanding of data features, their generation process is costly.Inspired by GhostNet [52], which was designed to validate the effectiveness of GhostConv, this study introduces the GhostConv technique to reduce memory consumption during feature space expansion.GhostConv generates more feature maps in a lower-cost manner, thereby reducing memory consumption during intermediate expansion.Additionally, to ensure effective feature extraction and enhance network stability, this study introduces residual connections in the CGH module.The structure of GhostConv is depicted in Figure 5. Deep neural networks often generate many similar redundant feature maps during feature extraction, consuming a large amount of computational resources.Although these feature maps are crucial for network understanding of data features, their generation process is costly.Inspired by GhostNet [52], which was designed to validate the effectiveness of GhostConv, this study introduces the GhostConv technique to reduce memory consumption during feature space expansion.GhostConv generates more feature maps in a lower-cost manner, thereby reducing memory consumption during intermediate expansion.Additionally, to ensure effective feature extraction and enhance network stability, this study introduces residual connections in the CGH module.The structure of Ghost-Conv is depicted in Figure 5. Residual connections address problems that may arise from increasing network depth, such as overfitting, gradient vanishing, and gradient explosion.In this study, residual connections are utilized in the CGH module to mitigate overfitting issues, significantly improving network stability.The Equation (1) for computing residual connections is as follows: Residual connections address problems that may arise from increasing network depth, such as overfitting, gradient vanishing, and gradient explosion.In this study, residual connections are utilized in the CGH module to mitigate overfitting issues, significantly improving network stability.The Equation (1) for computing residual connections is as follows: The input and output of the first layer are defined as Input and Output, respectively, with nonlinear transformation defined as F(Input, {W i }), including nonlinear activation functions, etc.The introduction of the GhostConv module and residual structure in the CGH module can greatly reduce computational complexity, obtain sufficient feature maps, and ensure network stability.
The CGH module improves the YOLOv5 C3 module by replacing the bottleneck layer and conventional convolutional layers on the branches with GhostConv, effectively reducing computational complexity.Furthermore, the introduction of residual connections instead of Concat operations addresses issues such as overfitting, gradient vanishing, and gradient explosion that deep networks may encounter, significantly enhancing network stability.This module combines the characteristics of low-cost feature map generation with GhostConv and residual connections to reduce memory consumption and improve feature extraction efficiency.

Multi-Convolution Features Fusion Block
For enhanced multiscale feature extraction, this study introduces a Multi-scale Context Fusion (MCFF) block.The schematic diagram of the MCFF block is illustrated in Figure 6.M in represents the input of the MCFF block, from which feature maps M 1 , M 2 , and M 3 are extracted using 3 × 3, 5 × 5, and 7 × 7 convolutional kernels of M in ∈ R C×H×W , respectively.Leveraging convolutions with larger receptive fields enables the extraction of richer feature scales.Additionally, the proposed block employs Global Average Pooling (GAP) to extract features from different resolutions of M 2 and M 3 Subsequently, adaptive feature extraction is performed using one-dimensional convolution.Through Sigmoid transformation, channel attention S 2 ∈ R 1×1×C and S 3 ∈ R 1×1×C are obtained and utilized in conjunction with function CBR to fuse M 1 , S 2 , and S 3 , resulting in the final output feature M out ∈ R C×H×W , as depicted in Equation (2).
where CBR represents the combination of three layers: a 3 × 3 convolutional layer, a Batch Normalization layer, and a non-linear activation function ReLU.⊗ denotes elementwise multiplication, σ signifies the ReLU layer, Conv 3×3 is a 3 × 3 convolutional layer used for fusing feature maps of different resolutions and channel attention, mitigating feature misalignment issues caused by simple multiplication.BatchNorm is a normalization technique addressing inconsistent input data distributions, highlighting their relative differences, thereby accelerating training speed.The ReLU layer introduces non-linear relationships to feature layers, preventing gradient vanishing and overfitting, and ensuring the network's capability to accomplish complex detection tasks.
In the domain of metal surface defect detection, the application of MCFF blocks offers several advantages.Firstly, the MCFF block is a crucial component designed specifically for multiscale feature extraction.By employing convolutional kernels of varying sizes and channel attention mechanisms, the MCFF block effectively enhances the model's feature expression capability and robustness.This implies that the model can better capture various scales and shapes of metal surface defects, thereby improving the accuracy and comprehensiveness of defect detection.Secondly, the introduction of MCFF blocks aids in mitigating gradient vanishing issues and accelerates the model's training process.Metal surface defect detection tasks often entail handling large volumes of data and complex features.Through optimized feature extraction, MCFF blocks enable neural networks to learn and adapt to different surface defect patterns more efficiently, thereby enhancing the model's convergence speed and training efficiency.Most importantly, MCFF blocks ensure that neural networks can tackle complex strip steel surface defect detection tasks.Surface defects on metal surfaces may exhibit different shapes, sizes, and textures, thus requiring models with strong feature expression and generalization capabilities.The introduction of MCFF blocks enables the model to better understand and distinguish between different types of defects, thereby enhancing the robustness and reliability of the detection system.
In summary, the application of MCFF blocks in strip steel surface defect detection tasks effectively enhances the model's performance and strengthens its ability to detect defects of different scales and shapes, while also improving training efficiency and generalization capability.This provides the metal manufacturing industry with a more reliable and efficient quality control solution.))), ( ( ( where CBR represents the combination of three layers: a 3 × 3 convolutional layer, a Batch Normalization layer, and a non-linear activation function ReLU.⊗ denotes element- wise multiplication, σ signifies the ReLU layer, Conv 3×3 is a 3 × 3 convolutional layer used for fusing feature maps of different resolutions and channel attention, mitigating feature misalignment issues caused by simple multiplication.BatchNorm is a normalization technique addressing inconsistent input data distributions, highlighting their relative differences, thereby accelerating training speed.The ReLU layer introduces non-linear relationships to feature layers, preventing gradient vanishing and overfitting, and ensuring the network's capability to accomplish complex detection tasks.
In the domain of metal surface defect detection, the application of MCFF blocks offers several advantages.Firstly, the MCFF block is a crucial component designed specifically for multiscale feature extraction.By employing convolutional kernels of varying sizes and channel attention mechanisms, the MCFF block effectively enhances the model's feature expression capability and robustness.This implies that the model can better capture various scales and shapes of metal surface defects, thereby improving the accuracy and comprehensiveness of defect detection.Secondly, the introduction of MCFF blocks aids in mitigating gradient vanishing issues and accelerates the model's training process.Metal surface defect detection tasks often entail handling large volumes of data and complex features.Through optimized feature extraction, MCFF blocks enable neural networks to learn and adapt to different surface defect patterns more efficiently, thereby enhancing the model's convergence speed and training efficiency.Most importantly, MCFF blocks ensure that neural networks can tackle complex strip steel surface defect detection tasks.Surface defects on metal surfaces may exhibit different shapes, sizes, and textures, thus

CARAFE
Feature upsampling is an essential operation in image processing and a key operation in modern convolutional network architectures, used to convert low-resolution feature maps into high-resolution ones, thereby enhancing the model's ability to capture details and local information.Currently, there are two mainstream upsampling methods.One is linear interpolation, including nearest neighbor interpolation and bilinear interpolation, widely used in sub-pixel space but unable to fully capture semantic information, which may lead to feature loss.The other common method is deconvolution, which expands the size through convolutional operations.However, deconvolution typically uses the same convolution kernel to operate on the entire feature map, limiting its perception of local variations, making it difficult to effectively capture local details, and increasing the model's parameter count.
Wang and his team proposed the CARAFE upsampling operator [53], which introduces the Content-Aware ReAssembly in FEature space (CARAFE) technology into feature map sampling.The CARAFE upsampler utilizes content information at each position to predict the reassembled kernel and reassemble features within a predefined neighborhood.Compared to traditional methods, the CARAFE upsampler achieves significant progress with only a small number of additional parameters and amount of computational work.Since CARAFE can flexibly adjust and optimize the reassembled kernel based on content information at different positions, it outperforms mainstream upsampling operators such as interpolation or deconvolution in terms of performance.The network structure of CARAFE is depicted in Figure 7.
duces the Content-Aware ReAssembly in FEature space (CARAFE) technology into feature map sampling.The CARAFE upsampler utilizes content information at each position to predict the reassembled kernel and reassemble features within a predefined neighborhood.Compared to traditional methods, the CARAFE upsampler achieves significant progress with only a small number of additional parameters and amount of computational work.Since CARAFE can flexibly adjust and optimize the reassembled kernel based on content information at different positions, it outperforms mainstream upsampling operators such as interpolation or deconvolution in terms of performance.The network structure of CARAFE is depicted in Figure 7.In the context of metal surface defect detection, the application of CARAFE technology brings significant benefits.Firstly, by replacing traditional upsampling methods, CARAFE enhances the receptive field and resolution of convolutional neural networks.This means the network can better capture subtle features and details in images, thereby improving the ability to identify defects on metal surfaces.Secondly, CARAFE technology makes the upsampling process in convolutional neural networks more effective.Traditional upsampling methods may introduce blur or distortion, especially for metal surface images containing a large amount of detailed information.CARAFE can more accurately reconstruct feature maps to better adapt to the content of input images.As a result, the detected defect positions and shapes are more accurate, enhancing the robustness and accuracy of the detection system.
In summary, the application of CARAFE technology in metal surface defect detection can effectively improve the performance of the detection system and enhance the ability to identify defects, while maintaining the clarity and accuracy of image features, thus providing strong support for quality control in practical production.

BiFPN
In the YOLOv5 algorithm, a Feature Pyramid Network (FPN) combined with a Path Aggregation Network (PAN) structure is employed for Neck region processing, achieving favorable outcomes in multiscale fusion.However, due to its computational complexity and the susceptibility of task images to environmental factors alongside diverse scales, there exists insufficient extraction and utilization of structural features, consequently resulting in substantial loss errors.To address this issue, a model named the Weighted Bi-directional Feature Pyramid Network (BiFPN), proposed by Google's artificial intelligence research team including Mingxing Tan et al., is introduced [54].This innovation allows rapid and straightforward multiscale feature fusion.The BiFPN module employs a weighted feature fusion mechanism to learn the importance of different resolution feature information in input images, as demonstrated in Equations ( 3) and ( 4).Simultaneously, it adopts a fast normalization method, as illustrated in Equation (5).Consequently, the BiFPN structure is integrated into the Neck region of the network.The structures of FPNs, PANs, and BiFPNs are depicted in Figure 8.
In Equations ( 3) and ( 4), P i in represents the input sample feature information of the i th layer node, P i td denotes the intermediate feature information of the top-down transmission path of the i th layer, P i out signifies the output feature information of the bottom-up transmission path of the i th layer, Conv indicates convolution operation, and resize represents either upsampling or downsampling operation.
In Equation ( 5), O represents the output value, I i denotes the input value of the node, j signifies the summation of all input nodes, and ω i represents the weight of input nodes.To ensure the condition of each input node's weight ω i ≥ 0 holds, the Rectified Linear Unit RELU activation function is applied to each operation.The BiFPN is constructed based on PANs.Compared to the original Neck structure, the BiFPN removes nodes that do not contribute to feature fusion to save resources.It introduces new channels between input and output nodes at the same level to more fully integrate feature information.Simultaneously, a cross-scale connection method is proposed, and additional edges are added to directly fuse features in the feature extraction network with features of relative size in the bottom-up path, retaining more surface-level semantic information while minimizing the loss of deep semantic information.The introduction of BiFPNs enables SDD-YOLO to save computational resources while incorporating more feature information, enhancing the network's information extraction capabilities.It combines bottom-layer position information with high-level semantic information, further improving the network's performance in object detection tasks.

Experimental Parameter Settings
In this experiment, an NVIDIA GeForce RTX 3090 graphics card with 24GB of memory and an Intel(R) Xeon(R) Silver 4210 @ 2.20GHz CPU with 32GB of memory were utilized.The experiment was conducted using the PyTorch deep learning framework on a Windows 10 environment for both training and testing.The training process of the network consisted of 160 epochs.This study employed the stochastic gradient descent (SGD) optimizer with a batch size of 8 and employed a linearly decaying learning rate scheduling strategy, with an initial learning rate set to 0.01 and a final learning rate of 0.0001.The momentum parameter was set to 0.941, and the weight decay was set to 0.0005.Input images were uniformly resized to 640 × 640 dimensions and normalized.Specific training parameter settings are outlined in Table 1.

Parameter Name
Parameter Value The BiFPN is constructed based on PANs.Compared to the original Neck structure, the BiFPN removes nodes that do not contribute to feature fusion to save resources.It introduces new channels between input and output nodes at the same level to more fully integrate feature information.Simultaneously, a cross-scale connection method is proposed, and additional edges are added to directly fuse features in the feature extraction network with features of relative size in the bottom-up path, retaining more surface-level semantic information while minimizing the loss of deep semantic information.The introduction of BiFPNs enables SDD-YOLO to save computational resources while incorporating more feature information, enhancing the network's information extraction capabilities.It combines bottom-layer position information with high-level semantic information, further improving the network's performance in object detection tasks.

Experimental Parameter Settings
In this experiment, an NVIDIA GeForce RTX 3090 graphics card with 24GB of memory and an Intel(R) Xeon(R) Silver 4210 @ 2.20GHz CPU with 32GB of memory were utilized.The experiment was conducted using the PyTorch deep learning framework on a Windows 10 environment for both training and testing.The training process of the network consisted of 160 epochs.This study employed the stochastic gradient descent (SGD) optimizer with a batch size of 8 and employed a linearly decaying learning rate scheduling strategy, with an initial learning rate set to 0.01 and a final learning rate of 0.0001.The momentum parameter was set to 0.941, and the weight decay was set to 0.0005.Input images were uniformly resized to 640 × 640 dimensions and normalized.Specific training parameter settings are outlined in Table 1.

Model Evaluation Metrics
This study comprehensively evaluates the proposed network using metrics such as accuracy (AP), mean average precision (mAP), precision, recall, floating-point operations (FLOPs), parameter count (Params), and frames per second (FPS).In the task of strip steel surface defect detection, Intersection over Union (IOU) is employed to determine whether the detection result corresponds to a genuine defect.If this value exceeds a predefined threshold, it is considered a positive sample; otherwise, it is deemed a negative sample.In object detection tasks, precision and recall are crucial metrics for evaluating the recognition performance of the network.
Precision (P) is defined as the ratio of the number of correctly classified positive samples to the total number of samples classified as positive by the classifier (Equation ( 6)).
Recall (R) is defined as the ratio of the number of correctly classified positive samples to the total number of true positive samples (Equation ( 7)).
In Equations ( 6) and ( 7), TP denotes the number of samples correctly predicted as positive by the model, FP denotes the number of samples incorrectly predicted as positive by the model, and FN denotes the number of samples incorrectly predicted as negative by the model.
AP is a metric that summarizes the precision-recall curve and measures the precision at various recall levels for a specific class.mAP is the mean of average precision scores for all classes, providing a single metric that evaluates the overall performance of a model across multiple classes.Their formulas are shown as Equations ( 8) and (9).
In Equation ( 9), mAP is the approximate area enclosed by the precision-recall curve.FLOPs are commonly used to measure model complexity, with lower values indicating faster model execution rates, as shown in Equations ( 10) and (11).
In Equations ( 10) and (11), C in and C out represent input and output channels, respectively, and K, H out , and W out represent kernel size, output feature map height, and width, respectively.
In evaluating the performance of the models in this study, mAP 50:95 and mAP 50 were simultaneously employed.mAP 50:95 represents the average mAP (mean average precision) across different IOU thresholds (ranging from 0.5 to 0.95 with a step size of 0.05), providing a comprehensive reflection of the model's performance.Additionally, during the testing phase, FPS was used to indicate the inference speed, with results averaged over 180 test images.To compare the computational complexity of different networks, this study selected computational time complexity (FLOPs) and computational space complexity (Params, parameter count) to represent the differences between various methods.

Experimental Result and Discussion
To demonstrate the excellent performance of SDD-YOLO in strip steel surface defect detection, this section presents the experimental results and analysis.In this section, this study first compares the performance of SDD-YOLO with the baseline YOLOv5s.Subsequently, ablation experiments are conducted to validate the contributions of the CGH, GhostConv, MCFF, CARAFE, and BiFPN modules, and the specific contributions of the Feature Pyramid Networks FPN, NAS-FPN, and BiFPN are verified through experiments.Additionally, the effectiveness of the proposed method is validated by comparing it with other classical object detection methods applied to strip steel surface defect detection tasks.In the final section of this study, three data augmentation methods-increasing brightness, decreasing brightness, and adding Gaussian noise-are employed to perform robustness analysis on the proposed SDD-YOLO method, demonstrating the model's strong generalization ability.

Performance Evaluation
The SDD-YOLO model was validated using the NEU-DET dataset.Experimental results, as depicted in Table 2, illustrate that the proposed SDD-YOLO model achieved improvements of 4.93%, 3.28%, 6.3%, and 4.3% in accuracy, recall, mAP 50 , and mAP 50:95 , respectively, while reducing parameter count by 51.4%.
A comparison of the detection performance of six surface defect categories between the baseline YOLOv5s and SDD-YOLOv5 models is illustrated in Figure 9.It is evident that the SDD-YOLO proposed in this study not only handles various types and conditions of strip steel surface defect images but also exhibits significantly superior detection performance compared to YOLOv5s.Additionally, the effectiveness of the proposed method is validated by comparing it with other classical object detection methods applied to strip steel surface defect detection tasks.In the final section of this study, three data augmentation methods-increasing brightness, decreasing brightness, and adding Gaussian noise-are employed to perform robustness analysis on the proposed SDD-YOLO method, demonstrating the model's strong generalization ability.

Performance Evaluation
The SDD-YOLO model was validated using the NEU-DET dataset.Experimental results, as depicted in Table 2, illustrate that the proposed SDD-YOLO model achieved improvements of 4.93%, 3.28%, 6.3%, and 4.3% in accuracy, recall, mAP50, and mAP50:95, respectively, while reducing parameter count by 51.4%.A comparison of the detection performance of six surface defect categories between the baseline YOLOv5s and SDD-YOLOv5 models is illustrated in Figure 9.It is evident that the SDD-YOLO proposed in this study not only handles various types and conditions of strip steel surface defect images but also exhibits significantly superior detection performance compared to YOLOv5s.

Ablation Experiment
This study conducted ablation experiments to validate the advantages of the CGH, GhostConv, TCFF, CARAFE, and BiFPN modules in the SDD-YOLO network.The experimental results, as shown in Table 3, demonstrate that these modules can improve detection speed and accuracy, and reduce parameter count and computational complexity.However, there exists a trade-off among detection accuracy, parameter count, computational complexity, and detection speed among these modules.Experiment 16, which in-

Ablation Experiment
This study conducted ablation experiments to validate the advantages of the CGH, GhostConv, TCFF, CARAFE, and BiFPN modules in the SDD-YOLO network.The experimental results, as shown in Table 3, demonstrate that these modules can improve detection speed and accuracy, and reduce parameter count and computational complexity.However, there exists a trade-off among detection accuracy, parameter count, computational complexity, and detection speed among these modules.Experiment 16, which incorporates these five modules, achieved a 6.3% improvement in detection accuracy compared to Experiment 1 while reducing parameter count and computational complexity by 51% and 59%, respectively, and increasing detection speed by 25 frames per second.The results of Experiment 16 significantly outperformed other experiments.By effectively reducing computational complexity and model size while maintaining performance, this experiment achieved lightweight and efficient detection of strip steel surface defects.These findings suggest that, for real-time and accurate detection of strip steel surface defects, the combination of Experiment 16 is more suitable.Currently, three commonly used Feature Pyramid Networks in the literature are FPNs, NAS-FPNs, and BiFPNs.As shown in Table 4, after integrating these three different Feature Pyramid Networks into SDD-YOLO, BiFPNs achieved the highest mAP 50 and mAP 50:95 among the three, with this study emphasizing accuracy.Therefore, the BiFPN was chosen as the Feature Pyramid Network in SDD-YOLO.

Comparison of Different Modules
To validate the performance of the proposed SDD-YOLO method for strip steel surface defect detection, this study compared it with classical models, including YOLOv3, YOLOv5s, YOLOv7-tiny, and YOLOv8s, among others.Additionally, the default backbone network of YOLOv5s was replaced with lightweight backbone networks such as Shuf-fleNetv2, MobileNetv3, and GhostNet.Table 5 and Figure 10 present the comprehensive performance of each method on the NEU-DET dataset.The SDD-YOLO method proposed in this study achieved a mAP50:95 of 40.3%, surpassing all other classical methods, while significantly reducing complexity compared to all other network models, with Params and FLOPs being only 3.4M and 6.4G, respectively.Although YOLOv3-tiny achieved the highest FPS, its detection performance was poor, with a mAP50:95 of only 22.4%.The SDD-YOLO proposed in this study achieved the best results in terms of detection accuracy, parameter count, and computational complexity, outperforming all lightweight networks and most mainstream networks.Compared to the baseline YOLOv5s, the proposed SDD-YOLO reduced parameter count and computational complexity by 51.4% and 59.5%, respectively, while increasing speed by 2.1 times.Moreover, the detection performance of ShuffleNetv2-YOLOv5, MobileNetv3-YOLOv5, and GhostNet-YOLOv5, which replaced the backbone network, was lower than that of SDD-YOLO proposed in this study.Figure 10 emphasizes the comprehensive performance of the proposed SDD-YOLO model, achieving the highest mean average precision (mAP) while maintaining low parameter count and lightweightness, highlighting its superior performance in strip steel surface defect detection compared to the other eight models.
To evaluate the performance of the proposed SDD-YOLO method in detecting different defect types, this study compared it with multiple other models.Table 6 and Figure 11 present the proposed SDD-YOLO method's performance compared to other classical methods in terms of average precision (AP).The SDD-YOLO method proposed in this study achieved a mAP 50:95 of 40.3%, surpassing all other classical methods, while significantly reducing complexity compared to all other network models, with Params and FLOPs being only 3.4M and 6.4G, respectively.Although YOLOv3-tiny achieved the highest FPS, its detection performance was poor, with a mAP 50:95 of only 22.4%.The SDD-YOLO proposed in this study achieved the best results in terms of detection accuracy, parameter count, and computational complexity, outperforming all lightweight networks and most mainstream networks.Compared to the baseline YOLOv5s, the proposed SDD-YOLO reduced parameter count and computational complexity by 51.4% and 59.5%, respectively, while increasing speed by 2.1 times.Moreover, the detection performance of ShuffleNetv2-YOLOv5, MobileNetv3-YOLOv5, and GhostNet-YOLOv5, which replaced the backbone network, was lower than that of SDD-YOLO proposed in this study.
Figure 10 emphasizes the comprehensive performance of the proposed SDD-YOLO model, achieving the highest mean average precision (mAP) while maintaining low parameter count and lightweightness, highlighting its superior performance in strip steel surface defect detection compared to the other eight models.
To evaluate the performance of the proposed SDD-YOLO method in detecting different defect types, this study compared it with multiple other models.Table 6 and Figure 11 present the proposed SDD-YOLO method's performance compared to other classical methods in terms of average precision (AP).Table 6 shows that the SDD-YOLO model's defect detection capabilities for each category surpass all classical algorithms, demonstrating that the proposed CGH module and MCFF effectively enhance the model's ability in strip steel surface defect detection tasks.Figure 11 illustrates the performance comparison between SDD-YOLO and classical object detection methods.By visualizing the centroids of the average precision (AP) values of the eight models, it is evident that the SDD-YOLO method significantly outperforms other methods in overall detection performance.Moreover, the AP values for each category highlight the proposed method's capability in detecting specific surface defect categories, with SDD-YOLO outperforming all other classical methods, especially in the Cr and Rs categories, where the AP values increased by 8% and 7.1%, respectively.

Robustness Testing
In the acquisition of metal surface images under real-world conditions, various environmental factors such as overexposure, dim lighting, and image blurriness often exert influence.To assess the adaptability of the SDD-YOLO model to these scenarios, this study subjected the dataset's images to three types of image interference processing: high exposure, low brightness, and the addition of Gaussian noise.Partial dataset images before and after image interference processing are depicted in Figure 12.Table 6 shows that the SDD-YOLO model's defect detection capabilities for each category surpass all classical algorithms, demonstrating that the proposed CGH module and MCFF effectively enhance the model's ability in strip steel surface defect detection tasks.
Figure 11 illustrates the performance comparison between SDD-YOLO and classical object detection methods.By visualizing the centroids of the average precision (AP) values of the eight models, it is evident that the SDD-YOLO method significantly outperforms other methods in overall detection performance.Moreover, the AP values for each category highlight the proposed method's capability in detecting specific surface defect categories, with SDD-YOLO outperforming all other classical methods, especially in the Cr and Rs categories, where the AP values increased by 8% and 7.1%, respectively.

Robustness Testing
In the acquisition of metal surface images under real-world conditions, various environmental factors such as overexposure, dim lighting, and image blurriness often exert influence.To assess the adaptability of the SDD-YOLO model to these scenarios, this study subjected the dataset's images to three types of image interference processing: high exposure, low brightness, and the addition of Gaussian noise.Partial dataset images before and after image interference processing are depicted in Figure 12.For comparative analysis of the robustness and generalization capabilities of the method proposed in this study and the benchmark YOLOv5s, both methods were separately applied to the dataset after image interference processing, yielding the results shown in Table 7.
Table 7. Comparative analysis of strip steel surface defect detection performance between SDD-YOLO and YOLOv5s on dataset before and after image interference processing.

Data
Proposed In the processed dataset, both YOLOv5s and SDD-YOLO exhibit a decline in performance.Specifically, YOLOv5s experiences reductions of 15.35% and 23.61% in mAP50 and mAP50:95, respectively, while SDD-YOLO experiences reductions of 6.57% and 12.66% in mAP50 and mAP50:95, respectively.Despite the susceptibility of the SDD-YOLO proposed in this study to missed detections and false detections when handling blurred images, overall, the model demonstrates its concurrent detection capability, achieving a final mAP of 71.1%.Compared to using the benchmark YOLOv5s model, SDD-YOLO achieves a 12% improvement in detecting interfered images, even surpassing the accuracy of YOLOv5s in detecting undisturbed images.Taking all factors into consideration, the improved SDD-YOLO model presented in this study exhibits stronger robustness and generalization capabilities than the benchmark YOLOv5s model.For comparative analysis of the robustness and generalization capabilities of the method proposed in this study and the benchmark YOLOv5s, both methods were separately applied to the dataset after image interference processing, yielding the results shown in Table 7.In the processed dataset, both YOLOv5s and SDD-YOLO exhibit a decline in performance.Specifically, YOLOv5s experiences reductions of 15.35% and 23.61% in mAP 50 and mAP 50:95 , respectively, while SDD-YOLO experiences reductions of 6.57% and 12.66% in mAP 50 and mAP 50:95 , respectively.Despite the susceptibility of the SDD-YOLO proposed in this study to missed detections and false detections when handling blurred images, overall, the model demonstrates its concurrent detection capability, achieving a final mAP of 71.1%.Compared to using the benchmark YOLOv5s model, SDD-YOLO achieves a 12% improvement in detecting interfered images, even surpassing the accuracy of YOLOv5s in detecting undisturbed images.Taking all factors into consideration, the improved SDD-YOLO model presented in this study exhibits stronger robustness and generalization capabilities than the benchmark YOLOv5s model.

Figure 1 .
Figure 1.The network architecture diagram of YOLOv5.

Figure 1 .
Figure 1.The network architecture diagram of YOLOv5.

(
Rs), and Scratches (Sc).Each type of defect contains 300 grayscale images with a resolu-tion of 200 × 200 pixels, totaling 1800 images.In this study, the NEU-DET dataset was divided into training, validation, and test sets in proportions of 80%, 10%, and 10%, respectively.The training set was used to optimize network parameters to minimize the loss function, the validation set was used to validate the performance of the model during training, and the test set was used to evaluate the accuracy of the trained network in surface defect recognition.
Figure2shows samples of the six typical surface defects in NEU-DET.

Figure 2 .
Figure 2. Six sample images from the NEU-DET dataset.

Figure 2 .
Figure 2. Six sample images from the NEU-DET dataset.
Metals 2024, 14, x FOR PEER REVIEW 7 of 21 network for strip steel surface defect detection.The network structure is illustrated in Fig- ure 3. Subsequently, this study provides a detailed explanation of the CGH module and MCFF module proposed in this study.

Figure 3 .Figure 3 .
Figure 3.The network architecture diagram of SDD-YOLO.3.2.1.CGH Module YOLOv5 is a classic object detection model, with its C3 module structure comprising three standard convolutional layers and a bottleneck layer.Although the C3 module adopts the CSP (Cross Stage Partial) structure to reduce computational complexity, its complexity remains relatively high compared to traditional single convolutional struc-

Figure 3 .
Figure 3.The network architecture diagram of SDD-YOLO.

Figure 4 .
Figure 4.The structural diagram of the CGH module.Figure 4. The structural diagram of the CGH module.

Figure 4 .
Figure 4.The structural diagram of the CGH module.Figure 4. The structural diagram of the CGH module.

Metals 2024 , 21 (
14, x FOR PEER REVIEW 9 of GAP) to extract features from different resolutions of M2 and M3 Subsequently, adaptive feature extraction is performed using one-dimensional convolution.Through Sigmoid transformation, channel attention utilized in conjunction with function CBR to fuse M1, S2, and S3, resulting in the final output feature

Figure 7 .
Figure 7.The structural diagram of the CARAEF upsampling operator.Figure 7. The structural diagram of the CARAEF upsampling operator.

Figure 7 .
Figure 7.The structural diagram of the CARAEF upsampling operator.Figure 7. The structural diagram of the CARAEF upsampling operator.

Metals 2024 ,
14, x FOR PEER REVIEW 14 of 21 study first compares the performance of SDD-YOLO with the baseline YOLOv5s.Subsequently, ablation experiments are conducted to validate the contributions of the CGH, GhostConv, MCFF, CARAFE, and BiFPN modules, and the specific contributions of the Feature Pyramid Networks FPN, NAS-FPN, and BiFPN are verified through experiments.

YOLOv5sOursFigure 9 .
Figure 9.The comparison of defect detection results between YOLOv5s and SDD-YOLO on the NEU-DET dataset.

Figure 9 .
Figure 9.The comparison of defect detection results between YOLOv5s and SDD-YOLO on the NEU-DET dataset.

Figure 10 .
Figure 10.The performance comparison of different detection algorithms.

Figure 10 .
Figure 10.The performance comparison of different detection algorithms.

Figure 11 .
Figure 11.Comparative analysis of AP performance between SDD-YOLO and classical object detection methods.

Figure 11 .
Figure 11.Comparative analysis of AP performance between SDD-YOLO and classical object detection methods.

Figure 12 .
Figure 12.Partial dataset images before and after image interference processing.

Figure 12 .
Figure 12.Partial dataset images before and after image interference processing.
The proposal of the CGH module, which improves the C3 module of YOLOv5.Ghost-Conv is used to replace conventional convolution layers on bottleneck layers and branches, effectively reducing computational complexity.Meanwhile, residual connections are introduced to replace Concat operations, significantly enhancing the stability of the network.This module combines the low-cost feature map generation of GhostConv with the characteristics of residual connections to reduce memory consumption and improve feature extraction efficiency.(2)The proposal of the MCFF module.By employing convolution kernels of different sizes and channel attention mechanisms, the feature expression capability and robustness of the model are effectively enhanced, avoiding gradient disappearance, accelerating the training process, and ensuring that the neural network can accomplish more complex tasks.(3) By simultaneously introducing CARAFE and BiFPNs, the receptive field of convolutional neural networks is enhanced, and the resolution of feature maps is improved, resulting in a more effective upsampling process within convolutional neural networks.Furthermore, the incorporation of the BiFPN into the Neck layer of YOLOv5, in place of traditional FPN and PANet structures, enhances the model's capability for deep feature fusion.(4) This study considers disruptive factors in strip steel production and uses three data interference methods-high exposure, low brightness, and Gaussian noise-on the original data to test the robustness of the SDD-YOLO method.The results show that SDD-YOLO has advanced generalization performance, making it suitable for real production environments and effectively improving the quality and efficiency of strip steel production.

Table 2 .
Comparison of results between YOLOv5s model and SDD-YOLO model.

Table 2 .
Comparison of results between YOLOv5s model and SDD-YOLO model.

Table 4 .
Comparison of different Feature Pyramid Networks in SDD-YOLO.

Table 5 .
Comparison of SDD-YOLO with state-of-the-art methods.

Table 6 .
Comparative analysis of AP performance between SDD-YOLO and classical object detection methods.

Table 6 .
Comparative analysis of AP performance between SDD-YOLO and classical object detection methods.

Table 7 .
Comparative analysis of strip steel surface defect detection performance between SDD-YOLO and YOLOv5s on dataset before and after image interference processing.