DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network

Jiang, Jiajun; Zhang, Yaodan; Xue, Ziyang; Wang, Chuzheng

doi:10.3390/electronics15122593

Open AccessArticle

DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network

by

Jiajun Jiang

,

Yaodan Zhang

,

Ziyang Xue

and

Chuzheng Wang

^*

School of Computer and Mathematics, Central South University of Forestry and Technology, Changsha 410004, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2593; https://doi.org/10.3390/electronics15122593

Submission received: 2 May 2026 / Revised: 22 May 2026 / Accepted: 3 June 2026 / Published: 12 June 2026

(This article belongs to the Special Issue Advanced Technologies and Applications for Computer Vision and Recognition Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The steel surface defect detection is crucial for steel quality and usage safety. The high computational cost and low detection accuracy are still the main issues in current steel detection models. To efficiently address the issues above, this paper proposes a new steel surface defect detection model named DIDW-YOLOv11. In the proposed DIDW-YOLOv11, the YOLOv11 C3k2 module is first innovatively improved by C3K2-DIMB, which integrates C3K2 and DIMB by introducing DynamicInceptionDWConv2d (DIDW) to sufficiently strengthen the detailed feature extraction for tiny defects and weak-texture defects, improving the matching degree of multi-scale receptive fields. Then the YOLOv11 SPPF module is enhanced by integrating the IDWFSPPF module for optimizing the fusion of local and global information, which combines average pooling and max pooling to enhance the model’s multi-scale feature fusion capability. An auxiliary detection head (ADH) is finally proposed with an additional coarse loss function to process shallow feature information into the model, which uses extra supervision for shallow features to suppress background noise and reduce false detections. Experimental results on the NEU-DET and GC10-DET datasets show that DIDW-YOLOv11 achieves 4.9% and 3.8% improvements in mAP@0.5 compared to the baseline model YOLOv11s. Our research indicates that DIDW-YOLOv11 exhibits stronger recognition ability and robustness in complex and diverse defect detection, providing an effective solution for steel defect detection in industrial production. In addition, experimental results show that our model offers improved performance over the baseline methods.

Keywords:

surface defect detection; YOLOv11 network; dynamic mixed convolution; auxiliary head; object detection

1. Introduction

Steel is an indispensable raw material in modern industrial production, which is widely used in manufacturing, transportation, infrastructure construction, etc. [1]. Steel surface defect detection plays a critical quality control role, as defects can reduce steel toughness or strength and weaken the corrosion resistance of steel. Therefore, timely detection and handling of steel surface defects are beneficial for steel quality and usage safety [2].

The existing methods for the steel surface defect detection struggle to balance accuracy and efficiency. In the early stage, the steel surface defect detection (SSDD) mainly relied on manual inspection, in which inspectors detected steel surfaces through visual inspection and touch [3]. Although the manual method is highly flexible, manual inspection is inefficient and easily influenced by subjective factors [3]. With advancements in the image processing techniques advancements, edge detection [4,5] and threshold segmentation [6,7] have been applied to steel surface defect detection. The edge detection technique HY-LBP, proposed by Navdeep et al. [5], utilizes the properties of ultra-smooth functions to effectively handle noise interference. The global adaptive percentile gradient image threshold segmentation method proposed by Nirbhar Neogi et al. [7] dynamically adjusts the percentile value for threshold segmentation based on the number of pixels, but these methods have poor adaptability and cannot meet the detection requirements of complex industrial scenarios. Machine learning-based defect detection methods have been gradually applied to steel surface defects. The support vector machines (SVM) [8], neural networks [9], and Bayesian networks [10] are traditional machine learning-based steel surface defect detection methods, which complete the mapping from input features to defect categories by learning the differences between defective and normal images in feature expression.

In the recent years, deep learning-based defect detection technology has been widely applied and gradually has become the core trend in the field of steel surface defect detection [11]. According to the differences in the testing procedures, deep learning-based surface defect detection technology can be divided into one-stage algorithms and two-stage algorithms. The two-stage algorithms represented by Faster R-CNN [12] and its extended version Mask R-CNN [13] first generate candidate regions via region proposal networks, then classify the target categories and regress the bounding boxes of the candidate regions, with further refinement to enhance localization precision. The two-stage object detection algorithms are characterized by high detection accuracy but slow detection speed [14]. In contract, the one-stage object detection algorithms represented by SSD [15,16], RetinaNet [17,18], and the YOLO series [19,20,21,22,23,24,25,26,27,28,29] skip the candidate region generation step, directly predicting target categories and positions simultaneously at different image locations and scales, which can achieve high detection speed while maintaining considerable accuracy.

As stated above, the YOLO series as single-stage algorithms characterized by the real-time performance and the strong robustness of the single-stage algorithms has exhibited strong adaptability in industrial defect detection [30,31]. The YOLO series models have also even become the mainstream choice in the field of SSDD [32,33,34,35,36]. Wang et al. [32] proposed an improved multi-scale YOLO-v5 model for real-time and accurate steel surface defect detection. The model develops a multi-scale explore block to capture defect information at different scales and introduces a spatial attention mechanism to enhance the focus on defect regions. The improved multi-scale YOLO-v5 still employs fixed convolutions and single pooling operations with limited adaptive feature extraction for complex and variable steel surface defects. In addition, the multi-scale feature fusion strategy is relatively simple, which makes it difficult to fully address the interference caused by complex backgrounds and weak textures. Song et al. [33] proposed RSTD-YOLOv7, an optimized YOLOv7 model tailored for steel surface defect detection. The RSTD-YOLOv7 model integrates the RFBVGG module in the backbone to expand the receptive field and enhance small defect feature extraction, However, the RSTD-YOLOv7 still relies on manually designed multi-branch structures and Transformer components, resulting in increased computational complexity and insufficient adaptability to extremely small, weak-contrast defects under complex backgrounds. Meanwhile, the multi-scale feature fusion mechanism remains relatively conventional, resulting in a limited balance between detection accuracy, model lightweight level, and inference speed for industrial application scenarios. Song et al. also [34] proposed an improved YOLOv8 model for high-accuracy steel surface defect detection while maintaining a lightweight parameter count and an acceptable real-time inference speed. Zheng et al. [35] proposed CCSS-YOLO, a lightweight steel surface defect detection model based on YOLOv9, which enhances complex feature extraction by integrating the multi-branch channel attention mechanism of SENetV2 into the RepNCSPELAN4 module. Hu et al. [36] proposed EAD-YOLOv10, a lightweight steel surface defect detection model improved based on YOLOv10, which replaces traditional convolutional layers with the Adaptive Downsampling structure to reduce computational complexity while enhancing semantic information retention. Meanwhile, the model introduces the Dynamic Upsample (DySample) module into the neck network to adaptively adjust sampling weights and improve the recognition accuracy of minority defect samples. Furthermore, the novel C2f_EMSCP module is integrated to enhance multi-scale feature fusion and sensitivity to targets of various sizes. Nevertheless, this method still struggles to achieve stable detection for extremely tiny defects under strong noise and uneven illumination, and the lightweight design brings certain challenges to the robustness of feature representation in complex industrial scenarios.

The YOLO11 introduces new functions based on previous versions of YOLO to further improve performance and flexibility [28], which have been applied to the steel surface defect detection (SSDD). Yang et al. [37] proposed CTC-YOLO, an improved YOLOv11-based model for steel surface defect detection, which enhances performance via the CSPPTB backbone, TBAM, and CCFF module, but its ability to detect small and low-contrast defects is insufficient, limiting the model’s effectiveness in identifying subtle defect features. Dang et al. [38] proposed the FD-YOLO11 model, which integrates self-calibrated convolution into C3k2 to form SC-C3k2, while designing FSPPF and adopting DySample, thereby effectively improving accuracy in steel surface defect detection despite the complex network structure. Huang et al. [39] proposed the DM-YOLOv11 model, which integrates the Dynamic Weight Reparameterization (DWR) module into C3k2 to form C3k2_DWR. The DM-YOLOv11 model incorporates the Multi-path Coordinate Attention (MPCA) mechanism into the backbone, and adopts the Wise-IoUv2 loss function, thereby effectively improving the detection accuracy of multi-scale steel surface defects while maintaining favorable computational efficiency. Liang et al. [40] proposed the YOLO-GH model, which replaces the C3k2 module in the backbone with the C3k2_HFERB module to enhance the high-frequency feature capture capability.

Although the YOLO-based methods above can improve detection accuracy in steel surface defect detection (SSDD), they still suffer from high computational complexity with excessive parameters and still cannot effectively detect complex and diverse defects on steel surfaces. To address these challenges, this paper proposes the DIDW-YOLOv11 steel surface defect detection model based on YOLOv11s, in which the main contributions are as follows:

The backbone network creatively adopts the C3k2-DIMB module as the core feature extraction component, in which the DIMB module employs two different scales of DynamicInceptionDWConv2d, which enhance the adaptive extraction capability of complex defect features;
The IDWFSPPF module is designed to replace the traditional SPPF module, which uses average pooling to assist max pooling and enhances the model’s ability to fuse local and global feature information;
An auxiliary detection head (ADH) is used to optimize the original detection head. The ADH is composed of an Anchor Free and an Aux Head. By utilizing the auxiliary head to detect shallow feature information in advance and combining the shallow features with the deep features, the auxiliary detection head is very helpful in effectively mitigating noise interference and the risk of overfitting.

2. Related Works

Because of all kinds of environmental interference and material properties in the steel production process, some multi-scale and irregular defects with heavy noise in steel product are inevitable, including large patches, pitting, tiny cracks and inclusions. High computational complexity and low detection accuracy are still the key challenges to be addressed in steel defect detection.

As stated above, the YOLOv11 has better feature fusion capabilities due to more efficient multi-scale feature extraction [28], but the YOLO-based methods are still unable to meet the detection requirements in complex SSDD industrial scenarios, in which defect images often exhibit intricate surface textures, varying illumination conditions, and significant variations in defect morphology. In YOLOv11,the C3k2 module of the backbone fails to sufficiently emphasize feature extraction depth when pursuing multi-task compatibility, which leads to small-target detail loss and reduces sensitivity to subtle defects under complex backgrounds. On the other hand, the SPPF module of the neck in YOLOv11 still has a weaker ability to capture global features and fails to stably capture features of defects with significant size variations. Additionally, the original detection head relies excessively on deep semantic features while failing to sufficiently exploit shallow detailed features, making it susceptible to noise interference. Consequently, the model exhibits weak generalization capability and is prone to overfitting.

This paper proposes the DIDW-YOLOv11 framework shown in Figure 1 for improving the issues above, which implements an optimized architecture built upon YOLOv11. In DIDW-YOLOv11, the C3k2 modules in the backbone are replaced with C3k2-DIMB integrated with DynamicInceptionDWConv2d (DIDW) to enhance the model’s adaptive feature extraction capability, thereby effectively suppressing the feature loss of small defects and cracks while improving the matching degree between receptive fields and multi-scale defects. The SPPF module in the neck is upgraded to IDWFSPPF which incorporates mixed pooling and upsampling to enhance the fusion of feature information for both large and small targets in steel defects. In order to enhance the model’s anti-interference ability against local noise and its dependence on redundant features, an auxiliary detection head (ADH) is used to improve the original detection head. Through a multi-task learning mechanism, the network focuses on more comprehensive information, thereby improving the model’s detection accuracy and generalization ability in complex scenes. The improvements above collectively enhance the model’s ability to handle multi-scale defects and complex backgrounds.

3. Proposed Model

3.1. DIDW-YOLOv11 Architecture Overview

Our proposed DIDW-YOLOv11, as illustrated in Figure 1, consists of three components, a backbone, a neck and a head.

Backbone. The backbone performs multi-scale feature extraction, which creatively adopts C3k2-DIMB as the primary feature extraction module. We propose the C3k2-DIMB module by combining the Dynamic Inception Mixer Block (DIMB) constructed with DynamicInceptionDWConv2d and the traditional C3k2 module, which allows the model to perform dynamically adaptive depthwise convolution operations based on the distribution characteristics of input features, while efficiently integrating multi-scale and multi-directional defect feature information. The DynamicInceptionDWConv2d module combines depthwise separable convolutions with dynamic kernel weight adjustment mechanisms, which can reduce the number of parameters and improve computational efficiency compared to traditional full-channel convolutions;
Neck. The neck adopts a Path Aggregation Network structure, incorporating four C3k2-DIMBs for cross-scale feature fusion. The C3k2-DIMB integrates DynamicInceptionDWConv2d equipped with three parallel depthwise kernels and a dynamic weight adjustment mechanism, addressing the insufficient feature extraction depth of the original C3k2 module. Additionally, the backbone embeds the IDWFSPPF module with mixed pooling strategy to improve the traditional SPPF, enriching multi-scale feature representation while retaining fine-grained information of small targets;
Head. An auxiliary detection head (ADH) is introduced with multi-loss supervision in order to enhance the model’s anti-interference ability against local noise and its dependence on redundant features, which integrates Anchor Free for deep semantic features, and Aux Head for shallow detail features. The total loss of ADH is computed as a weighted combination of coarse loss and fine loss, optimizing the way the head processes feature maps.

3.2. The C3k2_DIMB Module

The C3k2 module for efficient feature mapping is the core component in the backbone of YOLO11, whose structure is shown in Figure 2a,b. As shown in Figure 2a,b, C3k2 has two configuration forms in YOLO11, which essentially differ in whether residual connections are enabled or disabled. If shortcut is True, C3k2 adopts a multi-branch residual structure for effectively alleviating the vanishing gradient problem in deep network training, while retaining key information such as edges and textures of defects through feature reuse. However, the residual connections increase the number of network parameters and computational complexity. If shortcut is False, the C3k2 module completes feature extraction by stacking 3 × 3 convolution structures, removing residual connections and branch convolutions, and directly outputs the feature map after channel restoration. The removal of residual connections and simplified structures reduces the floating-point operation count and memory overhead, but removing residual connections can decrease the model’s recognition accuracy. In a word, the original C3k2 module struggles to balance accuracy and efficiency. On the other hand, the C3k2 module adopts traditional single-size convolution kernels in C3k, which struggle to adapt to the diverse shape characteristics of defects, resulting in a failure to effectively capture the detailed features of steel surface defects characterized by complex textures, irregular shapes and significant scale variations.

To address the issues in C3k2, inspired by the Dynamic Inception Mixed Block (DIMB), we propose the C3k2-DIMB module by integrating the Dynamic Inception Mixed Block (DIMB) into the original C3k2 structure for enhancing adaptive feature extraction capability while balancing computational efficiency. The overall architecture of the C3k2-DIMB module is illustrated in Figure 2d.

3.2.1. The Structure of C3k2_DIMB

Our proposed C3k2-DIMB module, as an improved version of the C3k2 module, adopts a multi-branch residual structure, in which the core feature extraction is performed by a stack of C3k-DIMBs shown in Figure 2c. As shown in Figure 2d, the input feature map is first processed by a standard convolution and then split into two branches. One branch is fed into the C3k-DIMB sub-module, in which the multi-scale convolution kernels are employed for efficient and adaptive feature extraction. The other branch undergoes a standard convolution operation, and its output is concatenated with the output of the C3k-DIMB branch via a residual connection to generate the final output feature map. Each C3k-DIMB shown in Figure 2c is constructed by replacing the standard convolutional layers within the original bottleneck structure with Dynamic Inception Mixed Block (DIMB).

Specifically, the C3k2-DIMB module is constructed by replacing the ordinary convolution layers in the bottleneck layer of the C3k structure within the original C3k2 module with DIMB modules while keeping the remainder of the structure unchanged.

3.2.2. The DIMB Module

In C3k2-DIMB, the Dynamic Inception Mixed Block (DIMB) block modules are nested in C3k structures to form the integrated C3k-DIMB architecture shown in Figure 2c, which introduces multiple convolution kernels of different sizes and dynamic weight adjustment mechanisms for cross-scale feature fusion and information sharing through DynamicInceptionDWConv2D with kernels of different scales. The cross-scale feature extraction method is helpful for capturing detailed information at multiple scales, rather than relying solely on a fixed-size kernel, thereby improving the model’s adaptability to complex scenarios.

The DIMB in C3k2-DIMB is combined with C3K2 whose structure is shown in Figure 3. As depicted in Figure 3, the input feature map is first split into two parts with equal channel numbers along the channel dimension to support parallel multi-scale feature learning and enrich feature diversity. The two branches are then processed by DynamicInceptionDWConv2D with different kernel sizes (3 × 3 and 5 × 5), in which the 3 × 3 kernel focuses on extracting local detailed features of small defects, and the 5 × 5 kernel captures global structural features of large defects, forming a typical multi-scale feature extraction mechanism. After concatenating the outputs of the two branches, a 1 × 1 convolution is adopted to restore the channel number to the original size. In this way, the DIMB module realizes multi-scale feature fusion by integrating receptive fields of different sizes, which enhances the model’s adaptability to defects with large-scale variations and ensures compatibility with the residual connection structure.

In the feature processing, the activation function CGLU shown in Figure 3 plays a key role in gradient transmission and feature interaction. Compared with ReLU, the subsequently introduced activation function CGLU enables adaptive feature filtering via a dynamic gating mechanism. The formula of the CGLU activation function is as follows:

CGLU (x) = x ⊙ σ (W_{g} x + b_{g})

(1)

where ⊙ denotes element-wise multiplication,

σ

represents the sigmoid activation function,

x

is the input vector,

W_{g}

and

b_{g}

are the weight matrix and bias vector of the gating layer, respectively.

In DIMB, the DynamicInceptionDWConv2d is the key component for adaptive multi-dimensional processing by fusing multi-scale depthwise convolution and dynamic weight mechanism [41]. The structure of DynamicInceptionDWConv2d is shown in Figure 4.

As shown in Figure 4, the DynamicInceptionDWConv2d employs three parallel depthwise kernels. The kernels respectively capture general local features, horizontal long-range dependencies, and vertical long-range dependencies, which are particularly helpful for extracting features of defects with different directional characteristics (such as scratches, crazing, and patches) in steel surface defect detection. A dynamic weight mechanism is introduced for further enhancing the model’s adaptability, which first processes the input feature map through an adaptive average pooling layer to dynamically adjust the contribution ratio of the three kernels according to the input content, adaptively handling multi-scale and multi-directional feature requirements within a single module. The DynamicInceptionDWConv2d is expressed as follows:

DynamicInceptionDWConv 2 d = SiLU (BatchNorm 2 d_{i} (\sum_{i = 0}^{2} ω_{i} \cdot Conv 2 d_{i} (x)))

(2)

where

Conv 2 d_{i}

denotes the convolution operation applied at layer

i

,

ω_{i}

represents dynamically adjusted weight parameters,

BatchNorm 2 d

denotes the batch normalization process, and

SiLU

refers to the Sigmoid Linear Unit activation function.

As stated above, the DynamicInceptionDWConv2d module combines depthwise separable convolutions with dynamic kernel weight adjustment mechanisms, which reduce the number of parameters and improves computational efficiency compared to traditional full-channel convolutions. The multi-scale feature fusion strategy is very suitable for steel surface defect detection tasks with diverse defect shapes, enhancing detection accuracy while maintaining lightweight characteristics. Compared with the traditional fixed kernel design, the dynamic adjustment of the convolution strategy not only maintains the computational efficiency advantage of depthwise convolution but also enhances the flexibility of feature extraction.

3.3. IDWFSPPF Module

In the backbone of the YOLO11 network, the Spatial Pyramid Pooling Fast (SPPF) module is utilized for pooling operations to further extract multi-scale spatial features, which struggles to adapt to the detection targets with varying sizes by relying solely on simple max-pooling operations. Due to a single type of max pooling operation, the SPPF lacks sufficient ability to capture global features, resulting in insufficient capture of the correlations across different scales. In our model, the Integrated Dynamic Weighted Feature Spatial Pyramid Pooling (IDWFSPPF) is proposed, as shown in Figure 5b, based on the SPPF framework [38].

As shown in Figure 5b, The input feature map

x

is first processed by a

1 \times 1

IDWConv2D to reduce channel redundancy and prepare for subsequent pooling operations as follows:

x_{1} = IDWConv 2 d_{1 \times 1} (x)

(3)

x_{1}

denotes output feature map of

x

with the IDWConv2d operation.

An average pooling layer to capture smooth global context and suppress noise is processed, which is further refined by another

1 \times 1

IDWConv2D and upsampled to restore the original spatial resolution for the base feature

x_{a v g}

:

x_{p} = AvgPool (x_{1})

(4)

x_{a v g} = UpSample (IDWConv 2 d_{1 \times 1} (x_{p}))

(5)

x_{p}

denote the output of

x_{1}

with average pooling operation.

Two consecutive max pooling operations are applied to extract multi-scale salient features. Then each pooled feature is upsampled to the original size to maintain spatial consistency, which is as follows:

x_{m 1} = UpSample (MaxPool (x_{p}))

(6)

x_{m 2} = UpSample (MaxPool (x_{m 1}))

(7)

The average-pooled feature

x_{a v g}

and the two-level max-pooled features

x_{m 1}

,

x_{m 2}

are concatenated with the channel dimension to achieve the fusion

x_{f u s i o n}

of global context and multi-scale details, which is as follows:

x_{f u s i o n} = Concat (x_{a v g}, x_{m 1}, m_{m 2})

(8)

The concatenated feature

x_{o u t}

is processed by a

1 \times 1

IDWConv2D, followed by batch normalization (BN) and ReLU activation to generate the output feature, shown as:

x c o n = ReLU (BN (IDWConv 2 d_{1 \times 1} (x_{f u s i o n})))

(9)

As shown in Figure 5b, three targeted improvements are implemented in IDWFSPPF module. Firstly, a mixed pooling strategy that combines average pooling (AvgPool) and max pooling (MaxPool) is introduced, in which AvgPool supplements global statistical features to complement the salient local features captured by MaxPool, thereby enhancing feature diversity and enriching the semantic defect features. An upsampling layer is then added to restore the low-resolution features after AvgPool pooling and integrated into the concatenation process for strengthening the retention of small-target features and mitigating the issue of small-target information loss. Finally, the fused features are processed a second time by a structure consisting of InceptionDWConv2d, normalization, and the ReLU activation function, so as to explore the correlations among features, shown in Figure 5b. This step can substantially reduce computational costs by combining the efficiency of InceptionDWConv2d with the nonlinear characteristic of ReLU.

3.4. Auxiliary Detection Head Module

In YOLO11, the detection head is overly reliant on deep semantic features, resulting in inadequate learning of shallow detail features and is susceptible to gradient vanishing during deep network training, resulting in over-fitting.

Inspired by the auxiliary head from YOLOv7 [24], an improved auxiliary detection head (named ADH) is proposed with the Task Alignment Learning (TAL) label assignment strategies for the excessive computational cost caused by complex iterative optimization, whose structure is illustrated in Figure 6.

As shown in Figure 6, the improved auxiliary detection head ADH integrates Anchor Free, the original detection head in YOLO11 with λ = 1 for deep semantic feature, and Aux Head, the original detection head in YOLO11 with λ = 0.25 for shallow detail features. A coarse-to-fine hierarchical detection mechanism is built in ADH, in which the Aux Head with λ = 0.25 acts on the penultimate layer in the network to perform early coarse-grained detection, while Anchor Free λ = 1 conducts fine-grained detection on the final feature layer. The Aux Head introduces additional supervision signals to optimize feature learning in shallow networks, which not only enables the network to learn effective defect features in advance but also further refines the final feature map.

The total loss of the detection head is a weighted combination of the coarse loss (from the Aux Head) and the fine loss (from the anchor) with a weight λ, whose formula is expressed as:

L o s s = L o s s_{f i n e} + λ \cdot L o s s_{c o a r s e}

(10)

where

L o s s_{f i n e}

denotes the fine-grained loss from the main detection head Anchor Free, and

L o s s_{c o a r s e}

represents the coarse-grained loss from the Aux Head. λ Denotes the proportion of Aux Head’s loss function in the total loss (in this study λ = 0.25).

In Formula (10),

L o s s_{f i n e}

and

L o s s_{c o a r s e}

consist of three loss components: classification loss, bounding box regression loss, distribution focal loss, whose formula is expressed as:

L o s s = L o s s_{c l s} + L o s s_{b o x} + L o s s_{d f l}

(11)

L o s s_{c l s} = \frac{1}{N_{p o s}} \sum B C E W i t h L o g i t s (p_{p r e d}, p_{g t})

(12)

L o s s_{b o x} = \frac{1}{N_{p o s}} \sum (1 - C I o U (b_{p r e d}, b_{g t})) \cdot ω_{s c o r e}

(13)

L o s s_{d f l} = \frac{1}{N_{p o s}} \sum [ω_{l} \cdot C E (d_{p r e d}, t_{l}) + ω_{r} \cdot C E (d_{p r e d}, t_{r})]

(14)

L o s s_{c l s}

is the binary cross-entropy (BCE) loss, where

N_{p o s}

denotes the number of positive samples,

P_{p r e d}

represents the category prediction score, and

P_{g t}

denotes the target score;

L o s s_{b o x}

is the bounding box regression loss, where

C I o U

represents the IoU loss function,

b_{p r e d}

denotes the predicted box, and

b_{g t}

is the ground truth box.

ω_{s c o r e}

denotes the target score weight.

L o s s_{d f l}

is the distribution focal loss.

C E

represents the cross-entropy loss,

d_{p r e d}

denotes the predicted offset distribution.

t_{l}

and

t_{r}

are the left and right discrete values of the target offset,

ω_{l}

denotes the difference between the ground truth offset of the target bounding.

In a normal detection head, different label assignment strategies directly affect the computational complexity and inference efficiency in the detection model. The Optimal Transport Assignment (OTA) strategy adopted in YOLOv7 achieves globally optimal matching by solving the optimal transport problem, but the OTA strategy relies on complex iterative optimization, resulting in excessive computational overhead [24]. To tackle this problem, the Task Alignment Learning (TAL) label assignment strategy is employed to simplify the matching logic and reduce computational burden in our ADH. The proposed detection head further enhances robustness and generalization ability in complex industrial environments with varying illumination, uneven surface textures, and diverse defect shapes, while balancing detection accuracy and inference efficiency.

4. Experiments

4.1. Data Collection and Data Preprocessing

The NEU-DET dataset from the Northeastern University surface defect detection dataset is mainly used in our study [39]. The NEU-DET dataset with a resolution of 200 × 200 pixels focuses on hot-rolled steel strip surfaces, collecting six common types of surface defects: crazing, patches, pitted surface, scratches, inclusions, rolled-in scales, etc. The detailed information of NEU-DET is presented in Table 1 and Figure 7. To verify the generalization performance of the model, the GC10-DET steel defect dataset is used in the comparative experiments [41].

In the NEU dataset, defects belonging to the same category exhibit notable appearance variations. Additionally, due to the influence of illumination conditions and material properties, defect images within the same category undergo substantial grayscale fluctuations.

The GC10-DET contains a total of 10 types of surface defects with the resolution of 2048 × 1000, which is collected during the steel plate manufacturing process and captured by grayscale cameras.

To enhance the model’s generalization ability and reduce the risk of overfitting, the model performs data augmentation, including random cropping, rotation, scaling, and application of mosaic transformation to augment data.

In this study, we used 10-fold cross-validation and independent tests for the final evaluation and comparison. The datasets are initially segmented through stratified sampling, with 80% allocated to training and validation, and the remaining 20% retained as an independent test set.

4.2. Implementation Details

All experiments in this study are conducted under the Windows 11 operating system. The GPU used is NVIDIA GeForce RTX 4060 Ti with 16 GB video memory. The CPU is a 10-core Intel(R) Core(TM) i5-12600KF processor. The deep learning framework is PyTorch 2.4.1, the programming language is Python 3.8.20, and the CUDA version is 11.6.

The optimizer is AdamW, the initial learning rate is set to 0.001, and the final learning rate is 0.0001. The momentum is set to 0.937. The input size is fixed at 640 × 640, and the batch size is set to 32. Our data augmentation strategies include random scaling, translation, and Mosaic.

4.3. Evaluation Metric

In this study, precision (P), recall (R), and mean Average Precision (mAP) are used as evaluation metrics to measure the accuracy of the model, and FPS is used to measure the detection efficiency.

Precision is defined as the proportion of correct predictions among all positive predictions by the model, the formula of the precision is as follows:

Precision = \frac{T P}{T P + F P}

(15)

Recall is defined as the proportion of correctly predicted samples among all actual positive samples, the formula of the recall is as follows:

Re call = \frac{TP}{TP + FN}

(16)

AP measures the average precision of a single category recognition, while mean Average Precision (mAP) calculates the average of AP values of all categories in the dataset used to evaluate the comprehensive performance of the model on the entire task dataset. Its calculation method is as follows:

m A P = \frac{1}{K} \sum_{j = 1}^{K} A P (j)

(17)

where

A P (j)

represents the average precision of the j-th category recognition, and

K

represents the total number of categories.

FPS represents the number of images that the model can process per second, reflecting the real-time performance of the model, whose calculation formula is as follows:

F P S = \frac{Total processing frame rate}{Total time}

(18)

where TP (True Positive) and TN (True Negative) respectively describe the experimental data correctly classified under real and non-real labels; FP (False Positive) and FN (False Negative) respectively describe the experimental data incorrectly classified under real and non-real labels.

The mAP@0.5 is the core metric to ensure fairness and consistency in performance comparison.

4.4. Ablation Experiments

To verify the effectiveness of the improved component in DIDW-YOLOv11, we conducted ablation experiments on the NEU-DET dataset with YOLOv11s as the baseline model. First, we evaluated the performance of the baseline model (YOLOv11s, Model 1). Subsequently, we gradually incorporate the improved modules into the baseline model and further combine proposed modules in pairs to evaluate the synergistic effect. The results of the ablation experiments are presented in Table 2.

As can be seen from Table 2, when the C3k2-DIMB module is integrated into the baseline model 1, except for a little instablity in recall, there is a notable increase in precision, mAP@0.5, with a 5.1% increase in precision, and a 1.3% increase in mAP@0.5. The improvement in precision and mAP is attributed to the dynamic mixed convolution mechanism of the DIMB module, which adaptively adjusts the receptive field according to the morphological characteristics of steel surface defects, enhancing the network’s ability to extract multi-scale and irregular defect features, thus effectively improving detection accuracy with a certain decrease in FPS due to the increased computational complexity of dynamic convolution. After replacing the original SPPF module with the IDWFSPPF module (model 3), the recall increases from 70.7% to 73.2% and the mAP@0.5 improves from 76.6% to 78.3% with a slight improvement in FPS. The improvement is due to the fact that the IDWFSPPF module adopts a hybrid pooling structure of average pooling and max pooling, which can better balance the fusion of global features of large targets and detailed features of small targets, thus effectively capturing small target defects such as cracks and rolled-in scales, while optimizing the computational efficiency of the feature fusion process. After the auxiliary head is introduced (model 4), from Table 2, we can see that all indexes experienced an improvement, with the precision, recall, and mAP improving by 9%, 0.4%, and 2.9%, respectively. The most obvious improvement is that the precision reaches 80.5% and the FPS is improved to 151 frames per second. This reason for the most obvious improvement in precision and FPS stems from the strategy in the auxiliary detection head, which introduces multi-loss supervision to optimize shallow feature learning, reducing the network’s dependence on local noise and redundant features in complex industrial backgrounds and suppressing false detections for improving the inference efficiency through lightweight auxiliary supervision.

As shown in Table 2, the multi-module combination experiments can demonstrate the complementary effects with the integrated modules. When C3k2-DIMB and IDWFSPPF are integrated in model 5, there is a slight improvement in recall and mAP@0.5 compared to the baseline, while maintaining the FPS at 93. Although the precision does not increase significantly, the recall rises from 70.7% to 73.4%, indicating that the combination enhances the model’s ability to capture more defect samples, especially small and irregular defects. This improvement in model 5 is attributed to the joint effect of the adaptive receptive field adjustment of C3k2-DIMB and the multi-scale feature fusion of IDWFSPPF, which together strengthen the extraction of detailed features while balancing global context information.

In Model 6, when C3k2-DIMB is combined with the auxiliary detection head (ADH), all metrics show significant improvements, with precision reaching 78.3%, recall reaching 74.8%, and mAP@0.5 reaching 79.9%. The recall improvement is particularly notable, increasing by 4.1% compared to the baseline. The improvements show that the auxiliary detection head provides multi-task supervision, which optimizes shallow feature learning and reduces the model’s sensitivity to noise, while the C3k2-DIMB module enhances the adaptive feature extraction capability.

In Model 7, where the IDWFSPPF is integreted with ADH, we can see a moderate improvement in mAP@0.5 to 78.0%, with the FPS increasing to 130. As shown in Table 2, the recall remains stable at 73.2%, while the precision increases to 73.1%, which indicates that the hybrid pooling strategy of IDWFSPPF and the multi-loss supervision of ADH complement each other, improving the model’s anti-interference ability and feature fusion efficiency. The auxiliary head is helpful for reducing the dependence on redundant features, while IDWFSPPF ensures that multi-scale features are effectively integrated, resulting in a balanced improvement in detection performance and inference speed.

Finally, when all three modules are combined to construct the proposed DIDW-YOLOv11, except for a slight decrease in FPS, the key indicators achieve comprehensive optimization, with the precision reaching 77.7%, the recall increasing to 75.6%, and the mAP@0.5 improving to 81.5%. Table 2 shows that the three introduced modules have a significant synergistic gain effect on the performance improvement with C3k2-DIMB enhancing defect feature extraction, IDWFSPPF optimizing multi-scale feature fusion, and the auxiliary head suppressing noise interference, which not only effectively overcomes the shortcomings of the baseline model in steel surface defect detection, but also achieves an excellent balance between detection accuracy and inference speed, thus proving the effectiveness and advancement of the proposed improvement method. On the other hand, Figure 8 illustrates the precision-recall (P-R) curves of YOLOv11 and proposed DIDW-YOLOv11. As shown in Figure 8, the performance of the improved DIDW-YOLOv11 is substantially enhanced. Figure 9 presents the detection outcomes of YOLOv11 and DIDW-YOLOv11. It can be observed from Figure 9 that the DIDW-YOLOv11 model has better detection effect.

Table 3 illustrates the detection performance of the DIDW-YOLOv11 on six types of defects (Cr, In, Pa, Ps, Rs, and Sc represent crazing, patches, inclusions, pitted surfaces, rolled-in scales and scratches, respectively), covering multi-scale targets and defects under complex background interference.

As shown in Table 3, when all three modules are integrated, the proposed DIDW-YOLOv11 achieves comprehensive performance, with crazing increasing from 40.5% to 44.2%, rolled-in scales from 67.1% to 70.2%, and all six defect categories indexes outperforming the baseline model.

As shown in Table 3, our proposed model verifies that the performance differences in the six defect categories, which stem mainly from the inherent characteristics of the steel surface defects, as well as the structure and function of our proposed modules in handling different defect types. The NEU-DET dataset includes defects with large variations in size, shape, contrast, and complexity. For example, crazing (Cr) is typically thin, low-contrast, and easily confused with background noise. Cr is the most challenging category, whose baseline mAP@0.5 is only 40.5%, and even with all modules enabled reaches just 44.2%. In contrast, inclusions (In), patches (Pa), and scratches (Sc) have clearer boundaries, higher contrast, and more distinct textures, resulting in naturally higher detection performance across all models.

The result in Table 3 means our proposed modules target many challenges in different ways. The C3k2-DIMB module enhances multi-scale feature extraction, improving detection of large and irregular defects such as patches (Pa) and pitted surfaces (Ps), with their mAP@0.5 rising from 91.0% to 97.4% and 80.3% to 95.4%, respectively. The IDWFSPPF module strengthens small-target feature fusion, which is effective for rolled-in scales (Rs), boosting its mAP@0.5 from 67.1% to 69.4%. The ADH optimizes shallow feature learning and reduces background interference, improving performance on low-contrast defects such as crazing (Cr) and inclusions (In). When the three modules are combined, the model achieves the best overall ability, which substantially improves the detection of the most challenging categories (Cr, Rs) while maintaining high performance on easier ones, resulting in the highest mAP@0.5 values for most defect types.

In a word, the performance differences in the six defect categories reflect the inherent difficulty of each defect category and the complementary function of our modules, rather than a flaw in the model design.

4.5. Comparative Experiments

To further confirm our proposed DIDW-YOLOv11 model, we conducted experimental comparisons between mainstream algorithms and various YOLO series on the NEU-DET and GC10-DET [38] datasets. The experimental results are shown in Table 4 and Table 5, respectively.

As shown in Table 4, on the NEU-DET dataset, the proposed DIDW-YOLOv11 model achieves higher mAP values than those of the other 8 algorithms, with improvements by 5.5% and 7.4% compared to the Faster-RCNN and SSD models, respectively. Compared to other YOLO series, it increases by 7.7%, 4.9%, and 7.3% compared to the YOLOv5s, YOLOv7,and YOLOv8s models, respectively. The DIDW-YOLOv11 also achieves an increase of 6%, 5.6%, and 4.9%, respectively, compared to the newer versions of YOLOv9s, YOLOv10s, and YOLOv11s. The inference speed of our model reaches 118 FPS, surpassing other models and is second only to YOLO11s with 124 FPS, which means the model has high accuracy while maintaining strong inference speed. Compared with the currently more advanced model FD-YOLO11 [38], our model has also achieved a slight advantage in terms of accuracy and detection speed.

As shown in Table 5 on GC10-DET dataset, the proposed DIDW-YOLOv11 model achieves competitive performance in precision and recall compared with other methods, and obtains relatively favorable values of 72.0% precision and 69.2% recall. In a word, the proposed DIDW-YOLOv11 model presents effective detection capability and stable generalization performance for steel surface defect detection.

5. Conclusions

In this paper, the DIDW-YOLOv11 model is proposed based on the YOLOv11s baseline, which is a comprehensive and targeted improvement scheme for efficient steel surface defect detection. First, the C3k2-DIMB module uses multi-scale dynamic depthwise convolutions to adaptively extract features of irregular, tiny, and multi-scale defects without introducing excessive parameters. Second, the IDWFSPPF module adopts a mixed pooling strategy to fuse global and local information, which strengthens multi-scale feature representation while maintaining efficiency. Third, the auxiliary detection head uses extra supervision for shallow features to suppress background noise and reduce false detection, which alleviates gradient vanishing in deep network training and facilitates more effective shallow feature learning. Experimental results on the NEU-DET and GC10-DET datasets demonstrate that DIDW-YOLOv11 outperforms mainstream comparative models, achieving mAP@0.5 scores of 81.5% and 72.0%, respectively, representing 4.9% and 3.8% improvements over the baseline, while maintaining high inference speeds of 118 FPS and 159 FPS. The DIDW-YOLOv11 model validates superior detection accuracy and robust generalization ability. In this study, although the proposed DIDW-YOLOv11 model achieves effective performance for some steel surface defect detection, it still exhibits certain limitations in other defect detection tasks. In future work, we will further explore more robust and generalizable surface defect detection models to improve adaptability in diverse industrial scenarios.

Author Contributions

Conceptualization, J.J. and Z.X.; methodology, J.J.; software, Y.Z.; validation, J.J., Y.Z. and Z.X.; formal analysis, Y.Z.; investigation, J.J.; resources, C.W.; data curation, J.J.; writing—original draft preparation, J.J.; writing—review and editing, Z.X.; visualization, Y.Z.; supervision, C.W.; project administration, J.J.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the National Natural Science Foundation of China (grant number 62202505) and the College Students’ Innovation and Entrepreneurship Training Program of China (grant number S202410538030X).

Data Availability Statement

The experiments in this article used publicly available datasets NEU-DET and GC10-DET. The codes is available at https://github.com/jiangjiajun789/DIDW-YOLOv11ss.git, accessed on 1 May 2026.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Wang, X.; Wang, Z.; Guo, C.; Han, Y.; Zhao, J.; Lu, N.; Tang, H. Application and Prospect of New Steel Corrugated Plate Technology in Infrastructure Fields. IOP Conf. Ser. Mater. Sci. Eng. 2020, 741, 012099. [Google Scholar] [CrossRef]
Xiong, Z.; Li, Q.; Mao, Q.; Zou, Q. A 3D Laser Profiling System for Rail Surface Defect Detection. Sensors 2017, 17, 1791. [Google Scholar] [CrossRef]
Amin, D.; Akhter, S. Deep learning-based defect detection system in steel sheet surfaces. In 2020 IEEE Region 10 Symposium (TENSYMP); IEEE: New York, NY, USA, 2020; pp. 444–448. [Google Scholar]
Luo, S.; Hou, J.; Zheng, B.; Zhong, X.; Liu, P. Research on edge detection algorithm of work piece defect in machine vision detection system. In 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC); IEEE: New York, NY, USA, 2022; pp. 1231–1235. [Google Scholar]
Gaidhane, V.H.; Rani, A.; Singh, V. An improved edge detection approach and its application in defect detection. IOP Conf. Ser. Mater. Sci. Eng. 2017, 244, 012017. [Google Scholar] [CrossRef]
Cao, G.; Ruan, S.; Peng, Y.; Huang, S.; Kwok, N. Large-complex-surface defect detection by hybrid gradient threshold segmentation and image registration. IEEE Access 2018, 6, 36235–36246. [Google Scholar] [CrossRef]
Neogi, N.; Mohanta, D.K.; Dutta, P.K. Defect detection of steel surfaces with global adaptive percentile thresholding of gradient image. J. Inst. Eng. Ser. B 2017, 98, 557–565. [Google Scholar] [CrossRef]
Wang, H.; Gu, J.; Wang, S. An effective intrusion detection framework based on SVM with feature augmentation. Knowl.-Based Syst. 2017, 136, 130–139. [Google Scholar] [CrossRef]
Zhiqiang, W.; Jun, L. A review of object detection based on convolutional neural network. In 2017 36th Chinese Control Conference (CCC); IEEE: New York, NY, USA, 2017; pp. 11104–11109. [Google Scholar]
Pernkopf, F. Detection of surface defects on raw steel blocks using Bayesian network classifiers. Pattern Anal. Appl. 2004, 7, 333–342. [Google Scholar] [CrossRef]
Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A modified YOLO for detection of steel surface defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2015; pp. 1440–1448. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar]
Akhyar, F.; Lin, C.Y.; Muchtar, K.; Wu, T.Y.; Ng, H.F. High efficient single-stage steel surface defect detection. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS); IEEE: New York, NY, USA, 2019; pp. 1–4. [Google Scholar]
Feng, X.; Gao, X.; Luo, L. X-SDD: A new benchmark for hot rolled steel strip surface defects detection. Symmetry 2021, 13, 706. [Google Scholar] [CrossRef]
Liu, X.; Gao, J. Surface defect detection method of hot rolling strip based on improved SSD model. In Database Systems for Advanced Applications. DASFAA 2021 International Workshops; Springer International Publishing: Cham, Switzerland, 2021; pp. 209–222. [Google Scholar]
Yang, Z.; Liu, Y. A steel surface defect detection method based on improved RetinaNet. Sci. Rep. 2025, 15, 6045. [Google Scholar] [CrossRef] [PubMed]
Sharma, M.; Lim, J.-T.; Chae, Y.-G. Steel Surface Defect Detection Using the RetinaNet Detection Model. Int. J. Internet Broadcast. Commun. 2022, 14, 136–146. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, M.; Wang, H.; Wan, Z. Surface defect detection of steel strips based on improved YOLOv4. Comput. Electr. Eng. 2022, 102, 108208. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Computer Vision—ECCV 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wen, X.; Shan, J.; He, Y.; Song, K. Steel surface defect recognition: A survey. Coatings 2022, 13, 17. [Google Scholar] [CrossRef]
Zhao, W.; Chen, F.; Huang, H.; Li, D.; Cheng, W. A new steel defect detection algorithm based on deep learning. Comput. Intell. Neurosci. 2021, 2021, 5592878. [Google Scholar] [CrossRef]
Wang, L.; Liu, X.; Ma, J.; Su, W.; Li, H. Real-time steel surface defect detection with improved multi-scale YOLO-v5. Processes 2023, 11, 1357. [Google Scholar] [CrossRef]
Song, H. RSTD-YOLOv7: A steel surface defect detection based on improved YOLOv7. Sci. Rep. 2025, 15, 19649. [Google Scholar] [CrossRef] [PubMed]
Song, X.; Cao, S.; Zhang, J.; Hou, Z. Steel surface defect detection algorithm based on YOLOv8. Electronics 2024, 13, 988. [Google Scholar] [CrossRef]
Zheng, T.; Yu, L.; Shi, Y.; Niu, F. A lightweight steel surface defect detection network based on YOLOv9. AIP Adv. 2025, 15, 055317. [Google Scholar] [CrossRef]
Haoyan, H.; Jinwu, T.; Haibin, W.; Xinyun, L. Ead-yolov10: Lightweight steel surface defect detection algorithm research based on yolov10 improvement. IEEE Access 2025, 13, 55382–55397. [Google Scholar] [CrossRef]
Yang, L.; Li, Z.; Hu, X.; Shao, M.; Zhao, Y.; Zhou, C. CTC-YOLO: An improved YOLOv11 algorithm for steel surface defect detection. Eng. Res. Express 2025, 7, 035265. [Google Scholar] [CrossRef]
Dang, Z.; Wang, X. FD-YQA0L011: A Feature-Enhanced Deep Learning Model for Steel Surface Defect Detection. IEEE Access 2025, 13, 63981–63993. [Google Scholar] [CrossRef]
Huang, B.; Wang, M. Steel Surface Defect Detection Algorithm Based on DM-YOLOv11. In 2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA); IEEE: New York, NY, USA, 2025; pp. 848–853. [Google Scholar]
Liang, E. Steel surface defect detection algorithm based on improved YOLO11. In ISCCN ‘25: Proceedings of the 2025 4th International Conference on Intelligent Systems, Communications and Computer Networks; Association for Computing Machinery: New York, NY, USA, 2025; pp. 185–191. [Google Scholar]
Zong, G.; Shan, W. Mcfd: A Multi-Backbone Dynamic Convolution Fusion and Focus Diffusion Pyramid Network for Infrared Small Target Detection. Available online: https://ssrn.com/abstract=5266815 (accessed on 2 June 2026).

Figure 1. DIDW-YOLOv11 network structure.

Figure 2. Structures of C3k2-DIMB. (a) C3k2, True. (b) C3k2, False. (c) In the C3k-DIMB module, the ordinary convolutional layers in the initial bottleneck block are replaced with DIMB modules. (d) The C3k2-DIMB module consists of multiple C3k-DIMBs, each of which integrates the Dynamic Inception Mixed Block (DIMB).

Figure 3. Structure of the DIMB.

Figure 4. The structure of DynamicInceptionDWConv2d.

Figure 5. Comparison of IDWFSPPF and SPPF. (a) SPPF, (b) IDWFSPPF.

Figure 6. The auxiliary detection head module. The feature map F1 on the left is used to coarsely calculate the loss, Loss coarse, while the feature map F2 on the right is used to accurately calculate the loss, Loss fine.

Figure 7. NEU-DET dataset.

Figure 8. P-R curve of the YOLOv11 and DIDW-YOLOv11.

Figure 9. Visualization comparison of the detection outcomes.

Table 1. Details of the NEU-DET dataset.

Defect Class	Images	Detects
Crazing	300	689
Patches	300	881
Inclusion	300	1011
Pitted surface	300	432
Rolled-in scale	300	628
Scratches	300	548
Total	1800	4189

Table 2. The results of ablation experiments conducted on the NEU-DET dataset.

Model	YOLO11s	C3k2-DIMB	IDWFSPPF	ADH	P (%)	R (%)	mAP@0.5 (%)	FPS
1	√				71.5	70.7	76.6	124
2	√	√			76.6	71.1	77.9	93
3	√		√		70.6	73.2	78.3	127
4	√			√	80.5	71.1	79.5	151
5	√	√	√		70.6	73.4	78.5	93
6	√	√		√	78.3	74.8	79.9	83
7	√		√	√	73.1	73.2	78.0	130
8	√	√	√	√	77.7	75.6	81.5	118

√ denotes the corresponding module is embedded. Bold formatting denotes the optimal experimental value for each evaluation metric.

Table 3. mAP@0.5 values of the six defect categories.

YOLO11s	C3k2-DIMB	IDWFSPPF	ADH	mAP@0.5 (%)
				Cr	In	Pa	Ps	Rs	Sc
√				40.5	85.9	91.0	80.3	67.1	94.8
√	√			33.9	87.3	97.4	95.4	61.5	92.2
√		√		41.6	80.7	94.0	88.9	69.4	95.3
√			√	40.5	86.1	96.3	93.2	64.3	96.4
√	√	√	√	44.2	87.2	97.6	93.9	70.2	95.7

√ denotes the corresponding module is embedded. Bold formatting denotes the optimal experimental value for each evaluation metric.

Table 4. Comparison result of different models on the NEU-DET.

Experiments	P (%)	R (%)	mAP@0.5 (%)	FPS
Faster-RCNN	70.9	71.6	76.0	18
SSD	75.6	66.3	74.1	89
YOLOv5s	66.8	71.3	73.8	116
YOLOv7	71.7	70.7	76.5	64
YOLOv8	67.7	68.8	74.2	113
YOLOv9	68.1	73.1	75.5	109
YOLOv10	70.2	70.5	75.9	102
YOLOv11	71.8	70.5	76.6	124
FD-YOLO11	74.0	77.7	81.1	110
DIDW-YOLOv11 (ours)	77.7	75.6	81.5	118

Bold formatting denotes the optimal experimental value for each evaluation metric.

Table 5. Comparison result of different models on the GC10-DET.

Experiments	P (%)	R (%)	mAP@0.5 (%)	FPS
Faster-RCNN	60.2	60.6	64.8	32
SSD	56.0	59.6	57.1	138
YOLOv5s	62.1	58.0	68.6	179
YOLOv7	66.6	42.5	63.6	86
YOLOv8s	61.6	59.4	68.2	185
YOLOv9s	70.9	54.3	67.4	177
YOLOv10s	61.0	59.1	64.6	165
YOLOv11s	69.4	53.7	67.2	183
FD-YOLOv11	68.9	68.7	71.3	161
DIDW-YOLOv11 (ours)	68.0	69.2	72.0	159

Bold formatting denotes the optimal experimental value for each evaluation metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, J.; Zhang, Y.; Xue, Z.; Wang, C. DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network. Electronics 2026, 15, 2593. https://doi.org/10.3390/electronics15122593

AMA Style

Jiang J, Zhang Y, Xue Z, Wang C. DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network. Electronics. 2026; 15(12):2593. https://doi.org/10.3390/electronics15122593

Chicago/Turabian Style

Jiang, Jiajun, Yaodan Zhang, Ziyang Xue, and Chuzheng Wang. 2026. "DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network" Electronics 15, no. 12: 2593. https://doi.org/10.3390/electronics15122593

APA Style

Jiang, J., Zhang, Y., Xue, Z., & Wang, C. (2026). DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network. Electronics, 15(12), 2593. https://doi.org/10.3390/electronics15122593

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network

Abstract

1. Introduction

2. Related Works

3. Proposed Model

3.1. DIDW-YOLOv11 Architecture Overview

3.2. The C3k2_DIMB Module

3.2.1. The Structure of C3k2_DIMB

3.2.2. The DIMB Module

3.3. IDWFSPPF Module

3.4. Auxiliary Detection Head Module

4. Experiments

4.1. Data Collection and Data Preprocessing

4.2. Implementation Details

4.3. Evaluation Metric

4.4. Ablation Experiments

4.5. Comparative Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI