Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization

Xu, He; Zhang, Zhibo; Ye, Hairong; Song, Jinyu; Chen, Yanbing

doi:10.3390/electronics14102029

Open AccessArticle

Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization

by

He Xu

,

Zhibo Zhang

,

Hairong Ye

^*,

Jinyu Song

^* and

Yanbing Chen

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(10), 2029; https://doi.org/10.3390/electronics14102029

Submission received: 8 April 2025 / Revised: 6 May 2025 / Accepted: 13 May 2025 / Published: 16 May 2025

Download

Browse Figures

Versions Notes

Abstract

Defect detection is a critical task in industrial manufacturing, playing a vital role in achieving automation, improving product quality, and ensuring operational safety. Traditional methods, however, face considerable limitations in terms of accuracy and efficiency. To address these challenges, we propose DCA-YOLO, a lightweight model for steel surface defect detection optimized based on task-specific knowledge. Specifically, our model incorporates a Dynamic Snake Convolution (DSConv) module to capture subtle linear features in challenging defect categories, a context-guided module to leverage contextual information for detecting clustered defects, and an Adaptive Spatial Feature Fusion (ASFF) mechanism to efficiently merge features across scales. The experimental results demonstrate that even with a nanoscale architecture (4.3 million parameters and 9.4 GFLOPs), the enhanced model exhibits marked improvements in detection accuracy and robustness, with mAP50 increasing by 4.6% and mAP50-95 by 7.7%. These findings not only offer a better solution for steel surface defect detection, but also provide new theoretical insights and practical experience for the advancement of industrial inspection technologies. In the future, DCA-YOLO is expected to be applied across a wider range of industrial detection scenarios, further driving progress in the field.

Keywords:

steel surface defect detection; lightweight YOLO; multi-scale feature fusion; fine linear features; industrial inspection

1. Introduction

Surface defect detection is a critical process for ensuring product quality and safe operation in industrial manufacturing and infrastructure maintenance. Recently, the fast development of deep learning technologies has led to significant advances in the field of defect detection, with researchers achieving numerous important results. For example, the recognition of steel surface using the Severstal dataset [1], product surface defect detection [2], PCB defect detection [3], and infrastructure defect detection [4,5] have all demonstrated the considerable potential of modern deep learning approaches.

Existing methods for steel surface defect detection still have many shortcomings and do not satisfy the rigorous demands of practical industrial applications. First, traditional approaches based on image processing and hand-crafted feature extraction often lack the robustness and adaptability needed to handle complex defect morphologies and diverse background interference, resulting in low detection accuracy and high false alarm rates. Second, although deep learning-based methods have somewhat alleviated these issues, most current models still exhibit significant limitations in multi-scale feature fusion and in capturing linear features, making it difficult to accurately identify clustered, low-contrast, and subtle defects. Moreover, some models incorporate numerous computationally intensive modules in an effort to achieve higher accuracy, such as attention modules and squeeze-and-excitation (SE) modules [6,7], whose added complexity and inference latency render them unsuitable for real-time inspection environments.

This study introduces a light YOLO-based steel surface defect detection model referred to as DCA-YOLO. To overcome the shortcomings of traditional deep learning models in multi-scale feature extraction and in capturing fine linear features, we adopted a modular design in our experimental methodology. First, we incorporate a DSConv module to optimize the feature extraction stage [8]; second, we integrate a Context-Guided module to improve the ability to capturing local features under complex backgrounds by leveraging surrounding contextual information [9]; finally, in the detection head, we employ an ASFF mechanism to perform adaptive weighted fusion of feature maps from different scales, thereby filtering out spatial conflict information and ensuring the scale invariance of features [10].

Building upon these experimental methods, the main contributions of the DCA-YOLO are as follows:

For steel surface defects exhibiting linear characteristics (e.g., scratches), we optimized a dedicated convolution module (DSConv) that employs a dynamic convolution strategy for continuous sampling and linear feature extraction, thereby significantly enhancing the detection performance for these difficult-to-detect defects by up to 15%.
In complex environments where different defect types may occur simultaneously and often cluster in the same region, the introduced Context-Guided module effectively improves local feature extraction by leveraging contextual information, resulting in enhanced identification of densely distributed defects.
In the detection head, the use of the ASFF mechanism ensures scale invariance and enables efficient adaptive weighted fusion of multi-scale features, further boosting overall detection performance.
By employing a nanoscale YOLO model as the baseline, our modular optimizations not only reduce computational complexity but also achieve detection performance comparable to the large-scale baseline model, thereby validating the efficiency of combining lightweight design with task-specific optimization.

The structure of this paper is outlined as follows: Section 2 reviews related research on defect detection, emphasizing advancements in both conventional approaches and deep learning techniques for industrial defect identification. Section 3 elaborates on the proposed methodology, covering module architecture, theoretical foundations, and implementation specifics. Section 4 presents the experimental findings derived from comparative analyses and ablation studies, confirming the effectiveness of our approach for detecting steel surface defects. Lastly, Section 5 examines the limitations of existing techniques and provides recommendations for future enhancements. Finally, Section 6 summarizes the main contributions of this work and outlines the potential applications of our research in industrial automated inspection.

2. Related Work

2.1. CNNs

In recent decades, Convolutional Neural Networks (CNNs) have undergone a revolutionary development in the field of computer vision, evolving from a nascent concept to a widely adopted technology [11]. As early as the 1980s, LeCun et al. integrated convolution operations with neural networks to demonstrate their potential in tasks such as handwritten digit recognition [12]. However, due to limitations in hardware and data availability, CNNs did not gain widespread popularity in their early stages. It was not until 2012, when AlexNet attained remarkable success in the ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) [13], that CNNs began to be recognized for their significant potential. Subsequently, a series of improved architectures such as VGG, GoogLeNet, and ResNet [14,15,16] were proposed, continuously pushing the performance boundaries in visual recognition, including object identification, detection, and various other computer vision applications. The core idea behind CNNs is to extract local spatial features using learnable convolutional kernels and to progressively abstract and fuse these features across different layers, endowing the network with a powerful representation capability which allows it to handle complex visual scenes. Today, CNNs have become a very popular model in deep learning, providing a solid technical foundation for numerous downstream tasks.

2.2. YOLO Model

YOLO models, due to their lightweight network architecture and excellent detection performance, have been widely adopted in industry [17]. Lately, YOLO models have achieved remarkable success across various detection scenarios. For instance, Zhukov et al. attained superior results in railway defect detection by incorporating an attention mechanism [18], while Yuan et al. effectively improved the accuracy of PCB defect detection using the LW-YOLO model [19]. Moreover, LGR-Net was introduced to identify defects in the pressure plates of elevator guide rails [20]. These examples demonstrate that the YOLO series and its enhanced versions possess extensive applicability and outstanding performance in industrial defect detection, and that further performance improvements can be realized through the optimization of modules and architectural refinements.

Some existing studies have sought to enhance model performance by incorporating computationally intensive new modules, such as Dynamic Head [21], Dysample [22], and CSWinTransformer [23]. Although these modules increase computational complexity to enhance the model’s adaptability to non-linear features, they often overlook the specific characteristics of the target tasks and focus excessively on optimizing the model itself. In contrast, this study adopts a nanoscale YOLO model as the baseline and concentrates on the in-depth exploitation of the inherent features in the recognition task. Targeted modules are then designed to optimize the model’s performance. This approach not only effectively limits the consumption of computational resources but also achieves a breakthrough in performance by fully leveraging task-specific features, ultimately outperforming traditional large-scale YOLO models.

2.3. Surface Defect Detection

Defect detection is an important process in industrial manufacturing, as automation not only significantly reduces labor costs but also minimizes the potential safety risks associated with manual inspections. In particular, detecting defects on steel surfaces is critically important, since the timely and accurate identification of defects can effectively improve product quality and prevent subsequent safety issues and economic losses caused by these defects. Traditional surface defect detection methods are often based on image analysis and machine learning techniques, such as template matching [24] and edge-based methods [25], to describe and classify defects in the original images. Although these approaches can achieve acceptable results in specific scenarios, their limited robustness and adaptability often prevent them from coping with the variable industrial environment. Nowadays, the use of deep learning techniques with enhanced generalization capabilities has become a major focus in defect detection [26]. For example, end-to-end defect detection networks (EDDN) [27,28] have been proposed; however, challenges such as insufficient accuracy still persist [29].

To further enhance the performance of steel surface defect detection, this paper addresses the shortcomings of the classical YOLO model in multi-scale feature fusion and fine linear feature recognition by designing and incorporating several modules—namely, the Context-Guided module, DSConv, and Detect_ASFF. The improved model achieves significant performance gains in steel surface defect detection tasks, and it is expected to provide a more precise and efficient technical solution for industrial automated inspection.

3. Method

This work adopts YOLOv8n as the foundational model, refining it to develop DCA-YOLO. Aiming to improve multi-scale feature extraction and fusion, we integrate three task-specific modules: DSConv, Context-Guided module, and Detect_ASFF, improving its representation ability across different scales.

As illustrated in Figure 1, DCA-YOLO introduces modifications to the Backbone, Neck, and Head of the YOLOv8n network architecture. First, the Backbone integrates the Context-Guided module to incorporate surrounding contextual information, thereby enhancing the detection of group-based defects. Next, DSConv is employed in the Neck to improve the capture of local complex defects, particularly fine linear features. Finally, the detection head incorporates Detect_ASFF for adaptive multi-scale feature fusion, ensuring optimal detection performance across different resolutions. Overall, these enhancements retain the efficiency of the YOLO framework while specifically optimizing multi-scale feature extraction and fusion.

3.1. Context-Guided Module

Guided by domain knowledge regarding the distribution patterns of steel surface defects, we observe that defects often exhibit group-based and regionally clustered characteristics. Influenced by environmental factors such as humidity, temperature fluctuations, and chemical corrosion, defects of the same type frequently concentrate in localized areas. Relying solely on convolutional features makes it challenging to comprehensively capture the morphology and spatial distribution of these defects. To address this, DCA-YOLO incorporates the Context-Guided module into its backbone, optimizing feature extraction and recognition by leveraging surrounding contextual information.

On one hand, the Context-Guided module utilizes dilated convolution to expand its receptive field, allowing it to perceive the clustered distribution of group-based defects within local regions. On the other hand, it integrates multi-scale feature information, enabling more precise localization and differentiation of recurrent and densely distributed defects during detection.

As depicted in Figure 2, the Context-Guided module first processes image through a

1 \times 1

Conv layer to tune the channel configuration and reduce redundant features. The feature map is subsequently separated into two parallel paths: the first branch employs Conv to extract local features, capturing the fine structure of defects, and the second branch utilizes dilated convolution to achieve a larger receptive field [30], thereby incorporating surrounding contextual information. On this basis, the results from each branch are fused through feature concatenation (Concat) and batch normalization (BN), forming a comprehensive feature representation denoted as

F_{joi}

, which integrates both local details and contextual cues. Finally, the fused feature map undergoes global average pooling and several fully connected layers to further refine the contextual-enhanced representation. Through this branched feature extraction and fusion strategy, the Context-Guided module not only preserves local defect characteristics but also learns surrounding defect information, effectively integrating contextual features from the current region. This strengthens DCA-YOLO’s ability to capture multi-scale features.

The local feature extraction function,

f_{l o c}

, is implemented using a stack of

3 \times 3

standard convolution kernels to capture fine-grained features from the input image. Given an input feature map

X \in R^{H \times W}

, the local feature at position

(i, j)

is computed as follows:

F_{l o c} (i, j) = \sum_{p = - k}^{k} \sum_{q = - k}^{k} X (i + p, j + q) \cdot W_{l o c} (p, q) + b_{l o c},

(1)

where

W_{l o c}

represents the local convolution kernel,

b_{l o c}

is the bias term, and k determines half the kernel size.

In contrast, surrounding feature extraction, denoted as

f_{s u r}

, is achieved using dilated convolution. Dilated convolution introduces a dilation rate r, which selectively skips certain pixels during convolution, allowing a broader receptive field while maintaining the parameter count. The computation is expressed as follows:

F_{s u r} (i, j) = \sum_{p = - k}^{k} \sum_{q = - k}^{k} X (i + r \cdot p, j + r \cdot q) \cdot W_{s u r} (p, q) + b_{s u r},

(2)

where

W_{s u r}

denotes the contextual convolution kernel,

b_{s u r}

is the corresponding bias term, and r is the dilation rate, which enables sparse sampling of the surrounding steel surface features. These features are then concatenated and processed through additional operations to assist in identifying defects in the current region.

3.2. DSConv

In the experimental outcomes of the baseline and comparative models, we observed that among the six types of defects, the recognition accuracy for crazing was the lowest, with an mAP50 of only 0.488. This is nearly 30% lower than the average mAP50 of 0.724. In contrast, patches, which exhibited the best recognition performance, achieved a high mAP50 of 0.916. Motivated by domain knowledge that pipeline-like defects such as crazing and scratches exhibit strong linear and elongated structural characteristics, we recognize that existing models, while effective at extracting cluster-like features, often fail to capture and represent such linear patterns accurately. To address this limitation, we introduced the DSConv module, a specialized deformable convolution technique [31] explicitly designed to enhance the modeling of linear features. DSConv dynamically adjusts the sampling positions of the convolution kernels while maintaining continuity, enabling the convolution operation to adaptively capture the irregular, pipeline-like shapes of defects. This significantly improves the model’s capacity to perceive and represent such features.

Traditional deformable convolution uses a simple convolution layer to compute the offset

Δ p_{n}

to control the convolution operation on feature maps, expressed as follows:

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n})

(3)

where

p_{0}

represents the coordinate of a specific position in the output feature map,

R

is the predefined convolution sampling region,

w (p_{n})

denotes the convolution kernel weights, and

Δ p_{n}

is the dynamically learned offset.

As shown in Figure 3, traditional DConv exhibits a relatively scattered and discontinuous distribution when learning offsets. This “fragmented” sampling approach struggles to effectively capture continuous pipeline-shaped defects and is susceptible to background noise and texture interference, leading to suboptimal extraction of fine linear features. In contrast, DSConv introduces a targeted design for linear and snake-like structures in its offset mechanism, ensuring that the convolution kernel’s sampling locations are more concentrated and continuously distributed along the defect body. This enables DSConv to accurately capture the edges and orientations of pipeline-shaped features. Furthermore, this offset strategy demonstrates strong robustness even in environments with complex noise, effectively reducing background interference during feature extraction. As a result, DSConv enhances detection precision and stability in steel surface defect detection, particularly for linear defects such as scratches.

In DSConv, the offsets are learned in an “incremental” manner along the linear direction. Specifically, the position of the

(i + c)

-th grid point depends on the previous grid point and is refined accordingly:

K_{i + c} = (x_{i} + \sum_{n = 1}^{c} δ_{n}^{x}, y_{i} + c), δ_{n}^{x} \in {- 1, 0, 1},

(4)

where

K_{i + c}

denotes a coordinate pair, with the first and second elements representing the x-axis and y-axis positions, respectively. The dynamic offset

δ_{n}^{x}

at each step n is an integer value constrained to

- 1

, 0, or 1, ensuring smooth spatial variation without producing discrete or irregular distributions. The summation of

δ_{n}^{x}

terms along the x-axis accumulates the deviations over each incremental step, allowing the kernel to maintain a quasi-linear structure while flexibly adapting to the shape of pipeline-shaped defects. (The example in this article refers to the y-axis, but a similar process applies when adapting along the x-axis.)

As depicted in Figure 4, the DSConv module operates based on a structured principle. This module generates multiple morphological convolutional kernel templates to analyze the structural features of the target from various directions. During this process, the input features are divided into several sub-regions, with each one undergoing DSConv operations independently.

Following each convolution operation, the module applies random feature dropout to mitigate the risk of overfitting. Additionally, BN is employed to balance feature distribution and stabilize the training process. Finally, convolution results from multiple directions are integrated to achieve feature fusion by summarizing key standard features. This multi-directional and multi-morphological convolutional kernel template strategy effectively captures critical characteristics of steel surface defects across different scales and orientations, thereby enhancing overall detection performance.

3.3. Detect_ASFF

Traditional YOLO models employ pyramid feature representations to handle scale variations in object detection [32]. However, in practical applications, inconsistencies among features of different scales often lead to suboptimal recognition performance. To address this issue, the DCA-YOLO model incorporates the ASFF mechanism, improving the detection head. The newly introduced Detect_ASFF module effectively suppresses inconsistencies by filtering out spatially conflicting information across different scales, ensuring the scale invariance of the features. More importantly, this module enhances detection accuracy while maintaining a low computational burden, thereby maintaining the general efficiency of DCA-YOLO.

As illustrated in Figure 5, the Detect_ASFF module achieves effective filtering of conflicting information through adaptive spatial feature fusion across feature maps of different resolutions, thereby ensuring scale invariance of the features. Specifically, multi-scale feature maps are first unified to the same resolution, and then, at each spatial location, features from different layers are adaptively weighted. Features that contain conflicting information are suppressed at that location, while those with higher discriminability or distinctiveness are assigned higher weights. This efficient approach allows for spatial filtering of multi-scale features without significantly increasing the computational cost.

In the resizing stage, the module adopts different approaches according to the sampling scale. Specifically, for 1/2 downsampling, a 3 × 3 convolutional layer with a step size of 2 is used, which not only reduces the spatial dimensions but also captures local details via the convolution kernel, thereby preserving effective features to the greatest extent. For 1/4 downsampling, due to the need for a substantial reduction in the feature map size, a single convolution may lead to significant information loss. Consequently, after a 3 × 3 convolution with a stride of 2, an additional max pooling operation with a stride of 2 is applied to further enhance salient features and suppress noise. For the upsampling process, a 1 × 1 convolution with a stride of 1 is used for channel adjustment, followed by bilinear interpolation to achieve the target resolution. The flexible combination of these scaling techniques allows the network to perform fine-grained scale adjustments at different levels, thereby resulting in a refined, enriched collection of features for the ensuing multi-scale fusion.

Regarding adaptive feature fusion, the following equation is used:

y_{i j}^{l} = α_{i j}^{l} \cdot x_{i j}^{1 \to l} + β_{i j}^{l} \cdot x_{i j}^{2 \to l} + γ_{i j}^{l} \cdot x_{i j}^{3 \to l}

(5)

Here,

y_{i j}^{l}

denotes the feature vector at spatial position

(i, j)

of the l-th output feature map after fusion;

x_{i j}^{1 \to l}

,

x_{i j}^{2 \to l}

, and

x_{i j}^{3 \to l}

represent the feature vectors at the same position

(i, j)

from three input feature maps at different scales or resolutions, respectively. The adaptive weights

α_{i j}^{l}

,

β_{i j}^{l}

, and

γ_{i j}^{l}

are learned to measure the spatial importance of each feature at that location. These weights are obtained via a conventional convolutional network followed by a softmax activation function, ensuring that

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} = 1

.

α_{i j}^{l} = \frac{e^{λ_{α_{i j}}^{l}}}{e^{λ_{α_{i j}}^{l}} + e^{λ_{β_{i j}}^{l}} + e^{λ_{γ_{i j}}^{l}}}

(6)

Unlike direct application of softmax to the input features, the softmax in our method operates on the learnable

λ

parameters generated via convolution, allowing the model to flexibly adjust the degree of emphasis or suppression for each feature map based on the task requirements. This design ensures that the fusion behavior is not rigid but dynamically optimized during training through backpropagation, enhancing robustness to feature variations across different scales.

4. Experiment

4.1. Experiment Details

In this experiment, we used 200 × 200 pixel images as our dataset format. The model was optimized with the Adam optimizer with a learning rate of 0.01 and a weight decay of 0.0005. Training was performed for 200 epochs with a batch size of 16. The training computations were performed on an NVIDIA GeForce RTX 3050 Mobile, with the operating environment configured with CUDA Version 12.2 and torch 2.6.0. Moreover, we employed Mosaic data augmentation during training to merge four images together [33].

To further verify the reliability and robustness of our model, we also evaluated it on the GC10-DET dataset [28], which consists of high-resolution images (2048 × 1000) covering 10 typical types of surface defects in steel manufacturing. This cross-dataset evaluation highlights the strong generalization ability of our model across different industrial scenarios.

4.2. Dataset

The NEU-DET is a dataset that collects images of six typical defects on steel surfaces. The defect types include crazing, inclusion, patches, pitted_surface, rolled-in_scale, and scratches [34,35,36]. This dataset comprises 1800 grayscale images, with 300 images per defect type, each originally sized at 200 × 200 pixels. Bounding boxes are used for annotation, indicating the types and locations of multiple defects present in each image. The dataset is randomly divided into training and testing sets in a 4:1 ratio. The distribution of the dataset is illustrated in Figure 6.

4.3. Evaluation Metrics

The performance of the model is primarily evaluated using the mAP50 and mAP50-95 metrics.

A P = \int_{0}^{1} p (r) d r,

(7)

where

p (r)

denotes the precision at recall r, this integral quantifies the overall precision across all levels of recall, providing a balanced measure of a model’s ability to identify true positives while minimizing false positives. For all classes, the mAP is calculated as follows:

m A P = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} A P_{i},

(8)

Here,

N_{c}

refers to the total class count, while

A P_{i}

indicates the average precision of the i-th class. The mAP50 metric is computed at an Intersection over Union (IoU) threshold of 0.50, which is relatively lenient and suitable for evaluating the model’s performance under lower overlap requirements. In contrast, mAP50-95 is derived by averaging mAP values over IoU thresholds spanning 0.50 to 0.95, with intervals of 0.05. Its calculation is given by Equation (9):

m A P_{50 - 95} = \frac{1}{10} \sum_{k = 0}^{9} A P_{0.50 + 0.05 k},

(9)

The mAP50-95 metric more comprehensively reflects the detection performance of the model at varying degrees of overlap, imposing higher requirements on the model’s robustness and its ability to precisely detect fine details.

4.4. Experimental Results

Table 1 summarizes the performance metrics of the EfficientDet, YOLO-BA, DDN, YOLOv8n, YOLOv8l, and our proposed DCA-YOLO models on the steel surface defect detection task. From the data, the results indicate that the DCA-YOLO model attains an overall mAP50 of 77.0, which is significantly higher than that of YOLOv8n (72.4) and YOLOv8l (73.5). Additionally, DCA-YOLO attains an accuracy of 88.8 in the Inclusion category, demonstrating its effective capability to capture local defect features.

Notably, in the detection of Scratches, DCA-YOLO, through targeted optimization, is the first to surpass a 93 accuracy threshold, reaching 93.5. This clearly reflects the model’s significant advantage in capturing fine linear features. Although there are some discrepancies in the performance of DCA-YOLO compared to other models in the Crazing and Rolled-in scale categories, overall, the integration of targeted modules—namely, DSConv, the Context-Guided module, and Detect_ASFF—has markedly enhanced the detection performance for critical defect types, thereby confirming the efficacy of the proposed enhancement in detecting steel surface defects.

In addition to the overall detection performance, we highlight the remarkable efficiency of our model under resource-constrained conditions. DCA-YOLO, despite operating at the nanoscale with only 4.3 million parameters and 9.4 GFLOPs, not only surpasses other lightweight models like YOLOv8n and YOLOv11n but also outperforms larger models such as YOLOv8l and YOLOv11l. Specifically, DCA-YOLO achieves an mAP50 of 77.0, exceeding the 73.5 of YOLOv8l and 75.2 of YOLOv11l, both of which have significantly larger sizes (YOLOv8l: 43.6 M parameters, 165.4 GFLOPs; YOLOv11l: 25.3 M parameters, 86.6 GFLOPs). Moreover, in critical defect categories such as Inclusion and Scratches, DCA-YOLO consistently achieves superior accuracy, demonstrating better feature extraction capabilities than these large-scale baselines. Remarkably, the weight file of DCA-YOLO remains around 8 MB, compared to approximately 80 MB for YOLOv8l, enabling efficient deployment in real-time or edge devices where computational and storage resources are limited. These results strongly validate that through targeted architectural enhancements, DCA-YOLO achieves an balance between model compactness and detection performance.

Although DCA-YOLO introduces targeted optimizations for linear feature extraction, a slight drop in performance on the Crazing category is observed. This can be attributed to the inherent characteristics of crazing defects, which often exhibit extremely subtle, fine-grained, and fragmented patterns with high intra-class variability. Such irregularities make crazing inherently more challenging to detect compared to more distinct linear defects like scratches. Moreover, the limited diversity and potential bias within the NEU-DET crazing samples may lead to overfitting risks during model training, reducing generalization capability.

To further validate the generalization ability of our proposed model, we conducted comparative experiments on the GC10-DET dataset, a benchmark for steel surface defect detection. As shown in Table 2, DCA-YOLO achieves the highest overall mAP of 64.4%, surpassing all other methods, including the widely adopted Libra Faster R-CNN (58.8%), FCOS (61.2%), and recent YOLO variants such as YOLOv8n (62.2%) and YOLOv11n (62.6%). Notably, DCA-YOLO demonstrates superior performance across multiple challenging defect categories, such as Welding Line (Wl), Crescent Gap (Cg), and Inclusion (In), indicating its robustness in handling diverse and subtle defect types. These results strongly support the reliability and general applicability of DCA-YOLO in real-world industrial defect detection tasks.

Figure 7 presents the model’s prediction results. The results demonstrate that DCA-YOLO not only correctly identifies the bounding boxes for most defect categories, but also generally produces higher confidence scores. Compared with the baseline model (YOLOv8n), DCA-YOLO performs particularly well in detecting rolled-in_scale defects: whereas the baseline model generates multiple overlapping erroneous predictions, DCA-YOLO accurately localizes the target defects, avoiding redundancy and misclassification. Moreover, for defects characterized by fine linear features such as scratches, DCA-YOLO shows significantly improved detection accuracy by capturing subtle defect regions that the baseline model tends to miss, thereby demonstrating a stronger ability to represent linear targets. Overall, DCA-YOLO not only maintains high overall detection precision but also more effectively suppresses false positives and undetected detections.

Nevertheless, it is important to note that for the Crazing defect category, DCA-YOLO does not achieve as strong a performance as for other defect types. As shown in Figure 7, the model occasionally misses a portion of the fine-grained Crazing regions, resulting in incomplete detection. This limitation is mainly due to the subtle, fragmented, and less continuous nature of Crazing defects, which makes them more challenging to be fully captured by the model’s current linear feature enhancement mechanism.

This study employs Precision–Recall (PR) curves as an essential metric for evaluating the detection performance of the models. Figure 8 illustrates the PR curve of the baseline YOLOv8n model, while Figure 9 displays that of the improved DCA-YOLO model. YOLOv8n achieves a high detection performance in some defect types (e.g., patches and pitted_surface), yet its accuracy for detecting defects such as rolled-in_scale and scratches remains suboptimal, resulting in an overall mAP50 of 0.724. In contrast, the PR curves of the DCA-YOLO model demonstrate enhanced detection performance across most categories—particularly for scratches and other linear defects—ultimately increasing the mAP50 to 0.770. In summary, DCA-YOLO exhibits superior multi-scale feature fusion and local feature extraction capabilities, affirming that the integration of the three optimized modules (DSConv, Context-Guided module, and Detect_ASFF) significantly improves the performance of steel surface defect detection.

Figure 10 and Figure 11, respectively, illustrate the normalized confusion matrices of the YOLOv8n and the DCA-YOLO for steel surface defect detection. The confusion matrix visually presents the classification results across various categories and the extent of inter-class misclassifications, thereby facilitating an assessment of the models’ recognition capabilities and limitations. Notably, the baseline model demonstrates inadequate recognition for defects characterized by fine linear features, such as the rolled-in_scale category, and some defect types are erroneously classified as background. In contrast, the confusion matrix of the DCA-YOLO model reveals a higher overall classification accuracy, with a significant improvement in the challenging rolled-in_scale category and enhanced recognition performance across most categories. Overall, compared to the baseline model, DCA-YOLO exhibits superior feature extraction ability and is more effective in suppressing false positives and mitigating background interference.

Figure 12 presents the attention distribution visualized using the Grad-CAM technique [41]. Grad-CAM tracks gradient information in convolutional layers and visualizes the activation intensity of the network for specific predictions in the form of a heatmap. This helps assess whether the model focuses on image regions that correspond to actual defect locations during object detection [42,43].

The experimental results indicate that for the Scratches defect category, DCA-YOLO exhibits a more linear and elongated attention distribution, closely adhering to the actual shape of the defect. In contrast, YOLOv8n’s attention regions tend to be more circular or diffusely distributed, making it difficult to accurately capture the extension direction of linear defects. For the rolled-in_scale defect, the baseline model not only generates multiple overlapping recognition areas but also exhibits invalid high-activation regions around the image borders. Conversely, DCA-YOLO focuses more effectively on the defect itself, minimizing redundant and erroneous attention spread. Additionally, for defects such as inclusion and patches, DCA-YOLO produces more concentrated and intense red activation regions, indicating a more precise focus on the target areas. DCA-YOLO outperforms the baseline model in feature extraction and localization across different defect categories, effectively concentrating network attention on actual defect regions. This further verifies the efficiency of the DCA-YOLO in steel surface defect detection.

However, it is worth noting that for certain defect types such as Crazing and Pitted-surface, DCA-YOLO exhibits limitations in attention concentration. As shown in Figure 12, the Grad-CAM heatmaps for these categories reveal smaller and lighter red activation regions compared to other defects, suggesting that the model’s focus on defect areas is less intense. This insufficient attention may contribute to missed or incomplete defect localization, highlighting a performance gap that requires further optimization in handling fine-grained or low-contrast surface features.

4.5. The Result of Ablation Study

To assess the performance improvements each module contributed to YOLOv8n, we executed ablation tests on the NEU-DET dataset. The results are presented in Table 3. In this ablation study, we adopted a stepwise incremental approach, introducing only a single optimization module at a time on YOLOv8n. This allowed us to independently assess the contribution of each module to the overall detection performance enhancement.

The experimental results demonstrate that incorporating the Context-Guided module significantly enhances model accuracy, increasing from 0.671 to 0.747. Meanwhile, mAP50 improves from 0.724 to 0.761, and mAP50-95 increases from 0.377 to 0.452. Although the recall slightly decreases from 0.685 to 0.678, the overall performance is improved. After integrating the DSConv module, the model achieves an mAP50 of 0.763, representing an approximately 4% improvement over the baseline model (0.724). Additionally, the recall rises from 0.685 to 0.733. This indicates that the DSConv module plays a vital role in capturing fine linear features of steel surface defects, greatly improving the model’s capability to recognize these characteristics. This validates the effectiveness of DSConv in optimizing the model for this specific task. Furthermore, incorporating the Detect_ASFF module results in mAP50 and mAP50-95 values of 0.759 and 0.454, respectively, with recall increasing to 0.749. This confirms the module’s effectiveness in multi-scale feature fusion.

In summary, the ablation study demonstrates that each module has its unique focus: the Context-Guided module leverages contextual information to enhance detection accuracy; the DSConv module improves the capture of fine linear features through a dynamic convolution strategy; and the Detect_ASFF module optimizes multi-scale feature fusion within the detection head. Ultimately, by enhancing specific detection capabilities through individual modules and effectively integrating these advantages, the constructed DCA-YOLO model achieves significant improvements across all evaluation metrics, showcasing outstanding detection performance.

5. Discussion

In this study, the DCA-YOLO model demonstrated superior performance in steel surface defect detection by leveraging knowledge-guided architectural optimization. Through integrating domain knowledge about defect morphology and distribution patterns into the network design, DCA-YOLO effectively handles variations in illumination and material properties and accurately distinguishes between defects with significant intra-class appearance differences and those with similar inter-class characteristics. However, when using a small-scale version (e.g., the nanoscale) of the model, despite achieving satisfactory deep feature extraction, certain limitations are exposed. Scaling up the model to further enhance deep feature extraction would inevitably increase the number of parameters and the associated computational cost, which may become a limiting factor in practical applications, particularly in real-time detection scenarios. Moreover, although DCA-YOLO has been effectively optimized for steel surface defect detection, its generalization capability on defect detection in other materials has not yet been fully validated. Surface defects in different materials may exhibit diverse morphologies and distribution patterns, and the model might require additional optimization measures to address these new challenges.

To tackle the above-mentioned challenges, future work will focus on refining the model’s deep feature extraction efficiency while reducing the computational burden through model quantization and the design of efficient convolution operations. Additionally, we will investigate the applicability of DCA-YOLO to defect detection tasks on other industrial materials. By optimizing data preprocessing, feature extraction, and multi-scale fusion strategies for different materials and defect types, we aim to enhance the model’s robustness and precision across domains. These advancements are expected to provide robust support for achieving more widespread automation in defect detection.

6. Conclusions

This study addresses the challenges of multi-scale feature extraction and fusion in steel surface defect detection by introducing the DCA-YOLO model. By integrating the Context-Guided module, DSConv module, and Detect_ASFF module into the YOLOv8n model, the proposed approach efficiently captures and fuses group-based, locally clustered, and fine linear features. Empirical evidence indicates that the optimized model considerably boosts detection precision, robustness, and multi-scale feature representation, with the mAP50 metric increasing by approximately 4%. These results thoroughly validate the effectiveness of the multi-scale feature fusion strategy in steel surface defect detection.

Author Contributions

Conceptualization, H.X. and Y.C.; methodology, H.X.; software, H.X.; validation, H.X., Z.Z., and J.S.; formal analysis, Z.Z.; investigation, H.X.; resources, H.X.; data curation, H.X.; writing—original draft preparation, H.X.; writing—review and editing, H.X., H.Y., J.S., and Y.C.; visualization, Z.Z. and Y.C.; supervision, H.Y. and J.S.; project administration, H.Y. and J.S.; funding acquisition, H.Y. and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alexey Grishin, BorisV, iBardintsev, Inversion, and Oleg. Severstal: Steel Defect Detection. Kaggle. 2019. Available online: https://kaggle.com/competitions/severstal-steel-defect-detection (accessed on 12 May 2025).
Tabernik, D.; Šela, S.; Skvarč, J.; Skočaj, D. Segmentation-Based Deep-Learning Approach for Surface-Defect Detection. J. Intell. Manuf. 2019, 31, 759–776. [Google Scholar] [CrossRef]
Tang, S.; He, F.; Huang, X.; Yang, J. Online PCB defect detector on a new PCB defect dataset. arXiv 2019, arXiv:1902.06197. [Google Scholar]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Cui, L.; Qi, Z.; Chen, Z.; Meng, F.; Shi, Y. Pavement Distress Detection Using Random Decision Forests. In Data Science, Proceedings of the Second International Conference, Sydney, Australia, 8–9 August 2015; Springer: Cham, Switzerland, 2015; pp. 95–102. [Google Scholar]
Zhou, X.; Fang, H.; Liu, Z.; Zheng, B.; Sun, Y.; Zhang, J.; Yan, C. Dense attention-guided cascaded network for salient object detection of strip steel surface defects. IEEE Trans. Instrum. Meas. 2021, 71, 1–14. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6070–6079. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
O’shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zhukov, A.; Rivero, A.; Benois-Pineau, J.; Zemmari, A.; Mosbah, M. A hybrid system for defect detection on rail lines through the fusion of object and context information. Sensors 2024, 24, 1171. [Google Scholar] [CrossRef]
Yuan, Z.; Tang, X.; Ning, H.; Yang, Z. Lw-yolo: Lightweight deep learning model for fast and precise defect detection in printed circuit boards. Symmetry 2024, 16, 418. [Google Scholar] [CrossRef]
Gao, R.; Chen, M.; Pan, Y.; Zhang, J.; Zhang, H.; Zhao, Z. LGR-Net: A Lightweight Defect Detection Network Aimed at Elevator Guide Rail Pressure Plates. Sensors 2025, 25, 1702. [Google Scholar] [CrossRef] [PubMed]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Goshtasby, A. Template matching in rotated images. IEEE Trans. Pattern Anal. Mach. Intell. 1985, PAMI-7, 338–344. [Google Scholar]
Mikolajczyk, K.; Zisserman, A.; Schmid, C. Shape recognition with edge-based features. In Proceedings of the British Machine Vision Conference (BMVC’03), Norwich, UK, 9–11 September 2003; Volume 2, pp. 779–788. [Google Scholar]
Luo, Q.; Fang, X.; Liu, L.; Yang, C.; Sun, Y. Automated visual defect detection for flat steel surface: A survey. IEEE Trans. Instrum. Meas. 2020, 69, 626–644. [Google Scholar] [CrossRef]
Wen, X.; Shan, J.; He, Y.; Song, K. Steel surface defect recognition: A survey. Coatings 2022, 13, 17. [Google Scholar] [CrossRef]
Lv, X.; Duan, F.; Jiang, J.j.; Fu, X.; Gan, L. Deep metallic surface defect detection: The new benchmark and detection network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef]
Yung, N.D.T.; Wong, W.; Juwono, F.H.; Sim, Z.A. Safety helmet detection using deep learning: Implementation and comparative study using YOLOv5, YOLOv6, and YOLOv7. In Proceedings of the 2022 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), Miri Sarawak, Malaysia, 26–28 October 2022; pp. 164–170. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Bao, Y.; Song, K.; Liu, J.; Wang, Y.; Yan, Y.; Yu, H.; Li, X. Triplet-graph reasoning network for few-shot metal generic surface defect segmentation. IEEE Trans. Instrum. Meas. 2021, 70, 1–11. [Google Scholar] [CrossRef]
Song, K.; Yan, Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 2013, 285, 858–864. [Google Scholar] [CrossRef]
He, Y.; Song, K.; Meng, Q.; Yan, Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Ma, X.; Deng, X.; Kuang, H.; Liu, X. YOLOv7-BA: A Metal Surface Defect Detection Model Based On Dynamic Sparse Sampling And Adaptive Spatial Feature Fusion. In Proceedings of the 2024 IEEE 6th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 24–26 May 2024; Volume 6, pp. 292–296. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; FCOS, T.H. Fully convolutional one-stage object detection. In Proceedings of the CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]

Figure 1. Model architecture diagram.

Figure 2. Context-Guided module. Arrows indicate the sequential order of operations, and boxes enclose components belonging to the same computational flow.

Figure 3. Comparison between DSConv and DConv. Arrows represent the displacement in dynamic convolution, and each box denotes a single convolution unit.

Figure 4. DSConv.

Figure 5. Structure of the ASFF module. Here,

X (1 \to 3)

,

X (2 \to 3)

, and

X (3 \to 3)

denote feature maps originally from scale 1, scale 2, and scale 3, respectively, that have been resized and aligned to the spatial resolution of the third scale for fusion. Each input feature undergoes scale transformation to match the target resolution before adaptive weighting and combination.

Figure 5. Structure of the ASFF module. Here,

X (1 \to 3)

,

X (2 \to 3)

, and

X (3 \to 3)

denote feature maps originally from scale 1, scale 2, and scale 3, respectively, that have been resized and aligned to the spatial resolution of the third scale for fusion. Each input feature undergoes scale transformation to match the target resolution before adaptive weighting and combination.

Figure 6. Dataset distribution.

Figure 7. Comparison of prediction results for different defect types. (a) Original images; (b) YOLOv8n predictions; (c) DCA-YOLO predictions. From top to bottom, the rows correspond to (1) Crazing defects, (2) Inclusion defects, (3) Patches defects, (4) Pitted-surface defects, (5) Rolled-in scale defects, and (6) Scratches defects.

Figure 8. Precision–recall curve of the baseline YOLOv8n model. This figure illustrates the detection performance of YOLOv8n across different confidence thresholds, reflecting its ability to balance precision and recall for steel surface defect detection.

Figure 9. Precision–recall curve of the proposed DCA-YOLO model. Compared to the baseline, this figure demonstrates improved precision and recall trade-offs.

Figure 10. Confusion matrix of the baseline YOLOv8n model for steel surface defect classification.

Figure 11. Confusion matrix of the proposed DCA-YOLO model showing improved classification accuracy across defect types.

Figure 12. Comparison of Grad-CAM visualizations for different defect types. (a) Original images; (b) YOLOv8n Grad-CAM heatmaps; (c) DCA-YOLO Grad-CAM heatmaps. From top to bottom: (1) Crazing defects, (2) Pitted-surface defects, (3) Scratches defects, (4) Rolled-in scale defects, (5) Patches defects, and (6) Inclusion defects.

Table 1. Evaluation of various models using the NEU-DET dataset. Abbreviations: CRA (Crazing), INC (Inclusion), PAT (Patches), PIT (Pitted surface), RIS (Rolled-in scale), SCR (Scratches). Bold values indicate the best performance achieved for each metric.

Method	mAP50	mAP50-95	CRA	INC	PAT	PIT	RIS	SCR
EfficientDet [37]	70.1	-	45.9	62.0	83.5	85.5	70.7	73.1
YOLO-BA [38]	74.8	38.8	36.3	67.8	91.0	96.6	70.6	86.4
DDN [36]	76.6	-	50.8	71.2	90.7	88.5	69.0	89.3
YOLOv8n	72.4	37.7	48.8	78.2	91.6	83.2	55.2	77.5
YOLOv11n	75.4	44.0	37.5	85.6	92.5	81.5	62.2	92.9
YOLOv8l	73.5	37.7	54.0	73.9	93.3	80.1	51.4	88.6
YOLOv11l	75.2	45.4	38.1	87.8	90.9	82.6	59.0	92.7
DCA-YOLO	77.0	45.4	39.5	88.8	92.8	83.5	64.0	93.5

Table 2. Comparison of different models on the GC10-DET defect dataset. Abbreviations: Pu (Punching), Wl (Weld line), Cg (Crescent gap), Ws (Water spot), Os (Oil spot), Ss (Scratches), In (Inclusion), Rp (Rolled pit), Cr (Crease), Wf (Wrinkle), mAP (mean average precision). Bold values indicate the best performance achieved for each metric.

Method	Pu	Wl	Cg	Ws	Os	Ss	In	Rp	Cr	Wf	mAP
Libra Faster R-CNN [39]	99.5	42.9	94.9	72.8	72.1	62.8	18.8	37.4	17.6	69.3	58.8
FCOS [40]	96.7	57.3	93.0	73.6	61.8	61.5	21.3	35.7	25.1	84.2	61.2
YOLOv8n	95.0	86.9	88.2	79.1	62.2	58.4	30.6	5.3	39.9	76.1	62.2
YOLOv11n	98.5	89.5	91.6	79.9	65.1	57.1	30.8	9.8	34.7	68.9	62.6
DCA-YOLO	98.6	93.4	96.0	80.7	63.7	56.5	34.6	15.2	31.5	74.1	64.4

Table 3. Ablation study results on the NEU-DET dataset, showing the performance impact of each proposed module when added to the YOLOv8n baseline. Bold values indicate the best performance achieved for each metric.

Model	P	R	mAP50	mAP50-95
YOLOv8n	0.671	0.685	0.724	0.377
+ContextGuide	0.747	0.678	0.761	0.452
+DSConv	0.674	0.733	0.763	0.445
+Detect_ASFF	0.664	0.749	0.759	0.454

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, H.; Zhang, Z.; Ye, H.; Song, J.; Chen, Y. Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization. Electronics 2025, 14, 2029. https://doi.org/10.3390/electronics14102029

AMA Style

Xu H, Zhang Z, Ye H, Song J, Chen Y. Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization. Electronics. 2025; 14(10):2029. https://doi.org/10.3390/electronics14102029

Chicago/Turabian Style

Xu, He, Zhibo Zhang, Hairong Ye, Jinyu Song, and Yanbing Chen. 2025. "Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization" Electronics 14, no. 10: 2029. https://doi.org/10.3390/electronics14102029

APA Style

Xu, H., Zhang, Z., Ye, H., Song, J., & Chen, Y. (2025). Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization. Electronics, 14(10), 2029. https://doi.org/10.3390/electronics14102029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization

Abstract

1. Introduction

2. Related Work

2.1. CNNs

2.2. YOLO Model

2.3. Surface Defect Detection

3. Method

3.1. Context-Guided Module

3.2. DSConv

3.3. Detect_ASFF

4. Experiment

4.1. Experiment Details

4.2. Dataset

4.3. Evaluation Metrics

4.4. Experimental Results

4.5. The Result of Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI