ELS-YOLO: Efficient Lightweight YOLO for Steel Surface Defect Detection

Zhang, Zhiheng; Zhong, Guoyun; Ding, Peng; He, Jianfeng; Zhang, Jun; Zhu, Chongyang

doi:10.3390/electronics14193877

Open AccessArticle

ELS-YOLO: Efficient Lightweight YOLO for Steel Surface Defect Detection

by

Zhiheng Zhang

,

Guoyun Zhong

^*,

Peng Ding

^*

,

Jianfeng He

,

Jun Zhang

and

Chongyang Zhu

School of Artificial Intelligence and Information Engineering, East China University of Technology, Nanchang 330000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(19), 3877; https://doi.org/10.3390/electronics14193877

Submission received: 14 July 2025 / Revised: 13 August 2025 / Accepted: 15 August 2025 / Published: 29 September 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Detecting surface defects in steel products is essential for maintaining manufacturing quality. However, existing methods struggle with significant challenges, including substantial defect size variations, diverse defect types, and complex backgrounds, leading to suboptimal detection accuracy. This work introduces ELS-YOLO, an advanced YOLOv11n-based algorithm designed to tackle these limitations. A C3k2_THK module is first introduced that combines a partial convolution, heterogeneous kernel selection protocoland the SCSA attention mechanism to improve feature extraction while reducing computational overhead. Additionally, the Staged-Slim-Neck module is developed that employs dual and dilated convolutions at different stages while integrating GMLCA attention to enhance feature representation and reduce computational complexity. Furthermore, an MSDetect detection head is designed to boost multi-scale detection performance. Experimental validation shows that ELS-YOLO outperforms YOLOv11n in detection accuracy while achieving 8.5% and 11.1% reductions in the number of parameters and computational cost, respectively, demonstrating strong potential for real-world industrial applications.

Keywords:

surface defect detection; lightweight network; multi-scale detection; attention mechanism

1. Introduction

Steel is widely used in critical fields like manufacturing and construction due to its high strength, durability, and plasticity. During production, various factors, such as temperature fluctuations, equipment precision, and operational errors, can cause surface defects like cracks and scratches. Defective steel may lead to serious accidents, including building collapses and mechanical failures, threatening safety and incurring high costs for repairs and recalls. Hence, the development of precise and effective steel surface defect detection methods becomes essential for quality assurance, efficiency enhancement, and intelligent advancement in steel manufacturing.

Traditional steel defect detection methods include manual inspection, photoelectric detection, and machine learning-based approaches. Manual inspection relies on human visual judgment with low efficiency and accuracy. Photoelectric detection uses optical and electromagnetic sensors for rapid, non-contact detection but is sensitive to environmental factors like lighting and temperature. Machine learning techniques for steel defect identification typically fall into two distinct approaches [1]: methods leveraging textural features [2,3] and methods exploiting geometric characteristics [4,5]. While textural feature-based methods perform effectively on texture-abundant surfaces, they suffer from computational complexity and noise vulnerability. Shape-based methods work well for clear-contour defects but struggle with irregular defects with complex shapes and blurred boundaries. Traditional methodologies exhibit inherent deficiencies regarding detection reliability, adaptability to varying conditions, and scalability across different scenarios, consequently falling short of satisfying modern manufacturing standards for precise and efficient fault detection.

The emergence of deep learning has revolutionized feature extraction processes, enabling neural architectures to automatically discover discriminative patterns within large-scale data repositories without explicit feature specification, thereby delivering enhanced performance for steel surface defect identification tasks. Contemporary deep learning-based detection systems encompass single-stage and two-stage architectures, wherein two-stage models (RCNN [6], Fast RCNN [7], Faster RCNN [8], and Mask RCNN [9]) execute detection through sequential region proposal refinement and classification stages, achieving superior accuracy levels. For railway steel rail defect detection, Choi [10] et al. constructed multi-source defect datasets based on Fast R-CNN and optimized network training, significantly improving detection accuracy and efficiency. Liu [11] et al. addressed low accuracy in automotive welding point detection by optimizing anchor-box selection and the loss function, effectively improving the detection accuracy of Faster R-CNN. However, two-stage methods require sequential candidate-region generation and precise target localization classification, resulting in large computational requirements and relatively low computational efficiency.

To establish an effective compromise between accuracy performance and computational demands, researchers have introduced single-stage detection architectures (exemplified by SSD [12], YOLO [13], and DETR [14]), which facilitate simultaneous localization and classification through a unified network pass, thereby reducing processing complexity while accelerating inference operations. Zhao [15] et al. proposed the SSD-CA method to effectively suppress complex background noise interference. Xiao [16] et al. effectively compressed model computational complexity while maintaining detection accuracy through depthwise separable convolution and loss function optimization. Zheng et al. [17] improved the Re-DETR method to enhance railway target detection accuracy in low-light environments. Single-stage detection methods offer improved efficiency but struggle with complex backgrounds, resulting in high miss and false-detection rates compared to two-stage methods.

In addition to the inherent limitations of existing detection paradigms, steel defect detection tasks present additional domain-specific challenges that further complicate the detection process. Specifically, steel defect detection faces three core challenges: First is the multi-scale detection problem, where steel defects span multiple scale ranges, from small-scale cracks to large-area oxide scales, and the regular texture of steel surfaces forms a strong contrast with the irregular shapes of defects. Previous detection models’ standard convolutional kernels cannot simultaneously capture the feature differences between regular backgrounds and irregular defects. Second is the precision assurance problem in complex industrial environments, where steel defects often exhibit low-contrast characteristics and existing models’ attention mechanisms cannot effectively focus on weak defect signals, leading to low detection accuracy. Finally, there is the performance optimization challenge under limited computational resources, where existing detection models’ deep network structures and complex feature fusion processes bring high computational overhead, making them unsuitable for deployment on resource-constrained edge devices.

To address these challenges, this paper selects YOLOv11n (as shown in Figure 1) as the base model, which maintains high detection accuracy while having fewer parameters and a faster inference speed, providing a good lightweight foundation for subsequent improvements. Based on this, we propose ELS-YOLO (as shown in Figure 2), specifically optimized for the special requirements of steel defect detection.

To enhance multi-scale feature extraction capabilities, we introduce the C3k2_THK module. By integrating T-shaped convolution [18] and the Heterogeneous Kernel Selection Protocol [19], redundant features are eliminated. The SiLU [20] activation function replaces ReLU [21] to enhance the complex feature expression capability. The SCSA [22] attention mechanism is incorporated into bottleneck blocks to further improve spatial and channel feature extraction.
To achieve computational efficiency without compromising detection precision, we develop the Staged-Slim-Neck architecture. Dual group shuffle convolution and dilated group shuffle convolution replace GSconv [23] at different levels to effectively reduce computational overhead and increase the receptive field. The GMLCA attention mechanism is added at the highest level to maximize utilization of channel features saved in the slim neck [23].
A multi-scale feature extraction detection head module, MSDetect, is proposed. This module designs MRFB-L and MRFB modules to replace depthwise separable convolution and standard convolution for regression and classification branches, respectively, effectively enhancing the feature capture capability of the detection head, improving detection accuracy, and reducing computational overhead.

This work proceeds as follows: related work (Section 2), methodology (Section 3), experimental evaluation (Section 4), and conclusions (Section 5).

2. Related Work

To improve steel surface defect detection accuracy, researchers have proposed various improvements targeting insufficient feature extraction and fusion. Ling Wang [24] et al. introduced multi-scale blocks in YOLOv5, utilizing different kernel sizes for multi-scale feature extraction. Li [25] et al. presented a novel Efficient Feature Fusion (EFF) architecture that enhances the efficiency of feature propagation through the utilization of concatenation and convolutional fusion mechanisms. Tinglin Zhang et al. [26] developed the GDM-YOLO model, leveraging large convolution kernels and reparameterization techniques for enhanced feature extraction. Fei Ren [27] et al. introduced deformable convolution to expand receptive fields, combined with an ECA attention mechanism to improve detection accuracy. You et al. [28] enhanced YOLOv8 by improving the SPPF module to optimize the expanded receptive field for enhanced detection accuracy.

Zichen Dang et al. [29] proposed the FD-YOLO11 model, balancing accuracy and efficiency through self-calibrated convolution, feature fusion spatial pyramid pooling, and dynamic sampling mechanisms. Hongkai Zhang [30] et al. proposed Multi-Scale Feature Fusion (MSF) to integrate multi-level features, improving complex defect detection. Su [31] et al. enhanced defect detection by incorporating long short-term self-attention (LSTA) with local convolution and a prior-modulated cross attention (PMCA) mechanism—both Transformer-based attention mechanisms—for improved detection accuracy. Chen [32] et al. proposed the HCT-Det model, which combines CNN and Transformer advantages through a window-based self-attention residual (WSA-R) block structure to improve steel surface defect detection accuracy.

Zhou [33] et al. utilized Ghost modules to reduce parameters while combining EIoU loss and CA attention for enhanced accuracy. Zhang [34] et al. balanced YOLOv5’s accuracy and efficiency through group convolution, shuffle structures, and SimAM attention. Fan [35] et al. introduced CAM modules and depth-wise separable convolution for improved speed and accuracy. Lu [36] et al. replaced YOLOv7’s backbone with lightweight MobileNetv3 and integrated a D-SimSPPF module with a SimAM attention mechanism to reduce computational complexity while improving detection accuracy. Ma [37] et al. enhanced YOLOv8 performance using an MPCA attention mechanism. Liao [38] et al. strengthened feature extraction and reduced computational complexity via DualConv and SlimFusionCSP modules. Yuan [39] et al. achieved a lightweight YOLOv8n implementation through deformable convolution and a Faster block. Zhang [40] et al. designed a C2f_DWR module that fuses the extended residual module (DWR) to effectively extract multiscale contextual information and enhance feature extraction capability. Zhao [41] et al. proposed the MSAF-YOLOv8n algorithm, which achieves dual improvement in accuracy and efficiency by incorporating ghost convolution to reduce parameters while enhancing feature representation capability.

Although previous studies have achieved certain advances in steel surface flaw identification, key challenges persist. Current approaches face difficulties in managing substantial scale variations of defects, demonstrate limited feature learning abilities across different anomaly categories, and cannot establish an ideal trade-off between recognition precision and computational efficiency.

3. Method

3.1. C3k2_THK

To address computational resource waste caused by the sparse distribution of steel defects and feature confusion from complex background textures, this paper proposes an improved feature extraction module named C3k2_THK. The module integrates T-shaped convolution from the FasterNet block to reduce redundant computation and introduces an SCSA attention mechanism to highlight critical feature characteristics, with the detailed architecture presented in Figure 2.

T-shaped convolution consists of partial convolution (PConv), pointwise convolution, ReLU activation, and batch normalization, reducing redundant computation by performing convolution on only

\frac{1}{4}

of the input channels while keeping the remaining

\frac{3}{4}

unchanged. The processed feature maps are concatenated with unprocessed parts, using pointwise convolution to fuse cross-channel feature information. However, using solely

3 \times 3

convolution has limited adaptability to multi-scale targets, and ReLU activation suffers from neuron death and non-zero-centered output issues.

R e L U (x) = max (0, x)

(1)

S i L U (x) = \frac{x}{1 + e^{- x}}

(2)

Therefore, this paper proposes THK convolution to replace the standard convolution in T-shaped convolution, as shown in Figure 3. The module adopts an HKS strategy, setting convolution kernel sizes hierarchically as (3, 5, 7, and 9) to improve the model’s responsiveness to varying feature scales. Additionally, the activation function transitions to SiLU, successfully addressing neuron saturation and function discontinuity problems at zero point.

CPCA [42] and other hybrid attention mechanisms mainly focus on single-dimensional feature enhancement, failing to fully exploit the synergistic relationship between spatial and channel dimensions. To tackle this limitation, this paper introduces the SCSA (Spatial and Channel Synergistic Attention) mechanism, as illustrated in Figure 4, which synergistically optimizes feature enhancement across both spatial and channel dimensions to maximize their interdependent benefits.

3.2. Staged-Slim-Neck

In YOLOv11’s neck network, standard convolutions ensure accuracy but have large parameter counts, and the C3k2 module’s splitting strategy does not process all neck features, failing to fully utilize backbone-network features. Therefore, this paper proposes a new feature fusion network called Staged-Slim-Neck, as shown in Figure 2. The network employs differentiated designs for different neck levels: dual convolution-improved GSConv is used in the lower neck layers (first three layers), while dilated convolution-improved GSConv is applied in the higher neck layer (last layer), as illustrated in Figure 5. This design aims to reduce computational overhead while expanding the module’s receptive fields. Additionally, a GMLCA attention mechanism based on improved MLCA [43] is integrated into the bottleneck blocks to enhance spatial information capture capability, comprehensively improving detection accuracy.

GSConv, as shown in Figure 5a, first uses a

3 \times 3

convolution to reduce the channel count, thereby lowering subsequent computational overhead. Next, the feature map is processed through

5 \times 5

grouped convolution to further reduce computational overhead. Subsequently, the feature map that only underwent

3 \times 3

convolution is concatenated with the feature map that underwent two convolution operations. Finally, the concatenated feature map undergoes a shuffle operation to promote information exchange between channels. The specific parameter count and computational complexity calculation formulas are given in Equations (3) and (4), where Parameters, FLOPs, Input, Output, K, G, h, and w represent the parameter count, computational complexity, input channels, output channels, kernel size, number of groups, and feature map height and width, respectively.

Parameters = Input \times Output \times K^{2} / G

(3)

FLOPs = Parameters \times h \times w / G

(4)

Given that feature maps maintain uniform dimensions across bottleneck blocks, the correlation between computational cost and parameter quantity stays proportional. Therefore, only the parameter count will be listed subsequently, as it indicates the relationship between computational complexities. According to Equations (3) and (4), the parameter count of GSConv is expressed as follows:

C_{1} \times \frac{C_{2}}{2} 3^{3} / 1 + \frac{C_{2}}{2} \times \frac{C_{2}}{2} \times 5^{2} / 2 = \frac{C_{2}}{2} = \frac{9}{2} C_{1} C_{2} + \frac{25}{2} C_{2}

(5)

The parameter count of the corresponding standard convolution is expressed as follows:

C_{1} \times C_{2} \times 3^{2} = 9 C_{1} C_{2}

(6)

where C₁ and C₂ represent the input and output channels, respectively. According to Equations (5) and (6), when

C_{1} > \frac{25}{9}

, standard convolution exhibits higher parameter quantities and computational overhead compared to GSConv. Since C₁ is generally greater than 4, it can be basically determined that GSConv has a lower parameter count and computational complexity than standard convolution.

In Figure 5a, the feature map after a

3 \times 3

standard convolution operation not only serves as input for the subsequent

5 \times 5

grouped convolution but also as part of the subsequent feature map concatenation. Therefore, this convolution has the greatest impact on GSConv’s computational complexity and performance. This paper integrates DualConv into this convolution and proposes DGSConv-L, as shown in Figure 5b. By introducing a pointwise convolution branch, it enhances inter-group information interaction in grouped convolution, improves the performance of grouped convolution, and effectively reduces the parameter count, with its structure shown in the upper part of Figure 5d. According to Equations (3) and (4), the parameter count of dual grouped shuffle convolution is expressed as follows:

C_{1} \times \frac{C_{2}}{2} \times 3^{2} / G + C_{1} \times \frac{C_{2}}{2} \times 1^{2} / G + \frac{C_{2}}{2} \times \frac{C_{2}}{2} \times 5^{2} / \frac{C_{2}}{2} = \frac{5 C_{1} C_{2}}{G} + \frac{25 C_{2}}{2}

(7)

According to Equations (5) and (7), when the number of groups is

G > \frac{10}{9}

, the parameter complexity of dual grouped shuffle convolution is lower than that of GSConv. Therefore, when

G \geq 2

, effective reduction in parameter quantities and computational overhead is achieved. However, although DualConv’s pointwise convolution promotes cross-group information exchange, excessive grouping significantly reduces individual group representation capability, for which pointwise convolution cannot fully compensate. Therefore, this paper only applies DualConv to the lower layers of the neck. For the higher layers of the neck, to increase the receptive field, this paper integrates dilated convolution into grouped shuffle convolution to enhance the model’s expressive capability without increasing the parameter count and computational complexity, as shown in Figure 5c. Since dilated convolution has discontinuous sampling points, applying it to the lower layers of the neck that contain rich features will inevitably result in feature loss. Additionally, using multiple convolutions with the same dilation rate can easily lead to the gridding effect [44], reducing detection accuracy. Therefore, dilated convolution is only applied to the higher layers of the neck, with dilation rates set to 2 and 3. As depicted in the lower section of Figure 5d, the

3 \times 3

dilated convolution with rate 2 enlarges the receptive field without adding extra sampling points.

Although dilated convolution expands the receptive field, its non-continuous sampling causes feature discontinuity and information loss. To address this, we propose the Gated Mixed Local Channel Attention (GMLCA) mechanism (Figure 6) to enhance feature representation through adaptive channel weighting. GMLCA is applied only to higher neck layers to balance efficiency and performance.

3.3. MSDetect

YOLOv11’s detection head has a limited receptive field when processing multi-scale targets, making it difficult to capture cross-scale feature information. The RFB [45] module utilizes a multi-branch structure where each branch adopts convolutions with different dilation rates for multi-scale feature extraction but has two problems: first, the gridding effect of dilated convolution causes discontinuous information sampling, leading to spatial context loss; second, extensive

1 \times 1

convolutions introduce excessive computational overhead.

To address the above issues and considering the different requirements of classification and regression branches in the detection head for multi-scale features, this paper designs a Multi-scale Receptive Field Block (MRFB) and applies it to the detection head, as shown in Figure 7. MRFB adopts a channel-splitting strategy to evenly divide the input feature map into four subsets, reducing frequent channel transformations. The first branch uses identity transformation to preserve original features and provide residual connections, while the remaining branches use grouped convolution kernels of different sizes to obtain continuous differentiated receptive fields, reducing computational complexity. Finally, point-wise convolution is used to fuse cross-channel feature information, achieving high-quality multi-scale feature representation.

To better meet the requirements of different detection branches, this paper designs a lightweight version of MRFB called MRFB-L. MRFB-L configures groups to match input-channel count, creating depthwise separable convolutions, targeting the classification branch that prioritizes computational efficiency; meanwhile, standard MRFB serves the regression branch requiring enhanced feature representation capability, establishing optimal equilibrium among computational efficiency plus feature quality.

4. Experiments

4.1. Experimental Settings

4.1.1. Experimental Environment

Experimental environment: Windows 11 64-bit system, NVIDIA GeForce RTX 4070Ti super GPU (16 GB VRAM), 32 GB RAM, CUDA 12.1, Python 3.9.19, and PyTorch 2.5.1. Specific parameters are shown in Table 1.

4.1.2. Dataset

The datasets used in this paper are sourced from the publicly available NEU-DET (Northeastern University Detection) dataset from Northeastern University, the GC10-DET dataset from Ultralytics, and the Severstal-Steel-Defect dataset provided by Severstal Steel Company of Russia. The NEU-DET dataset contains 1800 images with dimensions of

200 \times 200

pixels, featuring evenly distributed data across six categories of steel surface defects: rolled scale (RS), patches (Pa), crazing (Cr), pitted surface (PS), inclusion (In), and scratches (Sc). The GC10-DET dataset comprises 2294 images with dimensions of

2048 \times 1000

pixels, encompassing ten types of steel surface defects: punching hole, welding line, crescent gap, water spot, oil spot, silk spot, inclusion, rolled pit, crease, and waist folding. The Severstal-Steel-Defect dataset includes 6666 images of

800 \times 128

pixels, covering four types of steel strip surface defects: scratches, slag inclusion, scale, and oxidation. All datasets were randomly partitioned into training, validation, and test sets in a 7:2:1 ratio.

4.1.3. Evaluation Metrics

To evaluate the detection performance of the ELS-YOLO model on the NEU-DET dataset, this study uses precision (P), recall (R), average precision (AP), and mean average precision (mAP) as evaluation metrics for steel surface defect detection. The relevant calculation formulas are expressed as follows:

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

A P = \int_{0}^{1} P (R) d R

(10)

MAP = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(11)

4.2. Comparative Studies

4.2.1. Comparative Studies of T-Shaped Convolution in C3k2_THK Module

To identify the most accurate lightweight convolution method, comparative experiments were conducted between T-shaped convolution and various lightweight approaches, including depthwise separable (DS) convolution [46], asymmetric convolution [47], and Ghost convolution [48], with standard convolution as the baseline. The results in Table 2 demonstrate that T-shaped convolution achieves the highest mAP@50 of 76.6% with relatively low parameters (2.50 M) and FLOPs (6.1 G). Specifically, T-shaped convolution outperforms DS convolution by 1.0% (76.6% vs. 75.6%), asymmetric convolution by 0.4% (76.6% vs. 76.2%), and Ghost convolution by 0.9% (76.6% vs. 75.7%). Notably, it even slightly surpasses standard convolution in accuracy (76.6% vs. 76.4%) while maintaining lower computational complexity. This validates T-shaped convolution as the optimal choice the foundation of our network architecture.

4.2.2. Comparative Studies of Attention Mechanisms in C3k2_THK Module

To validate the efficacy of the SCSA mechanism within C3k2_THK, experiments were designed to compare SE [49], ECA [50], CBAM [51], EMA [52], and no attention mechanism against SCSA mechanisms with different numbers of heads under the condition of only adding THK convolution. Table 3 presents the experimental outcomes. These findings reveal that, in contrast with traditional channel approaches (SE and ECA) and hybrid attention mechanisms (CBAM, etc.), C3k2_THK with and eight-head SCSA attention mechanism achieves an mAP@50 of 78.5% with an approximately equal computational cost and parameter count.

Furthermore, in the SCSA attention mechanism, multi-head attention groups the input features by the number of heads for parallel processing, where different heads are capable of acquiring distinct feature representations. Therefore, configuring the number of heads exerts considerable influence on model effectiveness. To address this, comparative experiments were conducted for different head-number settings for the SCSA attention mechanism. The experiments designed SCSA attention mechanisms with different numbers of heads (numbers in parentheses indicate head settings), and the findings reveal that eight-head SCSA achieves optimal performance regarding the mAP@50 metric.

4.2.3. Comparative Studies of Neck Architectures in ELS-YOLO

To validate the effectiveness of Slim-Neck, we compared it with T-shaped convolution, BiFPN [53], and CARAFE [54], using the original neck as the baseline. Table 4 shows that both Slim-Neck and T-shaped convolution achieve the highest mAP@50 of 76.8%, but Slim-Neck accomplishes this performance with fewer parameters (2.46 M). Specifically, Slim-Neck improves accuracy by 0.4% over the original neck while reducing parameters by 4.7%. Compared to other methods, it matches T-shaped convolution in accuracy but with fewer parameters and outperforms BiFPN by 0.7% and CARAFE by 2.2%. This validates the superior accuracy–efficiency trade-off of Slim-Neck.

4.2.4. Staged-Slim-Neck Convolution Comparison Studies

To demonstrate the effectiveness of using different convolutions at different stages, comparative studies were performed based upon Slim-Neck, as presented in Table 5. When DGSconv-L and DGSconv-H are deployed across lower-staged and upper-staged layers, respectively, the approach achieves 77.9% mAP@50 with 2.42 M parameters and 5.9 G FLOPs, outperforming the baseline GSconv combination (76.8% mAP@50, 2.46 M parameters, 6.1 G FLOPs) with 1.1% higher accuracy and reduced computational overhead. In contrast, using DGSconv-L exclusively yields lower accuracy (77.8% mAP@50) despite fewer parameters (2.36 M), while DGSconv-H alone achieves only 77.5% mAP@50. These results validate that the stage-wise deployment of different convolutions effectively enhances detection performance while reducing model complexity.

4.2.5. Comparative Studies of Different Attention Mechanisms in Staged-Slim-Neck

In order to validate GMLCA’s effectiveness, five attention mechanisms (SE, ECA, SCSA, MLCA, and GMLCA) were compared on the Staged-Slim-Neck architecture, with results shown in Table 6. Experiments demonstrate that GMLCA significantly outperforms other methods on the mAP@0.5 metric. Compared to SCSA, GMLCA achieves a 1.5% accuracy improvement with equivalent parameters and only 0.1 GFLOPs of additional computational cost. Compared to MLCA, GMLCA improves accuracy by 1.1%, benefiting from the gating mechanism’s ability to adaptively balance local and global channel information. The results demonstrate GMLCA’s superiority and efficiency in object detection tasks.

4.2.6. MRFB Comparison Studies with Different Convolutional Kernel Sizes

In order to verify various convolutional kernel arrangements in the MRFB module, comparative experiments were conducted in YOLOv11n’s detection head, with results shown in Table 7. Experiments show that the four-branch scheme (MRFB (3, 5, 7, 9)) achieves a 76.4% mAP@50, which is lower than the three-branch scheme (MRFB (0, 3, 5, 7)), with 77.6%, while increasing parameters and computations from 2.54 M/6.1 G FLOPs to 2.61 M/6.5 G FLOPs. Classification-branch results are similar, with MRFB-L (3, 5, 7, 9) at 77.0%, versus MRFB_L (0, 3, 5, 7) at 77.8%. Results demonstrate that the three-branch scheme, by incorporating an identity mapping branch, can maintain the integrity of original features while preserving key details when utilizing multi-scale information.

To validate the optimal convolutional kernel combination, hierarchical comparative experiments were designed. First, comparing MRFB (0, 3, 5, 7) with MRFB (0, 5, 7, 9), the former achieved 77.6% mAP@50 (2.54 M parameters, 6.1 G FLOPs), outperforming the latter’s 77.0% (2.60 M parameters, 6.4 G FLOPs), validating the effectiveness of three and five starting kernels. Further comparison between MRFB (0, 3, 5, 7) and MRFB (0, 3, 5, 9) showed the former’s 77.6% superior to the latter’s 77.0% (2.57 M parameters, 6.3 G FLOPs). Classification-branch experiments showed MRFB-L (0, 3, 5, 7) at 77.8%, outperforming MRFB-L (0, 5, 7, 9) at 77.4% and MRFB-L (0, 3, 5, 9) at 76.0%. Results demonstrate that using a kernel size of seven in the fourth branch provides better feature extraction performance than a kernel size of nine, with the (0, 3, 5, 7) combination reaching the best trade-off of precision and computational cost.

4.3. Ablation Studies

To verify the performance of our approach in steel flaw identification applications, ablation analyses are performed on individual enhanced components using NEU-DET data under identical testing conditions.

4.3.1. C3k2_THK Ablation Studies

The ablation experiments on the C3k2_THK module demonstrate distinct performance contributions from each improvement component, with results shown in Table 8. When used individually, the SCSA attention mechanism achieves the best performance (mAP@50 improvement of 1.1% to 77.5%), the T-shaped convolution offers lightweight advantages (improvement of 0.2% to 76.6%), and the SiLu activation function maintains parity with the baseline model (76.4%), while the HKS strategy shows performance degradation when used alone (dropping to 74.9%). This degradation occurs because HKS introduces redundant information that interferes with detection, leading to decreased accuracy.

Among pairwise combinations, the synergy between T-convolution and SiLu demonstrates optimal effectiveness, achieving an mAP@50 of 77.3% (0.9% improvement), with parameters reduced to 2.50 M, realizing equilibrium between precision and model compactness. The SiLu and SCSA combination exhibits outstanding precision, reaching 77.5% (1.1% improvement). However, combinations involving HKS show poor performance: T-convolution + HKS drops to 75.9%, while SiLu + HKS decreases to 74.9%, with parameters increasing to 3.26 M. These experimental results confirm that HKS, indeed, causes detrimental effects, validating the conclusions from the individual component analysis. This indicates that the HKS strategy generates parameter redundancy when used independently and requires coordination with lightweight components to achieve effectiveness.

Three-component combinations significantly enhance performance. T-convolution + SiLu + SCSA performs best, with an mAP@50 of 78.1% (1.7% improvement) and 2.50 M parameters; T-convolution + HKS + SCSA achieves 78.2% (1.8% improvement); and T-convolution + SiLu + HKS reaches 77.9% (1.5% improvement). Conversely, the SiLu + HKS + SCSA combination lacking T-convolution shows performance degradation to 76.1%, with parameters increasing to 3.26 M, validating the critical role of T-convolution in eliminating parameter redundancy.

The synergistic relationship between T-convolution and SiLu proves essential for HKS effectiveness. When deployed independently, HKS suffers from performance degradation (74.9% vs. 76.6% for T-convolution alone) due to redundant information introduced by enlarged convolution kernels. T-convolution mitigates this issue by restructuring the convolution operation to reduce parameter redundancy, while SiLu functions as an intelligent feature selector through its smooth nonlinear characteristics and self-gating mechanism. SiLu effectively suppresses redundant feature activations while enhancing meaningful feature expression, thereby compensating for the feature extraction limitations inherent in the HKS approach. This complementary optimization—structural efficiency via T-convolution and feature refinement through SiLu—enables HKS to harness the benefits of enlarged receptive fields while minimizing the detrimental effects of information redundancy, ultimately delivering superior detection performance.

The complete four-component combination achieves optimal performance: mAP@50 of 78.5% (2.1% improvement), 2.54 M parameters, and 6.2 G FLOPs. The experiments validate the synergistic effects: T-convolution reduces parameter redundancy, SiLu enhances feature representation, HKS enables dynamic kernel selection in coordination with T-convolution, and SCSA strengthens key feature weighting, forming a comprehensive optimization framework that reaches the ideal trade-off between recognition precision and computational performance.

4.3.2. Staged-Slim-Neck Ablation Studies

To verify the individual impact of every module, separate experiments were performed on DGSConv-L, DGSConv-H, and GMLCA. As shown in Table 9, DGSConv-L improves mAP@50 from the baseline 76.4% to 77.0% while reducing parameters to 2.42 M and FLOPs to 5.9 G. DGSConv-H demonstrates more significant performance gains, achieving 77.6% mAP@50 with 2.46 M parameters and 6.1 G FLOPs. GMLCA also exhibits excellent performance, with a 77.4% mAP@50, 2.46 M parameters, and 6.1 G FLOPs. These findings confirm the efficacy of every module structure.

In pairwise combination experiments, the DGSConv-L and DGSConv-H combination achieves a 77.9% mAP@50 with 2.42 M parameters and 5.9 G FLOPs, demonstrating strong complementarity between low-level and high-level feature processing. The DGSConv-H and GMLCA combination performs optimally, reaching a 78.1% mAP@50 with 2.46 M parameters and 6.1 G FLOPs. Notably, the DGSConv-L and GMLCA combination yields a 77.0% mAP@50, showing a slight decline compared to using GMLCA alone. This indicates that without high-level feature enhancement, low-level feature optimization may introduce feature imbalance, constraining the effectiveness of the GMLCA attention mechanism.

When all components are integrated, the complete architecture achieves optimal performance, with a 78.5% mAP@50, demonstrating a 2.1% enhancement compared to the original model, with 2.42 M preserved parameters plus 6.0 G FLOPs. The results confirm the synergistic effects among the three components, proving the efficacy of our approach regarding precision plus efficiency.

4.3.3. MSDetect Ablation Studies

Table 10 presents the ablation study results for each improvement in MSDetect. The experimental data shows that when introducing the lightweight MRFB-L module solely in the classification branch, mAP@50 improved from YOLOv11n’s 76.4% to 77.8% (a 1.4% increase) while maintaining nearly unchanged parameters (2.58 M) and computation (6.3 G FLOPs). When applying the MRFB module solely in the Classification branch, mAP@50 improved to 77.6% while reducing parameters from 2.58 M to 2.54 M and computation from 6.3 G to 6.1 G FLOPs. When simultaneously employing MRFB-L in the regression branch and MRFB in the classification branch, mAP@50 reached 78.1%, a 1.7% improvement over YOLOv11n, with only 2.55 M parameters and 6.1 G FLOPs for computation. Our findings indicate that the MSDetect module is capable of reaching optimal balance between accuracy and efficiency.

4.3.4. ELS-YOLO Ablation Studies

As shown in Table 11, the ELS-YOLO ablation study results fully validate the effectiveness of each improved module. In single-module experiments, C3k2_THK, Staged-Slim-Neck, and MRFB achieved mAP@50 values of 78.5%, 78.5%, and 78.1% respectively, all significantly outperforming the baseline of 76.4%.

However, the dual-module combination experiments revealed complex interaction mechanisms between modules. Although the combination of C3k2_THK and Staged-Slim-Neck demonstrates favorable synergistic effects, achieving 79.2% mAP@50, other combinations reveal significant compatibility issues. The combination of C3k2_THK and MRFB achieved only 76.6% mAP@50, showing a significant 1.9% performance drop compared to using C3k2_THK alone, which reveals the fundamental conflict between lightweight design and multi-scale feature fusion. Specifically, while C3k2_THK achieves computational efficiency by reducing convolution operations, this directly weakens the semantic expression capability and feature diversity of feature maps. MRFB, as a multi-scale fusion module, heavily relies on the richness of input features. When receiving simplified features from C3k2_THK, its multiple fusion branches cannot obtain sufficient effective information for complementary fusion and may amplify noise components in the features. Combined with the lack of specialized feature adaptation mechanisms between the two modules, this creates negative interactions in what should be a synergistic combination. In contrast, the combination of Staged-Slim-Neck and MRFB achieved 78.2% mAP@50 with only a slight 0.3% drop because Staged-Slim-Neck maintains relatively complete feature expression capability through staged processing, providing MRFB with sufficient multi-scale information. However, while preserving rich features, it inevitably contains redundant features and noise information. When these mixed features are input to MRFB, the multi-scale fusion process struggles to fully distinguish and suppress redundant components, ultimately transmitting fused features containing slight noise to the detection head.

The three-module complete combination demonstrated excellent synergistic optimization effects, with the complete ELS-YOLO architecture, achieving optimal performance of 79.5% mAP@50 with only 2.36 M parameters and 5.6 G FLOPs, fully proving ELS-YOLO’s significant advantages in balancing accuracy and efficiency. The three modules work synergistically to achieve lightweight design while enhancing accuracy: C3k2_THK reduces computational redundancy at the backbone level while maintaining essential feature extraction capabilities, providing a solid foundation for subsequent processing; Staged-Slim-Neck acts as an intelligent feature adapter, compensating for the simplified features from the backbone through staged fusion and delivering enriched multi-scale representations to the detection head; and MRFB leverages these well-prepared features to perform efficient multi-scale detection, maximizing detection accuracy without introducing excessive computational overhead. This collaborative mechanism ensures that each module operates within its optimal efficiency zone to achieve optimal performance.

Additionally, Figure 8 demonstrates that the model exhibits stable decline in all loss functions, continuous improvement in performance metrics, and consistent alignment between training and validation curves over 400 training epochs, confirming the effectiveness of the training process and the rationality of the model architecture.

4.4. Comparison with State-of-the-Art Algorithms

For thorough verification of ELS-YOLO’s benefits regarding detection precision plus computational overhead, this chapter conducts systematic experiments. ELS-YOLO is compared with a series of state-of-the-art object detection algorithms on the NEU-DET dataset to verify the advancement of this algorithm. This paper compares ELS-YOLO with seven object detection methods on the NEU-DET dataset: YOLOv5, YOLOv8n, YOLOv11n, RT-DETR-l [55], ECA-SimSPPF-SIoU-Yolov5 [27] (denoted as ESS-YOLO in tables), GDM-YOLO [26], SwinYOLO [56], CE-DETR [57], MobileViT-YOLOv8 [58], and RTCN [59]. Detailed outcomes are presented in Table 12.

Analyzing the mAP@50 performance, our ELS-YOLO achieves the highest detection accuracy, at 79.5%. For YOLO-series methods, ELS-YOLO comprehensively outperforms all variants, including YOLOv5 (74.6%), YOLOv8n (75.4%), YOLOv11n (76.4%), and ESS-YOLOv5 (78.8%). For Transformer-based approaches, ELS-YOLO demonstrates superior performance across all methods, surpassing SwinYOLO (74.9%), CE-DETR (78.6%), and RT-DETR-l (68.7%).

In terms of computational efficiency, ELS-YOLO exhibits remarkable parameter efficiency, with only 2.36 M parameters and 5.6 GFLOPs. Compared to YOLO-series methods, it achieves superior accuracy with fewer parameters and lower computational cost than most variants. When compared to Transformer-based methods, ELS-YOLO is substantially more efficient than RT-DETR-l (103.5 G, 31.99 M) and CE-DETR (67.87 G), demonstrating exceptional balance between detection accuracy and computational efficiency.

Regarding mAP@50-95, our ELS-YOLO achieves 43.2%, matching YOLOv8n’s performance and trailing YOLOv11n by only 0.9 percentage points (44.1%). Notably, ELS-YOLO achieves this competitive accuracy across varying IoU thresholds (0.5–0.95) with fewer parameters (2.36 M vs. YOLOv11n’s 2.58 M), demonstrating superior efficiency in balancing detection precision and computational cost. Relative to existing approaches, our ELS-YOLO framework exhibits superior capabilities regarding computational complexity, parameter count, and accuracy. Detection outcome comparisons across all implemented approaches for this benchmark are presented in Figure 9.

4.5. Generalization Studies

To comprehensively validate the effectiveness of this algorithm in different scenarios, this paper selects the GC10-DET dataset and the Severstal-Steel-Defect dataset to conduct systematic validation work on the ELS-YOLO model. By leveraging the characteristics of multi-scenario datasets, we thoroughly assess the framework’s effectiveness for real-world steel flaw identification applications. Since computation cost and parameter quantity stay constant for identical architectures, Table 13 and Table 14 only display accuracy metrics.

The experimental results on the GC10-DET dataset are shown in Table 13. ELS-YOLO achieved significant improvements in mAP@50 compared to mainstream detection models such as YOLOv5, YOLOv8n, and YOLOv10n, with an improvement of 2.4% over the baseline YOLOv11n model. During validation tests using the Severstal-Steel-Defect dataset, our ELS-YOLO approach achieved a 49.5% mAP@50, demonstrating a 0.4% enhancement compared to the YOLOv11n baseline, according to Table 14. These comparative findings fully confirm that ELS-YOLO possesses excellent cross-dataset generalization capability. The comparison of detection performance across different models on the GC10-DET and Severstal-Steel-Defect datasets is presented in Figure 10 and Figure 11, respectively.

Under the more stringent mAP@50–95 metric evaluation, ELS-YOLO demonstrates distinct performance characteristics. On the GC10-DET dataset, ELS-YOLO achieves a mAP@50–95 of 26.7%, which, although lower than the baseline YOLOv11n’s 30.4%, still maintains performance within a reasonable range. On the Severstal-Steel-Defect dataset, ELS-YOLO attains an mAP@50–95 of 22.4%, essentially on par with YOLOv11n’s 22.7%, demonstrating good cross-dataset stability. The mAP@50–95 metric requires precise target localization across IoU thresholds ranging from 0.5 to 0.95, and these results indicate that ELS-YOLO still has room for optimization in boundary localization accuracy at high IoU thresholds, particularly in handling small-scale defects and achieving precise pixel-level localization, which points to directions for future research improvements.

5. Conclusions

To address multi-scale and multi-morphology defect identification challenges in steel surface inspection, this study presents an enhanced YOLOv11n algorithm with innovative architectural modifications. The methodology integrates advanced feature extraction, multi-scale fusion strategies, and optimized detection heads. Through systematic optimization, the algorithm achieves 3.1% accuracy improvement on the NEU-DET dataset, with validation across the GC10 and Severstal datasets confirming broad industrial applicability.

Despite promising performance, three limitations remain: insufficient small-scale defect detection accuracy, limited few-shot learning capability under data imbalance, and weak generalization in complex industrial environments with diverse backgrounds and defect morphologies.

Future research should prioritize enhancing small target detection through advanced feature pyramid networks and attention mechanisms; rigorous large-scale industrial validation for improved robustness; and the investigation of adaptability under varying lighting, surface textures, and noise conditions. These efforts aim to deliver robust solutions for real-world industrial inspection systems, advancing automated quality control technologies and facilitating widespread adoption.

Author Contributions

Conceptualization, Z.Z. and P.D.; methodology, P.D., J.H. and G.Z.; writing—original draft, G.Z., J.Z. and C.Z.; writing—review and editing, Z.Z., P.D. and G.Z.; supervision, J.Z. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the Jiangxi Provincial Natural Science Foundation (20252BAC200197), the Doctoral Startup Fund Project of East China University of Technology (DHBK2024026), the National Natural Science Foundation of China (62162002), the National Natural Science Foundation of China Regional Science Foundation Project (62566002) and the Youth Science and Technology Foundation of the Department of Education of Jiangxi Province (GJJ2400604).

Data Availability Statement

Data are contained within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wen, X.; Shan, J.; He, Y.; Song, K. Steel surface defect recognition: A survey. Coatings 2022, 13, 17. [Google Scholar] [CrossRef]
Liu, X.; Xu, K.; Zhou, D.; Zhou, P. Improved contourlet transform construction and its application to surface defect recognition of metals. Multidimens. Syst. Signal Process. 2020, 31, 951–964. [Google Scholar] [CrossRef]
Wang, J.; Li, Q.; Gan, J.; Yu, H.; Yang, X. Surface defect detection via entity sparsity pursuit with intrinsic priors. IEEE Trans. Ind. Inform. 2019, 16, 141–150. [Google Scholar] [CrossRef]
Wang, J.; Fu, P.; Gao, R.X. Machine vision intelligence for product defect inspection based on deep learning and Hough transform. J. Manuf. Syst. 2019, 51, 52–60. [Google Scholar] [CrossRef]
Hwang, Y.I.; Seo, M.K.; Oh, H.G.; Choi, N.; Kim, G.; Kim, K.B. Detection and classification of artificial defects on stainless steel plate for a liquefied hydrogen storage vessel using short-time fourier transform of ultrasonic guided waves and linear discriminant analysis. Appl. Sci. 2022, 12, 6502. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; Neural Information Processing Systems: La Jolla, CA, USA, 2015; Volume 28. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Choi, J.Y.; Han, J.M. Deep learning (Fast R-CNN)-based evaluation of rail surface defects. Appl. Sci. 2024, 14, 1874. [Google Scholar]
Liu, W.; Hu, J.; Qi, J. Resistance Spot Welding Defect Detection Based on Visual Inspection: Improved Faster R-CNN Model. Machines 2025, 13, 33. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhao, J.; Zhang, R.; Chen, S.; Duan, Y.; Wang, Z.; Li, Q. Enhanced Infrared Defect Detection for UAVs Using Wavelet-Based Image Processing and Channel Attention-Integrated SSD Model. IEEE Access 2024, 12, 188787–188796. [Google Scholar] [CrossRef]
Xiao, G.; Hou, S.; Zhou, H. PCB defect detection algorithm based on CDI-YOLO. Sci. Rep. 2024, 14, 7351. [Google Scholar] [CrossRef]
Zheng, Z.; Ji, T.; Ju, J.; Qing, G.; Zou, S.; Zhang, Q.; Zhou, Q.; He, Y. Re-DETR: Research on Fast Detection Technology for Railway Engineering Targets in the Dark Time Domain. IEEE Access 2024, 12, 175501–175510. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J.-Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Wang, L.; Liu, X.; Ma, J.; Su, W.; Li, H. Real-time steel surface defect detection with improved multi-scale YOLO-v5. Processes 2023, 11, 1357. [Google Scholar] [CrossRef]
Li, Z.; Wei, X.; Hassaballah, M.; Li, Y.; Jiang, X. A deep learning model for steel surface defect detection. Complex Intell. Syst. 2024, 10, 885–897. [Google Scholar] [CrossRef]
Zhang, T.; Pang, H.; Jiang, C. GDM-YOLO: A Model for Steel Surface Defect Detection Based on YOLOv8s. IEEE Access 2024, 12, 148817–148825. [Google Scholar] [CrossRef]
Ren, F.; Fei, J.; Li, H.; Doma, B.T. Steel surface defect detection using improved deep learning algorithm: ECA-SimSPPF-SIoU-Yolov5. IEEE Access 2024, 12, 32545–32553. [Google Scholar] [CrossRef]
You, C.; Kong, H. Improved steel surface defect detection algorithm based on YOLOv8. IEEE Access 2024, 12, 99570–99577. [Google Scholar] [CrossRef]
Dang, Z.; Wang, X. FD-Y0L011: A Feature-Enhanced Deep Learning Model for Steel Surface Defect Detection. IEEE Access 2025, 13, 63981–63993. [Google Scholar]
Zhang, H.; Li, S.; Miao, Q.; Fang, R.; Xue, S.; Hu, Q.; Hu, J.; Chan, S. Surface defect detection of hot rolled steel based on multi-scale feature fusion and attention mechanism residual block. Sci. Rep. 2024, 14, 7671. [Google Scholar] [CrossRef]
Su, J.; Luo, Q.; Yang, C.; Gui, W.; Silvén, O.; Liu, L. PMSA-DyTr: Prior-modulated and semantic-aligned dynamic transformer for strip steel defect detection. IEEE Trans. Ind. Inform. 2024, 20, 6684–6695. [Google Scholar] [CrossRef]
Chen, X.; Zhang, X.; Shi, Y.; Pang, J. HCT-Det: A High-Accuracy End-to-End Model for Steel Defect Detection Based on Hierarchical CNN–Transformer Features. Sensors 2025, 25, 1333. [Google Scholar] [CrossRef]
Zhou, S.; Zeng, Y.; Li, S.; Zhu, H.; Liu, X.; Zhang, X. Surface defect detection of rolled steel based on lightweight model. Appl. Sci. 2022, 12, 8905. [Google Scholar] [CrossRef]
Zhang, Y.; Shen, S.; Xu, S. Strip steel surface defect detection based on lightweight YOLOv5. Front. Neurorobot. 2023, 17, 1263739. [Google Scholar] [CrossRef]
Fan, J.; Wang, M.; Li, B.; Liu, M.; Shen, D. ACD-YOLO: Improved YOLOv5-based method for steel surface defects detection. IET Image Process. 2024, 18, 761–771. [Google Scholar] [CrossRef]
Lu, J.; Yu, M.; Liu, J. Lightweight strip steel defect detection algorithm based on improved YOLOv7. Sci. Rep. 2024, 14, 13267. [Google Scholar] [CrossRef] [PubMed]
Ma, S.; Zhao, X.; Wan, L.; Zhang, Y.; Gao, H. A lightweight algorithm for steel surface defect detection using improved YOLOv8. Sci. Rep. 2025, 15, 8966. [Google Scholar] [CrossRef] [PubMed]
Liao, L.; Song, C.; Wu, S.; Fu, J. A Novel YOLOv10-Based Algorithm for Accurate Steel Surface Defect Detection. Sensors 2025, 25, 769. [Google Scholar] [CrossRef] [PubMed]
Yuan, Z.; Ning, H.; Tang, X.; Yang, Z. GDCP-YOLO: Enhancing steel surface defect detection using lightweight machine learning approach. Electronics 2024, 13, 1388. [Google Scholar] [CrossRef]
Zhang, T.; Pan, P.; Zhang, J.; Zhang, X. Steel surface defect detection algorithm based on improved YOLOv8n. Appl. Sci. 2024, 14, 5325. [Google Scholar] [CrossRef]
Zhao, B.; Chen, Y.; Jia, X.; Ma, T. Steel surface defect detection algorithm in complex background scenarios. Measurement 2024, 237, 115189. [Google Scholar] [CrossRef]
Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical image segmentation. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1451–1460. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tian, C.; Xu, Y.; Zuo, W.; Lin, C.W.; Zhang, D. Asymmetric CNN for image superresolution. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 3718–3730. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Wang, Q.; Dong, H.; Huang, H. Swin-Transformer-YOLOv5 for lightweight hot-rolled steel strips surface defect detection algorithm. PLoS ONE 2024, 19, e0292082. [Google Scholar]
Wang, Y.; Cheng, H.; Shi, J.; Du, X.; Wang, P. Co-training encoder-based transformer method for steel surface defect detection. J. Electron. Imaging 2025, 34, 023058. [Google Scholar] [CrossRef]
Lv, Z.; Zhao, Z.; Xia, K.; Gu, G.; Liu, K.; Chen, X. Steel surface defect detection based on MobileViTv2 and YOLOv8. J. Supercomput. 2024, 80, 18919–18941. [Google Scholar] [CrossRef]
Hou, S.; He, H.; Peng, K.; Qiao, S. Improved Swin Transformer-Based Model for Hot-Rolled Strip Defect Detecting. Comput. Inform. 2024, 43, 1352–1371. [Google Scholar] [CrossRef]

Figure 1. Architecture of the YOLOv11 Model. The model comprises three main components: the backbone (feature extraction with C3k2, SPPF, and C2PSA modules), neck (multi-scale feature fusion), and head (detection output).

Figure 2. Improved YOLOv11 model architecture. The backbone integrates C3k2_THK modules with varying convolution kernel sizes across different network depths. The neck features a VOVDGSCSP architecture where shallow layers use DGSConv-L while deeper layers employ DGSConv-H. The detection module employs MSDetect, utilizing lightweight MRFB-L for classification and conventional MRFB for regression operations.

Figure 3. Structure of T-shaped convolution and THK convolution. (a) The different convolution kernel sizes utilized across various hierarchical levels; (b) the structural design of the original T-shaped convolution; (c) the structural design of the proposed THK convolution.

Figure 4. Structure of spatial and channel synergistic attention. SCSA integrates SMSA for hierarchical spatial feature extraction and PCSA for adaptive channel refinement, facilitating holistic feature optimization across dual-dimensional spaces, where GroupNorm-N represents group normalization with N groups.

Figure 5. Hierarchical design of DGSconv modules. (a) Baseline GSconv. (b) DGSconv-L with dual convolution for low-level features. (c) DGSconv-H with dilated convolution for high-level features. (d) Structural details of dual and dilated convolution operations.

Figure 6. GMLCA attention mechanism architecture. The mechanism splits input features into global (GAP) and local (LAP/UNAP) branches, then fuses them through sigmoid gating with complementary weights, unifying channel and spatial modeling while reducing computational cost.

Figure 7. MRFB module and MSDetect head architecture. The figure illustrates (a) the multi-scale receptive field block (MRFB) that captures multi-scale features through parallel convolutions of different kernel sizes and (b) a comparison between YOLOv11’s detection head and the proposed MSDetect head, which integrates MRFB and MRFB-L modules for enhanced performance.

Figure 8. Comparison of detection Results of Multiple Models on the NEU-DET Dataset.

Figure 9. Comparison of detection results of multiple models on NEU-DET Dataset.

Figure 10. Comparison of detection results of multiple models on GC10-DET.

Figure 11. Comparison of detection results of multiple models on Severstal-Steel-Defect.

Table 1. Training Parameter settings.

Parameter	Value
Epoch	400
Batch size	16
Optimizer	AdamW
Input size	$640 \times 640$
Close mosaic	10
Learning rate	0.001

Table 2. Comparative experiments between T-shaped and other convolution methods. DS Conv represents depthwise separable convolution.

Method	mAP@50 (%)	Param (M)	FLOPs (G)
Standard Conv	76.4	2.58	6.3
T-shaped Conv	76.6	2.50	6.1
DS Conv	75.6	2.48	6.1
Asymmetric Conv	76.2	2.54	6.2
Ghost Conv	75.7	2.49	6.1

Table 3. Comparative studies on model accuracy under different attention mechanisms. SCSA(x) represents the SCSA attention mechanism using x heads, designed as 4, 8, and 16.

Method	mAP@50 (%)	Param (M)	FLOPs (G)
THK	77.9	2.54	6.2
+SE	78.1	2.54	6.2
+ECA	76.9	2.54	6.2
+CBAM	76.8	2.54	6.2
+EMA	74.9	2.54	6.2
+SCSA (4)	76.6	2.54	6.2
+SCSA (8)	78.5(+0.6)	2.54	6.2
+SCSA (16)	77.7	2.54	6.2

Table 4. Comparative experiments between Slim-Neck and other neck architectures.

Method	mAP@50 (%)	Param (M)	FLOPs (G)
Original Neck	76.4	2.58	6.3
Slim-Neck	76.8	2.46	6.1
T-shape Conv	76.8	2.48	6.1
BiFPN	76.1	2.58	6.3
CARAFE	74.6	2.72	6.6

Table 5. Comparative studies on model accuracy under different convolutions at different stages.

Low-Level	High-Level	mAP@50 (%)	Parameters (M)	FLOPs (G)
GSconv	GSconv	76.8	2.46	6.1
DGSconv-L	GSconv	77.0	2.42	5.9
DGSconv-L	DGSconv-L	77.8	2.36	5.9
GSconv	DGSconv-H	77.6	2.46	6.1
DGSconv-H	DGSconv-H	77.5	2.46	6.1
DGSconv-L	DGSconv-H	77.9(+1.1)	2.42	5.9

Table 6. Comparative studies on model accuracy under different attention mechanisms.

Method	mAP@50 (%)	Parameters (M)	FLOPs (G)
Staged-Slim-Neck	77.9	2.42	5.9
+SE	77.2	2.42	5.9
+ECA	76.6	2.42	5.9
+SCSA	77.0	2.42	5.9
+MLCA	77.4	2.42	6.0
+GMLCA	78.5(+0.6)	2.42	6.0

Table 7. MRFB performance under different kernel combinations. Numbers in parentheses represent kernel sizes of each branch, where 0 denotes the identity mapping branch. MRFB is applied only to the regression branch, while MRFB-L is applied only to the classification branch.

Method	mAP@50 (%)	Parameters (M)	FLOPs (G)
MRFB (3, 5, 7, 9)	76.4	2.58	6.3
MRFB (0, 3, 5, 7)	77.6(+1.2)	2.54	6.1
MRFB (0, 5, 7, 9)	77.0	2.60	6.4
MRFB (0, 3, 5, 9)	77.0	2.57	6.3
MRFB-L (3, 5, 7, 9)	77.0	2.58	6.4
MRFB-L (0, 3, 5, 7)	77.8(+1.4)	2.58	6.3
MRFB-L (0, 5, 7, 9)	77.4	2.58	6.4
MRFB-L (0, 3, 5, 9)	76.0	2.58	6.3

Table 8. Ablation experiments of C3k2_THK improvement points. Throughout all tables in this paper, check marks (✓) indicate the adoption of the corresponding structures/components.

T-shaped Conv	SiLu	HKS	SCSA (8)	mAP@50 (%)	Parameters/M	FLOPs/G
				76.4	2.58	6.3
✓				76.6	2.50	6.1
	✓			76.4	2.58	6.3
		✓		74.9	3.26	7.2
			✓	77.5	2.58	6.3
✓	✓			77.3	2.50	6.1
✓		✓		75.9	2.54	6.2
✓			✓	76.9	2.50	6.1
	✓	✓		74.9	3.26	7.2
	✓		✓	77.5	2.58	6.3
		✓	✓	76.9	3.26	7.2
✓	✓	✓		77.9	2.54	6.2
✓	✓		✓	78.1	2.50	6.1
✓		✓	✓	78.2	2.54	6.2
	✓	✓	✓	76.1	3.26	7.2
✓	✓	✓	✓	78.5(+2.1)	2.54	6.2

Table 9. Ablation experiments of Staged-Slim-Neck improvement points. DGSConv-L, DGSConv-H, and GMLCA represent the use of DGSConv-L in low stages, the use of DGSConv-H in high stages, and the use of the GMLCA attention mechanism in high stages, respectively.

Slim-Neck	DGSConv-L	DGSConv-H	GMLCA	map@50 (%)	Parameters/M	FLOPs/G
				76.4	2.58	6.3
✓				76.8	2.46	6.1
✓	✓			77.0	2.42	5.9
✓		✓		77.6	2.46	6.1
✓			✓	77.4	2.46	6.1
✓	✓	✓		77.9	2.42	5.9
✓		✓	✓	78.1	2.46	6.1
✓	✓		✓	77.0	2.42	6.0
✓	✓	✓	✓	78.5(+2.1)	2.42	6.0

Table 10. Ablation study on MSDetect improvements. The baseline YOLOv11n detection head uses standard convolutions for regression and DSC for classification.

Regression Branch	Classification Branch	mAP@50 (%)	Param (M)	FLOPs (G)
Conv	DSC	76.4	2.58	6.3
MRFB-L	DSC	77.8	2.58	6.3
Conv	MRFB	77.6	2.54	6.1
MRFB-L	MRFB	78.1(+1.7)	2.55	6.1

Table 11. Ablation Study Results for ELS-YOLO. The first row shows the YOLOv11n experimental results.

C3k2_THK	Staged-Slim-Neck	MRFB	map@50 (%)	Param (M)	FLOPs (G)
			76.4	2.58	6.3
✓			78.5	2.54	6.2
	✓		78.5	2.42	6.0
		✓	78.1	2.55	6.1
✓	✓		79.2	2.39	5.8
✓		✓	76.6	2.51	6.0
	✓	✓	78.2	2.39	5.8
✓	✓	✓	79.5(+3.1)	2.36	5.6

Table 12. Experimental results of mainstream target detection methods on NEU-DET datasets.

Operator	CR	IN	PA	PS	RS	SC	mAP@50 (%)	mAP@50–95 (%)	Param (M)	FLOPs (G)
YOLOv5	38.1	82.2	91.2	83.6	62.3	90.0	74.6	42.2	2.50	7.1
YOLOv8n	39.4	83.7	90.6	83.3	66.7	88.9	75.4	43.2	3.00	8.1
YOLOv10n	36.4	77.4	91.2	85.2	57.5	86.2	72.3	41.2	2.69	8.2
YOLOv11n	40.4	83.9	91.2	86.8	67.6	88.8	76.4	44.1	2.58	6.3
RT-DETR-l [55]	29.8	74.9	86.3	77.1	58.8	85.5	68.7	35.5	31.99	103.5
ESS-Yolov5 [27]	59.1	83.7	94.9	89.1	53.5	92.2	78.8	-	7.07	-
GDM-YOLO [26]	-	-	-	-	-	-	79.3	-	9.0	28.1
SwinYOLO [56]	40.3	82.5	91.2	85.8	60.4	89.1	74.9	-	4.49	9.9
CE-DETR [57]	44.2	75.1	91.9	91.6	75.3	93.3	78.6	-	67.87	-
MobileViT-YOLOv8 [58]	48.6	77.9	92.6	78.4	64.3	82.8	74.1	-	27.5	34.9
RTCN [59]	41	69.7	87.8	72.4	55.1	79.5	67.6	-	-	-
Ours	47.7	84.8	92.4	89.3	70.3	90.2	79.5	43.2	2.36	5.6

Table 13. Experimental results of ELS-YOLO and YOLOv11n on the GC10_DET dataset.

Model	PH	WL	CG	WS	OS	SS	IC	RP	CR	WF	map@50 (%)	mAP@50–95 (%)
YOLOv5	70.2	54.8	57.5	69.0	38.8	49.5	18.1	19.0	30.6	99.5	50.7	29.8
YOLOv8n	75.6	47.8	55.4	69.9	39.8	44.6	13.7	12.1	31.3	99.5	49.0	27.8
YOLOv10n	79.2	48.7	50.3	64.6	39.1	45.2	13.8	3.38	23.9	99.5	46.8	24.8
YOLOv11n	71.7	59.6	59.8	67.5	34.1	51.3	16.6	12.8	43.3	99.5	51.6	30.4
RT-DETR-l	72.0	70.2	47.3	63.1	39.9	46.9	13.9	13.2	43.6	2.93	41.3	19.4
ELS-YOLO	78.9	53.7	59.3	73.1	37.2	51.3	18.9	31.1	37.4	99.5	54.0	26.7

Table 14. Experimental results of ELS-YOLO and YOLOv11n on the Severstal-Steel-Defect dataset.

Model	Scratch	Inclusion	Scale	Rust	map@50 (%)	mAP@50–95 (%)
YOLOv5	49.3	27.3	62.7	52.6	48.0	22.4
YOLOv8n	50.9	29.5	62.4	52.5	48.8	22.6
YOLOv10n	48.6	26.6	62.9	48.6	46.7	21.7
YOLOv11n	52.1	28.6	63.6	52.3	49.1	22.7
RT-DETR-l	52.5	20.7	62.5	51.4	46.8	21.5
ELS-YOLO	48.7	34.8	61.4	53.2	49.5	22.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Zhong, G.; Ding, P.; He, J.; Zhang, J.; Zhu, C. ELS-YOLO: Efficient Lightweight YOLO for Steel Surface Defect Detection. Electronics 2025, 14, 3877. https://doi.org/10.3390/electronics14193877

AMA Style

Zhang Z, Zhong G, Ding P, He J, Zhang J, Zhu C. ELS-YOLO: Efficient Lightweight YOLO for Steel Surface Defect Detection. Electronics. 2025; 14(19):3877. https://doi.org/10.3390/electronics14193877

Chicago/Turabian Style

Zhang, Zhiheng, Guoyun Zhong, Peng Ding, Jianfeng He, Jun Zhang, and Chongyang Zhu. 2025. "ELS-YOLO: Efficient Lightweight YOLO for Steel Surface Defect Detection" Electronics 14, no. 19: 3877. https://doi.org/10.3390/electronics14193877

APA Style

Zhang, Z., Zhong, G., Ding, P., He, J., Zhang, J., & Zhu, C. (2025). ELS-YOLO: Efficient Lightweight YOLO for Steel Surface Defect Detection. Electronics, 14(19), 3877. https://doi.org/10.3390/electronics14193877

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ELS-YOLO: Efficient Lightweight YOLO for Steel Surface Defect Detection

Abstract

1. Introduction

2. Related Work

3. Method

3.1. C3k2_THK

3.2. Staged-Slim-Neck

3.3. MSDetect

4. Experiments

4.1. Experimental Settings

4.1.1. Experimental Environment

4.1.2. Dataset

4.1.3. Evaluation Metrics

4.2. Comparative Studies

4.2.1. Comparative Studies of T-Shaped Convolution in C3k2_THK Module

4.2.2. Comparative Studies of Attention Mechanisms in C3k2_THK Module

4.2.3. Comparative Studies of Neck Architectures in ELS-YOLO

4.2.4. Staged-Slim-Neck Convolution Comparison Studies

4.2.5. Comparative Studies of Different Attention Mechanisms in Staged-Slim-Neck

4.2.6. MRFB Comparison Studies with Different Convolutional Kernel Sizes

4.3. Ablation Studies

4.3.1. C3k2_THK Ablation Studies

4.3.2. Staged-Slim-Neck Ablation Studies

4.3.3. MSDetect Ablation Studies

4.3.4. ELS-YOLO Ablation Studies

4.4. Comparison with State-of-the-Art Algorithms

4.5. Generalization Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI