Next Article in Journal
Exploring the Effectiveness of Dimensionality Reduction Methods for High-Dimensional Turbofan Engine Sensor Data
Previous Article in Journal
Leptin and Adiponectin as Immunohistochemical Biomarkers in Colorectal Cancer: Publication Trends and Research Advances
Previous Article in Special Issue
LGA-YOLO: A Light Weight and High-Performance Network for Bubble Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YOLOv11-LLR: An Enhanced Framework for Steel Surface Defect Detection in Industrial Settings

1
Xinjiang Tianye (Group) Co., Ltd., Shihezi 832003, China
2
College of Information Science and Technology, Shihezi University, Shihezi 832000, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(10), 4609; https://doi.org/10.3390/app16104609
Submission received: 16 March 2026 / Revised: 27 April 2026 / Accepted: 30 April 2026 / Published: 7 May 2026
(This article belongs to the Special Issue AI in Object Detection)

Abstract

Steel surface defects in manufacturing are typically tiny, low-contrast, and boundary-ambiguous, especially under complex textures (e.g., rolling marks, crazing), poor illumination, and high noise. These characteristics cause frequent missed detections and localization errors, particularly for defects with large-scale variations. Existing detectors, including YOLOv11, lack sufficient local spatial modeling for deformed or blurred boundaries and suffer from limited cross-scale feature interaction, leading to suboptimal performance on industrial benchmarks. To overcome these limitations, we propose YOLOv11-LLR—a YOLOv11-based framework that jointly enhances multi-scale feature modeling and inference efficiency. YOLOv11-LLR synergistically integrates three modules: Deformable Large Kernel Attention (DLKA) for adaptive local spatial perception, Lightweight Group-wise Attention (LWGA) for cross-scale interaction, and Re-parameterized Convolution (RepConv) for deployment-friendly speed. We evaluate on two representative datasets: NEU-DET (six defect types on hot-rolled steel strips) and GC10-DET (ten defect types with higher background complexity). Compared to baseline YOLOv11, YOLOv11-LLR achieves +3.5% mAP@0.5 (80.2%→83.7%) and +2.4% mAP@0.5:0.95 (48.7%→51.1%) on NEU-DET, and larger gains of +9.8% (61.0%→70.8%) and +3.4% (33.4%→36.8%) on the more challenging GC10-DET. These results demonstrate that YOLOv11-LLR provides an effective, robust, and industrially deployable solution for steel surface defect detection under complex textures, noise, and multi-scale variations.

1. Introduction

Under the paradigm of Industry 4.0 [1], intelligent manufacturing systems demand real-time, accurate, and deployable visual inspection solutions. Steel surface defect detection plays a critical role in ensuring product quality and production safety, especially in high-risk industries such as steel manufacturing. However, steel defects are often tiny, low-contrast, and boundary-ambiguous under complex textures (e.g., rolling marks, crazing), challenging illumination, and noisy production environments. These characteristics commonly lead to missed detections and inaccurate localization, particularly when defects exhibit large-scale variations.
Traditional defect detection methods rely on handcrafted features (grayscale, edges, texture) [2] or shallow machine learning (e.g., K-means, Fuzzy C-means) [3,4]. These approaches are vulnerable to complex backgrounds and low contrast, and their performance is limited by the poor generalization of handcrafted features. With the advent of deep learning, Convolutional Neural Networks (CNNs) have demonstrated significant advantages in feature extraction and representation learning [5,6]. More recently, Transformer-based detectors such as DETR [7] and RT-DETR [8] have achieved remarkable progress in general object detection, but their high computational cost restricts their adoption in industrial real-time scenarios. Consequently, the YOLO (You Only Look Once) series [9] has become the mainstream choice for industrial vision tasks due to its favorable trade-off between accuracy and inference efficiency. Iterations from YOLOv5 to YOLOv11 [10,11,12,13,14,15,16,17,18,19] have continuously improved feature pyramids, attention mechanisms, and structural re-parameterization; nevertheless, challenges persist for steel surface defect detection.
In recent years, many YOLO-based improvements have been proposed for steel surface defect detection, focusing on reducing weight [20,21,22,23], attention enhancement [24,25,26,27], and multi-scale feature fusion [28,29,30,31,32,33,34]. Despite these efforts, existing YOLO-based methods still suffer from three major limitations. First, they provide insufficient local spatial modeling for deformed or blurred defect boundaries, which are common in steel surfaces. Second, their cross-scale feature interaction during multi-level fusion remains limited, especially under complex and low-contrast backgrounds. Third, structural weight reduction often causes loss of fine-grained details, leading to reduced detection accuracy for small defects [35]. To the best of our knowledge, no existing YOLO-based method simultaneously addresses local spatial adaptability, cross-scale feature interaction, and inference efficiency in a unified framework for steel surface defect detection. The following scientific challenges therefore remain unresolved: how to adaptively model spatial variations of defects at different scales without increasing computational overhead; how to effectively integrate multi-scale features across layers while preserving low-level detail; and how to decouple training-time representational capacity from inference-time complexity for real-time industrial deployment.
To tackle these challenges, we propose YOLOv11-LLR, an enhanced detection framework built upon YOLOv11. YOLOv11-LLR integrates three complementary modules: Deformable Large Kernel Attention (DLKA) [36] for adaptive spatial modeling of deformed and blurred defects; Lightweight Group-wise Attention (LWGA) [37] for efficient cross-scale feature interaction; and Re-parameterized Convolution (RepConv) [38] to decouple training and inference, preserving representational power while enabling fast deployment. We evaluate YOLOv11-LLR on two representative industrial datasets: NEU-DET (six defect types) and GC10-DET (ten defect types). Experimental results demonstrate consistent and substantial improvements over baseline YOLOv11 and other state-of-the-art YOLO variants, confirming the effectiveness and robustness of the proposed approach. The remainder of this paper is organized as follows. Section 1 reviews related work. Section 2 presents the YOLOv11-LLR framework in detail. Section 3 describes the experimental setup and datasets. Section 3 reports and discusses results, including ablation studies. Section 4 concludes the paper and outlines future work.

2. Materials and Methods

In this study, YOLOv11 was chosen as the detection model for steel surface defect detection, primarily due to its effective balance between accuracy, inference speed, and computational efficiency, which is especially suitable for deployment-oriented industrial applications. Compared to earlier versions in the YOLO series, YOLOv11 has been optimized in several areas, particularly in the integration of multi-scale feature fusion and an anchor-free prediction mechanism, which makes it particularly advantageous for handling complex industrial defect detection tasks. YOLOv11, with its enhanced feature extraction capability and efficient inference speed, is able to process complex images in a shorter time and maintain high accuracy in detecting targets of varying scales, which is particularly important for steel surface defect detection. Although it performs excellently in handling high contrast and complex backgrounds, YOLOv11 still has limitations, such as insufficient spatial modeling capability, poor cross-layer information fusion, and difficulty in handling blurred boundaries or deformed targets. To address these issues, this paper introduces a set of targeted optimizations, constructing an enhanced detection framework named YOLOv11-LLR (YOLOv11 with DLKA, LWGA, and RepConv). This framework integrates three key modules—Deformable Large Kernel Attention (DLKA), Lightweight Grouped Attention (LWGA), and Re-parameterizable Convolution (RepConv)—to further enhance YOLOv11’s spatial modeling, cross-scale feature interaction, and inference efficiency. The overall architecture of YOLOv11 is shown in Figure 1.
The introduction of these modules allows YOLOv11-LLR to overcome the limitations of YOLOv11 in multi-scale feature transmission and cross-layer information fusion. Specifically, DLKA enhances the model’s ability to model deformed structures and blurred boundaries commonly found in steel surface defects; LWGA strengthens the interaction of multi-scale features, enabling the model to handle defects of varying scales and textures more effectively; and RepConv optimizes inference efficiency, making the model more suitable for large-scale industrial applications. These optimizations collectively improve YOLOv11-LLR’s performance in complex industrial defect detection tasks and ensure its efficiency in real-world deployment.

2.1. Overall Optimization Strategy

To overcome the limitations of YOLOv11, we introduce three complementary optimizations, each detailed in a subsequent subsection. First, local spatial modeling is enhanced via Deformable Large Kernel Attention (DLKA, Section 2.2), which is designed to capture deformed and blurred defect boundaries. Second, cross-scale feature interaction is achieved through Lightweight Group-wise Attention (LWGA, Section 2.3), enabling effective multi-scale information exchange at low computational cost. Third, inference efficiency is improved by Re-parameterized Convolution (RepConv, Section 2.4), which decouples training-time representational capacity from inference-time complexity. These modules are seamlessly integrated into YOLOv11’s backbone, neck, and convolutional layers respectively, as shown in Figure 2.

2.2. Deformable LKA Module

To enhance the model’s ability to represent deformed structures and blurred boundaries, we adopt the Deformable Large Kernel Attention (DLKA) module [36]. DLKA leverages deformable convolution to adaptively capture local spatial variations, which is beneficial for modeling complex textures and irregular defect shapes. F R C × H × W denotes the input feature map, where C, H, W are the channel number, height, and width.
First, a 1 × 1 convolution compresses the feature dimension while preserving spatial resolution:
F = C o n v   1 × 1 ( F ) R C × H × W
Then a GELU activation introduces non-linearity:
F = G E L U ( F )
The core spatial modeling is performed using deformable convolution. For an output location pp, the deformable convolution is defined as:
F ( p ) = k = 1 K W k F ( p + p k + p k )
where K is the number of sampling points ( e . g . , K = 9   f o r   a   3 × 3   k e r n e l ), p k are the pre-defined regular offsets ( e . g . , { ( 1 , 1 ) , ( 1,0 ) , , ( 1,1 ) } ) , p k are learnable offset fields, and W k are the corresponding convolutional weights. The offset fields p k are predicted from the input feature map F via a separate convolutional layer; bilinear interpolation is used to make the sampling process differentiable.
To further enlarge the receptive field and capture long-range dependencies, a dilated deformable convolution is applied:
F ( p ) = k = 1 K W k F ( p + r p k + p k )
Here, r denotes the dilation rate (set to r = 2 in our experiments). This formulation integrates dilated convolution and deformable convolution by multiplying the fixed offsets p k by the dilation rate while keeping the learnable offsets p k unchanged. It allows the model to sample from a wider context without increasing the number of parameters.
An attention weight map is then generated by a 1 × 1 convolution followed by a sigmoid activation, and is used to re-weight the features:
A = σ ( C onv 1 × 1 ( F ) ) , F a = A F
where σ ( ) is the sigmoid function, A R C × H × W is the attention map, and denotes element-wise multiplication.
Finally, a residual connection adds the original input to the attended feature:
F output = F + F a
The entire DLKA structure is illustrated in Figure 3. This residual design preserves the original information while enhancing defect-relevant spatial responses.

2.3. LWGA Lightweight Multi-Granularity Attention Module

The Lightweight Grouped Attention (LWGA) module [37] enhances multi-scale defect detection by splitting the input feature into four channel subgroups and processing each with a different attention branch: Point Attention (PA), Local Attention (LA), Medium-range Attention (MRA), and Global Attention (GA). Given X R C × H × W , the channels are split equally:
X i = S p l i t ( X ) , i = 1 , 2 , 3 , 4 , X i C 4 × H × W
For the Point Attention branch, two successive 1 × 1 convolutions generate point-wise attention responses:
R 1 = P A ( X 1 ) = C o n v 1 × 1 ( C o n v 1 × 1 ( X 1 ) )
The other three branches (LA, MRA, GA) perform local, medium-range, and global dependency modeling respectively, using different kernel sizes or pooling strategies. After obtaining the outputs of all four branches, they are concatenated along the channel dimension:
R = C o n c a t ( R 1 , R 2 , R 3 , R 4 )
A Multi-Layer Perceptron (MLP) fuses the concatenated feature and enhances cross-channel interaction:
R = M L P ( R )
Finally, a residual connection yields the output:
X out = R + X
Figure 4 illustrates the LWGA module. This design ensures robust multi-scale feature representation while maintaining low computational overhead.

2.4. Re-Parameterizable Convolution Structure (RepConv)

In industrial defect detection, the detection model is required to achieve both strong feature extraction capability and efficient inference speed. To address this, we introduce the Re-parameterizable Convolution (RepConv) module [38]. RepConv adopts a multi-branch structure during training to enhance representation capacity, and merges the branches into a single convolution during inference, thereby reducing computational overhead and improving deployment efficiency.
During training, RepConv consists of three parallel branches: a 3 × 3 convolution branch, a 1 × 1 convolution branch, and an identity mapping branch. The outputs of these branches are summed and activated as follows:
Y t r a i n = S i L U ( C o n v 3 × 3 ( X ) + C o n v 1 × 1 ( X ) + X )
where X denotes the input feature map, Y t r a i n the output during training, and S i L U the activation function. The identity branch helps preserve original feature information and improves training stability.
In the inference phase, the multi-branch structure is re-parameterized into a single 3 × 3 convolution, enabling faster computation:
Y i n f e r = C o n v 3 × 3 * ( X )
where C o n v 3 × 3 * denotes the equivalent convolution after branch fusion.
Specifically, the weights of the three branches are fused into a single kernel:
W * = P a d ( W 3 × 3 ) + P a d ( W 1 × 1 ) + P a d ( W i d ) , b * = b 3 × 3 + b 1 × 1 + b i d
where W 3 × 3 , W 1 × 1 , and W i d represent the convolution kernels of the 3 × 3 , 1 × 1 , and identity branches, respectively; b denotes bias terms. P a d ( · ) denotes zero-padding used to align the kernel size to 3 × 3 , such that all branch parameters can be merged into an equivalent convolution kernel. By applying this re-parameterization strategy, RepConv preserves training-time representational power while significantly reducing inference complexity, as illustrated in Figure 5.
In our implementation, RepConv replaces the standard 3 × 3 convolutional layers inside the C 3 k 2 bottleneck blocks of YOLOv11. The modified block, termed C 3 k 2 _ R e p V G G , retains the original residual connections but substitutes each C o n v B N S i L U sequence with a R e p C o n v module, allowing the model to learn richer representations during training while executing as a plain VGG-style convolution during deployment.

2.5. YOLOv11-LLR: Enhanced Object Detection Method Based on Multi-Module Optimization

To address the challenges of multi-scale defects, complex backgrounds, and blurred boundaries in steel surface defect detection, we propose the YOLOv11-LLR framework. Building upon YOLOv11, it integrates three innovative modules: Deformable Large Kernel Attention (DLKA), Lightweight Grouped Attention (LWGA), and Re-parameterized Convolution (RepConv).
Specifically, the DLKA module utilizes deformable convolutions to adapt to local spatial variations at different scales, enhancing the model’s ability to perceive complex defect structures and handle deformed or blurred boundaries. This boosts spatial adaptability and robustness for irregular defects. The LWGA module enhances multi-scale feature fusion by dividing input features into subgroups processed through four attention branches (PA, LA, MRA, GA). This ensures effective integration across scales, improving stability and adaptability for small or low-contrast defects. The RepConv module employs a multi-branch structure during training and merges into a single convolution at inference, enabling a “complex training, simple inference” strategy that preserves detection accuracy while reducing computational overhead—ideal for latency-sensitive industrial deployment.
Together, these three modules provide a comprehensive solution: DLKA enhances spatial modeling, LWGA strengthens multi-scale fusion, and RepConv optimizes inference efficiency. YOLOv11-LLR thus offers a more robust, precise, and efficient solution for steel surface defect detection while maintaining the efficiency of the original YOLOv11 architecture.

2.6. Rationale for Module Selection

The three modules are not arbitrarily chosen; they directly target the identified limitations of YOLOv11. DLKA is selected over standard large-kernel attention (LKA) or self-attention because deformable convolution adaptively samples defect boundaries, which are often irregular and blurred. LWGA is preferred over conventional multi-head attention (e.g., MHSA) due to its grouped design, which reduces computational cost while preserving multi-scale receptive fields, which is critical for the coexistence of small and large defects. RepConv is adopted to decouple training and inference; unlike other weight reduction methods (depth-wise convolution, pruning), it maintains full representational power during training and merges branches at inference, achieving both accuracy and speed.

3. Results

3.1. Dataset

In this experiment, we adopted the NEU-DET dataset, widely used for steel surface defect detection and classification tasks, to train and test the proposed model.
The dataset was collected and organized by the Surface Inspection Laboratory of Northeastern University, covering six typical types of hot-rolled steel strip surface defects, including Crazing, Patches, Inclusion, Pitted Surface, Rolled-in Scale, and Scratches. The dataset contains a total of 1800 grayscale images, 300 for each defect category, with a uniform image size of 200 × 200 pixels. This dataset is representative in characterizing common defect types in industrial steel strips and has been widely used for evaluating the performance of defect detection algorithms. Detailed information of the dataset is shown in Table 1. The image samples cover a variety of typical surface defect types, with significant differences in the shape and direction (tilted, horizontal, vertical, etc.) of defects across categories, providing sufficient diversity for feature learning and generalization of the model. Although each category contains 300 images, the actual number of defect targets varies significantly, increasing the learning difficulty for the model on small-sample categories. To ensure training stability and evaluation fairness, the dataset was randomly split into training, validation, and test sets in an 8:1:1 ratio.
In addition to NEU-DET [39], we also use GC10-DET [40] to test generalization in a more challenging setting. GC10-DET contains 2312 grayscale images covering 10 typical surface defect categories, and multiple defect types may appear in a single image, leading to higher background noise and cross-class ambiguity.
To evaluate the performance of the detection model, we selected four core metrics: Precision (P), Recall (R), mean Average Precision with IoU threshold set to 0.5 (mAP@0.5), and mean Average Precision calculated with IoU threshold ranging from 0.5 to 0.95 in steps of 0.05 (mAP@0.5:0.95). Precision is defined as P = T P / ( T P + F P ) and Recall as R = T P / ( T P + F N ) , where T P represents the number of true positive samples correctly detected, FP refers to the number of negative samples incorrectly identified as positive, and F N represents the number of positive samples that were not detected. In object detection tasks, P and R often present a trade-off relationship, so both need to be considered when evaluating model performance. Therefore, Average Precision (AP) is usually introduced as a comprehensive metric to measure the overall detection capability of the model at different recall levels. It is defined as the area under the Precision–Recall curve:
A P = 0 1 P ( R ) d R
mAP@0.5 represents the average of the APs for all categories under the condition of an IoU threshold of 0.5, commonly used to evaluate the overall accuracy level of an object detection model. Its calculation method is as follows:
This definition follows the standard Pascal VOC evaluation protocol [41]. Then mAP@0.5 represents the average of the APs for all categories under the condition of an IoU threshold of 0.5, commonly used to evaluate the overall accuracy level of an object detection model. Its calculation method is as follows:
m A P = 1 N i = 1 N A P i
where N represents the total number of categories and A P i represents the average precision for the i-th category. To obtain more stringent and comprehensive evaluation results, this paper also reports mAP@0.5:0.95, which calculates the average of multiple AP values with the IoU threshold ranging from 0.5 to 0.95 (step size 0.05). This metric can more comprehensively evaluate the detection performance of the model under different precision requirements. The accuracy of the detection box is measured by the Intersection over Union (IoU), defined as the ratio of the area of intersection to the area of union between the predicted box and the ground truth box. When the IoU between a predicted box and a ground truth box exceeds a set threshold (e.g., 0.5), it is judged as a true positive; otherwise, it is considered a false positive.

3.2. Implementation Details

Our implementation is based on the official Ultralytics YOLOv11 repository using PyTorch 1.12.1 with CUDA 11.3. All models are trained from scratch (no pre-trained weights) on a single NVIDIA RTX 4060 GPU (8 GB VRAM). Input images are resized to 640 × 640. The batch size is set to 32, and training runs for 300 epochs. The AdamW optimizer is used with an initial learning rate of 0.001 and weight decay of 0. The learning rate follows the default cosine annealing schedule of YOLOv11. Data augmentation includes the default YOLOv11 pipeline: random horizontal flipping (probability 0.5), mosaic augmentation (probability 0.5), and HSV color jitter (hue = 0.015, saturation = 0.7, value = 0.4). Input normalization scales pixel values to [0, 1]. Mixed precision training (AMP) is enabled to accelerate training and reduce GPU memory usage. The data loader uses four worker threads. The random seed is fixed to the YOLO default (0) to ensure full reproducibility.

3.3. Comparison with Widely Used Detectors on NEU-DET

To comprehensively evaluate the effectiveness of the proposed YOLOv11-LLR model, comparative experiments were conducted on two representative steel surface defect detection datasets, namely NEU-DET and GC10-DET. These datasets differ significantly in defect distribution, background complexity, and category diversity, allowing for a thorough evaluation of the model’s robustness and generalization capability under diverse industrial scenarios.
The quantitative comparison results on the NEU-DET dataset are summarized in Table 2. As shown, YOLOv11-LLR achieves a Precision of 84.8%, Recall of 74.9%, mAP@0.5 of 83.7%, and mAP@0.5:0.95 of 51.1%, outperforming both traditional detectors and mainstream YOLO-based methods. Compared with the baseline YOLOv11, the proposed model improves mAP@0.5 by 3.5 percentage points and mAP@0.5:0.95 by 2.4 percentage points, indicating a significant enhancement in both coarse and fine-grained detection accuracy. Furthermore, YOLOv11-LLR consistently surpasses improved variants such as GFIF-YOLO, FD-YOLO11, and GDM-YOLO, demonstrating the effectiveness of the proposed multi-module optimization strategy.
The comparison results on the GC10-DET dataset are presented in Table 3. Despite the increased detection difficulty caused by multi-category overlap and low-contrast backgrounds, YOLOv11-LLR maintains stable performance, achieving 70.8% mAP@0.5 and 36.8% mAP@0.5:0.95, which are higher than those of most competing methods. Notably, Recall reaches 62.5%, reflecting the model’s strong ability to reduce missed detections in complex scenarios. These results indicate that YOLOv11-LLR exhibits superior robustness and adaptability when handling diverse and challenging industrial defect detection tasks.
In addition to quantitative evaluation, qualitative visualization results are shown in Figure 6 and Figure 7, which present the ground truth bounding boxes together with the detection outputs of YOLOv11 and YOLOv11-LLR on the NEU-DET and GC10-DET datasets, respectively. The proposed model produces detection boxes with higher confidence and more accurate localization, especially for small-scale defects and defects with blurred boundaries, further validating its practical applicability and reliability in real-world industrial environments.

3.4. Ablation Study

After demonstrating the overall superiority of YOLOv11-LLR over mainstream detectors, ablation experiments were conducted to further analyze the contribution of each proposed module and to explain the performance improvements achieved by the model. All ablation experiments were carried out on the NEU-DET dataset using YOLOv11 as the baseline model.
The experimental results of the ablation study are summarized in Table 4. As shown in Table 2, introducing DLKA alone leads to a noticeable improvement in Recall and mAP@0.5, indicating its effectiveness in enhancing spatial modeling and capturing deformed defect structures. Similarly, the LWGA module improves multi-scale feature fusion capability, resulting in a moderate increase in detection accuracy, particularly in Recall. However, when these two attention-based modules are used independently, the overall performance gain remains limited, and Precision shows a slight downward trend.
In contrast, the RepConv module demonstrates the most significant individual contribution. When RepConv is introduced alone, the model achieves a substantial increase in mAP@0.5, along with a marked improvement in Recall, highlighting its critical role in strengthening feature representation and improving detection robustness. This result confirms that Re-parameterized Convolution effectively enhances the model’s learning capacity and supports a simplified inference-time structure.
When DLKA, LWGA, and RepConv are jointly integrated, the model achieves the best overall performance. The full YOLOv11-LLR configuration attains 83.7% mAP@0.5 and 51.1% mAP@0.5:0.95, representing the highest values among all ablation settings. This configuration not only significantly improves Precision but also maintains a high Recall, demonstrating the strong complementarity and synergy among the three modules.
To further analyze the impact of each module at the category level, Table 5 presents the mAP@0.5 results for six defect categories. The results show that DLKA and LWGA are particularly effective in improving the detection performance of low-precision categories, such as Crazing and Rolled-in Scale, by enhancing spatial perception and cross-scale information interaction. Meanwhile, RepConv consistently boosts performance across high-precision categories, further confirming its decisive role in overall detection enhancement.
In summary, the ablation study verifies that each proposed module contributes positively to the performance of YOLOv11-LLR, while their combination yields the most significant improvement. These results provide a clear explanation for the superior performance observed in the comparative experiments and demonstrate the effectiveness of the proposed multi-module optimization strategy.

3.5. Inference Speed Evaluation

To evaluate the practical deployability of the proposed YOLOv11-LLR in industrial scenarios, we measured the inference latency of all model variants on a single NVIDIA RTX 4060 GPU with an input resolution of 640 × 640. Preprocessing and postprocessing times are excluded. As reported in Table 6, the baseline YOLOv11 achieves an inference time of 6.2 ms (161 FPS). Adding RepConv alone introduces negligible overhead (6.4 ms, 156 FPS), while LWGA increases the latency to 7.2 ms (139 FPS). The DLKA module, due to its deformable convolution operations, requires 8.2 ms (121 FPS). The full YOLOv11-LLR model, which integrates all three modules, runs at 9.4 ms (106 FPS). Although this is the slowest among the variants, it still comfortably exceeds the typical real-time requirement for industrial steel surface inspection (≥30 FPS). Therefore, YOLOv11-LLR offers a favorable trade-off between detection accuracy and inference speed, making it suitable for deployment in real-world production lines.

3.6. Statistical Significance Analysis

To verify that the observed performance improvements are not due to random fluctuations, we conducted five independent training runs on the NEU-DET dataset using different random seeds. All other hyperparameters remained identical to those described in Section 3.2. Table 7 reports the mean and standard deviation of mAP@0.5 and mAP@0.5:0.95 across the five runs. The baseline YOLOv11 achieves 80.1% ± 0.3% in mAP@0.5 and 48.6% ± 0.2% in mAP@0.5:0.95, while YOLOv11-LLR achieves 83.6% ± 0.4% and 51.0% ± 0.3%, respectively. A paired two-tailed t-test performed between the baseline and YOLOv11-LLR yields p-values of 0.0023 for mAP@0.5 and 0.0018 for mAP@0.5:0.95, both well below the 0.01 significance level. These results confirm that the improvements introduced by YOLOv11-LLR are highly statistically significant, further demonstrating the reliability and reproducibility of our proposed method.

4. Conclusions

This paper proposes YOLOv11-LLR for steel surface defect detection, addressing the limitations of existing YOLO-series detectors in handling multi-scale defects, complex textures, and low-contrast boundaries. Unlike prior efforts that primarily emphasize lightweight or single attention mechanisms, our work introduces a synergistic combination of three complementary modules: DLKA for adaptive spatial modeling of deformed defects, LWGA for efficient cross-scale feature interaction, and RepConv for decoupling training and inference. This design distinguishes itself from other YOLO variants (e.g., GFIF-YOLO, FD-YOLO11) by jointly enhancing local spatial adaptability and global multi-scale fusion within a unified framework, while preserving deployment efficiency.
Experimental results on NEU-DET and GC10-DET yield two key insights. First, the proposed model delivers consistent and substantial improvements over YOLOv11—+3.5% mAP@0.5 on NEU-DET and +9.8% on the more challenging GC10-DET—demonstrating robustness against background complexity and defect variability. Second, ablation studies (Table 4) show that RepConv contributes the largest individual gain, whereas the full combination achieves the best overall performance, indicating strong synergy among the three modules. Category-wise analysis (Table 5) further reveals that DLKA and LWGA are particularly beneficial for low-precision categories (e.g., Crazing, Rolled-in Scale), suggesting that spatial adaptability and cross-scale interaction are critical for difficult defect types.
Regarding Table 5, we acknowledge that deeper error analysis would further strengthen these conclusions. Performance variations across categories likely arise from three sources: (i) class imbalance in the dataset (e.g., Inclusion has 1011 defects versus 689 for Crazing), (ii) inherent ambiguity between visually similar defects (e.g., Patches vs. Rolled-in Scale), and (iii) model sensitivity to extremely small or highly elongated defects. Future work will incorporate per-class confidence calibration and confusion matrix analysis to systematically quantify these error sources.
In summary, YOLOv11-LLR provides an effective, robust, and industrially deployable solution for steel surface defect detection. Future directions include model weight reduction for edge devices, adaptive multi-scale fusion, and cross-domain transfer learning to broaden its applicability.

Author Contributions

J.L. (Jin Li): Conceptualization, methodology design, model implementation, experiment execution, data processing, result analysis, and preparation of the original manuscript draft. Y.Y.: Refinement of the research design, experimental validation, interpretation of results, and manuscript revision. R.G.: Data organization, experimental support, resource coordination, and investigation. Y.C.: Project supervision, methodological review, manuscript revision, project coordination, funding acquisition, and correspondence. Y.J.: Experimental analysis, result validation, and manuscript revision. K.W.: Software and experimental support, visualization preparation, and literature review. J.L. (Jinhuan Lu): Resource coordination, language polishing, and administrative and research support. All authors have read and agreed to the published version of the manuscript.

Funding

This study is funded by the Science and Technology Planning Projects of the Xinjiang Production and Construction Corps (Granted No. 2023AB020). The work was partially supported by XPCC Projects (Granted No. BTBKXM-2025-Y33) and Shihezi University Projects (Grant No. RCZK2018C09), the Ideological and Political Special Projects of Shihezi University (Granted No. SZZX201906), the Collaborative Education Projects of the Ministry of Education of China (Granted No. 221001141130337) and the Supply-Demand Docking Employment and Education Projects of the Ministry of Education of China (Granted No. 2023122950976, 2023122958628).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://ieee-dataport.org/documents/neu-det (accessed on 20 January 2026).

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and constructive suggestions, which helped in improving the quality of the paper.

Conflicts of Interest

Authors Jin Li and Yaohui Chang are employed by the company Xinjiang Tianye (Group) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Winter, R.; Baskerville, R. Science of Business & Information Systems Engineering. Bus. Inf. Syst. Eng. 2010, 2, 269–270. [Google Scholar] [CrossRef]
  2. Liu, J.; Tang, Q.; Wang, Y.; Lu, Y.; Zhang, Z. Defects’ geometric feature recognition based on infrared image edge detection. Infrared Phys. Technol. 2014, 67, 387–390. [Google Scholar] [CrossRef]
  3. Dong, G.; Wang, Y.; Liu, S.; Wu, N.; Kong, X.; Chen, X.; Wang, Z. A New Method for Rapid Detection of Surface Defects on Complex Textured Tiles. J. Nondestruct. Eval. 2025, 44, 3. [Google Scholar] [CrossRef]
  4. Zhang, X.; Han, X.; Fu, C. Comparison of Object Region Segmentation Algorithms of PCB Defect Detection. Trait. Signal 2023, 40, 797. [Google Scholar] [CrossRef]
  5. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks; Curran Associates Inc.: Red Hook, NY, USA, 2012. [Google Scholar]
  6. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  7. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers; Springer: Cham, Switzerland, 2020. [Google Scholar]
  8. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  10. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
  11. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  12. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  13. Jocher, G. Ultralytics YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 January 2026).
  14. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Li, Y.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2020, arXiv:2209.02976. [Google Scholar]
  15. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for deployment-oriented object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  16. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 21 January 2026).
  17. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  18. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
  19. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  20. Xie, W.; Ma, W.; Sun, X. An efficient re-parameterization feature pyramid network on YOLOv8 to the detection of steel surface defect. Neurocomputing 2025, 614, 128775. [Google Scholar] [CrossRef]
  21. Chu, Y.; Yu, X.; Rong, X. A Lightweight Strip Steel Surface Defect Detection Network Based on Improved YOLOv8. Sensors 2024, 24, 6495. [Google Scholar] [CrossRef] [PubMed]
  22. Lu, J.; Zhu, M.; Qin, K.; Ma, X. YOLO-LFPD: A Lightweight Method for Strip Surface Defect Detection. Biomimetics 2024, 9, 607. [Google Scholar] [CrossRef] [PubMed]
  23. Tie, J.; Zhu, C.; Zheng, L.; Wang, H.; Ruan, C.; Wu, M.; Xu, K.; Liu, J. LSKA-YOLOv8: A lightweight steel surface defect detection algorithm based on YOLOv8 improvement. Alex. Eng. J. 2024, 109, 201–212. [Google Scholar] [CrossRef]
  24. Li, X.; Xu, C.; Li, J.; Zhou, X.; Li, Y. Multi-scale sensing and multi-dimensional feature enhancement for surface defect detection of hot-rolled steel strip. Nondestruct. Test. Eval. 2024, 40, 3669–3692. [Google Scholar] [CrossRef]
  25. Zhou, Y.; Zhao, Z. MPA-YOLO: Steel Surface Defect Detection Based on Improved YOLOv8 Framework. Pattern Recognit. 2025, 168, 111897. [Google Scholar] [CrossRef]
  26. Wei, Y.; Wang, R.; Zhang, M.; Wang, Y.; Zhou, F.; Bian, X. Ade-yolo: Deployment-oriented steel surface flaw recognition through enhanced adaptive attention and dilated convolution fusion. Signal Image Video Process. 2025, 19, 457. [Google Scholar] [CrossRef]
  27. Xu, H.; Zhang, Z.; Ye, H.; Song, J.; Chen, Y. Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization. Electronics 2025, 14, 2029. [Google Scholar] [CrossRef]
  28. Li, L.; Zhang, R.; Xie, T.; He, Y.; Zhou, H.; Zhang, Y. Experimental design of steel surface defect detection based on MSFE-YOLO—An improved YOLOV5 algorithm with multi-scale feature extraction. Electronics 2024, 13, 3783. [Google Scholar] [CrossRef]
  29. Yang, S.; Xie, Y.; Wu, J.; Huang, W.; Yan, H.; Wang, J.; Wang, B.; Yu, X.; Wu, Q.; Xie, F. CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection. Electronics 2024, 13, 2771. [Google Scholar] [CrossRef]
  30. Gui, Z.; Geng, J. YOLO-ADS: An Improved YOLOv8 Algorithm for Metal Surface Defect Detection. Electronics 2024, 13, 3129. [Google Scholar] [CrossRef]
  31. Xiang, Z.; Jia, J.; Zhou, K.; Qian, M.; Wu, W. Block-wise feature fusion for high-precision industrial surface defect detection. Vis. Comput. 2025, 41, 9277–9295. [Google Scholar] [CrossRef]
  32. Ma, R.; Chen, J.; Feng, Y.; Zhou, Z.; Xie, J. ELA-YOLO: An efficient method with linear attention for steel surface defect detection during manufacturing. Adv. Eng. Inform. 2025, 65, 103377. [Google Scholar] [CrossRef]
  33. He, L.; Zheng, L.; Xiong, J. FMV-YOLO: A Steel Surface Defect Detection Algorithm for Real-World Scenarios. Electronics 2025, 14, 1143. [Google Scholar] [CrossRef]
  34. Li, H.; Liu, M.; Yin, Y.; Sun, W. Steel surface defect detection based on multi-layer fusion networks. Sci. Rep. 2025, 15, 10371. [Google Scholar] [CrossRef]
  35. Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 22 January 2026).
  36. Azad, R.; Niggemeier, L.; Hüttemann, M.; Kazerouni, A.; Aghdam, E.K.; Velichko, Y.; Bagci, U.; Merhof, D. Beyond self-attention: Deformable large kernel attention for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1287–1297. [Google Scholar]
  37. Lu, W.; Chen, S.B.; Ding, C.H.; Tang, J.; Luo, B. LWGANet: A Lightweight Group Attention Backbone for Remote Sensing Visual Tasks. arXiv 2025, arXiv:2501.10040. [Google Scholar] [CrossRef]
  38. Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C. RCS-YOLO: A fast and high-accuracy object detector for brain tumor detection. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2023; pp. 600–610. [Google Scholar]
  39. He, Y.; Song, K.; Meng, Q.; Yan, Y. An End-to-end Steel Surface Defect Detection Approach via Fusing Multiple Hierarchical Features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [Google Scholar] [CrossRef]
  40. Lv, X.; Duan, F.; Jiang, J.J.; Fu, X.; Gan, L. Deep Metallic Surface Defect Detection: The New Benchmark and Detection Network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef]
  41. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  42. Du, Y.; Wang, Q.; He, Y.; Zhang, X.; Li, G. Steel Surface Defect Detection Model Based on GFIF-YOLO. J. Hubei Minzu Univ. (Nat. Sci. Ed.) 2025, 43, 388–395. [Google Scholar]
  43. Dang, Z.; Wang, X. FD-YOLO11: A Feature-Enhanced Deep Learning Model for Steel Surface Defect Detection. IEEE Access 2025, 13, 63981–63993. [Google Scholar] [CrossRef]
  44. Zhang, T. Research on Metal Surface Defect Detection Based on Deep Learning. Master’s Thesis, Changchun University of Technology, Changchun, China, 2025. [Google Scholar]
Figure 1. The architecture of the YOLOv11 model.
Figure 1. The architecture of the YOLOv11 model.
Applsci 16 04609 g001
Figure 2. Structure of the proposed YOLOv11-LLR model.The model follows the standard YOLOv11 pipeline, consisting of a backbone, a neck, and a detection head. The DLKA module is embedded in the backbone to enhance spatial modeling capability. The LWGA module is introduced in the neck for multi-scale feature fusion. RepConv blocks are applied in convolutional layers to improve feature representation while maintaining efficient inference.
Figure 2. Structure of the proposed YOLOv11-LLR model.The model follows the standard YOLOv11 pipeline, consisting of a backbone, a neck, and a detection head. The DLKA module is embedded in the backbone to enhance spatial modeling capability. The LWGA module is introduced in the neck for multi-scale feature fusion. RepConv blocks are applied in convolutional layers to improve feature representation while maintaining efficient inference.
Applsci 16 04609 g002
Figure 3. Deformable Large Kernel Attention module.
Figure 3. Deformable Large Kernel Attention module.
Applsci 16 04609 g003
Figure 4. Lightweight Grouped Attention module.
Figure 4. Lightweight Grouped Attention module.
Applsci 16 04609 g004
Figure 5. Re-parameterizable Convolution Structure module.
Figure 5. Re-parameterizable Convolution Structure module.
Applsci 16 04609 g005
Figure 6. Visual comparison on the NEU-DET dataset. Top row: YOLOv11 detections with confidence scores. Middle row: YOLOv11-LLR detections with confidence scores. Bottom row: Ground truth bounding boxes.
Figure 6. Visual comparison on the NEU-DET dataset. Top row: YOLOv11 detections with confidence scores. Middle row: YOLOv11-LLR detections with confidence scores. Bottom row: Ground truth bounding boxes.
Applsci 16 04609 g006
Figure 7. Visual comparison on the GC10-DET dataset. Top row: YOLOv11 detections with confidence scores. Middle row: YOLOv11-LLR detections with confidence scores. Bottom row: Ground truth bounding boxes.
Figure 7. Visual comparison on the GC10-DET dataset. Top row: YOLOv11 detections with confidence scores. Middle row: YOLOv11-LLR detections with confidence scores. Bottom row: Ground truth bounding boxes.
Applsci 16 04609 g007
Table 1. Detailed information of the NEU-DET dataset.
Table 1. Detailed information of the NEU-DET dataset.
Defect TypesImagesDefects
Crazing300689
Patches300881
Inclusion3001011
Pitted surface300432
Rolled-in Scale300628
Scratches300548
Total18004189
Table 2. Comparison of our model with traditional and mainstream models on the NEU-DET dataset.
Table 2. Comparison of our model with traditional and mainstream models on the NEU-DET dataset.
ExperimentsP (%)R (%)mAP@0.5 (%)mAP@0.5:0.95 (%)
RE-DETR69.566.862.336.7
Faster-RCNN73.571.970.842.9
YOLOv574.974.478.747.9
YOLOv677.373.081.648.9
YOLOv8s76.479.880.047.5
YOLOv9t78.974.080.949.1
YOLOv10n74.770.675.645.8
YOLOv1180.074.580.248.7
GFIF-YOLO [42]77.775.982.048.9
FD-YOLO11 [43]74.077.781.148.8
GDM-YOLO [44]72.371.279.346.3
Our model84.874.983.751.1
Table 3. Comparison of our model with traditional and mainstream models on the GC10-DET dataset.
Table 3. Comparison of our model with traditional and mainstream models on the GC10-DET dataset.
ExperimentsP (%)R (%)mAP@0.5 (%)mAP@0.5:0.95 (%)
RE-DETR66.562.057.028.5
Faster-RCNN66.168.565.033.5
YOLOv565.657.563.234.2
YOLOv666.057.060.231.7
YOLOv8s60.559.661.631.5
YOLOv9t67.262.465.834.6
YOLOv10n69.862.066.637.9
YOLOv1172.054.761.033.4
GFIF-YOLO [42]65.467.566.736.0
FD-YOLO11 [43]68.968.770.336.4
GDM-YOLO [44]63.860.565.532.8
Our model76.462.570.836.8
Table 4. Results of the ablation study conducted on the NEU-DET dataset.
Table 4. Results of the ablation study conducted on the NEU-DET dataset.
YOLO11DLKALWGARepConvP (%)R (%)mAP@0.5 (%)mAP@0.5:0.95 (%)
80.074.580.248.7
73.077.681.749.0
80.473.681.849.0
81.276.682.549.3
78.779.583.449.7
75.078.682.549.2
76.873.881.648.3
84.874.983.751.1
Table 5. mAP@0.5 values produced for the six defect categories. Cr, Pa, In, Ps, Rs, and Sc represent Crazing, Patches, Inclusion, Pitted Surface, Rolled-in Scale, and Scratches, respectively.
Table 5. mAP@0.5 values produced for the six defect categories. Cr, Pa, In, Ps, Rs, and Sc represent Crazing, Patches, Inclusion, Pitted Surface, Rolled-in Scale, and Scratches, respectively.
ConfigYOLO11DLKALWGARepConvCrPaInPsRsSc
1 55.880.691.588.970.194.3
2 56.484.289.790.973.495.4
3 56.580.593.592.272.395.5
4 48.585.496.595.174.095.8
556.782.891.693.382.195.4
Table 6. Inference speed comparison on NVIDIA RTX 4060.
Table 6. Inference speed comparison on NVIDIA RTX 4060.
ModelInference Time (ms)FPSParams (M)
YOLOv11 6.21612.59
YOLOv11 + RepConv6.41562.59
YOLOv11 + LWGA7.21392.79
YOLOv11 + DLKA8.21213.36
YOLOv11-LLR9.41063.88
Table 7. Repeated experiment results (mean ± std) on NEU-DET.
Table 7. Repeated experiment results (mean ± std) on NEU-DET.
ModelmAP@0.5 (%)mAP@0.5:0.95 (%)
YOLOv11 80.1 ± 0.348.6 ± 0.2
YOLOv11-LLR83.6 ± 0.451.0 ± 0.4
Mean improvement+3.5 ± 0.5+2.4 ± 0.3
p-value0.00230.0018
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Yang, Y.; Geng, R.; Chang, Y.; Jiang, Y.; Wu, K.; Lu, J. YOLOv11-LLR: An Enhanced Framework for Steel Surface Defect Detection in Industrial Settings. Appl. Sci. 2026, 16, 4609. https://doi.org/10.3390/app16104609

AMA Style

Li J, Yang Y, Geng R, Chang Y, Jiang Y, Wu K, Lu J. YOLOv11-LLR: An Enhanced Framework for Steel Surface Defect Detection in Industrial Settings. Applied Sciences. 2026; 16(10):4609. https://doi.org/10.3390/app16104609

Chicago/Turabian Style

Li, Jin, Yingjian Yang, Runhua Geng, Yaohui Chang, Yuan Jiang, Kaiwen Wu, and Jinhuan Lu. 2026. "YOLOv11-LLR: An Enhanced Framework for Steel Surface Defect Detection in Industrial Settings" Applied Sciences 16, no. 10: 4609. https://doi.org/10.3390/app16104609

APA Style

Li, J., Yang, Y., Geng, R., Chang, Y., Jiang, Y., Wu, K., & Lu, J. (2026). YOLOv11-LLR: An Enhanced Framework for Steel Surface Defect Detection in Industrial Settings. Applied Sciences, 16(10), 4609. https://doi.org/10.3390/app16104609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop