Figure 1.
Architecture of the YOLOv11 Model. The model comprises three main components: the backbone (feature extraction with C3k2, SPPF, and C2PSA modules), neck (multi-scale feature fusion), and head (detection output).
Figure 1.
Architecture of the YOLOv11 Model. The model comprises three main components: the backbone (feature extraction with C3k2, SPPF, and C2PSA modules), neck (multi-scale feature fusion), and head (detection output).
Figure 2.
Improved YOLOv11 model architecture. The backbone integrates C3k2_THK modules with varying convolution kernel sizes across different network depths. The neck features a VOVDGSCSP architecture where shallow layers use DGSConv-L while deeper layers employ DGSConv-H. The detection module employs MSDetect, utilizing lightweight MRFB-L for classification and conventional MRFB for regression operations.
Figure 2.
Improved YOLOv11 model architecture. The backbone integrates C3k2_THK modules with varying convolution kernel sizes across different network depths. The neck features a VOVDGSCSP architecture where shallow layers use DGSConv-L while deeper layers employ DGSConv-H. The detection module employs MSDetect, utilizing lightweight MRFB-L for classification and conventional MRFB for regression operations.
Figure 3.
Structure of T-shaped convolution and THK convolution. (a) The different convolution kernel sizes utilized across various hierarchical levels; (b) the structural design of the original T-shaped convolution; (c) the structural design of the proposed THK convolution.
Figure 3.
Structure of T-shaped convolution and THK convolution. (a) The different convolution kernel sizes utilized across various hierarchical levels; (b) the structural design of the original T-shaped convolution; (c) the structural design of the proposed THK convolution.
Figure 4.
Structure of spatial and channel synergistic attention. SCSA integrates SMSA for hierarchical spatial feature extraction and PCSA for adaptive channel refinement, facilitating holistic feature optimization across dual-dimensional spaces, where GroupNorm-N represents group normalization with N groups.
Figure 4.
Structure of spatial and channel synergistic attention. SCSA integrates SMSA for hierarchical spatial feature extraction and PCSA for adaptive channel refinement, facilitating holistic feature optimization across dual-dimensional spaces, where GroupNorm-N represents group normalization with N groups.
Figure 5.
Hierarchical design of DGSconv modules. (a) Baseline GSconv. (b) DGSconv-L with dual convolution for low-level features. (c) DGSconv-H with dilated convolution for high-level features. (d) Structural details of dual and dilated convolution operations.
Figure 5.
Hierarchical design of DGSconv modules. (a) Baseline GSconv. (b) DGSconv-L with dual convolution for low-level features. (c) DGSconv-H with dilated convolution for high-level features. (d) Structural details of dual and dilated convolution operations.
Figure 6.
GMLCA attention mechanism architecture. The mechanism splits input features into global (GAP) and local (LAP/UNAP) branches, then fuses them through sigmoid gating with complementary weights, unifying channel and spatial modeling while reducing computational cost.
Figure 6.
GMLCA attention mechanism architecture. The mechanism splits input features into global (GAP) and local (LAP/UNAP) branches, then fuses them through sigmoid gating with complementary weights, unifying channel and spatial modeling while reducing computational cost.
Figure 7.
MRFB module and MSDetect head architecture. The figure illustrates (a) the multi-scale receptive field block (MRFB) that captures multi-scale features through parallel convolutions of different kernel sizes and (b) a comparison between YOLOv11’s detection head and the proposed MSDetect head, which integrates MRFB and MRFB-L modules for enhanced performance.
Figure 7.
MRFB module and MSDetect head architecture. The figure illustrates (a) the multi-scale receptive field block (MRFB) that captures multi-scale features through parallel convolutions of different kernel sizes and (b) a comparison between YOLOv11’s detection head and the proposed MSDetect head, which integrates MRFB and MRFB-L modules for enhanced performance.
Figure 8.
Comparison of detection Results of Multiple Models on the NEU-DET Dataset.
Figure 8.
Comparison of detection Results of Multiple Models on the NEU-DET Dataset.
Figure 9.
Comparison of detection results of multiple models on NEU-DET Dataset.
Figure 9.
Comparison of detection results of multiple models on NEU-DET Dataset.
Figure 10.
Comparison of detection results of multiple models on GC10-DET.
Figure 10.
Comparison of detection results of multiple models on GC10-DET.
Figure 11.
Comparison of detection results of multiple models on Severstal-Steel-Defect.
Figure 11.
Comparison of detection results of multiple models on Severstal-Steel-Defect.
Table 1.
Training Parameter settings.
Table 1.
Training Parameter settings.
Parameter | Value |
---|
Epoch | 400 |
Batch size | 16 |
Optimizer | AdamW |
Input size | |
Close mosaic | 10 |
Learning rate | 0.001 |
Table 2.
Comparative experiments between T-shaped and other convolution methods. DS Conv represents depthwise separable convolution.
Table 2.
Comparative experiments between T-shaped and other convolution methods. DS Conv represents depthwise separable convolution.
Method | mAP@50 (%) | Param (M) | FLOPs (G) |
---|
Standard Conv | 76.4 | 2.58 | 6.3 |
T-shaped Conv | 76.6 | 2.50 | 6.1 |
DS Conv | 75.6 | 2.48 | 6.1 |
Asymmetric Conv | 76.2 | 2.54 | 6.2 |
Ghost Conv | 75.7 | 2.49 | 6.1 |
Table 3.
Comparative studies on model accuracy under different attention mechanisms. SCSA(x) represents the SCSA attention mechanism using x heads, designed as 4, 8, and 16.
Table 3.
Comparative studies on model accuracy under different attention mechanisms. SCSA(x) represents the SCSA attention mechanism using x heads, designed as 4, 8, and 16.
Method | mAP@50 (%) | Param (M) | FLOPs (G) |
---|
THK | 77.9 | 2.54 | 6.2 |
+SE | 78.1 | 2.54 | 6.2 |
+ECA | 76.9 | 2.54 | 6.2 |
+CBAM | 76.8 | 2.54 | 6.2 |
+EMA | 74.9 | 2.54 | 6.2 |
+SCSA (4) | 76.6 | 2.54 | 6.2 |
+SCSA (8) | 78.5(+0.6) | 2.54 | 6.2 |
+SCSA (16) | 77.7 | 2.54 | 6.2 |
Table 4.
Comparative experiments between Slim-Neck and other neck architectures.
Table 4.
Comparative experiments between Slim-Neck and other neck architectures.
Method | mAP@50 (%) | Param (M) | FLOPs (G) |
---|
Original Neck | 76.4 | 2.58 | 6.3 |
Slim-Neck | 76.8 | 2.46 | 6.1 |
T-shape Conv | 76.8 | 2.48 | 6.1 |
BiFPN | 76.1 | 2.58 | 6.3 |
CARAFE | 74.6 | 2.72 | 6.6 |
Table 5.
Comparative studies on model accuracy under different convolutions at different stages.
Table 5.
Comparative studies on model accuracy under different convolutions at different stages.
Low-Level | High-Level | mAP@50 (%) | Parameters (M) | FLOPs (G) |
---|
GSconv | GSconv | 76.8 | 2.46 | 6.1 |
DGSconv-L | GSconv | 77.0 | 2.42 | 5.9 |
DGSconv-L | DGSconv-L | 77.8 | 2.36 | 5.9 |
GSconv | DGSconv-H | 77.6 | 2.46 | 6.1 |
DGSconv-H | DGSconv-H | 77.5 | 2.46 | 6.1 |
DGSconv-L | DGSconv-H | 77.9(+1.1) | 2.42 | 5.9 |
Table 6.
Comparative studies on model accuracy under different attention mechanisms.
Table 6.
Comparative studies on model accuracy under different attention mechanisms.
Method | mAP@50 (%) | Parameters (M) | FLOPs (G) |
---|
Staged-Slim-Neck | 77.9 | 2.42 | 5.9 |
+SE | 77.2 | 2.42 | 5.9 |
+ECA | 76.6 | 2.42 | 5.9 |
+SCSA | 77.0 | 2.42 | 5.9 |
+MLCA | 77.4 | 2.42 | 6.0 |
+GMLCA | 78.5(+0.6) | 2.42 | 6.0 |
Table 7.
MRFB performance under different kernel combinations. Numbers in parentheses represent kernel sizes of each branch, where 0 denotes the identity mapping branch. MRFB is applied only to the regression branch, while MRFB-L is applied only to the classification branch.
Table 7.
MRFB performance under different kernel combinations. Numbers in parentheses represent kernel sizes of each branch, where 0 denotes the identity mapping branch. MRFB is applied only to the regression branch, while MRFB-L is applied only to the classification branch.
Method | mAP@50 (%) | Parameters (M) | FLOPs (G) |
---|
MRFB (3, 5, 7, 9) | 76.4 | 2.58 | 6.3 |
MRFB (0, 3, 5, 7) | 77.6(+1.2) | 2.54 | 6.1 |
MRFB (0, 5, 7, 9) | 77.0 | 2.60 | 6.4 |
MRFB (0, 3, 5, 9) | 77.0 | 2.57 | 6.3 |
MRFB-L (3, 5, 7, 9) | 77.0 | 2.58 | 6.4 |
MRFB-L (0, 3, 5, 7) | 77.8(+1.4) | 2.58 | 6.3 |
MRFB-L (0, 5, 7, 9) | 77.4 | 2.58 | 6.4 |
MRFB-L (0, 3, 5, 9) | 76.0 | 2.58 | 6.3 |
Table 8.
Ablation experiments of C3k2_THK improvement points. Throughout all tables in this paper, check marks (✓) indicate the adoption of the corresponding structures/components.
Table 8.
Ablation experiments of C3k2_THK improvement points. Throughout all tables in this paper, check marks (✓) indicate the adoption of the corresponding structures/components.
T-shaped Conv | SiLu | HKS | SCSA (8) | mAP@50 (%) | Parameters/M | FLOPs/G |
---|
| | | | 76.4 | 2.58 | 6.3 |
✓ | | | | 76.6 | 2.50 | 6.1 |
| ✓ | | | 76.4 | 2.58 | 6.3 |
| | ✓ | | 74.9 | 3.26 | 7.2 |
| | | ✓ | 77.5 | 2.58 | 6.3 |
✓ | ✓ | | | 77.3 | 2.50 | 6.1 |
✓ | | ✓ | | 75.9 | 2.54 | 6.2 |
✓ | | | ✓ | 76.9 | 2.50 | 6.1 |
| ✓ | ✓ | | 74.9 | 3.26 | 7.2 |
| ✓ | | ✓ | 77.5 | 2.58 | 6.3 |
| | ✓ | ✓ | 76.9 | 3.26 | 7.2 |
✓ | ✓ | ✓ | | 77.9 | 2.54 | 6.2 |
✓ | ✓ | | ✓ | 78.1 | 2.50 | 6.1 |
✓ | | ✓ | ✓ | 78.2 | 2.54 | 6.2 |
| ✓ | ✓ | ✓ | 76.1 | 3.26 | 7.2 |
✓ | ✓ | ✓ | ✓ | 78.5(+2.1) | 2.54 | 6.2 |
Table 9.
Ablation experiments of Staged-Slim-Neck improvement points. DGSConv-L, DGSConv-H, and GMLCA represent the use of DGSConv-L in low stages, the use of DGSConv-H in high stages, and the use of the GMLCA attention mechanism in high stages, respectively.
Table 9.
Ablation experiments of Staged-Slim-Neck improvement points. DGSConv-L, DGSConv-H, and GMLCA represent the use of DGSConv-L in low stages, the use of DGSConv-H in high stages, and the use of the GMLCA attention mechanism in high stages, respectively.
Slim-Neck | DGSConv-L | DGSConv-H | GMLCA | map@50 (%) | Parameters/M | FLOPs/G |
---|
| | | | 76.4 | 2.58 | 6.3 |
✓ | | | | 76.8 | 2.46 | 6.1 |
✓ | ✓ | | | 77.0 | 2.42 | 5.9 |
✓ | | ✓ | | 77.6 | 2.46 | 6.1 |
✓ | | | ✓ | 77.4 | 2.46 | 6.1 |
✓ | ✓ | ✓ | | 77.9 | 2.42 | 5.9 |
✓ | | ✓ | ✓ | 78.1 | 2.46 | 6.1 |
✓ | ✓ | | ✓ | 77.0 | 2.42 | 6.0 |
✓ | ✓ | ✓ | ✓ | 78.5(+2.1) | 2.42 | 6.0 |
Table 10.
Ablation study on MSDetect improvements. The baseline YOLOv11n detection head uses standard convolutions for regression and DSC for classification.
Table 10.
Ablation study on MSDetect improvements. The baseline YOLOv11n detection head uses standard convolutions for regression and DSC for classification.
Regression Branch | Classification Branch | mAP@50 (%) | Param (M) | FLOPs (G) |
---|
Conv | DSC | 76.4 | 2.58 | 6.3 |
MRFB-L | DSC | 77.8 | 2.58 | 6.3 |
Conv | MRFB | 77.6 | 2.54 | 6.1 |
MRFB-L | MRFB | 78.1(+1.7) | 2.55 | 6.1 |
Table 11.
Ablation Study Results for ELS-YOLO. The first row shows the YOLOv11n experimental results.
Table 11.
Ablation Study Results for ELS-YOLO. The first row shows the YOLOv11n experimental results.
C3k2_THK | Staged-Slim-Neck | MRFB | map@50 (%) | Param (M) | FLOPs (G) |
---|
| | | 76.4 | 2.58 | 6.3 |
✓ | | | 78.5 | 2.54 | 6.2 |
| ✓ | | 78.5 | 2.42 | 6.0 |
| | ✓ | 78.1 | 2.55 | 6.1 |
✓ | ✓ | | 79.2 | 2.39 | 5.8 |
✓ | | ✓ | 76.6 | 2.51 | 6.0 |
| ✓ | ✓ | 78.2 | 2.39 | 5.8 |
✓ | ✓ | ✓ | 79.5(+3.1) | 2.36 | 5.6 |
Table 12.
Experimental results of mainstream target detection methods on NEU-DET datasets.
Table 12.
Experimental results of mainstream target detection methods on NEU-DET datasets.
Operator | CR | IN | PA | PS | RS | SC | mAP@50 (%) | mAP@50–95 (%) | Param (M) | FLOPs (G) |
---|
YOLOv5 | 38.1 | 82.2 | 91.2 | 83.6 | 62.3 | 90.0 | 74.6 | 42.2 | 2.50 | 7.1 |
YOLOv8n | 39.4 | 83.7 | 90.6 | 83.3 | 66.7 | 88.9 | 75.4 | 43.2 | 3.00 | 8.1 |
YOLOv10n | 36.4 | 77.4 | 91.2 | 85.2 | 57.5 | 86.2 | 72.3 | 41.2 | 2.69 | 8.2 |
YOLOv11n | 40.4 | 83.9 | 91.2 | 86.8 | 67.6 | 88.8 | 76.4 | 44.1 | 2.58 | 6.3 |
RT-DETR-l [55] | 29.8 | 74.9 | 86.3 | 77.1 | 58.8 | 85.5 | 68.7 | 35.5 | 31.99 | 103.5 |
ESS-Yolov5 [27] | 59.1 | 83.7 | 94.9 | 89.1 | 53.5 | 92.2 | 78.8 | - | 7.07 | - |
GDM-YOLO [26] | - | - | - | - | - | - | 79.3 | - | 9.0 | 28.1 |
SwinYOLO [56] | 40.3 | 82.5 | 91.2 | 85.8 | 60.4 | 89.1 | 74.9 | - | 4.49 | 9.9 |
CE-DETR [57] | 44.2 | 75.1 | 91.9 | 91.6 | 75.3 | 93.3 | 78.6 | - | 67.87 | - |
MobileViT-YOLOv8 [58] | 48.6 | 77.9 | 92.6 | 78.4 | 64.3 | 82.8 | 74.1 | - | 27.5 | 34.9 |
RTCN [59] | 41 | 69.7 | 87.8 | 72.4 | 55.1 | 79.5 | 67.6 | - | - | - |
Ours | 47.7 | 84.8 | 92.4 | 89.3 | 70.3 | 90.2 | 79.5 | 43.2 | 2.36 | 5.6 |
Table 13.
Experimental results of ELS-YOLO and YOLOv11n on the GC10_DET dataset.
Table 13.
Experimental results of ELS-YOLO and YOLOv11n on the GC10_DET dataset.
Model | PH | WL | CG | WS | OS | SS | IC | RP | CR | WF | map@50 (%) | mAP@50–95 (%) |
---|
YOLOv5 | 70.2 | 54.8 | 57.5 | 69.0 | 38.8 | 49.5 | 18.1 | 19.0 | 30.6 | 99.5 | 50.7 | 29.8 |
YOLOv8n | 75.6 | 47.8 | 55.4 | 69.9 | 39.8 | 44.6 | 13.7 | 12.1 | 31.3 | 99.5 | 49.0 | 27.8 |
YOLOv10n | 79.2 | 48.7 | 50.3 | 64.6 | 39.1 | 45.2 | 13.8 | 3.38 | 23.9 | 99.5 | 46.8 | 24.8 |
YOLOv11n | 71.7 | 59.6 | 59.8 | 67.5 | 34.1 | 51.3 | 16.6 | 12.8 | 43.3 | 99.5 | 51.6 | 30.4 |
RT-DETR-l | 72.0 | 70.2 | 47.3 | 63.1 | 39.9 | 46.9 | 13.9 | 13.2 | 43.6 | 2.93 | 41.3 | 19.4 |
ELS-YOLO | 78.9 | 53.7 | 59.3 | 73.1 | 37.2 | 51.3 | 18.9 | 31.1 | 37.4 | 99.5 | 54.0 | 26.7 |
Table 14.
Experimental results of ELS-YOLO and YOLOv11n on the Severstal-Steel-Defect dataset.
Table 14.
Experimental results of ELS-YOLO and YOLOv11n on the Severstal-Steel-Defect dataset.
Model | Scratch | Inclusion | Scale | Rust | map@50 (%) | mAP@50–95 (%) |
---|
YOLOv5 | 49.3 | 27.3 | 62.7 | 52.6 | 48.0 | 22.4 |
YOLOv8n | 50.9 | 29.5 | 62.4 | 52.5 | 48.8 | 22.6 |
YOLOv10n | 48.6 | 26.6 | 62.9 | 48.6 | 46.7 | 21.7 |
YOLOv11n | 52.1 | 28.6 | 63.6 | 52.3 | 49.1 | 22.7 |
RT-DETR-l | 52.5 | 20.7 | 62.5 | 51.4 | 46.8 | 21.5 |
ELS-YOLO | 48.7 | 34.8 | 61.4 | 53.2 | 49.5 | 22.4 |