IMTS-YOLO: A Steel Surface Defect Detection Model Integrating Multi-Scale Perception and Progressive Attention

Fu, Pengzheng; Yuan, Hongbin; He, Jing; Wu, Bangzhi; Xu, Nuo; Gu, Yong

doi:10.3390/coatings16010051

Open AccessArticle

IMTS-YOLO: A Steel Surface Defect Detection Model Integrating Multi-Scale Perception and Progressive Attention

by

Pengzheng Fu

¹,

Hongbin Yuan

^1,*

,

Jing He

²,

Bangzhi Wu

¹

,

Nuo Xu

¹ and

Yong Gu

¹

School of Engineering, Hangzhou Normal University, Hangzhou 311121, China

²

State Key Laboratory of Chemical Engineering and Low-Carbon Technology, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Coatings 2026, 16(1), 51; https://doi.org/10.3390/coatings16010051

Submission received: 14 November 2025 / Revised: 17 December 2025 / Accepted: 30 December 2025 / Published: 2 January 2026

(This article belongs to the Special Issue Solid Surfaces, Defects and Detection, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In recent years, steel surface defect detection has emerged as a significant area of focus within intelligent manufacturing research. Existing approaches often exhibit insufficient accuracy and limited generalization capability, constraining their practical implementation in industrial environments. To overcome these shortcomings, this study presents IMTS-YOLO, an enhanced detection model based on the YOLOv11n architecture, incorporating several technical innovations designed to improve detection performance. The proposed framework introduces four key enhancements. First, an Intelligent Guidance Mechanism (IGM) refines the feature extraction process to address semantic ambiguity and enhance cross-scenario adaptability, particularly for detecting complex defect patterns. Second, a multi-scale convolution module (MulBk) captures and integrates defect features across varying receptive fields, thereby improving the characterization of intricate surface textures. Third, a triple-head adaptive feature fusion (TASFF) structure enables more effective detection of irregularly shaped defects while maintaining computational efficiency. Finally, a specialized bounding box regression loss function (Shape-IoU) optimizes localization precision and training stability. The model achieved a 5.0% improvement in mAP50 and a 3.2% improvement in mAP50-95 on the NEU-DET dataset, while also achieving a 4.4% improvement in mAP50 and a 3.1% improvement in mAP50-95 in the cross-dataset GC10-DET validation. These results confirm the model’s practical value for real-time industrial defect inspection applications.

Keywords:

intelligent manufacturing; steel surface defect detection; multi-scale receptive field; YOLOv11; IMTS-YOLO

Graphical Abstract

1. Introduction

Steel, as a foundational material in modern industries such as mechanical manufacturing, automotive, and energy equipment, has its product integrity and longevity directly influenced by surface quality. Imperfections arising from manufacturing processes, equipment variability, and handling can severely compromise not only aesthetics but also critical mechanical properties, including fatigue property, wear behavior, and corrosion performance, thus posing significant risks to product safety and performance. Consequently, the development of high-precision, efficient automated detection systems for steel surface defects is paramount for quality assurance and aligns with the strategic objectives of intelligent manufacturing under the Industry 4.0 paradigm [1].

Traditional detection, predominantly reliant on manual visual assessment aided by optical tools, is fraught with limitations. Its labor-intensive nature, susceptibility to inspector subjectivity and fatigue, and inherent inefficiency often result in high rates of missed detections and inconsistent outcomes, ultimately leading to compromised product quality and elevated production costs [2]. While machine learning-based approaches (e.g., those utilizing SIFT, HOG, and SVM) marked a transition towards data-driven feature learning and offered improved accuracy, their representational capacity remains constrained, struggling with the vast morphological diversity, complex backgrounds, and scale variations characteristic of real-world industrial environments [3].

The advent of deep learning has ushered in transformative solutions for this domain. Object detectors based on deep learning play a pivotal role in intelligent industrial inspection by learning representative features in an end-to-end manner [4]. A primary division within these models separates them into two-stage and single-stage paradigms. The two-stage family, exemplified by Faster R-CNN [5], utilizes a region proposal mechanism to attain superior detection accuracy at the cost of significant computational overhead, which limits their use in time-sensitive scenarios. In contrast, single-stage detectors (e.g., YOLO [6], SSD [7]) frame detection as a unified regression problem, offering high efficiency.

Regarding feature fusion and model architecture design, ZHAO [8] et al. conducted in-depth improvements and optimizations to the YOLOv5 model in their paper. They introduced the Res2Net module into the backbone network of YOLOv5. This module employs a hierarchically cascaded residual structure, which splits input features into multiple sub-features for parallel processing, thereby enhancing the network’s ability to extract multi-scale information. This improvement aids in capturing subtle features of steel surface defects and increases the accuracy and robustness of the model for steel defect detection. Liang [9] et al. proposed a context enhancement module (CEM), which effectively supplements the contextual information of small targets and performs multi-scale fusion to enhance semantic information representation. Additionally, they designed a feature enhancement module (FEM) at the end of the backbone network to refine feature information, effectively capturing both global and local feature information to detect defects on steel surfaces. Gui [10] et al. proposed a novel cross-stage partial network (CSPNet) incorporating an average spatial pyramid pooling fast module (ASPPFCSPC) to enhance the model’s ability to fuse and represent local features and global contextual information. Ma [11] et al. proposed a split-feature fusion model for steel defect detection (ST-YOLO). This model adopts a split-feature network architecture and is trained using a self-correcting transmission assignment method, breaking away from the single-mode feature processing of traditional networks. By separately processing features from different levels and types, the network enables efficient parallel extraction and analysis of diverse defect characteristics. Liu [12] et al. proposed a spatial feature fusion method applicable to YOLO model detection heads. By dynamically learning the spatial fusion weights of multi-scale features, it effectively alleviates the cross-scale inconsistency problem in feature pyramids of single-stage detectors, significantly improving the performance of YOLOv3 in complex multi-scale object detection and achieving the best speed-accuracy trade-off on the MS COCO dataset with low computational overhead. Hao Zhang and Shuajie Zhang [13] proposed a bounding box regression loss function. This method incorporates target structural features based on IoU, making the regression process more aligned with the actual geometric shape, thereby significantly improving the performance of advanced detectors like YOLOv7 and YOLOv8 on datasets such as PASCAL VOC and VisDrone.

In terms of attention mechanisms and lightweight design, Feng [14] et al. adopted the newly proposed RepVGG algorithm, which utilizes a unique structural re-parameterization design to equivalently transform a complex multi-branch architecture into a simple single-branch structure, making it suitable for scenarios with high demands on detection efficiency. Concurrently, they integrated the RepVGG algorithm with the Spatial Attention (SA) mechanism. The SA mechanism enables the network to focus on key regions within an image, thereby enhancing its sensitivity to target features. Cheng [15] et al. focused on optimizing the backbone network by introducing an efficient channel attention bottleneck (EB) module. Its unique computational approach is specifically optimized for feature extraction of small and slender defects, significantly enhancing the backbone network’s perception capability for such specific defects and improving the efficiency and accuracy of target defect feature extraction. Lu [16] et al. employed a dynamic non-monotonic focusing mechanism based on WIoU loss, shifting the focus to anchor boxes of ordinary quality, thereby improving the overall performance of the detector. Furthermore, they designed a C2f DSC module based on dynamic snake convolution, enabling the model to adaptively adjust the receptive field. Finally, they introduced the GSConv and VOV-GSCSP modules into the neck network, reducing computational complexity and parameter count while ensuring model accuracy. Liu [17] et al. proposed a surface defect detection method that combines an attention mechanism with a multi-feature fusion network. They used the traditional single-shot multibox detector (SSD) model as the base framework and selected a knowledge-distilled residual network as the feature extraction network to fuse and complement low-level and high-level features, thereby improving detection accuracy. XIE [18] et al. addressed the challenges of small defect scales and high complexity by designing a lightweight multi-scale mixed convolution (LMSMC) module based on the YOLOv8 benchmark model. This module was fused with C2f to create C2f_LMSMC, achieving network lightweighting. They also incorporated a proposed efficient global attention mechanism and adopted three independent decoupled heads for regression and classification. Furthermore, they replaced CIoU with NWD as the regression loss to enhance the detection performance for small-scale defects. Si [19] et al. proposed an SCSA module. By decoupling spatial and channel dimensions and introducing lightweight multi-semantic spatial guidance and semantic difference mitigation mechanisms, it significantly enhances the model’s ability to extract and fuse multi-scale, multi-semantic features in complex scenes. This module outperforms existing attention methods on multiple benchmark tasks including ImageNet-1K, MSCOCO, and ADE20K, demonstrating excellent generalization performance. Yi [20] et al. proposed the YOLOv7-SiamFF framework, which integrates three feature enhancement modules, validating its effectiveness on a specialized industrial defect detection visual dataset. Wang [21] et al. constructed a multi-scale attention network (MAN) that significantly enhances the modeling capability for global and local information in image super-resolution reconstruction, effectively avoiding block artifacts that traditional large-kernel dilated convolutions might introduce. It achieved performance comparable to or even better than SwinIR on multiple benchmark datasets while maintaining low model complexity.

Despite advances in specific applications, existing defect detection methods face several persistent challenges. A primary issue is the imbalanced feature extraction for multi-label defects of disparate characteristics, which frequently compromises the detection of small targets. Moreover, the prevalent architectural design in object detection employs separate branches for classification and regression, creating a feature divergence that hinders precise localization amidst industrial noise. Finally, the pursuit of high accuracy in some models often incurs a significant computational burden manifests as a sharp increase in key metrics such as model size, parameter count, and computational complexity. For example, high-precision detection typically requires denser anchor box designs or more refined region proposal generation (e.g., as in two-stage detectors), which substantially elevates the computational cost of post-processing steps such as non-maximum suppression (NMS). Additionally, to effectively capture small objects or fine-grained features, models often depend on high-resolution feature maps or supplementary feature pyramid structures (e.g., FPN, PANet), significantly escalating memory consumption and computational requirements. Collectively, these factors complicate the achievement of an optimal balance between speed and accuracy, which is essential for practical deployment.

To address these limitations, we propose IMTS-YOLO, an enhanced model built upon the YOLOv11n [22] framework. The key contributions of this work are summarized as follows:

We propose a novel C2PSA-IGM module incorporating an Intelligent Guidance Mechanism (IGM) that significantly enhances the model’s capacity for extracting and integrating heterogeneous semantic features. Through the implementation of multi-semantic spatial guidance and progressive channel interaction mechanisms, our approach effectively addresses semantic ambiguity and feature conflicts while preserving discriminative spatial structures. Unlike conventional hybrid attention methods such as CBAM that process spatial and channel responses separately, our IGM employs grouped spatial modeling combined with channel self-attention, demonstrating superior performance in semantic consistency, cross-scene generalization, and comprehension of complex visual patterns.
We develop the MulC3k2 module by augmenting the C3k2 structure with a multi-scale attention component (MulBk). This innovative design couples MLKA with a GSAU, enabling more effective modeling of long-range dependencies while enhancing local feature representation. In contrast to traditional methods like RCAN that primarily depend on single-scale attention mechanisms, our MulBk module provides multi-scale receptive fields and a more adaptable feature fusion framework, which proves particularly valuable for reconstructing high-frequency image details and processing complex defect texture.
We introduce a TASFF module to enhance the detection head performance. The TASFF mechanism effectively mitigates cross-scale inconsistencies in feature pyramids by dynamically learning spatial fusion weights across multi-scale features. Distinguished from conventional element-wise addition or concatenation methods typically employed in single-stage detectors, our TASFF module adaptively filters conflicting information while preserving discriminative features, thereby substantially improving detection consistency for multi-scale targets. This approach maintains high detection efficiency while demonstrating enhanced robustness for targets of varying sizes and low contrast in complex scenarios.

The paper is structured into five sections. Section 2 details the datasets, theoretical foundations, and methodological approach. Section 3 presents a comprehensive evaluation of the IMTS-YOLO model through benchmark comparisons, instance visualizations, and an ablation study. Section 4 delves into the key findings and the constraints identified throughout this investigation. Finally, Section 5 concludes by summarizing the principal contributions and outlining emerging research directions in defect detection.

2. Materials and Methods

2.1. Datasets

To ensure a comprehensive evaluation, this study utilizes two available datasets: the NEU-DET steel surface defect dataset from NEU and the GC10-DET collected from real industrial environments.

2.1.1. NEU-DET [23,24,25]

This dataset comprises six common types of steel surface defects, namely Scratches (Sc), Crazing (Cr), Pitted Surface (Ps), Inclusions (In), Patches (Pa), and Rolled-in Scale (Rs), with approximately 1800 grayscale images in total. The samples of the NEU-DET dataset and their label distribution are shown in Figure 1.

2.1.2. GC10-DET [26]

The dataset features approximately 3600 images across ten metal surface defect categories: Punching (Pu), Welding Line (Wl), Crescent Gap (Cg), Water Spot (Ws), Oil Spot (Os), Silk Spot (Ss), Inclusion (In), Rolled Pit (Rp), Crease (Cr), and Waist Folding (Wf). The samples of the GC10-DET dataset and their label distribution are shown in Figure 2.

2.1.3. Dataset Analysis

Across the two datasets, surface defects belonging to the same category often exhibit significant visual variations. Taking scratch-type defects as an example, their morphology can appear in multiple orientations, such as horizontal, vertical, or inclined. Additionally, influenced by imaging lighting conditions and the intrinsic material properties of the steel, defects within the same category may display noticeable differences in grayscale distribution. Furthermore, defects of different categories frequently share similar visual characteristics—for instance, cracks, rolled-in scale, and surface pits often resemble one another in texture and shape. Based on the analysis of the two dataset distribution charts mentioned earlier, the sample distribution across defect categories is imbalanced, with visually similar categories accounting for a relatively small proportion. On the other hand, a significant number of defect regions exhibit low contrast with the background, leading to blurred defect boundaries and a high degree of feature intermixing with background information, which poses challenges for accurate identification.

Defect size represents another critical attribute in these datasets. The scale of defects varies widely across the two datasets, ranging from subtle traces occupying only a minimal portion of the image to large defects that cover almost the entire image. Notably, small-scale defects dominate the datasets, making the detection of fine-grained defects—which are particularly susceptible to background interference—a challenging issue.

Building upon the systematic analysis of the steel surface defect datasets outlined above, and in response to the limited generalization capability and robustness of existing detection models when confronted with challenges such as high intra-class variation, inter-class similarity, low defect-background contrast, and multi-scale defect sizes, this study aims to develop a novel methodology capable of effectively addressing these difficulties.

2.2. Baseline: YOLOv11 Model

The YOLOv11 model, released by Ultralytics in late 2024, serves as the foundation for our work. It incorporates several architectural and training optimizations that strike a balance between detection speed and accuracy, making it suitable for diverse vision tasks. The model family (n, s, m, l, x) shares a common module composition but varies in depth, width, and channel count, leading to different computational complexities. The architecture consists of three primary components:

Backbone: primarily built with Conv, C3k2, and SPPF modules for feature extraction from input images. A C2PSA module is added post-SPPF to enhance feature selection.
Neck: composed of Upsample, Concat, and C3k2 modules to facilitate multi-scale feature fusion between shallow and deep layers.
Head: employs Conv, DWConv (depthwise convolution), and Conv2d modules for the final object classification and localization predictions.

While the overall structure is similar to YOLOv8, key enhancements include the use of the C3k2 module for feature selection and the integration of depthwise separable convolutions in the head to reduce computational redundancy. For this study, the YOLOv11n variant was selected as the baseline to optimize the trade-off between detection performance and computational resource consumption. Figure 3 presents an overview of the YOLOv11 network architecture.

2.3. IMTS-YOLO Model

Building upon the YOLOv11n baseline, the IMTS-YOLO model introduces targeted optimizations across three key dimensions: multi-scale perception, feature fusion, and localization accuracy. The overall network architecture and detailed designs of the core modules are illustrated in Figure 4 and Figure 5, respectively. The improvements are elaborated in the following subsections.

2.3.1. C2PSA-IGM

While YOLOv11 represents an advancement, its C2PSA module exhibits limitations in the collaborative modeling of channel and spatial features, restricting its ability to capture fine-grained defect characteristics. At its core, this limitation refers to the module’s inability to sufficiently and efficiently integrate feature information across different dimensions, which results in the model’s insufficient capability to perceive and represent complex defects (such as subtle, irregular, or low-contrast defects). This is particularly problematic in steel defect detection, where categories like inclusions and pitting can suffer from feature confusion due to complex backgrounds and scale variations.

To overcome this, we designed an efficient and lightweight Intelligent Guidance Mechanism (IGM). The overall architecture of this module is shown in Figure 6. The IGM is constructed from two complementary components:

Multi-Semantic Selective Attention (MSSA): To optimize computational efficiency while maintaining performance, the MSSA module adopts a decomposition strategy inspired by Transformer [27] architectures for processing 1D sequences. The input feature tensor

X \in R^{B \times C \times H \times W}

is decomposed along its spatial dimensions through global average pooling, yielding two independent 1D sequence representations

X_{H} \in R^{B \times C \times W}

for the height dimension and

X_{W} \in R^{B \times C \times H}

for the width dimension. To capture diverse spatial characteristics, each sequence is further partitioned into K distinct feature groups

X_{H}^{i}

and

X_{W}^{i}

, with K = 4 as the default configuration. The decomposition process is formally defined as follows:

X_{W}^{i} = X_{W} [:, (i - 1) \times ⌊\frac{C}{K}⌋ : i \times ⌊\frac{C}{K}⌋, :]

(1)

X_{H}^{i} = X_{H} [:, (i - 1) \times ⌊\frac{C}{K}⌋ : i \times ⌊\frac{C}{K}⌋, :]

(2)

where

X_{H}^{i}

and

X_{W}^{i}

are the i-th sub-features, respectively, and

i \in [1, K]

. By maintaining the independence of each sub-feature, the model can efficiently capture spatial information across multiple semantic levels.

Refer to research related to reducing feature redundancy [28]. To enhance feature diversity and mitigate redundancy, we implement depthwise separable 1D convolutions with varying kernel sizes (3, 5, 7, 9) across different feature groups. This multi-scale approach captures spatial patterns at different granularities. Subsequently, a lightweight 1D squeeze-and-excitation (SE) module is incorporated, which employs global average pooling for spatial compression, followed by two fully-connected layers to model channel interdependencies. A Sigmoid activation generates channel-wise attention weights, emphasizing informative feature channels. To address potential receptive field limitations from the dimensional decomposition, lightweight shared convolutions are applied for cross-dimensional feature alignment, implicitly modeling inter-dimensional dependencies. The enhanced feature representations are obtained as follows:

{\tilde{X}}_{H}^{i} = {{S E}_{1 D} (D W C o n v 1 d}_{k_{i}}^{\frac{C}{K} \to \frac{C}{K}} (X_{H}^{i}))

(3)

{\tilde{X}}_{W}^{i} = {{S E}_{1 D} (D W C o n v 1 d}_{k_{i}}^{\frac{C}{K} \to \frac{C}{K}} (X_{W}^{i}))

(4)

In these equations,

{\tilde{X}}_{H}^{i}

{\tilde{X}}_{H}^{i}

and

{\tilde{X}}_{W}^{i}

correspond to the enhanced spatial information in the height and width dimensions for the

i

-th sub-feature after lightweight convolution and channel weighting, while

k_{i}

is the convolution kernel specific to that sub-feature.

The spatial attention mechanism is constructed by concatenating processed sub-features along the channel dimension, followed by group normalization (GN) [29]. Compared to batch normalization [30], GN demonstrates superior performance in preserving semantic distinctions between feature groups while avoiding batch statistics noise. Finally, spatial attention maps are generated through Sigmoid activation, emphasizing relevant regions while suppressing noise. The complete MSSA computation is formalized as

{A t t n}_{H} = δ ({G N}_{H}^{K} (C o n c a t ({\tilde{X}}_{H}^{1}, {\tilde{X}}_{H}^{2}, \dots, {\tilde{X}}_{H}^{K})))

(5)

{A t t n}_{W} = δ ({G N}_{W}^{K} (C o n c a t ({\tilde{X}}_{W}^{1}, {\tilde{X}}_{W}^{2}, \dots, {\tilde{X}}_{W}^{K})))

(6)

M S S A (X) = X_{s} = {A t t n}_{H} \times {A t t n}_{W} \times X

(7)

where

δ (\cdot)

denotes Sigmoid normalization, and

{G N}_{H}^{K} (\cdot)

and

{G N}_{W}^{K} (\cdot)

denote K-group group normalization along the height and width dimensions, respectively.

Progressive Compressed Spatial Semantic Attention (PCSSA): Traditional channel attention methods based on convolutional operations face limitations in effectively capturing inter-channel relationships [31]. Inspired by Vision Transformer’s success [32] in modeling spatial token similarities, by integrating single-head self-attention (SHSA) and spatial priors provided by the MSSA, our approach computes cross-channel similarity measures. To maintain computational efficiency while leveraging multi-semantic information, we employ a strip pooling-based compression strategy for multi-scale semantic guidance. The PCSSA implementation proceeds as follows:

X_{p} = {S t r P o o l}_{(7, 7)}^{(H, W) \to (H^{'}, W^{'})} (X_{s})

(8)

F_{p r o j} = {D W C o n v 1 d}_{(1, 1)}^{C \to C}

(9)

K = F_{p r o}^{K} (X_{p}), Q = F_{p r o}^{Q} (X_{p}), V = F_{p r o}^{V} (X_{p})

(10)

X_{a t t n} = A t t e n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{C}}) V

(11)

P C S S A (X_{s}) = X_{c} = X_{s} \times δ (A v g {P o o l}_{(H^{'}, W^{'})}^{(H^{'}, W^{'}) \to (1, 1)} (X_{a t t n}))

(12)

where

{S t r P o o l}_{(k, k)}^{(H, W) \to (H^{'}, W^{'})} (\cdot)

denotes the strip pooling operation with kernel size

k \times k

to rescale the resolution from

(H, W)

to

(H^{'}, W^{'})

, Global average pooling is performed on the compressed feature map by

A v g {P o o l}_{(H^{'}, W^{'})}^{(H^{'}, W^{'}) \to (1, 1)} (\cdot)

, and the projection for the Query, Key, and Value is implemented by the linear function

F_{p r o j} (\cdot)

.

Notably, our CA-SHSA configuration differs fundamentally from standard MHSA in Vision Transformers. While ViT employs

Q, K, V \in R^{B \times N \times C}

, with

N = H W

, our approach uses

Q, K, V \in R^{B \times C \times N^{'}}

, where

N = H^{'} W^{'}

represents compressed spatial positions. This design, coupled with single-head attention, ensures efficient interaction with MSSA-derived sub-feature [33].

A novel contribution of our work, the IGM module, synergistically combines concepts from CBAM [34] and CPCA [35] to guide channel attention through spatial context. MSSA provides precise spatial priors by extracting multi-semantic spatial information, while PCSSA refines the semantic understanding of local features using global context, effectively mitigating semantic discrepancies from multi-scale convolutions. Crucially, IGM preserves channel dimensionality to prevent information loss. The integrated module is defined as

I G M (X) = P C S S A (M S S A (X))

(13)

Experimental validation confirms that this architecture significantly enhances detection capability for subtle and ambiguous defects while improving robustness across complex steel surface inspection scenarios.

2.3.2. Structural Enhancement: The MulC3k2 Module

Inspired by MetaFormer [36], building upon the foundational C3k2 module, which comprises a 2D convolutional layer and a bottleneck structure for dimensional reduction and computational efficiency, we introduce the enhanced MulC3k2 architecture. The core innovation involves replacing the standard bottleneck component with our novel multi-scale attention module (MulBk), as illustrated in Figure 7 and Figure 8.

The MulBk module integrates two complementary components: multi-scale large kernel attention (MLKA) and gated spatial attention unit (GSAU). The processing pipeline for an input feature X proceeds as follows:

N = L N (X),

(14)

X = X + α_{1} f_{3} (M L K A (f_{1} (N)) \otimes f_{2} (N)),

(15)

N = L N (X),

(16)

X = X + α_{2} f_{6} (G S A U (f_{4} (N)) \otimes f_{5} (N)),

(17)

Here,

L N (\cdot)

represents layer normalization

α

is learnable scaling parameters,

\otimes

indicates element-wise multiplication, and

f_{i} (N)

corresponds to dimension-preserving pointwise convolutional operations. The adoption of layer normalization, rather than batch normalization or its absence, ensures preservation of instance-specific characteristics while promoting accelerated convergence.

Multi-scale Large Kernel Attention (MLKA): Conventional attention mechanisms [37] in object detection, including channel attention and self-attention, have demonstrated limitations in concurrently capturing local features and long-range dependencies while maintaining fixed receptive fields. Our MLKA module addresses these constraints through an integrated framework combining large kernel decomposition with multi-scale learning, comprising three fundamental operations:

Large Kernel Attention: For an input feature map

X \in R^{B \times C \times H \times W}

, LKA decomposes the

K \times K

convolutional operation into three sequential components:

L K A (X) = f_{P W} (f_{D W D} (f_{D W} (X)))

(18)

This employs a

(2 d - 1) \times (2 d - 1)

depthwise convolution, followed by a

⌈ \frac{K}{d} ⌉ \times ⌈\frac{K}{d}⌉

depthwise dilated convolution, and concludes with pointwise convolution.

Multi-Scale Mechanism: The input feature map undergoes channel-wise partitioning into n distinct groups

X_{1}, X_{2}, \dots, X_{n},

each containing

⌊\frac{C}{n}⌋

channels. Each group features

X_{i}

processes through an LKA configuration

{K_{i}, d_{i}}

, generating specialized attention maps. Our implementation utilizes three LKA variants, {7, 2}, {21, 3}, and {35, 4}, corresponding to convolutional combinations of 3-5-1, 5-7-1, and 7-9-1 respectively.

Gated Aggregation: To mitigate artifacts from dilation and partitioning operations while enhancing local feature preservation, we incorporate a spatial gating mechanism:

{M L K A}_{i} (X_{i}) = G_{i} (X_{i}) \otimes {L K A}_{i} (X_{i})

(19)

Here,

G_{i} (\cdot)

represents a gating function generated through

a_{i} \times a_{i}

depthwise convolution.

Gated Spatial Attention Unit (GSAU): Addressing the computational inefficiency of traditional feedforward networks in Transformers, GSAU incorporates simple spatial attention with gated linear units to establish an adaptive gating framework with reduced parameter overhead. The fundamental operation is defined as

G S A U (X, Y) = f_{D W} (X) \otimes Y

(20)

In this formulation,

f_{D W} (\cdot)

and

\otimes

denote depthwise convolution and element-wise multiplication. This design allows the gated spatial attention unit (GSAU) to effectively capture local continuity through a spatial gate, which simplifies the architecture by reducing the need for excessive nonlinear layers.

Integrated Advantages: The synergistic operation of MLKA and GSAU within the MulBk framework enables comprehensive feature representation across multiple granularities. MLKA facilitates adaptive integration of local details with global contextual information, proving particularly effective for diverse defect types ranging from micro-cracks in NEU-DET to extensive stains in GC10-DET. Concurrently, GSAU’s dynamic spatial weighting mechanism suppresses background interference and enhances defect saliency, significantly improving detection robustness in challenging industrial environments.

2.3.3. TASFF Head

The TASFF framework is designed to selectively integrate complementary features while filtering conflicting information across different scales through dynamically learned spatial weighting. Unlike conventional feature pyramid approaches that often exhibit inconsistencies in multi-scale feature integration, our method enables position-aware adaptive selection of the most informative feature contributions from varying scales. This capability is particularly crucial in industrial inspection scenarios where defect sizes exhibit significant variation.

Leveraging this advantage, we integrate TASFF into YOLOv11 to form a specialized head. This integration endows the model with an enhanced ability to fuse multi-scale features—combining semantic abstraction with spatial precision—which collectively boost performance in detecting targets at various scales and estimating accurate bounding boxes.

As illustrated in Figure 9, the network’s neck component generates three distinct feature maps (Level 0, Level 1, and Level 2) with varying spatial resolutions and abstraction levels. To enable seamless fusion while maintaining dimensional consistency, we apply appropriate rescaling operations, including 1/2 downsampling, 1/4 downsampling, and upsampling.

The adaptive fusion process operates independently at each hierarchy level through dedicated TASFF-0, TASFF-1, and TASFF-2 modules. Each module performs intelligent feature weighting using learnable coefficients (α, β, γ) that dynamically determine the relative importance of different scale contributions. Using TASFF-2 as an illustrative example, the fusion mechanism at spatial coordinates

(i, j)

follows this formulation:

M_{i j}^{l} = α_{i j}^{l} \cdot y_{i j}^{0 \to l} + β_{i j}^{l} \cdot y_{i j}^{1 \to l} + γ_{i j}^{l} \cdot y_{i j}^{2 \to l}

(21)

Here,

M_{i j}^{l}

presents the synthesized feature map at level l and position

(i, j) . y_{i j}^{0 \to l}

denotes rescaled feature vectors from source level k to target level

l;

α_{i j}^{l}

,

β_{i j}^{l}

and

γ_{i j}^{l}

are adaptive learning weights satisfying the following formulas:

α_{i j}^{l}

,

β_{i j}^{l}

,

γ_{i j}^{l} \in [0, 1]

. The adaptive weights maintain the following constraints:

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} = 1

(22)

α_{i j}^{l} = \frac{e^{δ_{α_{i j}}^{l}}}{e^{δ_{γ_{i j}}^{l}} + e^{δ_{α_{i j}}^{l}} + e^{δ_{β_{i j}}^{l}}}

(23)

β_{i j}^{l} = \frac{e^{δ_{β_{i j}}^{l}}}{{e^{δ_{β_{i j}}^{l}} + e}^{δ_{α_{i j}}^{l}} + e^{δ_{γ_{i j}}^{l}}}

(24)

γ_{i j}^{l} = \frac{e^{δ_{γ_{i j}}^{l}}}{e^{δ_{α_{i j}}^{l}} + e^{δ_{β_{i j}}^{l}} + e^{δ_{γ_{i j}}^{l}}}

(25)

The control parameters

δ_{α_{i j}}^{l}

,

δ_{β_{i j}}^{l}

and

δ_{γ_{i j}}^{l}

are derived from rescaled feature maps via

1 \times 1

convolutional operations.

This dynamic weighting strategy ensures optimal utilization of multi-scale information at each spatial location, producing highly discriminative feature representations. The resulting architecture not only enhances spatial precision and semantic richness but also demonstrates superior detection performance in complex multi-scale industrial inspection environments.

2.3.4. Shape-IoU

The efficacy of object detection models is critically dependent on the bounding box regression loss, which governs localization precision. Although existing losses (e.g., IoU, GIoU, CIoU, and SIoU) have incorporated an expanding set of geometric metrics—from overlap and center distance to aspect ratio and angle, their performance remains limited by a common failure to consider how the fundamental geometry of a bounding box influences the regression process.

To address this limitation, we use Shape-IoU, an innovative loss function that incorporates a shape and scale-aware weighting mechanism to better guide model optimization. Our approach stems from the observation that bounding boxes of different geometries exhibit varying sensitivity to positional deviations: non-square boxes demonstrate heightened sensitivity along their shorter dimension, while small-scale targets show greater susceptibility to shape variations. The formula for shape-IoU can be derived from Figure 10.

The fundamental concept involves explicit modeling of how ground truth box dimensions affect loss weighting. Specifically, we formulate direction-sensitive coefficients derived from the GT box, with width

w^{g t}

and height

h^{g t}

:

h h = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(26)

w w = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(27)

Here, the scale parameter adapts to target size distributions across different datasets. These weights are subsequently integrated into both distance and shape constraint components:

d^{s h a p e} = h h \cdot \frac{{(x_{c} - x_{c}^{g t})}^{2}}{c^{2}} + w w \cdot \frac{{(y_{c} - y_{c}^{g t})}^{2}}{c^{2}}

(28)

w_{h} = w w \cdot \frac{|h - h^{g t}|}{\max (h, h^{g t})}

(29)

w_{w} = h h \cdot \frac{|w - w^{g t}|}{\max (w, w^{g t})}

(30)

Ω^{s h a p e} = \sum_{t \in \{w, h\}} {(1 - e^{- w t})}^{4}

(31)

The complete loss formulation combines these elements:

L_{s h a p e - I} = 1 - I + d^{s h a p e} + 1 / 2 \cdot Ω^{s h a p e}

(32)

In this configuration,

d^{s h a p e}

represents the directionally-weighted center distance penalty,

w_{w}

and

w_{h}

denote dimensional deviations, and

Ω^{s h a p e}

constitutes the aspect ratio-based shape penalty term.

By incorporating shape-aware distance weighting and geometric similarity assessment, Shape-IoU substantially improves adaptation to bounding box geometry while amplifying deviation penalties along the more sensitive shorter dimension. This approach maintains the foundational IoU structure while achieving enhanced sensitivity to bounding box proportions throughout the regression process, ultimately leading to more precise localization across diverse target geometries.

3. Results

3.1. Experimental Setup and Training Parameters

All our experiments were conducted on an online server with the following environment configuration: Ubuntu 22.04.1 operating system, NVIDIA GeForce RTX 4090 GPU, AMD EPYC 7352 processor, a 100 GB hard drive, 124 GB RAM, CUDA 12.1 parallel computing architecture, PyTorch 2.3.0 deep learning framework. Key training parameters are summarized in Table 1.

3.2. Experimental Metrics

Precision (P): P reflects the model’s correctness when it makes a positive prediction. It is computed as

P = \frac{T P}{T P + F P}

(33)

where TP and FP represent true and false positives, respectively.

Recall (R): R assesses the model’s effectiveness in identifying all relevant positive instances, given by

R = \frac{T P}{T P + F N}

(34)

where FN signifies false negatives.

Precision–Recall (P-R) Curve: The precision–recall (P-R) curve, which plots precision against recall on the vertical and horizontal axes, respectively, is a fundamental tool for evaluating model performance. The area under this curve (AUC), denoted as p(r), defines the average precision (AP). This metric quantifies the model’s detection capability, where a higher AP value corresponds to superior performance. The AP is mathematically formulated as follows:

A P = \int_{0}^{1} p (r) d r

(35)

mAP50: This metric represents the mean average precision (AP) calculated at an intersection over union (IoU) threshold of 0.5. AP corresponds to the area under the precision-recall curve for a given class; mAP50 is derived by averaging the AP values across all object categories, as defined in Equation (36). It serves as a fundamental benchmark in object detection, indicating the algorithm’s overall performance under a moderate localization requirement. The formula is as follows:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(36)

mAP50-95: This value denotes the average mAP computed over multiple IoU thresholds, ranging from 0.5 to 0.95 in increments of 0.05. By evaluating detection consistency across varying localization strictness, it provides a comprehensive measure of model robustness and precision.

Giga Floating Point Operations Per Second (GFLOPs): GFLOPs quantify the computational cost of a single forward pass through the model, expressed in billions of floating-point operations. The calculation is given by

G F L O P s = (2 \times C_{i n} \times K^{2} - 1) \times W_{o u t} \times H_{o u t} \times \frac{C_{o u t}}{10^{9}}

(37)

Here,

C_{i n}

and

C_{o u t}

represent the input and output channel counts, respectively; K is the convolutional kernel size; and

W_{o u t}

and

H_{o u t}

denote the spatial dimensions of the output feature map. Higher GFLOPs indicate increased computational complexity, which in turn demands greater hardware processing capacity.

Parameters (Params): Parameters is the total count of trainable parameters in the model, represented in millions.

Frames Per Second (FPS): FPS measures the throughput of a model in terms of the number of image frames processed within one second. This metric serves as a direct indicator of real-time inference capability, where a higher FPS corresponds to reduced processing latency. FPS was measured under standardized conditions: batch size of 1, input resolution of 640 × 640, FP32 precision, 100 warm-up iterations prior to timing, and no post-processing acceleration (e.g., TensorRT).

3.3. Ablation Study

The individual and synergistic contributions of the proposed components were rigorously evaluated through an ablation study on the NEU-DET dataset, with results detailed in Table 2 and Figure 11. Our meticulously designed experiments followed two strategic pathways: one isolating each module’s impact, and another combining them to examine interactive effects.

Group 1 (Baseline): The baseline YOLOv11n model established a performance benchmark with a 75.3% mAP50 and 41.5% mAP50-95.

Group 2 (M1): The addition of the C2PSA-IGM module, which integrates multi-semantic guidance, raised mAP50 to 77.5% and improved precision, indicating a stronger ability to discriminate visually similar defects and suppress background noise.

Group 3 (M2): Introducing the MulC3k2 module increased mAP50 to 76.6%, underscoring its efficacy in handling multi-scale and low-contrast defects through enhanced feature representation.

Group 4 (M3): Employing the TASFF detection head led to an mAP50 of 76.5%, confirming that adaptive multi-scale feature fusion mitigates cross-scale inconsistencies.

Group 5 (M4): Utilizing the Shape-IoU loss resulted in an mAP50 of 75.8% and the highest single-module mAP50-95 (42.8%), highlighting its role in refining localization accuracy and training stability for small or overlapping defects.

Group 6 (M1 + M2 + M3): The combination of MulC3k2, C2PSA-IGM, and TASFF achieved a mAP50 of 79.9%, demonstrating that multi-scale feature extraction, attentive feature refinement, and adaptive fusion work in concert.

Group 7 (M1 + M2 + M4): This configuration, focusing on feature enhancement and refined localization, achieved a competitive mAP50 of 79.5% with a low parameter count (2.5 M). The absence of TASFF, however, slightly reduced its adaptability to irregular shapes.

Group 8 (M1 + M3 + M4): This combination formed a compact, high-performance model (79.4% mAP50, 44.3% mAP50-95), effectively balancing complexity and accuracy.

Group 9 (M2 + M3 + M4): While excelling in noise suppression and boundary refinement (78.7% mAP50), the lack of C2PSA-IGM’s progressive guidance capabilities slightly weakened overall defect handling.

Group 10 (Full IMTS-YOLO): The integration of all four modules yielded the best overall performance, with mAP50 reaching 80.3% and mAP50-95 reaching 44.7%. This result confirms that each module provides unique, non-overlapping benefits, and their full integration achieves optimal synergistic performance.

This paper compares the impact of different loss functions on detection performance, as shown in Table 3. In terms of evaluation metrics, GIOU achieves the highest precision (80.9%) but performs relatively poorly in recall (67.8%) and mAP50 (78.3%). DIOU yields a lower precision (76.4%). SIOU attains the best result in mAP50-95 (44.8%). In contrast, Shape-IoU demonstrates the highest mAP50 (80.3%) and exhibits a more balanced overall performance, making it the most suitable choice as the optimal loss function.

3.4. Performance Evaluation: Precision–Recall Analysis and Visual Assessment

The detection capabilities of the baseline YOLOv11n and our proposed IMTS-YOLO framework are quantitatively compared through P, R, and PR curves generated on the NEU-DET test set, as visualized in Figure 12 and Figure 13, respectively. These curves provide comprehensive insights into the per-class detection performance while revealing overall trends through mean average precision (mAP) metrics.

To provide an intuitive comparison between the YOLOv11n and IMTS-YOLO models in terms of recognition performance, this study conducted a visual analysis of their outputs on the NEU-DET dataset. The visualization covers both prediction bounding boxes and feature heatmaps, as illustrated in Figure 13 and Figure 14. The heatmaps were generated by visualizing the activation maps from the final convolutional layer of the backbone network. We employed Grad-CAM++ to compute gradient-weighted class activation, followed by normalization to the range

[0, 1]

and overlaying onto the original image using a jet colormap. This approach highlights regions that most contribute to the detection decision, providing interpretable evidence for model behavior. Overall, the visual results demonstrate that the IMTS-YOLO model not only maintains high detection accuracy but also exhibits superior defect localization capability and a lower false negative rate. Its predictions show significant advantages in terms of bounding box alignment precision and confidence calibration.

Notably, IMTS-YOLO shows enhanced capability in handling challenging cases where defect characteristics exhibit minimal contrast with background elements. The improved performance is consistently observed across various defect typologies, confirming the model’s robustness in addressing the complex challenges inherent in steel surface defect inspection. The visual evidence corroborates the quantitative metrics, establishing IMTS-YOLO as a superior solution for industrial surface quality assessment.

3.5. Comparison with Mainstream Object Detection Algorithms

We benchmarked the proposed IMTS-YOLO model against a range of state-of-the-art detectors on the NEU-DET dataset, as shown in Table 4 and Figure 15. The comparison includes two-stage models like Faster R-CNN, which exhibit high precision but substantial computational loads, and various generations of single-stage models, including the YOLO series.

The results demonstrate that IMTS-YOLO achieves the high mAP50 (80.3%) and mAP50-95 (44.7%) among the compared models. Notably, it surpasses the performance of its baseline, YOLOv11n, by a significant margin (5.0% in mAP50) and also outperforms the lightweight YOLOv8n. This is achieved while maintaining a real-time inference speed and a parameter count that is orders of magnitude lower than two-stage detectors. This balance between high accuracy and computational efficiency confirms the model’s strong suitability for practical industrial defect detection applications.

3.6. Cross-Dataset Generalization Assessment

To validate the generalization capacity of the proposed IMTS-YOLO architecture, a comprehensive comparative analysis was performed on the GC10-DET benchmark. The precision–recall characteristics of both YOLOv11n and our enhanced model are depicted in Figure 16 and Figure 17, respectively, while Table 5 presents systematic comparisons with contemporary detection algorithms.

Experimental results demonstrate IMTS-YOLO’s superior cross-dataset performance, achieving leading metrics across multiple dimensions: 75.4% precision, 69.2% recall, and peak mAP scores of 73.9% (mAP50) and 40.4% (mAP50-95). When compared to the second-best performer RT-DETR-R18, our method delivers a 2.3% mAP50 enhancement while reducing parameter requirements by 80.4%. The architecture also shows significant improvements over its baseline, with 4.4% and 3.1% gains in mAP50 and mAP50-95, respectively.

Importantly, all improvements are realized within a computationally lightweight architecture. The proposed solution successfully bridges the gap between detection precision and operational economy, delivering superior performance while preserving low computational demands. This effective synergy between accuracy and efficiency strongly supports the model’s practical viability in real-world settings.

The consistent performance across diverse datasets underscores IMTS-YOLO’s exceptional generalization capabilities in steel surface inspection tasks. The architecture demonstrates particular strength in adapting to varying defect morphologies and imaging conditions, establishing its suitability for real-world quality control applications where environmental factors and defect characteristics may vary substantially.

4. Discussion

Although the model proposed in this study demonstrates potential for specific tasks, existing technologies still possess significant room for optimization at the level of engineering deployment to meet the stringent time-sensitive requirements of industrial applications. To facilitate the transition from theoretical research to mature industrial practice, future investigations can be pursued along the following critical directions:

Firstly, the ultimate efficacy of any detection framework is fundamentally governed by the caliber and quantity of its training data. In the context of deep learning-based surface flaw identification, model performance is profoundly dependent on the richness and fidelity of the visual examples provided for learning. When confronted with datasets of limited size, traditional data enrichment strategies, which typically involve applying basic geometric or photometric transformations to existing images, often fall short. Such methods may fail to generate the necessary diversity in the data’s underlying feature representation, thereby constraining the model’s learning potential. To surmount this limitation, a promising avenue for future inquiry lies in harnessing the sophisticated synthetic data generation capacities of state-of-the-art models, such as generative adversarial networks (GANs). These can be employed to fabricate a wide spectrum of photorealistic and varied defect imagery, effectively enriching the training corpus. In parallel, a foundational strategy for performance enhancement is to elevate the quality of the source imagery itself. This can be achieved by prioritizing the capture of high-definition visuals at the point of acquisition and integrating advanced pre-processing techniques, for instance, sophisticated noise reduction algorithms, to preserve critical information and furnish the model with superior input data.

Secondly, regarding the refinement and enhancement of model performance, the current architecture still offers considerable scope for exploration. While the model presented in this study achieves a preliminary trade-off between detection speed and accuracy, the practical application of steel defect detection is often situated in industrial environments with constrained computational resources. Such settings impose exceptionally high standards for a system’s real-time responsiveness and long-term operational stability, and the performance of the existing model under certain extreme operating conditions may not yet fully meet production-grade requirements. Consequently, subsequent research can incorporate model compression techniques, such as network pruning and knowledge distillation, to perform a deep optimization of the existing model architecture. This aims to more effectively balance the dual demands for efficiency and precision inherent in industrial inspection scenarios.

Finally, the issue of the model’s deployment adaptability in industrial contexts requires urgent attention. On actual production lines, systems for steel surface defect detection typically need to operate on edge devices—such as the NVIDIA Jetson Nano series—where both computational power and memory resources are limited. To overcome these hardware bottlenecks, future work will focus on deployment experiments tailored for edge computing, employing techniques like model quantization and acceleration frameworks such as TensorRT to boost operational efficiency. Furthermore, given the variability across different production line processes and imaging conditions, the model’s cross-scenario generalization capability becomes paramount. Subsequent efforts will be dedicated to exploring domain adaptation strategies, including but not limited to feature distribution alignment, adversarial training, and self-supervised domain-invariant representation learning. The goal is to enhance the model’s adaptability to unseen environments and reduce its reliance on extensive labeled data from target domains. These research directions will effectively bridge the gap between laboratory-level performance metrics and industrial-grade robustness.

5. Conclusions

The experimental results substantiate that our proposed IMTS-YOLO framework achieves remarkable performance in steel surface defect inspection while maintaining robust generalization capabilities. Through the systematic integration of the MulC3k2, C2PSA-IGM, and TASFF architectural components, coupled with the novel Shape-IoU loss function, our framework effectively enhances feature extraction and representation capabilities, particularly for fine-grained and morphologically variable defect patterns. Comprehensive ablation studies conducted on the NEU-DET benchmark provide compelling evidence for the individual and synergistic contributions of each proposed enhancement.

Our methodology demonstrates consistent improvements across multiple performance dimensions, including detection precision, confidence calibration, false positive reduction, and robustness under varying imaging conditions. Notably, the cross-dataset validation on the more challenging GC10-DET dataset further confirms the model’s strong adaptability and generalization ability, achieving a 4.4% improvement in mAP50 over the baseline while maintaining competitive computational efficiency. This balanced trade-off between high accuracy and efficient resource utilization positions IMTS-YOLO as a practical, scalable, and deployable solution for real-world industrial quality control applications, where both reliability and speed are critical.

Author Contributions

Conceptualization, P.F.; Methodology, P.F.; Software, P.F.; Validation, P.F.; Formal analysis, J.H.; Investigation, J.H. and N.X.; Resources, B.W.; Data curation, B.W.; Writing—original draft, P.F.; Writing—review and editing, H.Y.; Supervision, H.Y.; Project administration, H.Y. and Y.G.; Funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Scientific Research Fund of Zhejiang Provincial Education Department (Grant No. Y202353517).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in [NEU-DET] at [http://faculty.neu.edu.cn/songkechen/zh_CN/zdylm/263270/list/index.htm] accessed on 1 January 2026 and in [GC10-DET] at [https://aistudio.baidu.com/datasetdetail/90446] accessed on 1 January 2026.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, Y.; Zhang, H.; Huang, Q.; Han, Y.; Zhao, M. DsP-YOLO: An anchor-free network with DsPAN for small object detection of multiscale defects. Expert Syst. Appl. 2024, 241, 122669. [Google Scholar] [CrossRef]
Zhang, D.; Hao, X.; Wang, D.; Qin, C.; Zhao, B.; Liang, L.; Liu, W. An efficient lightweight convolutional neural network for industrial surface defect detection. Artif. Intell. Rev. 2023, 56, 10651–10677. [Google Scholar] [CrossRef]
Huang, X.; Zhu, J.; Huo, Y. SSA-YOLO: An Improved YOLO for Hot-Rolled Strip Steel Surface Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 1–17. [Google Scholar] [CrossRef]
Zhang, T.; Ma, C.; Liu, Z.; ur Rehman, S.; Li, Y.; Saraee, M. Gas pipeline defect detection based on improved deep learning approach. Expert Syst. Appl. 2025, 267, 126212. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A modified YOLO for detection of steel surface defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
Liang, C.; Wang, Z.Z.; Liu, X.L.; Zhang, P.; Tian, Z.W.; Qian, R.L. SDD-Net: A Steel Surface Defect Detection Method Based on Contextual Enhancement and Multiscale Feature Fusion. IEEE Access 2024, 12, 185740–185756. [Google Scholar] [CrossRef]
Gui, Z.; Geng, J. YOLO-ADS: An Improved YOLOv8 Algorithm for Metal Surface Defect Detection. Electronics 2024, 13, 3129. [Google Scholar] [CrossRef]
Ma, H.; Zhang, Z.; Zhao, J. A Novel ST-YOLO Network for Steel-Surface-Defect Detection. Sensors 2023, 23, 9152. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric considering Bounding Box Shape and Scale. arXiv 2023, arXiv:2312.17663. [Google Scholar] [CrossRef]
Feng, X.; Gao, X.; Luo, L. X-SDD: A New Benchmark for Hot Rolled Steel Strip Surface Defects Detection. Symmetry 2021, 13, 706. [Google Scholar] [CrossRef]
Cheng, Z.; Gao, L.; Wang, Y.; Deng, Z.; Tao, Y. EC-YOLO: Effectual Detection Model for Steel Strip Surface Defects Based on YOLO-V5. IEEE Access 2024, 12, 62765–62778. [Google Scholar] [CrossRef]
Lu, M.; Sheng, W.; Zou, Y.; Chen, Y.; Chen, Z. WSS-YOLO: An improved industrial defect detection network for steel surface defects. Measurement 2024, 236, 115060. [Google Scholar] [CrossRef]
Liu, X.; Gao, J. Surface Defect Detection Method of Hot Rolling Strip Based on Improved SSD Model. In Database Systems for Advanced Applications, Proceedings of the DASFAA 2021 International Workshops, Taipei, Taiwan, 11–14 April 2021; Springer: Cham, Switzerland, 2021; pp. 209–222. [Google Scholar]
Xie, W.; Sun, X.; Ma, W. A light weight multi-scale feature fusion steel surface defect detection model based on YOLOv8. Meas. Sci. Technol. 2024, 35, 55017. [Google Scholar] [CrossRef]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Yi, F.; Zhang, H.; Yang, J.; He, L.; Mohamed, A.S.A.; Gao, S. YOLOv7-SiamFF: Industrial defect detection algorithm based on improved YOLOv7. Comput. Electr. Eng. 2024, 114, 109090. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Wang, G.; Liu, X. Multi-scale Attention Network for Single Image Super-Resolution. arXiv 2022, arXiv:2209.14145. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Bao, Y.; Song, K.; Liu, J.; Wang, Y.; Yan, Y.; Yu, H.; Li, X. Triplet-Graph Reasoning Network for Few-Shot Metal Generic Surface Defect Segmentation. IEEE Trans. Instrum. Meas. 2021, 70, 1–11. [Google Scholar] [CrossRef]
Song, K.; Yan, Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 2013, 285, 858–864. [Google Scholar] [CrossRef]
He, Y.; Song, K.; Meng, Q.; Yan, Y. An End-to-End Steel Surface Defect Detection Approach via Fusing Multiple Hierarchical Features. IEEE Trans. Instrum. Meas. 2020, 69, 1493–1504. [Google Scholar] [CrossRef]
Lv, X.; Duan, F.; Jiang, J.-j.; Fu, X.; Gan, L. Deep Metallic Surface Defect Detection: The New Benchmark and Detection Network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 122–138. [Google Scholar]
Wu, Y.; He, K. Group Normalization. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical image segmentation. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef]
Yu, W.; Si, C.; Zhou, P.; Luo, M.; Zhou, Y.; Feng, J.; Yan, S.; Wang, X. MetaFormer Baselines for Vision. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 896–912. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
Xie, W.; Ma, W.; Sun, X. An efficient re-parameterization feature pyramid network on YOLOv8 to the detection of steel surface defect. Neurocomputing 2025, 614, 128775. [Google Scholar] [CrossRef]
Wang, F.; Jiang, X.; Han, Y.; Wu, L. YOLO-LSDI: An Enhanced Algorithm for Steel Surface Defect Detection Using a YOLOv11 Network. Electronics 2025, 14, 2576. [Google Scholar] [CrossRef]
Liu, P.; Yuan, X.; Han, Q.; Xing, B.; Hu, X.; Zhang, J. Micro-defect Varifocal Network: Channel attention and spatial feature fusion for turbine blade surface micro-defect detection. Eng. Appl. Artif. Intell. 2024, 133, 108075. [Google Scholar] [CrossRef]
Zhou, H.; Zou, H.; Hu, G. An efficient and lightweight algorithm for detecting surface defects of steel based on SCCI-YOLO. Sci. Rep. 2025, 15, 36276. [Google Scholar] [CrossRef] [PubMed]
Song, H. RSTD-YOLOv7: A steel surface defect detection based on improved YOLOv7. Sci. Rep. 2025, 15, 19649. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Samples of the NEU-DET dataset and their label distribution. Specific content can be seen in the explanations in the text.

Figure 2. Samples of the GC10-DET dataset and their label distribution. Specific content can be seen in the explanations in the text.

Figure 3. YOLOv11 model [6].

Figure 4. IMTS-YOLO model [6].

Figure 5. Details of IMTS-YOLO modules [6].

Figure 6. The structure of the IGM, with its constituent PCSSA and MSSA modules detailed in the accompanying diagrams. The symbolic conventions used are defined as follows: Let B denote the batch size, C the channel count, H and W the spatial dimensions of the feature map, and n the number of partitions for sub-features.

Figure 7. MulBk structure.

Figure 8. Architecture of MulBk module.

Figure 9. TASFF head details.

Figure 10. Shape-IoU.

Figure 11. The final results of the ablation study on the NEU-DET dataset, presented as a line graph, showing the four metrics: accuracy, recall, mAP50, and mAP50-95.

Figure 12. Comparison of precision, recall, and PR curves between the baseline and our model on the NEU-DET.

Figure 13. Detection result comparison between baseline and our model: (a) original image; (b) YOLOv11n results; (c) IMTS-YOLO (ours) results.

Figure 14. Heatmap visualization comparison between baseline and our model: (a) original image; (b) YOLOv11n results; (c) IMTS-YOLO (ours) results.

Figure 15. A scatter plot of our model versus existing mainstream object detection algorithms on the NEU-DET dataset, with GFLOPS on the x-axis and mAP50 on the y-axis.

Figure 16. Comparison of precision, recall, and PR curves between the baseline and our model on the GC10-DET.

Figure 17. A scatter plot of our model versus existing mainstream object detection algorithms on the GC10-DET dataset, with GFLOPS on the x-axis and mAP50 on the y-axis.

Table 1. Model training parameters.

Parameter	Parameter Value
Epochs	300
Momentum	0.937
Initial Learning Rate	0.01
Optimizer	SGD
Batch Size	32
Weight Decay	0.0005
Mosaic	1.0
Mixup	0.0
Hsv_h	0.015
Hsv_s	0.7
Hsv_v	0.4

Table 2. Ablation study data on NEU-DET.

Group	C2PSA-IGM (M1)	MulC3k2 (M2)	TASFF (M3)	Shape-IoU (M4)	P (%)	R (%)	Parms (M)	GFLOPs	mAP50 (%)	mAP50-95 (%)
1	-	-	-	-	70.3	69.9	2.5	6.4	75.3	41.5
2	√				73.4	71.8	2.5	6.4	77.5	42.6
3		√			70.6	70.1	2.6	6.7	76.6	41.1
4			√		71.9	70.8	3.9	8.6	76.5	41.6
5				√	70.5	70.0	2.5	6.4	75.8	42.8
6	√	√	√		77.1	73.0	3.9	8.8	79.9	43.4
7	√	√		√	75.9	73.2	2.6	6.7	79.5	44.1
8	√		√	√	77.0	73.3	3.9	8.6	79.4	44.3
9		√	√	√	74.7	72.1	3.9	8.8	78.7	43.8
10	√	√	√	√	78.0	73.8	3.9	8.8	80.3	44.7

Table 3. Comparison of detection indicators for different bounding box regression loss functions.

Loss Function	P	R	mAP50	mAP50-95
Base (CIOU)	77.1	73.0	79.9	43.4
GIoU	80.9	67.8	78.3	42.4
DIoU	76.4	73.9	79.4	43.5
EIoU	78.7	71.6	80.1	43.5
SIoU	76.9	71.7	79.0	44.8
Shape-IoU	78.0	73.8	80.3	44.7

Table 4. Comparative experiment of mainstream object detection algorithms on NEU-DET.

Model	P (%)	R (%)	Params (M)	GFLOPs	mAP50 (%)	mAP50-95 (%)	FPS
Faster R-CNN [16]	33.0	91.1	138.4	368.2	73.6	33.0	36.0
SSD [38]	-	-	25.1	88.2	70.8	-	37.7
RT-DETR [38]	-	-	28.5	100.6	73.5	-	66.1
Deformable DETR [39]	-	-	34.2	78.0	71.6	40.1	118.7
YOLOv3 [16]	76.3	71.2	103.7	282.2	76.8	42.5	67.0
YOLOv5s [16]	74.7	74.7	7.0	15.8	76.8	42.4	220.0
VF-Net [40]	38.1	59.9	599.1	140.97	70.6	-	12.2
YOLOv7-Tiny [16]	73.4	66.4	6.0	13.1	74.0	37.1	165.0
YOLOv8n	69.2	77.4	3.0	8.1	75.2	40.9	212.5
YOLOv10n [39]	-	-	2.7	8.2	73.7	41.8	220.7
YOLOv11n	70.3	69.9	2.5	6.4	75.3	41.5	235.3
RDD-YOLO [8]	-	-	-	-	81.1	-	57.8
SCCI-YOLO [41]	-	-	1.7	-	78.6	45.3	270.2
IMTS-YOLO (Ours)	78.0	73.8	3.9	8.8	80.3	44.7	268.2

Table 5. Comparative experiment of mainstream object detection algorithms on GC10-DET.

Model	P (%)	R (%)	Params (M)	GFLOPs	mAP50 (%)	mAP50-95 (%)	FPS
Faster R-CNN [16]	38.2	59.4	138.4	368.2	56.9	20.4	35.0
SSD [38]	-	-	25.7	88.8	68.3	-	37.5
RT-DETR-R18 [16]	72.5	69.5	19.9	55.4	71.6	36.8	137.0
YOLOv3 [16]	62.4	62.7	103.7	282.2	62.2	32.5	106.0
YOLOv5s [16]	72.4	66.5	7.0	15.8	69.4	35.6	239.0
VF-Net [42]	-	-	-	-	64.5	-	-
YOLOv7-Tiny [16]	76.2	58.9	6.0	13.1	68.1	33.8	208.0
YOLOv8n [16]	65.0	67.2	3.0	8.1	68.7	36.1	222.0
YOLOv11n	70.7	66.2	2.5	6.4	69.5	37.3	224.3
RDD-YOLO [8]	-	-	-	-	75.2	-	57.5
SCCI-YOLO [41]	-	-	1.7	-	67.3	33.4	-
IMTS-YOLO (Ours)	75.4	69.2	3.9	8.8	73.9	40.4	238.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, P.; Yuan, H.; He, J.; Wu, B.; Xu, N.; Gu, Y. IMTS-YOLO: A Steel Surface Defect Detection Model Integrating Multi-Scale Perception and Progressive Attention. Coatings 2026, 16, 51. https://doi.org/10.3390/coatings16010051

AMA Style

Fu P, Yuan H, He J, Wu B, Xu N, Gu Y. IMTS-YOLO: A Steel Surface Defect Detection Model Integrating Multi-Scale Perception and Progressive Attention. Coatings. 2026; 16(1):51. https://doi.org/10.3390/coatings16010051

Chicago/Turabian Style

Fu, Pengzheng, Hongbin Yuan, Jing He, Bangzhi Wu, Nuo Xu, and Yong Gu. 2026. "IMTS-YOLO: A Steel Surface Defect Detection Model Integrating Multi-Scale Perception and Progressive Attention" Coatings 16, no. 1: 51. https://doi.org/10.3390/coatings16010051

APA Style

Fu, P., Yuan, H., He, J., Wu, B., Xu, N., & Gu, Y. (2026). IMTS-YOLO: A Steel Surface Defect Detection Model Integrating Multi-Scale Perception and Progressive Attention. Coatings, 16(1), 51. https://doi.org/10.3390/coatings16010051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IMTS-YOLO: A Steel Surface Defect Detection Model Integrating Multi-Scale Perception and Progressive Attention

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.1.1. NEU-DET [23,24,25]

2.1.2. GC10-DET [26]

2.1.3. Dataset Analysis

2.2. Baseline: YOLOv11 Model

2.3. IMTS-YOLO Model

2.3.1. C2PSA-IGM

2.3.2. Structural Enhancement: The MulC3k2 Module

2.3.3. TASFF Head

2.3.4. Shape-IoU

3. Results

3.1. Experimental Setup and Training Parameters

3.2. Experimental Metrics

3.3. Ablation Study

3.4. Performance Evaluation: Precision–Recall Analysis and Visual Assessment

3.5. Comparison with Mainstream Object Detection Algorithms

3.6. Cross-Dataset Generalization Assessment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI