Next Article in Journal
Fault Detection in MV Switchgears Through Unsupervised Learning of Temperature Conditions
Previous Article in Journal
Integrated Sensing and Communication Using Random Padded OTFS with Reduced Interferences
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detection of Surface Defects in Steel Based on Dual-Backbone Network: MBDNet-Attention-YOLO

1
School of Materials Science and Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
2
School of Arts and Sciences, Northeast Agricultural University, Harbin 150030, China
3
School of Lilac, Harbin Institute of Technology (Weihai), Weihai 264209, China
4
Shanghai Collaborative Innovation Center of Laser Advanced Manufacturing Technology, Shanghai University of Engineering Science, Shanghai 201620, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(15), 4817; https://doi.org/10.3390/s25154817
Submission received: 12 June 2025 / Revised: 28 July 2025 / Accepted: 30 July 2025 / Published: 5 August 2025
(This article belongs to the Section Sensing and Imaging)

Abstract

Automated surface defect detection in steel manufacturing is pivotal for ensuring product quality, yet it remains an open challenge owing to the extreme heterogeneity of defect morphologies—ranging from hairline cracks and microscopic pores to elongated scratches and shallow dents. Existing approaches, whether classical vision pipelines or recent deep-learning paradigms, struggle to simultaneously satisfy the stringent demands of industrial scenarios: high accuracy on sub-millimeter flaws, insensitivity to texture-rich backgrounds, and real-time throughput on resource-constrained hardware. Although contemporary detectors have narrowed the gap, they still exhibit pronounced sensitivity–robustness trade-offs, particularly in the presence of scale-varying defects and cluttered surfaces. To address these limitations, we introduce MBY (MBDNet-Attention-YOLO), a lightweight yet powerful framework that synergistically couples the MBDNet backbone with the YOLO detection head. Specifically, the backbone embeds three novel components: (1) HGStem, a hierarchical stem block that enriches low-level representations while suppressing redundant activations; (2) Dynamic Align Fusion (DAF), an adaptive cross-scale fusion mechanism that dynamically re-weights feature contributions according to defect saliency; and (3) C2f-DWR, a depth-wise residual variant that progressively expands receptive fields without incurring prohibitive computational costs. Building upon this enriched feature hierarchy, the neck employs our proposed MultiSEAM module—a cascaded squeeze-and-excitation attention mechanism operating at multiple granularities—to harmonize fine-grained and semantic cues, thereby amplifying weak defect signals against complex textures. Finally, we integrate the Inner-SIoU loss, which refines the geometric alignment between predicted and ground-truth boxes by jointly optimizing center distance, aspect ratio consistency, and IoU overlap, leading to faster convergence and tighter localization. Extensive experiments on two publicly available steel-defect benchmarks—NEU-DET and PVEL-AD—demonstrate the superiority of MBY. Without bells and whistles, our model achieves 85.8% mAP@0.5 on NEU-DET and 75.9% mAP@0.5 on PVEL-AD, surpassing the best-reported results by significant margins while maintaining real-time inference on an NVIDIA Jetson Xavier. Ablation studies corroborate the complementary roles of each component, underscoring MBY’s robustness across defect scales and surface conditions. These results suggest that MBY strikes an appealing balance between accuracy, efficiency, and deployability, offering a pragmatic solution for next-generation industrial quality-control systems.

1. Introduction

Driven by progressive industrial automation, the steel sector now stands as a linchpin of worldwide infrastructure; its quality directly governs structural safety and operational longevity [1]. Central to this quality assurance pipeline is surface defect detection [2], since imperfections—pores, cracks, scratches, pits, and decarburization—frequently originate from raw-material contaminants, process deviations, or equipment malfunctions [3]. Beyond cosmetic degradation, these anomalies erode mechanical integrity and corrosion resistance: micro-cracks and pores act as fracture nuclei during subsequent forming or service, whereas scratches and pits undermine coating adhesion and surface-treatment efficacy. From a metallurgical standpoint, many surface defects originate from microstructural mechanisms such as grain boundary decohesion, inclusion-induced stress concentration, or localized plastic deformation. These processes lead to macroscopic manifestations like cracks, pores, and patches—patterns that are directly captured in visual inspection and form the basis of defect recognition in computer vision models.
Surface defect detection is critical in both hot-rolled and cold-rolled steel production, as each involves distinct defect characteristics and detection challenges. Hot-rolled steel typically suffers from scale, cracks, and inclusions due to exposure to high temperatures and intense deformation forces, while cold-rolled steel is more susceptible to scratches, dents, and irregularities caused by roll wear or lubrication inconsistencies. Effective detection systems must be tailored to these domain-specific conditions. In this study, we focus specifically on hot-rolled steel to address its characteristic defect patterns and industrial requirements.
Steel surface anomaly detection has crystallized into two paradigms: traditional image-processing pipelines and deep-learning-based intelligent systems [4]. Classical techniques hinge on operator expertise and hand-crafted descriptors, thereby failing to reconcile the modern production mandate for simultaneous throughput and sub-millimeter precision [5]. With the advent of Machine Learning and Cognitive Computing—most notably deep-learning—data-driven detectors have become the prevailing direction. Current detectors fall into two camps: two-stage models like R-CNN [6] give high accuracy but run slowly because they first generate region proposals and then refine them, while single-stage models like YOLO [7,8,9] and SSD [10] skip the proposal step for speed but lose a bit of accuracy. Since the original YOLOv1 was introduced in 2016, the YOLO series has undergone continual improvement. YOLOv3 enhanced small-object detection via multi-scale predictions. YOLOv4 introduced CSP modules and Mish activation to boost accuracy and speed. YOLOv5 streamlined training pipelines and deployment. YOLOv7 integrated E-ELAN and re-parameterization for better real-time accuracy. Most recently, YOLOv8 refined architecture design and loss functions for improved robustness and performance. These developments have solidified YOLO’s dominance in efficient, high-precision object detection, especially in industrial scenarios. However, static Feature Pyramid Networks (FPNs), as widely used in prior detection frameworks, rely on fixed top-down fusion pathways and uniform weighting schemes, which limits their ability to adapt to complex defect distributions. In industrial scenarios where defect sizes vary drastically and backgrounds exhibit strong interference, such rigid structures often fail to capture informative scale-specific features. This limitation motivates the use of Dynamic Align Fusion (DAF), which adaptively aligns and aggregates multi-scale features based on spatial and semantic cues. Moreover, steel defects in real production are irregular and often sub-millimeter, making precise localization difficult. Reliable detectors must therefore (1) extract hierarchical features that capture subtle cues and (2) remain stable across varying backgrounds, textures, and lighting; both demands place significant constraints on robustness.
Consequently, CNN-based object detection has emerged as the dominant frontier in steel surface anomaly identification. To evaluate detection performance, we adopt standard mean Average Precision (mAP) metrics at IoU thresholds of 0.5 (mAP@0.5) and the averaged range from 0.5 to 0.95 with 0.05 intervals (mAP@0.5–0.95), which are widely used in defect detection to assess both detection sensitivity and localization accuracy. Zhao et al. [11] propose RDD-YOLO, an augmented YOLOv5 variant that enlarges the effective receptive field via Res2Net blocks, reinforces feature reuse through a Dual Feature Pyramid Network (DFPN), and disentangles classification and regression by adopting a decoupled detection head. On NEU-DET and GC10-DET, RDD-YOLO improves mAP by 4.3% and 5.8%, respectively, without compromising inference speed. Complementarily, Sunkara et al. [12] introduce YOGA, a lightweight architecture that marries efficient feature learning with multi-scale attention. A CSPGhostNet backbone and AFF-PANet fusion layer jointly condense parameters and computation, while a two-stage refinement pipeline and grouped convolutions further curb FLOPs. Crucially, a localized attention mechanism in the neck selectively amplifies small-defect cues. On the COCO validation set, YOGA elevates AP by 15% relative to YOLOv5 while trimming computational cost by 29% and parameter count by 23%, underscoring its suitability for resource-constrained edge deployment.
In addition to CNN-based approaches, transformer-based architectures and learning-based methods have recently gained increasing attention in the field of surface defect detection. Transformer-based models are particularly effective in capturing long-range dependencies and complex contextual features. For instance, Zhou et al. [13] proposed the GDALR model, which integrates a dual-branch transformer structure to simultaneously capture both local and global defect features. The model leverages a pure transformer branch for short-range dependency modeling through token detail aggregation, while utilizing pixel shuffle operations to enhance salient global representations across feature scales.
Similarly, Sun et al. [14] introduced SDD-DETR, which is the first work to apply the Detection Transformer (DETR) framework to aero-engine blade surface defect detection. The model incorporates two lightweight modules—a progressive feature input multi-scale deformable attention module (PFI-MSDA) and a lightweight feedforward network (LW-FFN)—to reduce computational costs while maintaining detection accuracy. Specifically, PFI-MSDA hierarchically reduces token inputs to the self-attention layers, and LW-FFN streamlines the network’s multilayer perceptron, making the model more efficient without sacrificing detection performance.
Huang et al. [15] developed ACViT, an adaptive cross transformer integrated with contrastive learning, to improve the detection of small-scale defects. ACViT adopts a meta-learning framework to enhance generalization across varying defect detection tasks. The model also incorporates self-supervised contrastive learning to improve feature distinctiveness, thereby increasing robustness against diverse defect types [16].
Beyond transformer-based approaches, self-supervised learning and few-shot learning methods have also been explored to address challenges arising from limited labeled data. These methods enable models to learn generalized feature representations that can be adapted to novel categories or small sample scenarios. For example, Su et al. proposed a surface defect detection method based on few-shot learning [17]. Their model enhances the traditional Faster R-CNN framework by incorporating deformable convolutions into the ResNet101 and FPN backbone, followed by object pyramid construction to enrich feature scales of small samples. Furthermore, the model applies contrastive learning on Region of Interest (RoI) features to improve compactness and reduce misclassification, thereby improving detection performance on rare defect categories.
While these advanced methods exhibit strong detection performance, their high computational costs and complex architectures often limit their applicability in real-time industrial scenarios. Consequently, YOLO-based detectors remain preferable in many practical applications due to their fast inference and relatively low computational complexity. Nevertheless, most existing YOLO-based methods still face challenges in detecting small defects and maintaining robustness under complex backgrounds, which motivates our work in this paper.
YOLO algorithms have become the predominant choice for industrial defect detection due to their ability to deliver real-time detection combined with strong classification accuracy. For instance, recent research applying YOLO to product defect detection has demonstrated that YOLO-based methods can achieve efficient inference while maintaining high detection accuracy, greatly enhancing quality inspection throughput and reliability [18]. Meanwhile, although transformer-based detectors typically achieve superior precision, their higher computational cost and longer inference latency often limit their feasibility in industrial real-time applications. Therefore, striking an effective balance between detection accuracy and computational efficiency remains a critical challenge. Our proposed MBY method builds upon YOLO’s strengths by integrating an efficient backbone and attention mechanisms to improve sensitivity and robustness without sacrificing inference speed.
Shi et al. [19] refine Faster R-CNN by grafting a ConvNeXt backbone and plugging a CBAM block to foreground defect cues while suppressing background clutter; anchor priors are further optimized via k-means clustering. The resulting model attains 80.78% mAP on NEU-DET, surpassing YOLOv5 by 1.5% and vanilla Faster R-CNN by 8.4%, while sustaining 26 FPS inference. Meng et al. [20] surgically replace the SPPF module in YOLOv8n, yielding consistent accuracy and latency gains on NEU-DET, thereby corroborating its efficacy for defect detection. Liu et al. [21] propose EZS-YOLOv10, a zero-shot extension of YOLOv10 tailored for micro-defects. Adaptive multi-scale fusion and a refined anchor generator are coupled with a Region Feature Synthesis network for unseen-class generalization. On COCO, EZS-YOLOv10 registers +1.5% mAP overall and +2.7% on small objects; analogous gains (+2.7% mAP) are observed on VisDrone. Guo et al. [22] embed the Transformer-inspired TRANS module into YOLOv5, yielding MSFT-YOLO. Multi-scale features are harmonized via BiFPN, empowering the detector to penetrate background noise and resolve minute flaws. The model elevates NEU-DET mAP by 7%. Tang et al. [23] present a YOLOv5n variant that interlaces an attention mechanism within the backbone and refines training dynamics via a redesigned loss. Consequently, recall rises by 8.3% and AP by 3.6% on steel-strip data, while training time is slashed by 44%. Jiang et al. [24] craft YOLOv8n-SDEC, substituting SPPF with SPPCSPC for richer multi-scale context, integrating DCNv2 and CARAFE for deformable receptive fields and content-aware upsampling, and adopting EIoU loss for tighter regression. The model records a 3.3% mAP boost on NEU-DET. Zhong et al. [25] augment YOLOv5 with an OSA-C3 module for deeper feature transformation, a BiC-PAN for cross-scale fusion, and Coordinate Attention for precise localization. The enhanced detector reaches 79.1% mAP on NEU-DET (+2.6% vs. baseline) without sacrificing speed. Zhao et al. [26] introduce AMFF, an attention-guided multi-scale fusion plug-in composed of SEAM (self-enhanced dilated-attention) and CEAM (cross-layer attention). When inserted into FPN/PAFPN, AMFF consistently elevates both accuracy and real-time performance across defect detection benchmarks.
To address the aforementioned challenges, we present MBY (MBDNet-Attention-YOLO), a purpose-built steel surface defect detector that synergizes a lightweight dual-backbone architecture with targeted attention mechanisms. Concretely, we first devise MBDNet, an efficient feature-extraction backbone that interleaves HGStem, Dynamic Align Fusion, and C2f-DWR blocks to harvest and refine multi-scale representations while curbing computational overhead. Second, the neck is augmented with the MultiSEAM module, which jointly re-calibrates spatial and channel dependencies across scales, thereby sharpening sensitivity to microscopic flaws embedded in textured backgrounds. Third, we adopt an improved Inner-SIoU loss that tightens the geometric alignment between predicted and ground-truth boxes, accelerating convergence and elevating localization precision. Extensive evaluations on the public NEU-DET and PVEL-AD benchmarks confirm that MBY surpasses contemporary baselines in both accuracy and recall without sacrificing real-time capability.
The principal contributions are summarized below:
  • We introduce MBY, a novel steel-defect detector that fuses the MBDNet backbone with the YOLO framework, attaining state-of-the-art accuracy while preserving inference efficiency.
  • We create the MBDNet architecture, a compact yet powerful backbone that leverages HGStem, Dynamic Align Fusion, and C2f-DWR to enrich multi-scale feature extraction.
  • We integrate the MultiSEAM module within the detection head to fuse cross-scale spatial-channel information, bolstering robustness against cluttered backgrounds and minute defects.
  • We devise Inner-SIoU loss to enhance bounding box regression accuracy and training stability, further improving overall model robustness.

2. Proposed Method

2.1. Backbone Network: MBDNet

In this study, we propose a novel backbone architecture named MBDNet, characterized by a dual-backbone framework designed to improve both feature extraction and integration. The comprehensive structure of the proposed network is depicted in Figure 1. Conventional convolutional neural networks (CNNs) frequently encounter difficulties in efficiently capturing multi-scale features, especially when addressing complex defect recognition challenges, which consequently hinders the comprehensive efficacy of the model [27].
To address this, we design two specialized modules: HGStem for low-level feature extraction and C2f for high-level feature fusion, both significantly improving the network’s representational capacity.
To further boost computational efficiency, we integrate the Depthwise Separable Convolution (DWConv) module [28]. DWConv decomposes standard convolutions into depthwise and pointwise operations, drastically reducing computational cost while preserving the ability to extract rich multi-scale feature information [29]. Additionally, to strengthen feature fusion capabilities, we incorporate the Dynamic Align Fusion module, which aligns and merges feature maps across different scales, thereby enhancing the model’s multi-scale perception [30]. Ultimately, the Spatial Pyramid Pooling Fast (SPPF) module is integrated to enhance the perceptual domain, ergo boosting the network’s ability to detect objects of diverse sizes. To avoid conflicting gradient updates, the dual backbones share a unified loss through the fused features in the Dynamic Align Fusion module. This joint supervision ensures consistent gradient flow to both branches, harmonizing their learning despite focusing on different feature levels [31]. Consequently, the model achieves stable and coordinated optimization across the dual-backbone structure.
In summary, the core components of MBDNet include HGBlock, DWConv, Dynamic Align Fusion, and SPPF, collectively enabling efficient and robust multi-scale feature extraction and fusion.
In tasks involving small-object detection, standard convolutions, shown in Figure 2a, collect much redundant detail when targeting small objects, raising computation. We therefore replace them with Depthwise Separable Convolution (DWConv). The DWConv process, illustrated in Figure 2b, decomposes convolution into two distinct stages. Initially, depthwise convolution is applied independently on each input channel, producing a reduced feature map, denoted as X a .
Subsequently, a simpler linear operation is conducted: group convolution is performed channel-wise, instead of the cross-channel convolution used in standard methods, resulting in a redundant feature map, denoted as X b .
In the end, these two feature maps, denoted as X a and X b , are merged to produce the final output feature map with depth, which can be expressed as
X a = X W 1
X b = X W 2
Y = X a X b
where, X refers to the feature map at the input layer, while W 1 denotes the standard convolution kernel and W 2 denotes the group convolution kernel. stands for the standard convolution operation, and indicates concatenation along the channel dimension [32].

2.2. Dynamic Align Fusion (DAF)

To reduce feature misalignment between backbones, we propose Dynamic Align Fusion (DAF) [33]. Since the feature maps differ in channels, stride, and receptive field, simple concatenation or weighted averaging lowers quality [34]. DAF addresses this by implementing feature alignment, adaptive weighting, and channel-wise modulation, enabling more precise fusion and thereby boosting object detection network performance [35]. The working principle of DAF is depicted in Figure 3. This approach is compatible with various architectures such as YOLOv8, EfficientDet [36], and PP-YOLO [37], enhancing detection accuracy while keeping additional computational costs low.
In the DAF structure [33], 1 × 1 convolutions are first applied to the input features X = { x 1 , x 2 } for channel alignment, ensuring that they can be fused within a common feature space. Since x1 and x2 may originate from different backbones and thus have mismatched channel dimensions, the following adjustments are required:
x ^ 1 = C o n v 1 × 1 x 1 x ^ 2 = C o n v 1 × 1 x 2
Here, W 1 ( 1 × 1 ) and W 2 ( 1 × 1 ) are 1 × 1 convolution weights used for channel transformation to align x1 and x2 along the channel dimension. This operation balances the information distribution across channels, providing a foundation for subsequent fusion.
After channel alignment, the aligned features x ^ 1 and x ^ 2 are concatenated. To make the fusion process more adaptive, the Dynamic Align Weight (DAW) mechanism is introduced. Specifically, a 3 × 3 convolution is performed on the concatenated features to capture high-level integrated information:
x c o n c a t = x ^ 1 , x ^ 2 ,   W a l i g n = σ C o n v 3 × 3 ( x c o n a c t )
where, C o n v 3 × 3 denotes a 3 × 3 convolutional layer utilized to derive the fused features from the concatenated input.
σ · represents the logistic function, which scales the weights to lie within the range (0, 1), enabling adaptive normalization of features from different sources. Subsequently, the output W a l i g n is partitioned along the channel axis into two separate dynamic weight tensors, w 1 and w 2 :
w 1 , w 2 = s p l i t ( W a l i g n , 2 )
Ultimately, the mechanism of DAF dynamically adjusts the role and degree of contribution of each input feature in the overall process through reasonable weight allocation; as indicated in Equation (7), its core purpose is to enable smoother and more coordinated mutual fusion among features generated by different backbone networks, thereby achieving a more optimal synergistic effect.
x 1 ~ = w 1 x ^ 1 x 2 ~ = w 2 x ^ 2
where, represents element-wise multiplication, where dynamic weights are used to assign a weighted contribution to the input features, ensuring that the effective information from different sources is more aligned with the fusion requirements.
However, relying solely on dynamic weights may lead to the excessive suppression of certain feature paths. Therefore, we further introduce Learnable Channel Weights to address this issue. λ 1 and λ 2 are used to optimize the ratio of the final fused features, allowing the network to be designed to adaptively modify the impact of various routes according to the input representations:
y = C o n v 1 × 1 ( λ 1 · x 1 ~   + λ 2 · x 2 ~ )
where, λ 1 and λ 2 are trainable parameters, initialized to 0.5, and automatically optimized during training to enable the model to learn the optimal feature fusion ratio. To prevent λ 1 and λ 2 from experiencing gradient explosion or numerical instability during training, DAF imposes the following constraints:
λ i = c l i p λ i , 1.0 · s i g n λ i , i { 1,2 }
These constraints keeps channel weights in a stable range, safeguarding model consistency and generalization. Finally, we use a 1 × 1 convolution to adapt the features for downstream detection tasks.

2.3. C2f-DWR Module

Although YOLOv8’s C2f (Cross Stage Partial Feature Fusion) module effectively propagates information across layers and balances accuracy with efficiency [38], the standard instantiation proves inadequate for steel surface anomalies. Edge cracks, pits, scratches, and decarburized regions exhibit extreme scale variance, irregular topology, and blurred boundaries; under such conditions, C2f’s fixed receptive fields struggle to harvest sufficient context, leading to missed detections or spurious alarms. To remedy this limitation, we introduce C2f-DWR, a plug-in replacement that embeds Dilated Residual (DWR) units [39] into the original C2f framework. By systematically enlarging receptive fields with dilated convolutions while preserving residual connectivity, C2f-DWR enriches multi-scale representations and sharpens responses to defects of disparate shapes and sizes without increasing computational overhead.
The structural design of the DWR module is illustrated in Figure 4a. This module adopts a dual-branch structure consisting of Region Residual (RR) and Semantic Residual (SR) components, aimed at enhancing feature extraction and enabling the model to achieve enhanced proficiency in managing defects that exhibit a wide range of dimensions.
Initially, the input features are processed with a 3 × 3 standard convolution. Afterward, the framework incorporates a normalization technique based on batch statistics, which is succeeded by an element-wise thresholding operation to introduce nonlinearity into the data flow, thus enabling the model to perform both normalization and nonlinear transformation effectively. This part belongs to the RR branch, which is primarily responsible for extracting local detail information, enabling the model to better recognize small defects.
Next, the features are passed to the SR branch for further feature extraction. Subsequent to the application of 3 × 3 depthwise convolution (DConv), the feature maps are subjected to a pair of 3 × 3 depthwise convolutions, each characterized by distinct dilation rates, thereby facilitating multi-scale feature extraction. Specifically, one branch employs a dilation rate of 3 (denoted as D-3), while the other utilizes a dilation rate of 5 (denoted as D-5). This dual-branch approach effectively expands the receptive field, thereby facilitating the model’s capacity to assimilate a more expansive ambit of contextual intelligence.
Finally, the features from both branches are concatenated, followed by batch normalization (BN) for data adjustment and a dimensionality-reducing 1 × 1 convolutional layer is utilized to alleviate computational burden. Thereafter, the resultant feature map is amalgamated with the initial input features by means of a residual linkage. This process enables the network to conserve the pristine information, refine feature depiction, and augment detection precision.
As illustrated in Figure 4b, the C2f-DWR unit interleaves dilated convolutions with progressive residual connections, systematically expanding the receptive field without enlarging kernels or deepening the stack. This dilation-centric design trades marginal memory for marked contextual gain, yielding richer multi-scale features at constant FLOPs. Consequently, the detector remains equally sensitive to microscopic scratches and expansive surface damage, enhancing both stability and cross-domain generalization.

2.4. Detect_MultiSEAM

Steel surface defects—scratches, pits, cracks, oxidation flecks, and peeling patches—span millimeter-scale to sub-pixel sizes and exhibit highly irregular geometries. Their visual signatures are further confounded by rolling textures, specular reflections, and acquisition noise, rendering robust detection non-trivial.
To amplify discriminative cues, as shown in Figure 5, MultiSEAM module is embedded into the YOLOv8n detection head. MultiSEAM re-calibrates and fuses multi-scale backbone features, capturing both fine-grained textures and global defect morphology within a single unified representation. This enriched embedding markedly suppresses false positives and missed detections across defect taxonomies. Moreover, by consolidating contextual evidence under occlusion and uneven illumination, MultiSEAM endows the detector with heightened resilience and cross-scene generalisability.
To improve classification accuracy and generalization, label smoothing is applied in the classification head. This regularization technique adjusts hard target labels (0 and 1) to softer values (e.g., 0.1 and 0.9), reducing model overconfidence and enhancing robustness against noisy data. In our implementation, a label-smoothing factor of 0.1 is used, which effectively mitigates overfitting and improves detection performance during training.
MultiSEAM improves detection accuracy by strengthening feature extraction in unoccluded regions and alleviating the performance degradation caused by object occlusion. Its core component is the Channel and Spatial Mixed Module (CSMM), which processes input features using parallel convolution kernels of varying scales to more fully capture multi-scale information, allowing for more precise defect localization. Inside CSMM, depthwise-separable convolutions are coupled with residual links to cut computation yet retain feature fidelity. Since handling channel information individually might overlook inter-channel correlations, a 1 × 1 convolution is additionally employed to integrate channel features and enhance feature interactions. The processed features then undergo global average pooling to reduce spatial dimensions, minimizing redundant information and boosting feature stability. Subsequently, a dual-layer fully connected network further integrates channel information, thereby reinforcing the interconnections among distinct features. Finally, channel expansion operations integrate multi-scale information effectively, significantly reducing detection errors caused by occlusion or complex background interference.

2.5. Inner-SIoU

Steel surface defects are minute and morphologically diverse, often yielding indistinct boundaries that undermine detection accuracy. The canonical Complete IoU (CIoU) loss frequently introduces false positives and negatives during box regression, degrading precision and hindering convergence. To remedy this, we adopt Inner-SIoU (Inner Shape Intersection over Union), a ratio-scaled extension of SIoU that adjusts an auxiliary bounding box to stabilize gradient flow and sharpen directional supervision.
CIoU’s supervision is limited to overlap, center distance, and aspect ratio. By contrast, Inner-SIoU enhances directional awareness, leading to tighter defect boundaries on complex steel surfaces and significantly faster convergence.
Inner-SIoU dynamically adjusts predicted bounding boxes based on the shape characteristics of the internal overlap with ground-truth boxes to improve alignment precision. When the predicted box is larger than the ground-truth box, Inner-SIoU identifies and reduces the excess regions by selectively adjusting width or height to better fit the ground-truth boundary, avoiding distortions caused by uniform scaling.
Conversely, if the predicted bounding box is undersized, it expands the box’s dimensions and adjusts its angle toward the uncovered ground-truth regions, ensuring accurate filling without over-extension or under-extension. This shape-aware adjustment enables more precise localization compared to traditional IoU metrics.
The Inner-SIoU loss modulates the auxiliary bounding box size via the scale factor, making it adaptable to various datasets and detectors, thereby addressing limitations in detection accuracy of existing methods. Specifically, as illustrated in Figure 6, the target box (TB) and anchor box (Anchor Box) are denoted as b g t and b , respectively, making it adaptable to various datasets and detectors, thereby addressing limitations in detection accuracy of existing methods. Specifically, as illustrated in Figure 6, the target box (TB) is represented as Box 1, while the anchor box is represented as Box 2. The central coordinates of Box 1 are ( x c g t , y c g t ) , and those of Box 2 are ( x c , y c ) . The respective widths and heights of Box 1 and Box 2 are w g t , h g t and w , h . The scale factor r r a t i o generally varies between 0.5 and 1.5. When r r a t i o is less than 1, the auxiliary box is smaller than the ground-truth box, resulting in a larger IoU gradient magnitude for the auxiliary box compared to the original target box. This accelerates the convergence for samples with high IoU values. In contrast, when r r a t i o is greater than 1, the auxiliary box is larger than the ground-truth box, which broadens the regression range and facilitates learning for samples with low IoU values.
The ratio for Inner IoU is defined as follows:
b l g t = x c g t W g t × r r a t i o 2 b r g t = x c g t W g t × r r a t i o 2
where, b l g t signifies the abscissa of the auxiliary bounding box’s left edge, and b r g t indicates the abscissa of its right edge.
b t g t = y c g t + h g t × r r a t i o 2 b b g t = y c g t + h g t × r r a t i o 2
where, b t g t signifies the perpendicular coordinate of the lower edge of the ancillary spatial localization frame, while b b g t indicates the perpendicular coordinate of the upper edge.
b l = x c W × r r a t i o 2 b r = x c g t + W × r r a t i o 2
b t = y c h × r r a t i o 2 b b = y c g t + h × r r a t i o 2
where, b l specifies the lateral position of the left extremity of the auxiliary anchor box; b r delineates the lateral position of the right extremity; b t denotes the vertical position of the lower extremity; and b b characterizes the vertical position of the upper extremity of the auxiliary anchor box.
i n t e r = ( m i n b r g t , b r m a x b l g t , b l ) × ( m i n b b g t , b b m a x b t g t , b t )
u n i o n = w g t × h g t × r a t i o 2 + w × h × r a t i o 2 i n t e r
I o U i n n e r = i n t e r u n i o n
where, I o U i n n e r corresponds to the Inner-IoU metric; inter indicates the area of overlap between the auxiliary delimiting box and the auxiliary reference box; union denotes the aggregate area encompassing both boxes.
Compared with CIoU, SIoU adds shape, center-distance, and angle terms to the basic IoU, yielding tighter bounding box alignment. The loss function is formulated as:
L S I o U = 1 I o U + + Ω 2
Here, the angle loss is calculated as follows:
= 1 2 w , h 1 e γ ρ , γ = 2 Λ
ρ x = w c b x b g t x 2 ρ y = h c b y b g t y 2
The angle Λ is calculated by Equation (20):
Λ = sin 2 s i n 1 x g t c x c 2 + y g t c y c 2 m i n ( x g t c x c , y g t c y c )
The computation of the shape loss Ω is articulated subsequently:
Ω = 1 2 w , h 1 e w t w t = | w w g t | w m a x ( w , w g t )
In summary, the Inner-SIoU loss amalgamates the SIoU loss and the Inner-IoU loss to refine bounding box regression within object detection frameworks. The Inner-SIoU loss function is formulated thusly:
L I n n e r S I o U = L S I o U + I o U I o U i n n e r

2.6. Object Detection Network MBDNet-Attention-YOLO (MBY)

YOLOv8 begins by resizing every input image to a canonical resolution, after which a CNN backbone extracts multi-scale features. These features are fed into the detection head, which predicts bounding boxes, class scores, and objectness across three spatial resolutions; post-processing via NMS subsequently yields the final detection results. In this research, MBY detection algorithm is proposed (Figure 7), the proposed model is primarily composed of three essential parts: the backbone, the neck, and the head.
The backbone is entrusted with extracting critical image features and comprises several modules, including the HGStem module, C2f-DWR module, Dynamic Align Fusion (DAF) feature fusion module, DWConv convolution module, HGBlock module, and SPPF module. Collectively, these layers extract rich hierarchical features and strengthen the network’s representation.
The neck primarily integrates and refines feature information. It derives motivation from the Feature Pyramid Network (FPN) [40] and Path Aggregation Network (PANet) [41], enabling the integration of feature representations at multiple scales. This design markedly improves small-object recall and detection accuracy across scales.
The detection head is tasked with generating the ultimate detection results. Utilizing a decoupled detection head architecture, the model optimizes classification and regression tasks independently. Moreover, to boost small-object detection performance, an additional dedicated small-object detection layer is incorporated in the detection head, enabling MBY to reliably detect minute targets even in complex backgrounds.

3. Experiments

3.1. Data Description

The ongoing research utilizes the NEU-DET dataset provided by Northeastern University to evaluate the efficacy of the proposed network. The dataset encompasses six categories of steel surface defects: cracks (Cr), inclusions (In), patches (Pa), pitted corrosion (Ps), rust spots (Rs), and scratches (Sc). Each classification consists of 300 images, with each image exhibiting at least one defect. To further evaluate the generalization capability of the proposed framework, experiments are conducted on the PVEL-AD dataset. Table 1 summarizes the number of images in each category together with the training and validation splits. These datasets support the subsequent experimental investigation and result analysis. Figure 8 illustrates the allocation of defect types in both datasets.

3.2. Hyperparameter Settings

To verify the efficacy of the introduced MBY algorithm, all model architectures were realized using the PyTorch 1.8 framework on an NVIDIA GeForce RTX 3090 GPU. The hyperparameter configurations for the models are detailed in Table 2.
To evaluate the sustainability of our models, we measured the energy consumption per 1000 images processed using the NVIDIA System Management Interface (nvidia-smi) tool on the same hardware setup. The results indicated that the NEU-DET model consumed 9.72 Wh, while the PVEL-AD model consumed 9.74 Wh per 1000 images. These findings highlight that our models achieve high performance while maintaining excellent energy efficiency, which is crucial for sustainable computing.

3.3. Evaluation Metrics

In the realm of object detection, Precision (P) and Recall (R) are the standard metrics for evaluating object detection performance. They are defined as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
True positives (TPs) denote the quantity of accurately detected instances, false positives (FPs) denote the count of erroneously detected instances, and false negatives (FNs) denote the quantity of identified objectives that eluded detection by the model. To rigorously assess per-class performance, we adopt Average Precision (AP), computed as the area under the Precision–Recall (PR) curve, where Recall spans the horizontal axis and Precision the vertical. The area beneath this curve corresponds to the AP. The mean Average Precision (mAP) is ascertained by computing the mean of the AP values of all categorical divisions. The equation is articulated thusly:
A P = 0 1 P R d R
m A P = 1 K i = 1 K A P i
where, K signifies the total number of object classes within the dataset. The mAP serves as an indicator of the model’s aggregate detection accuracy spanning all genres. A superior mAP value is indicative of enhanced detection efficacy of the model across all categories.

3.4. An Analytical Comparison of Backbone Networks

To empirically verify the effectiveness of the proposed MBDNet backbone, we conduct comparative experiments on the NEU-DET and PVEL-AD datasets, evaluating it against several representative backbone architectures. Table 3 summarizes the performance of YOLOv8, HGNetV2 [42], MobileNetV4 [43], and EfficientFormerV2 [44], and the performance of our proposed MBDNet is assessed with respect to mAP@0.5, mAP@0.5–0.95, recall, and computational complexity (GFLOPs). The findings indicate that MBDNet attains an optimal trade-off between detection performance and computational cost on both datasets.
On the NEU-DET dataset, HGNetV2 integrates Graph Convolutional Networks to enlarge the receptive field and enhance global context aggregation. This refinement elevates recall to 75.1%, an absolute gain of 2.3 percentage points over the baseline, while the overall computational load remains modest at 6.9 GFLOPs. However, its feature representation remains limited, yielding an mAP@0.5–0.95 of only 52.0, indicating room for improvement in overall detection accuracy. MobileNetV4 leverages depthwise separable convolutions to optimize computational efficiency, with a GFLOPs of 22.5, but achieves an mAP@0.5–0.95 of merely 53.6. These results suggest that both HGNetV2 and MobileNetV4 exhibit constrained feature extraction abilities in complex environments, struggling to balance efficiency and accuracy. EfficientFormerV2 combines TransformerV2 and CNN architectures, effectively balancing efficiency and accuracy; however, its recall slightly drops by 0.3% compared to the baseline, revealing deficiencies in cross-scale feature extraction and local information capture. Replacing the original backbone with MBDNet yields more substantial gains: mAP@0.5–0.95 rises by 2.7% and recall reaches 77.3%, at the cost of 11.5 GFLOPs. Thus, MBDNet delivers higher precision while keeping the computational overhead moderate, offering a favorable balance between accuracy and efficiency.
On the PVEL-AD dataset, performance disparities among backbones become more pronounced. HGNetV2 attains a respectable recall of 70.8% but an mAP@0.5–0.95 of only 47.6%. MobileNetV4, despite its lightweight design, suffers a steep decline to 42.7% mAP@0.5–0.95 and reduced recall, indicating poor adaptability to the complex defect classes in this dataset. EfficientFormerV2 achieves a modest improvement with an mAP@0.5–0.95 of 51.0%, but overall performance remains limited. In direct comparison, MBDNet attains 75.9% mAP@0.5, a clear margin over YOLOv8’s 64.6%, while keeping the computational load at a comparable 8.1 GFLOPs and further improving both recall and overall detection accuracy.
In summary, the proposed MBDNet backbone consistently delivers excellent detection performance with lower computational costs across both datasets, demonstrating its strong potential for defect detection tasks.

3.5. An In-Depth Comparative Investigation of Loss Functions

To mitigate the imprecise regression and slow convergence caused by the standard CIoU loss, we adopt Inner-SIoU for bounding box localization. Results comparing different loss functions are provided in Table 4.
On the NEU-DET dataset, the model using the Inner-SIoU loss achieved an mAP@0.5 of 82.5%, outperforming CIoU [45], WIoU [46], Inner-IoU [47], and EIoU [48] by 2.3%, 2.5%, 1.3%, and 1.1%, respectively. For the mAP@0.5–0.95 metric, it improved by 1.0% compared to CIoU, reaching 51.5%. Recall also rose to 73.1%, showing that Inner-SIoU cuts missed detections. Tests on PVEL-AD confirm its robustness: mAP@0.5 reaches 71.6%, ahead of CIoU, WIoU, Inner-IoU, and EIoU, while mAP@0.5–0.95 is 45.6% and recall is 66.7%. Although computational costs among these loss functions are comparable, the consistent performance gains of Inner-SIoU make it more suitable for defect detection scenarios demanding high precision [49].
Figure 9 illustrates the performance trends of different loss functions during training. Inner-SIoU demonstrates faster convergence on both datasets, with more stable loss reduction during training and validation phases. The trajectories of recall and mean Average Precision (mAP) metrics, including mAP@0.5 and mAP@0.5–0.95, manifest a progressive elevation, culminating in markedly enhanced performance levels. In summary, Inner-SIoU delivers consistent gains in both detection accuracy and recall across NEU-DET and PVEL-AD, confirming its practical value for steel surface defect detection [31].

3.6. Ablation Experiments

To systematically validate the contribution of every introduced module, we carried out extensive ablation studies on both the NEU-DET and PVEL-AD datasets, examining the incremental impact of each enhancement on overall detection performance. Table 5 and Table 6 summarize the results under various experimental settings, where MB-DAF, DWR, MultiSEAM, and Inner-SIoU represent the four key modules. A checkmark (“√”) signifies the incorporation of the module, whereas a cross (“×”) indicates its omission.
In the initial experiment, the baseline YOLOv8n model attained an mAP@0.5 of 80.2%, an mAP@0.5–0.95 of 50.5%, and a recall (R) of 72.8% on the NEU-DET dataset. In the subsequent experiment, the MB-DAF module was incorporated, thereby augmenting the model’s capacity to adapt to intricate steel surface defects. This change increased mAP@0.5 to 83%, confirming that the MB-DAF module strengthens detection across scales and suppresses background noise [49].
The third experiment integrated the DWR module, designed to improve the model’s multi-scale feature representation and small-object detection accuracy. This modification increased mAP@0.5 to 81.3%. The fourth experiment incorporated the MultiSEAM module, which strengthens multi-scale feature extraction and improves detection under complex backgrounds, elevating mAP@0.5 to 82.4%, mAP@0.5–0.95 to 51.3%, and recall as well [31].
In the fifth experiment, we replaced the default loss with Inner-SIoU, which leverages an internal bounding box to tighten the alignment between predicted and ground-truth boxes. This improved mAP@0.5 to 82.5% and mAP@0.5–0.95 to 51.5%. In experiment six, MultiSEAM and Inner-SIoU were combined to exploit their complementary strengths: MultiSEAM’s enhanced multi-scale feature extraction and Inner-SIoU’s refined bounding box regression. This synergy raised mAP@0.5 to 84.5%, mAP@0.5–0.95 to 54.0%, and recall to 79.3%, confirming the significant contribution of these modules in steel surface defect detection [50].
In the seventh experiment, MB-DAF, DWR, and MultiSEAM were jointly applied to further improve detection robustness in complex environments. The results indicated that the mAP@0.5 and mAP@0.5–0.95 values ascended to 85.8% and 56.2%, with the recall attaining a level of 79.7%. This validates that the integration of multiple optimization strategies can simultaneously enhance detection accuracy and model generalization [51].
To clarify the design philosophy behind the overall architecture, it is important to emphasize that these modules were systematically designed with complementary functionalities. MB-DAF improves shallow feature adaptation, DWR strengthens deeper-level feature fusion, MultiSEAM bridges scale gaps through spatial-channel synergy, and Inner-SIoU refines localization precision. Their cumulative gains are visible in the progressive performance increase across experiments.
Furthermore, the proposed method leverages a dual-backbone structure, which decouples low-level and high-level feature learning. One branch emphasizes fine-grained texture details critical for detecting small defects, while the other learns high-level semantic context useful for discriminating defect vs. background. Compared to single-backbone designs with enhanced multi-scale fusion, this separation allows more targeted optimization at each feature level, which contributes significantly to performance gains, especially in complex industrial scenarios.
This architecture design avoids unnecessary redundancy while preserving detection efficiency, as evidenced by competitive inference speeds and ablation improvements. The results demonstrate that each component meaningfully contributes to the final performance rather than introducing overfitting or excessive complexity.
Furthermore, experiments on the PVEL-AD dataset exhibited consistent significant improvements. Compared to the baseline YOLOv8n, the optimized MBY structure achieved an 11.3% increase in mAP@0.5, demonstrating the broad applicability and strong generalization potential of the proposed enhancements across diverse datasets.
Overall, the ablation study highlights the advantages of each improvement and further confirms their effectiveness.
It is worth noting that although our method achieves strong performance in terms of mAP@0.5, the mAP@0.5–0.95 metric remains relatively low on the PVEL-AD dataset. This phenomenon is commonly observed in small-object detection tasks, especially in industrial defect detection settings. The PVEL-AD dataset contains a large proportion of tiny, irregularly shaped defects that are extremely difficult to localize precisely under high IoU thresholds.
While mAP@0.5–0.95 provides a comprehensive evaluation of localization accuracy, its stringent requirements can overly penalize detection results on such challenging datasets. In practical industrial applications, detecting the presence of defects is often prioritized over precise pixel-level localization. Therefore, mAP@0.5 is widely regarded as a more suitable metric for evaluating detection performance in this context.
Our method demonstrates consistent improvements in mAP@0.5 across datasets, which indicates its effectiveness in practical defect detection scenarios. This trade-off between detection sensitivity and localization precision reflects the inherent challenges of the task rather than a flaw in the model design. These patterns are clearly illustrated in the precision-recall curves: Figure 10 shows the comparison between YOLOv8 and our proposed model on the NEU-DET dataset, while Figure 11 presents the corresponding results on the PVEL-AD dataset.

4. Comparison of Different Object Detection Models

To assess the efficacy of the proposed MBY model in steel surface defect detection tasks, empirical trials were executed utilizing two publicly available datasets: NEU-DET and PVEL-AD. The selected benchmark models include YOLOv5 [52], YOLOv6 [53], YOLOv8, YOLOv10, YOLOv11 [54], as well as other advanced detection models such as CAY [55], YOLOv7-BA [56], EFd-YOLOv4 [57], MS-YOLOv5s [58], PD-DETR [59], SSA-YOLO [60], and EDTNet [61]. All experiments were performed under consistent settings, with the specific detection metrics shown in Table 7 and Table 8.
In the initial experimental series, the YOLOv5 model attained mAP@0.5 of 84.5% and mAP@0.5–0.95 of 53.6% on the NEU-DET dataset. The second set of experiments introduced YOLOv6, which performed similarly to YOLOv5, with mAP@0.5 of 84.4% and a marginal enhancement in mAP@0.5–0.95. YOLOv8 and YOLOv10 demonstrated certain advantages in detection speed, but their overall accuracy did not meet expectations. YOLOv11, on the other hand, showed weaker performance, with mAP@0.5 of only 77.3%, indicating its subpar performance when dealing with complex textures.
In the sixth set of experiments, the proposed MBY model achieved superior performance compared to all other models with respect to detection correctness. On the NEU-DET dataset, the mAP@0.5 value ascended to 85.8%, while the mAP@0.5–0.95 metric climbed to 56.2% and recall reached 79.7%, with significant improvements across all metrics. This indicates that MBY excels in feature extraction, target localization, and small-defect recognition. On the PVEL-AD dataset, the MBY model also demonstrated excellent performance, with mAP@0.5 reaching 75.9%, outperforming models such as YOLOv8 and YOLOv10. Despite the improved detection accuracy, MBY’s computational cost remains at 8.1 G, comparable to YOLOv8, and significantly lower than YOLOv6’s 11.8 G, reflecting the model’s balanced performance between accuracy and efficiency.
Figure 12 and Figure 13 show the detection visualization results of MBY, YOLOv8, and YOLOv11 on two datasets. MBY demonstrates more stable performance in detecting small objects such as cracks, scratches, and pits, with more accurate bounding box localization and a significant reduction in false positives and false negatives.
It is worth emphasizing that the proposed model enhances detection performance and, concurrently, mitigates computational overhead to a notable extent. Overall, the empirical investigations within this segment validate the potency of the propounded enhancements and clearly highlight their advantages across different dimensions.

5. Conclusions

This study proposes MBY (MBDNet-attention-YOLO), a steel surface defect detector that first introduces the carefully engineered MBDNet backbone to markedly improve multi-resolution feature encoding and representation of defects. At the same time, the integration of the designed MultiSEAM module effectively improves feature adaptability across different scales and complex background conditions, particularly showing higher accuracy in handling small defects.
Next, to tackle the problem of inadequate bounding box regression precision, we introduce the Inner-SIoU loss function, optimizing the correspondence between predicted boxes and reference boxes, accelerating the convergence speed during training, and improving detection stability. Finally, we conducted comprehensive evaluations on the public NEU-DET and PVEL-AD datasets; experimental results show that the MBY algorithm achieved a detection accuracy of 85.8% on NEU-DET and 70.8% on PVEL-AD. Compared to mainstream single-stage detection algorithms such as YOLOv5, YOLOv6, and YOLOv8, MBY outperforms in multiple key metrics, demonstrating its strong adaptability and generalization capability in complex texture interference and small-object recognition.
To further propel this research, future endeavors will focus on the following three aspects: (1) refining the backbone network architecture to magnify the model’s capacity for representing multi-scale defect features; (2) combining different feature fusion strategies and attention mechanisms to further strengthen the model’s robustness in complex backgrounds; and (3) exploring the application of the MBY algorithm to other types of surface defect detection tasks.

Author Contributions

X.W., S.M., S.W., Z.L., J.C. and P.X.: Conceptualization, Methodology, Software, Formal analysis, writing—original draft preparation, and Visualization; P.X.: Supervision. P.X.: Funding. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shanghai (grant number: 20ZR1422700) and the Class III Peak Discipline of Shanghai—Materials Science and Engineering (High–Energy Beam Intelligent Processing and Green Manufacturing).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest concerning the publication of this article.

References

  1. Xie, W.; Sun, X.; Ma, W. A Lightweight Multi-Scale Feature Fusion Steel Surface Defect Detection Model Based on YOLOv8. Meas. Sci. Technol. 2024, 35, 055017. [Google Scholar] [CrossRef]
  2. Tang, M.; Li, Y.; Yao, W.; Hou, L.; Sun, Q.; Chen, J. A Strip Steel Surface Defect Detection Method Based on Attention Mechanism and Multi-Scale Maxpooling. Meas. Sci. Technol. 2021, 32, 115401. [Google Scholar] [CrossRef]
  3. Zuo, Y.; Wang, J.; Song, J. Application of YOLO Object Detection Network in Weld Surface Defect Detection. In Proceedings of the 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER) 2021, Jiaxing, China, 27–31 July 2021; pp. 704–710. [Google Scholar] [CrossRef]
  4. Wang, J.; Li, L.; Xu, P. Visual Sensing and Depth Perception for Welding Robots and Their Industrial Applications. Sensors 2023, 23, 9700. [Google Scholar] [CrossRef]
  5. Li, W.; Zhang, H.; Wang, G.; Xiong, G.; Zhao, M.; Li, G.; Li, R. Deep Learning Based Online Metallic Surface Defect Detection Method for Wire and Arc Additive Manufacturing. Robot. Comput.-Integr. Manuf. 2023, 80, 102470. [Google Scholar] [CrossRef]
  6. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision 2015, Santiago, Chile, 13–16 December 2015; pp. 1440–1448. Available online: https://openaccess.thecvf.com/content_iccv_2015/html/Girshick_Fast_R-CNN_ICCV_2015_paper.html (accessed on 5 May 2025).
  7. Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  8. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  9. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  10. Design Tradeoffs for SSD Performance. Available online: https://www.usenix.org/legacy/event/usenix08/tech/full_papers/agrawal/agrawal_html/ (accessed on 5 May 2025).
  11. Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A Modified YOLO for Detection of Steel Surface Defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
  12. Sunkara, R.; Luo, T. YOGA: Deep Object Detection in the Wild with Lightweight Feature Learning and Multiscale Attention. Pattern Recogn. 2023, 139, 109451. [Google Scholar] [CrossRef]
  13. Zhou, X.; Zhou, S.; Zhang, Y.; Ren, Z.; Jiang, Z.; Luo, H. GDALR: Global Dual Attention and Local Representations in transformer for surface defect detection. Measurement 2024, 229, 114398. [Google Scholar] [CrossRef]
  14. Sun, X.; Song, K.; Wen, X.; Wang, Y.; Yan, Y. SDD-DETR: Surface Defect Detection for No-Service Aero-Engine Blades with Detection Transformer. IEEE Trans. Autom. Sci. Eng. 2024, 22, 6984–6997. [Google Scholar] [CrossRef]
  15. Huang, X.; Li, Y.; Bao, Y.; Zheng, W. Adaptive Cross Transformer with Contrastive Learning for Surface Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 2534717. [Google Scholar] [CrossRef]
  16. Mia, M.S.; Li, C. STD2: Swin Transformer-Based Defect Detector for Surface Anomaly Detection. IEEE Trans. Instrum. Meas. 2024, 74, 5002715. [Google Scholar] [CrossRef]
  17. Su, Y.; He, R.; Liu, Y.; Tian, J. Surface Defect Detection Method Based on Few-shot Learning. Inf. Control. 2025, 54, 502–512. [Google Scholar] [CrossRef]
  18. Dehaerne, E.; Dey, B.; Halder, S. A Comparative Study of Deep-Learning Object Detectors for Semiconductor Defect Detection. In Proceedings of the 2022 29th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 24–26 October 2022; pp. 1–2. [Google Scholar] [CrossRef]
  19. Shi, X.; Zhou, S.; Tai, Y.; Wang, J.; Wu, S.; Liu, J. An Improved Faster R-CNN for Steel Surface Defect Detection. In Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, 26–28 September 2022; pp. 1–5. [Google Scholar]
  20. Meng, J.; Wen, S. Detection of Steel Surface Defects Based on Improved YOLOv8n Algorithm. In Proceedings of the 2024 2nd International Conference on Algorithm, Image Processing and Machine Vision (AIPMV), Zhenjiang, China, 12–14 July 2024; pp. 8–12. [Google Scholar] [CrossRef]
  21. Liu, T. Enhanced Zero-Shot YOLOv10 for Multi-Class Tiny-Object Detection of Steel Surface Defects. In Proceedings of the 2024 6th International Conference on Robotics and Computer Vision (ICRCV), Wuxi, China, 20–22 September 2024; pp. 44–52. [Google Scholar] [CrossRef]
  22. Guo, Z.; Wang, C.; Yang, G.; Huang, Z.; Li, G. MSFT-YOLO: Improved YOLOv5 Based on Transformer for Detecting Defects of Steel Surface. Sensors 2022, 22, 3467. [Google Scholar] [CrossRef]
  23. Tang, L.; Cai, L.C.; Cheng, K.; Wang, X.; Luo, C.; Zhao, Y. Improved Yolov5n Strip Surface Defect Detection Algorithm. In Proceedings of the 2023 CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS), Yibin, China, 22–24 September 2023; pp. 1–5. [Google Scholar] [CrossRef]
  24. Jiang, X.; Cui, Y.; Cui, Y.; Xu, R.; Yang, J.; Zhou, J. Optimization Algorithm of Steel Surface Defect Detection Based on YOLOv8n-SDEC. IEEE Access 2024, 12, 95106–95117. [Google Scholar] [CrossRef]
  25. Zhong, H.; Wu, B.; Zhang, X.; Hu, Y.; Zhou, T.; Zhang, Y. Steel Surface Defect Detection Based on an Improved YOLOv5 Model. In Proceedings of the 2023 5th International Conference on Intelligent Control, Measurement and Signal Processing (ICMSP), Chengdu, China, 19–21 May 2023; pp. 51–55. [Google Scholar] [CrossRef]
  26. Zhao, Y.; Liu, Q.; Su, H.; Zhang, J.; Ma, H.; Zou, W. Attention-Based Multiscale Feature Fusion for Efficient Surface Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5013310. [Google Scholar] [CrossRef]
  27. Wang, J.; Wang, J. A Lightweight YOLOv8 Based on Attention Mechanism for Mango Pest and Disease Detection. J. Real-Time Image Proc. 2024, 21, 136. [Google Scholar] [CrossRef]
  28. Huang, S.; Wang, Q.; Zhang, S.; Yan, S.; He, X. Dynamic Context Correspondence Network for Semantic Alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2010–2019. [Google Scholar]
  29. Wang, R.; Liang, F.; Wang, B.; Zhang, G.; Chen, Y.; Mou, X. An Efficient and Accurate Surface Defect Detection Method for Wood Based on Improved YOLOv8. Forests 2024, 15, 1176. [Google Scholar] [CrossRef]
  30. He, Y.; Song, K.; Meng, Q.; Yan, Y. An End-to-End Steel Surface Defect Detection Approach via Fusing Multiple Hierarchical Features. IEEE Trans. Instrum. Meas. 2020, 69, 1493–1504. [Google Scholar] [CrossRef]
  31. Su, B.; Zhou, Z.; Chen, H. PVEL-AD: A Large-Scale Open-World Dataset for Photovoltaic Cell Anomaly Detection. IEEE Trans. Ind. Inform. 2023, 19, 404–413. [Google Scholar] [CrossRef]
  32. Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.-M.; Liu, J.; Wang, J. On the Connection between Local Attention and Dynamic Depth-Wise Convolution. arXiv 2022, arXiv:2106.04263. [Google Scholar] [CrossRef]
  33. Zhang, P.; Lo, E.; Lu, B. High Performance Depthwise and Pointwise Convolutions on Mobile Devices. Proc. AAAI Conf. Artif. Intell. 2020, 34, 6795–6802. [Google Scholar] [CrossRef]
  34. Li, H.; Zhao, J.; Li, J.; Yu, Z.; Lu, G. Feature Dynamic Alignment and Refinement for Infrared–Visible Image Fusion: Translation Robust Fusion. Inf. Fusion 2023, 95, 26–41. [Google Scholar] [CrossRef]
  35. Zhong, Q.; Lyu, G.; Yang, Z. Align While Fusion: A Generalized Nonaligned Multiview Multilabel Classification Method. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 7627–7636. [Google Scholar] [CrossRef]
  36. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  37. Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Shen, H.; Ren, J.; Han, S.; Ding, E.; et al. PP-YOLO: An Effective and Efficient Implementation of Object Detector. arXiv 2020, arXiv:2007.12099. [Google Scholar] [CrossRef]
  38. Sohan, M.; Ram, T.S.; Reddy, R.; Venkata, C. A Review on YOLOv8 and Its Advancements. In Data Intelligence and Cognitive Informatics; Springer: Berlin/Heidelberg, Germany, 2024; pp. 529–545. [Google Scholar]
  39. Zhang, T.; Pan, P.; Zhang, J.; Zhang, X. Steel Surface Defect Detection Algorithm Based on Improved YOLOv8n. Appl. Sci. 2024, 14, 5325. [Google Scholar] [CrossRef]
  40. Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective Fusion Factor in FPN for Tiny Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1160–1168. [Google Scholar]
  41. Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-Shot Image Semantic Segmentation With Prototype Alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
  42. Chen, L.; Chu, X.; Zhang, X.; Liu, Q.; Sun, H.; Zhang, B. HGNetV2: An Improved High-Resolution Representation for Dense Prediction. arXiv 2023, arXiv:2304.00780. [Google Scholar]
  43. Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking Vision Transformers for MobileNet Size and Speed. arXiv 2023, arXiv:2212.08059. [Google Scholar] [CrossRef]
  44. Li, Y.; Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Chen, Y.; Xie, L.; Liu, Z.; Hu, H. EfficientFormer: Vision Transformers at MobileNet Speed. arXiv 2022, arXiv:2206.01191. [Google Scholar]
  45. Du, S.; Zhang, B.; Zhang, P.; Xiang, P. An Improved Bounding Box Regression Loss Function Based on CIOU Loss for Multi-scale Object Detection. In Proceedings of the 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 16–18 July 2021; pp. 92–98. [Google Scholar] [CrossRef]
  46. Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
  47. Cho, Y.-J. Weighted Intersection over Union (wIoU) for Evaluating Image Segmentation. Pattern Recognit. Lett. 2024, 185, 101–107. [Google Scholar] [CrossRef]
  48. Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. arXiv 2022, arXiv:2101.08158. [Google Scholar] [CrossRef]
  49. Zheming, Y.; Terekhov, V.I.; He, J. Research of Different Neural Network Architectures for Real-Time Metal Surface Defect Detection. In Proceedings of the 2025 7th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), Moscow, Russia, 8–10 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
  50. Huang, H.; Peng, X.; Wu, S.; Ou, W.; Hu, X.; Chen, L. An Automotive Body-in-White Welding Stud Flexible and Efficient Recognition System. IEEE Access 2025, 13, 51938–51955. [Google Scholar] [CrossRef]
  51. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. arXiv 2021, arXiv:2005.03572. [Google Scholar] [CrossRef]
  52. Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
  53. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
  54. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  55. Da, M.; Jiang, L.; Tao, Y.; Zhang, Z. Infrared target detection algorithm based on multipath coordinate attention mechanism. Meas. Sci. Technol. 2024, 36, 015208. [Google Scholar] [CrossRef]
  56. Ma, X.; Deng, X.; Kuang, H.; Liu, X. YOLOv7-BA: A Metal Surface Defect Detection Model Based On Dynamic Sparse Sampling and Adaptive Spatial Feature Fusion. In Proceedings of the 2024 IEEE 6th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 24–26 May 2024; pp. 292–296. [Google Scholar] [CrossRef]
  57. Li, S.; Kong, F.; Wang, R.; Luo, T.; Shi, Z. EFD-YOLOv4: A steel surface defect detection network with encoder-decoder residual block and feature alignment module. Measurement 2023, 220, 113359. [Google Scholar] [CrossRef]
  58. Wang, C.; Zhou, M.; Liang, Y.; Pan, W.; Gao, Z. MS-YOLOv5s: An Improved YOLOv5s for the Detection of Imperceptible Defects on Steel Surfaces. In Advanced Intelligent Computing Technology and Applications; Huang, D.-S., Zhang, C., Eds.; Springer Nature: Singapore, 2024; pp. 404–415. [Google Scholar] [CrossRef]
  59. Zhao, L.; Wu, Y.; Yuan, Y. PD-DETR: Towards efficient parallel hybrid matching with transformer for photovoltaic cell defects detection. Complex Intell. Syst. 2024, 10, 7421–7434. [Google Scholar] [CrossRef]
  60. Huang, X.; Zhu, J.; Huo, Y. SSA-YOLO: An Improved YOLO for Hot-Rolled Strip Steel Surface Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5400177. [Google Scholar] [CrossRef]
  61. Yang, X.; Li, H.; Zhang, Y.; Yang, W.; Wang, J. EDTNet: An Efficient Deep Texture Network for Low-Light Image Enhancement. arXiv 2023, arXiv:2307.00989. [Google Scholar]
Figure 1. (a) MBDNet: dual-backbone network architecture diagram presents the overall architecture of the dual-backbone network, including the inputs, processing stages, and fusion module of the two backbone networks. (b) SPPF: SPPF module structure diagram provides a detailed description of the internal structure of the SPPF module, including multi-scale pooling and feature concatenation. (c) HGblock: HGblock module structure diagram displays the structure of the HGBlock module, including convolutional layers, lightweight convolution, and residual connections. (d) HGSteam: HGSteam module structure diagram describes the multi-stream feature processing and fusion mechanism of the HGStream module.
Figure 1. (a) MBDNet: dual-backbone network architecture diagram presents the overall architecture of the dual-backbone network, including the inputs, processing stages, and fusion module of the two backbone networks. (b) SPPF: SPPF module structure diagram provides a detailed description of the internal structure of the SPPF module, including multi-scale pooling and feature concatenation. (c) HGblock: HGblock module structure diagram displays the structure of the HGBlock module, including convolutional layers, lightweight convolution, and residual connections. (d) HGSteam: HGSteam module structure diagram describes the multi-stream feature processing and fusion mechanism of the HGStream module.
Sensors 25 04817 g001
Figure 2. Schematic diagrams of different convolution operations: (a) Standard convolution (Conv); (b) Depth-wise convolution (DWConv).
Figure 2. Schematic diagrams of different convolution operations: (a) Standard convolution (Conv); (b) Depth-wise convolution (DWConv).
Sensors 25 04817 g002
Figure 3. Network architecture of the Dynamic Align Fusion (DAF) mechanism.
Figure 3. Network architecture of the Dynamic Align Fusion (DAF) mechanism.
Sensors 25 04817 g003
Figure 4. DWR module and C2f-DWR architecture diagram.
Figure 4. DWR module and C2f-DWR architecture diagram.
Sensors 25 04817 g004
Figure 5. Architecture diagram of the MultiSEAM module.
Figure 5. Architecture diagram of the MultiSEAM module.
Sensors 25 04817 g005
Figure 6. Network architecture diagram of SIoU.
Figure 6. Network architecture diagram of SIoU.
Sensors 25 04817 g006
Figure 7. Architecture diagram of the MBY algorithm.
Figure 7. Architecture diagram of the MBY algorithm.
Sensors 25 04817 g007
Figure 8. Category distribution in the two datasets.
Figure 8. Category distribution in the two datasets.
Sensors 25 04817 g008
Figure 9. Loss function performance on NEU-DET and PVEL-AD.
Figure 9. Loss function performance on NEU-DET and PVEL-AD.
Sensors 25 04817 g009
Figure 10. P–R curve of YOLOv8 and the proposed model on the NEU-DET dataset.
Figure 10. P–R curve of YOLOv8 and the proposed model on the NEU-DET dataset.
Sensors 25 04817 g010
Figure 11. P–R curve of YOLOv8 and the proposed model on the PVEL-AD dataset.
Figure 11. P–R curve of YOLOv8 and the proposed model on the PVEL-AD dataset.
Sensors 25 04817 g011
Figure 12. Comparative visualization of networks on NEU-DET dataset.
Figure 12. Comparative visualization of networks on NEU-DET dataset.
Sensors 25 04817 g012
Figure 13. Comparison of visualization results of different networks on the PVEL-AD dataset.
Figure 13. Comparison of visualization results of different networks on the PVEL-AD dataset.
Sensors 25 04817 g013
Table 1. Statistics of dataset categories and image counts.
Table 1. Statistics of dataset categories and image counts.
NameNumber of
Categories
Training SetValidation SetTest Set
NEU-DET
PVEL-AD
61260360180
123600900450
Table 2. Hyperparameter settings.
Table 2. Hyperparameter settings.
NameNEU-DETPVEL-AD
OptimizerSGDSGD
Learning rate0.010.01
Weight decay factor0.00050.0005
Batch size3232
Epoch300300
Image size640 × 640640 × 640
Energy consumption per 1000 images (Wh)9.729.74
Table 3. Performance on NEU-DET and PVEL-AD with different backbones.
Table 3. Performance on NEU-DET and PVEL-AD with different backbones.
DatasetMethodmAP@0.5mAP@0.5–0.95RecallGFLOPs
NEU-DETYOLOV880.250.572.88.1
HGNetV281.352.075.16.9
MobileNetV482.953.677.122.5
EfficientFormerV281.651.072.59.4
MBDNet83.053.277.311.5
PVEL-ADYOLOV864.644.160.28.1
HGNetV269.147.670.86.9
MobileNetV470.742.765.022.6
EfficientFormerV267.651.072.59.4
MBDNet75.937.458.88.1
Table 4. Comparison of loss functions on NEU-DET and PVEL-AD datasets.
Table 4. Comparison of loss functions on NEU-DET and PVEL-AD datasets.
DatasetMethodmAP@0.5mAP@0.5–0.95RecallGFLOPs
NEU-DETCIoU80.250.572.88.1
WIoU80.050.172.58.2
Inner-IoU81.250.372.38.2
EIoU81.450.672.88.2
Inner-SIoU82.551.573.18.1
PVEL-ADCIoU64.644.160.28.1
WIoU70.850.369.08.1
Inner-IoU68.543.473.48.1
EIoU66.942.357.48.1
Inner-SIoU71.645.666.78.1
Table 5. Ablation experiments of MBY on the NEU-DET dataset.
Table 5. Ablation experiments of MBY on the NEU-DET dataset.
ExperimentsMB-DAFDWRMultiSEAMInner-SIoUmAP@0.5mAP@0.5–0.95RecallG
1××××80.250.572.88.1
2×××83.053.277.311.5
3×××81.350.871.28.1
4×××82.451.373.57.4
5×××82.551.573.18.1
6××84.554.079.310.5
785.856.279.710.5
Table 6. Ablation experiments of MBY on the PVEL-AD dataset.
Table 6. Ablation experiments of MBY on the PVEL-AD dataset.
ExperimentsMB-DAFDWRMultiSEAMInner-SIoUmAP@0.5mAP@0.5–0.95RecallG
1××××64.644.160.28.1
2×××68.442.577.39.9
3×××68.844.860.28.0
4×××71.448.869.67.3
5×××71.645.666.78.1
6××69.336.671.29.7
775.937.458.88.1
Table 7. Comparison of detection performance for various single-stage algorithms on the NEU-DET dataset.
Table 7. Comparison of detection performance for various single-stage algorithms on the NEU-DET dataset.
ModelsmAP@0.5mAP@0.5–0.95GFLOPs
YOLOv584.553.67.1
YOLOv684.454.111.8
YOLOv880.250.58.1
YOLOv1082.952.26.5
YOLOv1177.350.96.3
CAY79.945.6\
YOLOV7-BA74.838.8\
EFd-YOLOv479.88\\
MS-YOLOv5s80.5\\
SSA-YOLO84.955.210.3
ETDNET85.155.812.1
Ours85.856.210.5
Table 8. Comparison of detection performance for various single-stage algorithms on the PVEL-AD dataset.
Table 8. Comparison of detection performance for various single-stage algorithms on the PVEL-AD dataset.
ModelsmAP@0.5mAP@0.5–0.95GFLOPs
YOLOv569.544.87.1
YOLOv665.243.411.8
YOLOv859.237.78.1
YOLOv1069.244.36.5
YOLOv1169.345.96.3
PD-DETR64.7\\
SSA-YOLO70.340.110.3
ETDNET71.239.812.1
Ours75.937.48.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Ma, S.; Wu, S.; Li, Z.; Cao, J.; Xu, P. Detection of Surface Defects in Steel Based on Dual-Backbone Network: MBDNet-Attention-YOLO. Sensors 2025, 25, 4817. https://doi.org/10.3390/s25154817

AMA Style

Wang X, Ma S, Wu S, Li Z, Cao J, Xu P. Detection of Surface Defects in Steel Based on Dual-Backbone Network: MBDNet-Attention-YOLO. Sensors. 2025; 25(15):4817. https://doi.org/10.3390/s25154817

Chicago/Turabian Style

Wang, Xinyu, Shuhui Ma, Shiting Wu, Zhaoye Li, Jinrong Cao, and Peiquan Xu. 2025. "Detection of Surface Defects in Steel Based on Dual-Backbone Network: MBDNet-Attention-YOLO" Sensors 25, no. 15: 4817. https://doi.org/10.3390/s25154817

APA Style

Wang, X., Ma, S., Wu, S., Li, Z., Cao, J., & Xu, P. (2025). Detection of Surface Defects in Steel Based on Dual-Backbone Network: MBDNet-Attention-YOLO. Sensors, 25(15), 4817. https://doi.org/10.3390/s25154817

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop