StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments

Li, Haomin; Zhang, Huanzun; Zang, Wenke

doi:10.3390/electronics14152994

Open AccessArticle

StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments

by

Haomin Li

¹,

Huanzun Zhang

²

and

Wenke Zang

^1,3,*

¹

Haide College, Ocean University of China, Qingdao 266100, China

²

Stony Brook Institute, Anhui University, Hefei 230039, China

³

Business School, Shandong Normal University, Jinan 250358, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 2994; https://doi.org/10.3390/electronics14152994 (registering DOI)

Submission received: 25 June 2025 / Revised: 17 July 2025 / Accepted: 25 July 2025 / Published: 27 July 2025

Download

Browse Figures

Versions Notes

Abstract

Recent advances in precision manufacturing and high-end equipment technologies have imposed ever more stringent requirements on the accuracy, real-time performance, and lightweight design of online steel strip surface defect detection systems. To reconcile the persistent trade-off between detection precision and inference efficiency in complex industrial environments, this study proposes StripSurface–YOLO, a novel real-time defect detection framework built upon YOLOv8n. The core architecture integrates an Efficient Cross-Stage Local Perception module (ResGSCSP), which synergistically combines GSConv lightweight convolutions with a one-shot aggregation strategy, thereby markedly reducing both model parameters and computational complexity. To further enhance multi-scale feature representation, this study introduces an Efficient Multi-Scale Attention (EMA) mechanism at the feature-fusion stage, enabling the network to more effectively attend to critical defect regions. Moreover, conventional nearest-neighbor upsampling is replaced by DySample, which produces deeper, high-resolution feature maps enriched with semantic content, improving both inference speed and fusion quality. To heighten sensitivity to small-scale and low-contrast defects, the model adopts Focal Loss, dynamically adjusting to sample difficulty. Extensive evaluations on the NEU-DET dataset demonstrate that StripSurface–YOLO reduces FLOPs by 11.6% and parameter count by 7.4% relative to the baseline YOLOv8n, while achieving respective improvements of 1.4%, 3.1%, 4.1%, and 3.0% in precision, recall, mAP₅₀, and mAP_50:95. Under adverse conditions—including contrast variations, brightness fluctuations, and Gaussian noise—SteelSurface-YOLO outperforms the baseline model, delivering improvements of 5.0% in mAP₅₀ and 4.7% in mAP_50:95, attesting to the model’s robust interference resistance. These findings underscore the potential of StripSurface–YOLO to meet the rigorous performance demands of real-time surface defect detection in the metal forging industry.

Keywords:

StripSurface-YOLO; ResGSCSP; DySample; EMA; Focal Loss

1. Introduction

In modern manufacturing systems, the steel strip serves as a fundamental product of the metal industry, with extensive applications in construction, automotive engineering, high-end machinery, aerospace, and electronic packaging [1]. As precision manufacturing, intelligent production, and high-end equipment technologies advance, market demands for surface quality, safety, stability, and consistency of steel strip products have risen markedly [2]. During the production process, interactions among the metallurgical organization of raw materials, rolling parameters, and variations in ambient lighting often give rise to surface defects such as cracks, delamination, scratches, and pitting [3,4]. These defects not only increase the cost of subsequent heat treatment and coating processes but also compromise the reliability of end products. Consequently, steel strip defect detection is critical for enhancing product quality and manufacturing efficiency.

Since the late twentieth century, both academia and industry have proposed various steel strip defect detection techniques, encompassing traditional approaches such as infrared thermography and magnetic flux leakage detection [5,6,7]. Despite their widespread adoption, these techniques frequently suffer from high detection costs, limited coverage, and inadequate real-time performance. With rapid progress in computer vision, deep learning, and edge computing, real-time visual detection systems based on object detection have emerged as the mainstream solution [8,9].

Object detection is a longstanding research area within computer vision. Prior to the widespread adoption of deep learning, detection frameworks primarily relied on image processing and traditional machine learning methods. Existing research can be categorized into three main types. First, image processing methods that extract handcrafted features have been used to capture low-level surface defect cues [10,11,12,13]. Second, traditional signal-processing techniques based on frequency-domain and statistical feature extraction have been applied to characterize defect signals [14,15,16]. Third, classical machine learning algorithms—including decision trees and autoregressive models—have been employed to classify defect types [17,18]. Although these approaches have yielded some improvements in metal surface defect detection, their reliance on handcrafted features renders them sensitive to variations in lighting and background noise, and their shallow representations limit robustness in complex industrial scenarios. As a result, despite numerous proposed models, the practical deployment of such methods remains constrained [19].

Recent years have witnessed the emergence of deep learning as a powerful tool for metal surface defect detection, propelled by breakthroughs in artificial intelligence and the ever-increasing computational capacity of GPUs [20,21,22]. Convolutional neural networks (CNNs), with their ability to perform end-to-end feature extraction, have set new benchmarks in object detection and image classification, garnering extensive research interest [23,24,25,26]. To date, numerous studies have explored the application of deep learning techniques for the detection of surface defects in metals. Li et al. [27] proposed a steel surface defect detection model that integrates image enhancement strategies with a dense multi-backbone network. Compared to the then-popular YOLOv5s model, their approach achieved a 7.4% improvement in mAP₅₀. However, the model’s complexity far exceeded that of YOLOv5s, leading to poor real-time performance. Lin et al. [28] designed a multi-scale cascaded CNN based on the lightweight MobileNet-v2 architecture, significantly reducing model complexity, but with negligible improvements in detection accuracy. Zhou et al. [29] introduced the CSPlayer and reparameterized GAM attention mechanisms into the YOLOv5s framework to enhance detection accuracy for steel surface defects. Despite a slight 1.4% improvement in mAP₅₀, the increased model complexity undermined its practical applicability. Zhang et al. [30] incorporated the lightweight GSConv and attention mechanisms into YOLOv5 to create a more efficient steel strip defect detection model. However, no significant accuracy improvements were observed, and inference speed was not notably enhanced. Li et al. [31] proposed an enhanced YOLOv7 model, combining structures such as PConv and BiFPN to reduce the model’s FLOPs by 60%, thus effectively reducing the model’s weight while maintaining accuracy. However, no robustness testing was performed, leaving the model prone to overfitting. Zhou and Zhao [32] extended the YOLOv8 framework by integrating multi-path convolutional attention (MPCA) and partial self-attention (PSA) units to optimize detection accuracy for steel surface defects. However, the increased model complexity was not sufficiently considered. Wu et al. [33] optimized the YOLOv5n framework by introducing Ghost lightweight convolutions and attention mechanisms, achieving a good balance between model lightness and detection accuracy. Nevertheless, the model demonstrated poor robustness in experiments involving simulated environmental interference.

Although the aforementioned studies demonstrate advanced model performance compared to baseline models, an ideal balance between detection accuracy and inference efficiency in steel strip surface defect detection remains unachieved. Furthermore, many existing models are trained and evaluated on data collected in controlled laboratory environments, which fail to effectively address the real-world interference factors commonly encountered in industrial settings, such as overexposure and uneven lighting. In typical steel strip surface defect detection scenarios, the system must not only meet stringent robustness requirements but also possess high real-time capability. However, under these complex conditions, existing deep learning models often struggle to meet the dual demands of accuracy and speed, highlighting the urgent need for further optimization.

To address these challenges, this study proposes StripSurface-YOLO, a real-time steel strip surface defect detection method based on the YOLOv8n framework. This method significantly improves detection accuracy while maintaining a lightweight architecture and has been validated for stability through robustness testing. The proposed approach dramatically reduces model parameters and computational complexity without compromising detection performance, thus fully meeting the stringent requirements of online inspection systems in the metal forging industry. The key innovations presented in this paper are as follows:

(1): A GSResBottleneck module was designed by integrating GSConv lightweight convolutions with a one-shot aggregation strategy to form a cross-stage partial network (CSP) module called ResGSCSP. This design simultaneously elevates detection precision and curtails both computational burden and inference latency.
(2): An EMA mechanism was introduced before the SPFF (Spatial Pyramid Feature Fusion) layer to strengthen the model’s focus on critical features, improve feature representation and robustness, and further enhance robust object generalization across scales.
(3): Instead of conventional nearest-neighbor interpolation, DySample upsampling was integrated within the neck network to produce higher-fidelity feature maps, thereby enhancing the fusion of deep, multi-level features.
(4): Focal Loss was adopted as the loss function; by dynamically assigning weights to easy and hard samples, the model’s recognition accuracy for challenging defect samples and overall robustness were significantly improved.
(5): To evaluate robustness under production conditions, this study applied five types of data interference to the original dataset. The results indicate that StripSurface-YOLO maintains superior generalization capability, making it highly applicable to real-world steel strip manufacturing scenarios and effectively augmenting both quality control and throughput.

The structure of this paper is as follows: Section 2 surveys existing research on the foundational YOLOv8 architecture, lightweight network designs, and multi-scale feature fusion techniques. Section 3 details the dataset and elaborates on the proposed StripSurface-YOLO approach. Section 4 describes the experimental methodology and presents the corresponding analyses. Finally, Section 5 offers concluding remarks.

With reductions of 7.4% in FLOPs and 11.6% in parameter count, StripSurface-YOLO demonstrates an improvement of over 4% in mAP₅₀. Its lightweight architecture ensures robust performance under varying illumination and noise conditions.

2. Related Work

2.1. YOLOv8

In this study, YOLOv8 adopts a modular architecture comprising three principal components—the backbone, neck, and head—that together achieve state-of-the-art performance in object detection tasks. The complete architecture is shown in Figure 1.

Backbone: This module is tasked with extracting multi-scale features, including shallow, intermediate, and deep representations. After fusion in the neck, these feature maps are forwarded to the head for the detection of large, medium, and small targets. YOLOv8 employs CSPDarknet53 as its backbone and optimizes the C2F (“CSP2-Fast”) module by reducing the original three convolutional layers to two, thereby lowering computational complexity. The C2F module leverages gradient connections across feature layers, halving the feature channels relative to the previous stage; this design both reduces FLOPs and enhances representational capacity. In addition, YOLOv8 uses the CBS (Conv2D-BatchNorm-SiLU) block as its basic unit: Conv2D performs two-dimensional convolution, BatchNorm provides regularization, and SiLU serves as the activation function to strengthen feature expressiveness. Feature maps processed by CBS and C2F are then input into the SPPF (Spatial Pyramid Pooling-Fast) module. By employing slicing operations to reduce feature dimensionality, SPPF maintains the precision of the original SPP (Spatial Pyramid Pooling) while significantly decreasing computational overhead.

Neck: YOLOv8’s neck employs a PAN-FPN-based feature-fusion structure to enhance multi-scale information exchange. Compared to earlier YOLO versions, PAN-FPN omits the convolutional operation on upsampled feature maps within its PAN structure. Traditional FPNs construct feature pyramids via lateral connections, but the upsampling process often introduces blur and noise, which can degrade the semantic richness of high-level features. In contrast, PAN conducts both bottom-up and top-down information flow to fuse features at different scales, thereby improving detection accuracy and augmenting the model’s adaptability to targets of varying sizes and shapes.

Head: Within the detection head, YOLOv8 adopts a decoupled architecture that isolates category classification from bounding-box regression, thereby improving inference efficiency. Furthermore, this work replaces traditional IoU-based assignment and one-sided distribution with a task-aligned matching strategy and transitions from an anchor-based to an anchor-free framework, further strengthening detection robustness.

2.2. Lightweight Network

Lightweight neural networks are engineered to enable efficient deployment in resource-constrained environments by minimizing both parameter count and computational overhead, while preserving high task performance. To further curtail computational demands in object detection while maintaining detection efficacy, researchers have developed various lightweighting techniques. One class of methods reduces the precision of network weights through quantization, thereby shrinking storage requirements and accelerating arithmetic operations (e.g., [34]). Another approach prunes redundant channels or convolutional kernels to streamline network topology and optimize computation. For example, Wu et al. [33] introduced a CGH lightweight C3 structure built on YOLOv5n to create an efficient steel surface defect detection framework, albeit with a modest decline in accuracy. Liu et al. [35] combined dilated convolutions with an attention module and multi-scale pooling to enrich semantic encoding. MADNet [36] leverages dense connectivity to strengthen multi-scale feature representation and correlation learning. Su et al. [25] integrated GhostConv with a one-shot aggregation paradigm within a CSP module to significantly reduce model complexity. Liang et al. [2] propose LAD-Net, a compact ultrasonic-welding defect detector that incorporates SAM-Conv into a lightweight stride-attention module (LSAM), reducing model complexity while preserving fine-grained defect details. Lu and Qu [37] replaced YOLOv8’s downsampling with SPD-Conv, integrated a CBAM module into the backbone, and substituted the neck’s C2f with a C2f-Ghost module, thereby lightening the model while maintaining its performance. However, these approaches predominantly focus on compressing pretrained networks or training small-scale models from scratch, which can unbalance overall performance.

To address these limitations, the present study introduces GSResBottleneck and ResGSCSP modules—both built upon the GSConv operator—to construct the StripSurface-YOLO detection model. The proposed design achieves substantial reductions in computational complexity and inference latency while safeguarding detection precision, thereby delivering an enhanced solution for lightweight object detection.

2.3. Attention Mechanism

This study recognizes that the attention mechanism—originating from insights into the human visual system and neural information processing—dynamically allocates computational resources according to the relative importance of input features and has attracted widespread interest in deep learning research. In recent years, this module has been successfully incorporated into various neural network architectures; by emphasizing the expression of critical information and suppressing redundant features, it has demonstrably enhanced model discriminative power and task sensitivity [38]. Concurrently, the attention mechanism reduces network redundancy and increases robustness to noisy samples, thereby strengthening generalization performance on unseen data.

Standard neural architectures customarily ascribe equal significance to every input feature, subjecting them uniformly to subsequent layers. In contrast, real-world applications frequently depend on a critical subset of features for accurate inference; indiscriminate processing of the entire feature repertoire can squander computational resources, exacerbate overfitting, and impair model interpretability [39]. In contrast, the attention mechanism assigns learnable weights to each feature, enabling the network to distinguish adaptively between critical and nonessential information. Visualization of these weight distributions further elucidates the model’s internal decision logic, guiding network-structure optimization and model tuning.

In computer vision, attention mechanisms allow models to emphasize image regions and feature channels most pertinent to the target class, markedly improving classification and detection accuracy. For instance, Zhou and Zhao [32] embed a multi-path convolutional attention (MPCA) block into both backbone and bottleneck layers—via a C2f–MPCA module—and introduce partial self-attention (PSA) units in the backbone to capture long-range dependencies. This attention-centric design yields a significant boost in mean average precision with only a marginal increase in computation. Zhao et al. [40] introduce HSC-YOLO, which augments the YOLOv10n backbone by replacing the neck’s dual concatenation operations with an SDI module to strengthen multi-scale feature fusion, and by integrating iterative attentional feature fusion (iAFF) alongside C2f.

Overall, through refined feature selection and targeted information filtering, attention mechanisms not only improve deep-model efficacy across a variety of tasks but also pave the way for enhanced interpretability and reduced computational overhead. As such, they have emerged as a central focus of contemporary deep learning research.

3. Methodology

3.1. StripSurface-YOLO Object-Detection Model

In strip surface defect detection tasks, YOLOv8n—despite its lightweight design—continues to exhibit limitations in detection accuracy and insufficient sensitivity to minute defects. To address these challenges, this study builds upon the YOLOv8n framework to propose an improved object-detection network—StripSurface-YOLO—combining high precision with a streamlined architecture; its overall structure is shown in Figure 2.

An Efficient Multi-Scale Attention (EMA) module is incorporated into the neck immediately upstream of the SPPF stage, thereby sharpening the network’s focus on salient feature regions, enhancing representational robustness, and improving cross-scale detection fidelity. To further elevate feature-map quality and enhance the distinction of overlapping wear-type defects during fusion, this study replaces the conventional nearest-neighbor interpolation with a lightweight DySample upsampling method. In the loss-function design, Focal Loss is employed: by dynamically assigning distinct weights to easily classified and challenging samples, the model’s sensitivity to difficult defect instances and overall detection robustness are significantly enhanced.

Furthermore, this study introduces a GSResBottleneck residual block built upon GSConv, combined with a one-shot aggregation paradigm, to realize an efficient cross-stage partial (CSP) module, termed ResGSCSP. This configuration markedly reduces computational complexity while preserving detection accuracy, thereby enabling lightweight feature extraction and efficient information fusion. Figure 2 depicts the complete network architecture of StripSurface-YOLO.

3.2. Design of the ResGS-CSP Architecture Based on GSConv

3.2.1. GSConv

In the backbone network of YOLOv8n, this study introduces a novel convolutional operation—GSConv—to replace the standard convolution (SC) layers, thereby optimizing overall network performance. In conventional CNN architectures, an input image undergoes successive transformations that map spatial information into channel space. However, this process invariably sacrifices some semantic information, and standard convolutions incur substantial computational overhead that increases dramatically as model size grows, leading to slower inference. By contrast, depthwise-separable convolutions (DSCs) reduce complexity by processing each channel independently, but at the cost of somewhat diminished feature-extraction and fusion-capabilities compared with SC, which can limit representational power.

GSConv achieves an effective compromise between computational efficiency and expressive capacity by integrating SC and DSC within a single module. Specifically, GSConv retains implicit connections between channels while roughly halving the computational cost of standard convolution. As illustrated in Figure 3, GSConv first applies grouped shift operations to redistribute spatial information across channels, and then employs a pointwise (1 × 1) convolution to fuse those shifted features. This combination preserves nearly the same feature-learning ability as SC but reduces FLOPs by approximately 50%.

3.2.2. ResGSCSP Structure

Building upon GSConv’s low-cost yet effective feature-generation properties, this study proposes a GSResBottleneck module (Figure 4a) and further integrates it with a cross-stage partial (CSP) design to form the ResGSCSP block (Figure 4b). Within GSResBottleneck, a dual-branch bottleneck structure is adopted. One branch performs lightweight feature mapping via GSConv, while the other implements identity mapping; these two branches are fused through a residual connection.

This residual design not only preserves feature-learning capacity but also mitigates gradient vanishing and explosion during deep network training, as expressed by

O u t p u t = F (I n p u t, \{W_{i}\}) + I n p u t,

(1)

where

I n p u t

and

O u t p u t

denote the input and output of the residual block, and

F (I n p u t, \{W_{i}\})

comprises a sequence of GSConv operations, activation functions, and normalization layers.

To further enhance nonlinearity and accelerate inference without incurring additional computational cost, this study incorporates one-shot aggregation into a CSP framework, yielding the ResGSCSP module. This specific fusion was motivated by the complementary strengths of GSConv and one-shot aggregation: GSConv’s channel-wise grouping and depthwise–pointwise factorization drastically reduce redundant computation, while one-shot aggregation maximizes information flow across parallel feature streams, together delivering richer representations at minimal cost. After channel grouping, multiple layers of GSConv extract features from each group in parallel; these groupwise features are then concatenated and fused via a 1 × 1 pointwise convolution. Compared with traditional CSP, ResGSCSP leverages GSConv’s low-cost characteristics to significantly improve the network’s expressive power and inference speed without increasing parameter count.

In summary, by inheriting GSConv’s efficient feature-generation advantage and carefully designing residual-based and cross-stage information flows, the GSResBottleneck and ResGSCSP modules achieve an optimal balance of lightweight architecture and high precision. This design provides robust technical support for online detection of surface defects in steel strip manufacturing.

3.3. Efficient Multi-Scale Attention (EMA)

In steel strip surface defect detection tasks, the model must capture detailed features of defects at varying scales while simultaneously preserving global semantic context to avoid overreliance on convolutional translation invariance that can lead to contextual neglect [41]. To address these requirements, this study incorporates the EMA mechanism [42]. EMA achieves efficient extraction and fusion of multi-scale features through a combination of channel reshaping and cross-spatial learning strategies, thereby significantly reducing both computational and memory overhead.

Let the input feature-map tensor be

X \in R^{C \times H \times W}

, where

C

,

H

, and

W

denote the number of channels, height, and width, respectively. EMA first partitions

X

along the channel dimension into

G

groups of sub-features:

X = [\begin{matrix} X_{0}, X_{i}, \dots, X_{G - 1} \end{matrix}], X_{i} \in R^{C / / G \times H \times W} .

(2)

Next, the module constructs three parallel multi-scale feature-extraction paths: two 1 × 1 convolution branches (each combined with one-dimensional average pooling along either the horizontal or vertical axis) and one 3 × 3 convolution branch. Specifically,

➢: Each 1 × 1 convolution branch preserves the number of channels and employs global one-dimensional pooling (horizontal or vertical) to capture long-range dependencies. Both branches share the same convolutional kernels to minimize parameter redundancy.
➢: The 3 × 3 convolution branch enhances local spatial information while interleaving channel interactions to expand feature dimensionality.

The outputs of all three branches are then encoded via two-dimensional global average pooling:

Z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (j, i) .

(3)

Subsequently, spatial attention weights are generated through a Softmax-based linear combination of these pooled features. In this way, while keeping the total channel count constant, features corresponding to small- or medium-scale defects—often difficult to distinguish—are assigned higher response weights.

As shown in Figure 5, EMA’s carefully designed channel grouping and cross-spatial learning enable efficient perception and collaborative enhancement of features across scales. Compared with traditional deep convolutional networks, this mechanism maintains computational efficiency while markedly improving detection of subtle surface defects, such as microcracks and pits. Consequently, EMA provides a reliable and lightweight attention solution for industrial online defect detection systems.

3.4. DySample Upsampling

In contemporary convolutional neural network frameworks, upsampling of feature maps constitutes a core operation whereby coarse, low-resolution activations are transformed into fine-grained, high-resolution representations, thereby enhancing the network’s ability to discern subtle patterns and local structures. Two mainstream upsampling strategies currently prevail. The first employs interpolation methods—such as nearest-neighbor or bilinear interpolation—which are widely used in subpixel space but fail to capture semantic richness, often resulting in feature degradation. The second approach utilizes transposed convolution (“deconvolution”) to expand spatial dimensions through learned convolutional kernels. However, transposed convolution applies identical kernels uniformly across the feature map, limiting responsiveness to local variations, compromising fine-detail preservation, and increasing parameter overhead.

In steel strip defect detection, diminutive flaws—such as pores, tiny pits, and microcracks—may be concealed by pixel-level distortions. To address this challenge, this study introduces a dynamically adaptive, sampling-based upsampling module—DySample—within the YOLOv8 framework, replacing the conventional UpSample operation. DySample improves responsiveness to minute defects while preserving the lightweight nature of the model.

At the core of DySample lies a combination of learned offset generation and coordinate-based bilinear interpolation. Let

X \in R^{C \times H \times W}

be the input feature map, and let the upsampling scale factor be

s

. First, a linear projection maps the

C

channels of

X

to generate an offset tensor

O

of shape

2 s^{2} \times H \times W

:

O = l i n e a r (X) .

(4)

Next, the PixelShuffle operation rearranges

O

into a coordinate-offset field of dimensions

2 \times (s H) \times (s W)

. With the regular sampling grid denoted as

G

, the dynamic sampling grid

δ

is obtained by elementwise addition of

O

to

G

:

δ = G + O .

(5)

Finally, DySample performs bilinear interpolation on

X

using the learned coordinate offsets

δ

, producing the upsampled feature map

\hat{X} \in R^{C \times s H \times s W}

:

\hat{X} = g r i d_s a m p l e (X, δ) .

(6)

By learning pixel-level offsets, DySample amplifies regions containing small defects without requiring computationally expensive dynamic convolutions, thereby substantially reducing resource consumption. In steel strip surface defect detection, DySample preserves defect edges and texture information more effectively than conventional interpolation methods, improving the detector’s accuracy on small, low-contrast anomalies while maintaining real-time performance. The network architecture of DySample is illustrated in Figure 6.

3.5. Focal Loss

The loss function plays a pivotal role in model optimization, affecting both training efficiency and ultimate performance. In steel strip surface defect detection tasks, the classification difficulty varies across defect types, and certain easily confounded samples can impede training. To increase emphasis on hard-to-classify instances, this study incorporates a dynamic modulation factor into the conventional multi-class cross-entropy loss, thereby diminishing the gradient contribution of well-classified samples and enhancing overall detection accuracy.

The standard cross-entropy loss is defined as

C E (p, y) = - \sum_{i = 1}^{C} y_{i} \log (p_{i}),

(7)

where

p_{i}

denotes the model’s predicted probability for class

i

, and

y_{i}

represents the corresponding ground-truth label.

Focal Loss extends this by introducing a modulation term

(1 - p_{i})^{γ}

, which directs the network’s attention toward low-confidence—and thus more difficult—examples during training. Formally, Focal Loss is expressed as

F L (p_{t}) = - \sum_{i = 1}^{C} (1 - p_{i})^{γ} \log (p_{i}),

(8)

where

γ \geq 0

is a tunable hyperparameter. For a misclassified sample,

p_{t}

is small, causing

(1 - p_{t})^{γ}

to approach unity, thus making the Focal Loss functionally equivalent to the standard cross-entropy loss. Conversely, when processing easily classified instances with high

p_{t}

values, this modulation term diminishes toward zero, substantially suppressing their influence on the total loss. As

γ

increases, the network’s focus on hard-to-classify samples intensifies, which benefits the recognition of subtle defect features and confusable cases in steel strip surface defect detection.

4. Experimental Findings and Discussion

4.1. Description of the Dataset

To assess the proposed StripSurface-YOLO framework, this study utilizes the NEU Surface Defect Detection benchmark (NEU-DET) as the primary evaluation corpus. NEU-DET encompasses six quintessential classes of steel surface anomalies, with each category comprising 300 gray-scale images at a uniform resolution of 200 × 200 pixels, summing to 1800 total samples.

The dataset is partitioned into training and test subsets in an 80:20 ratio: the training set is used for model parameter optimization and loss minimization, while the test set assesses classification performance in the surface defect recognition task. Figure 7 presents exemplary images of the several defect categories from the dataset.

4.2. Evaluation Metrics for Experiments

4.2.1. Precision and Recall

Precision (

P

) reflects the fraction of correctly predicted positive instances among all instances labeled as positive by the model. It is expressed as

P r e c i s i o n = \frac{T P}{F P + F P} .

(9)

Recall (

R

) indicates the fraction of actual positive instances that the model correctly identifies. It is given by

R e c a l l = \frac{T P}{F P + F N},

(10)

where TP denotes true positives, FP false positives, and FN false negatives.

4.2.2. mAP₅₀ and mAP_50:95

Mean average precision (mAP) represents the area beneath the precision–recall curve. For an individual class, the average precision (AP) is defined as

A P = \int_{0}^{1} P (R) d R,

(11)

where

P (R)

represents precision as a function of recall. The mean average precision over

n

classes is then

m A P = \frac{\sum_{i = 1}^{n} {A P}_{i}}{n} .

(12)

In this study, mAP₅₀ denotes the mAP calculated at a single Intersection over Union (IoU) threshold of 0.50. By contrast, mAP_50:95 provides a more stringent evaluation by averaging AP over multiple IoU thresholds, ranging from 0.50 to 0.95 in increments of 0.05.

4.2.3. Parameter Count, Computational Complexity, and Inference Throughput

The aggregate count of trainable parameters—including every weight and bias in the network—provides a direct measure of the model’s memory footprint and its representational capacity. In practice, architectures with larger parameter budgets incur greater storage overhead and typically demand longer durations for both training and inference.

FLOPs quantify the model’s computational complexity; a lower FLOP count indicates faster runtime. For a convolutional layer, FLOPs are calculated as follows:

F L O P s (C o n v) = (2 \times C_{i n} \times K^{2} - 1) \times W_{o u t} \times H_{o u t} \times C_{o u t} .

(13)

When the spatial dimensions of input and output match and only pointwise convolutions are considered, this simplifies to

F L O P s (C o n v) = (2 \times C_{i n} - 1) \times C_{o u t},

(14)

where

C_{i n}

and

C_{o u t}

denote the numbers of input and output channels;

K

is the convolutional kernel size; and

W_{o u t}

and

H_{o u t}

represent the width and height of the output feature map.

Inference throughput is measured in frames per second (FPS), indicating how many images the model can process within one second—an essential metric for applications requiring real-time performance. In this work, FPS is obtained by averaging the per-image inference time over 300 test samples. Higher FPS values signify superior real-time suitability, making the model more viable for scenarios with stringent latency constraints.

4.3. Experimental Setup and Parameter Initialization

Computational evaluations were carried out on a Windows 10-based workstation using PyCharm 2024.1 (Professional Edition) as the development environment. The deep learning stack comprised CUDA 12.4, Python 3.11.0, and PyTorch 2.5.1. The hardware configuration included an NVIDIA RTX 4090D GPU with 24 GB of VRAM and an AMD Ryzen 9 7950X CPU clocked at 4.50 GHz. Detailed hyperparameter configurations are provided in Table 1.

4.4. Model Performance Assessment

To demonstrate the merits of StripSurface-YOLO, this study conducted a comparative analysis against the baseline YOLOv8n model, with the results summarized in Table 2.

Evaluation on the test set reveals that StripSurface-YOLO requires 11.6% fewer floating-point operations and 7.4% fewer parameters than YOLOv8n, while achieving improvements of 1.4% in precision, 3.1% in recall, 4.1% in mAP₅₀, and 3.0% in mAP_50:95.

To further elucidate StripSurface-YOLO’s defect detection capabilities, this study presents qualitative comparisons on representative test images (Figure 8). As shown in Figure 8, the baseline YOLOv8n model exhibits both missed detections and false positives when confronted with subtle defects, particularly cracks (Cr), patches (Pa), and rolled-in scale (Rs). In contrast, StripSurface-YOLO demonstrates enhanced robustness, effectively mitigating detection errors and exhibiting superior defect recognition performance.

4.5. Investigation of Attention Mechanism Effects

To rigorously evaluate the efficacy of the EMA mechanism for detecting surface defects on strip steel, this study replaced the lightweight YOLOv8n backbone with one of five alternative modules and conducted comparative experiments under an identical dataset and training regimen. CA integrates spatial coordinate encoding with channel-wise perception, thereby establishing pixel-level positional dependencies that improve localization accuracy over defect regions. SE leverages global channel statistics to adaptively recalibrate channel weights, mitigating information loss due to uneven inter-channel responses. NonlocalBlockND computes similarity between any two feature locations via nonlocal operations, enabling global context modeling of long-range dependencies; however, its high-order affinity computations incur substantial overhead on high-resolution feature maps. CBAM sequentially applies two attention submodules—first performing global average and max pooling to refine channel allocation, and then compressing across channels to highlight critical spatial regions, which can enhance detection performance but is less favorable for real-time deployment and resource efficiency compared to EMA’s lightweight design. SCSA combines spatial and channel interaction with multi-scale semantic fusion to balance local detail and global context, though its multi-branch architecture increases network complexity.

As shown in Table 3, the overall detection performance of the model reaches its peak with the incorporation of the EMA mechanism into the baseline YOLOv8n framework. Compared to the original YOLOv8n, the EMA-enhanced model achieves precision (P), recall (R), mAP_50, and mAP_50:95 scores of 69.5%, 72.5%, 76.4%, and 43.2%, respectively, with the latter three metrics representing the highest performance among all configurations. Although the precision of the EMA model is marginally lower than that achieved by a few alternative attention mechanisms, it demonstrates an optimal trade-off between recall and precision, thereby significantly improving the model’s robustness in defect detection tasks. Moreover, the integration of EMA does not incur a substantial increase in FLOPs or parameter count, underscoring the mechanism’s efficacy in enhancing detection performance while maintaining a lightweight architecture—an essential attribute for practical deployment in steel strip surface defect inspection.

4.6. Findings from Ablation Studies

To systematically evaluate the contribution of each improved module to the strip steel surface defect detection task, a series of ablation experiments was conducted based on the lightweight YOLOv8n architecture (Table 4). Model 0 serves as the baseline YOLOv8n, and subsequent models incrementally incorporate EMA, DySample, ResGSCSP, and Focal Loss to quantify the impact of each component on detection performance, computational complexity, and parameter footprint.

The ablation results in Table 4 reveal the following key findings:

(1): Effectiveness of the EMA Module: The integration of the EMA module substantially augments detection performance. EMA introduces a cross-spatial learning scheme in which a portion of the channel dimension is reinterpreted as additional batch samples, allowing these channel subsets to be processed concurrently. By combining this grouping strategy with 3 × 3 convolutions, EMA captures features at multiple spatial scales, enabling joint modeling of both channel and spatial dependencies. Its parallel subnetwork architecture applies multi-scale convolutions to the grouped sub-features, while a dynamic modulation mechanism strengthens object responses without incurring the dimensionality-reduction penalties typical of conventional attention methods. Additionally, EMA adopts a cross-spatial feature-aggregation strategy—leveraging lightweight decomposition and depthwise-separable convolutions—to reinforce the representation of fine steel strip defect details while simultaneously reducing computational complexity. Experimentally, the addition of EMA yields increases in precision (P), recall (R), mAP₅₀, and mAP_50:95 of 1.9%, 2.2%, 2.4%, and 2.4%, respectively, thereby validating EMA’s efficient feature-representation capabilities.
(2): Impact of the ResGSCSP Lightweight Branch: The ResGSCSP structure, based on GSConv, substantially reduced FLOPs (from 6.9 G to 6.0 G) and parameters (from 2.57 M to 2.32 M) while maintaining comparable detection performance. This demonstrates the module’s effectiveness in balancing accuracy with computational efficiency, which is crucial for deployment in resource-constrained industrial environments.
(3): Contribution of DySample: Upon integrating the DySample upsampling module, recall (R), mAP₅₀, and mAP_50:95 all show significant improvements—1.0%, 0.8%, and 0.7%, respectively—indicating that DySample’s dynamic upsampling mechanism and lightweight decomposition design boost multi-scale feature representation with only a marginal parameter overhead. Notably, recall (R) exhibits a slight decrease from 70.3% to 70.2%; this minor drop may be attributable to DySample’s lowered response threshold for edge-blurred regions or background noise when optimizing feature sensitivity. Nevertheless, from a holistic performance standpoint, DySample’s cross-scale feature-aggregation strategy combined with dynamic sampling-point partitioning effectively enhances detection robustness while preserving overall model efficiency.
(4): Impact of the Focal Loss: Substituting the original loss function with Focal Loss improved recall by 0.8%, mAP₅₀ by 0.7%, and mAP_50:95 by 0.6%, with only a marginal drop in precision (–0.5%). This indicates that Focal Loss enhances the model’s focus on hard-to-classify samples, particularly benefiting the detection of small or subtle defects, without increasing computational or parameter burdens.
(5): Combinatorial Optimization of Multiple Components: Various combinations of the above modules were tested, showing synergistic improvements in detection accuracy. These enhancements incurred only minimal increases in computational complexity and parameter count. Specifically, the incorporation of ResGSCSP produced consistent decreases in computational overhead while maintaining detection accuracy, thereby demonstrating the practicality of multimodule collaborative optimization for industrial defect inspection.
(6): Overall Performance of StripSurface-YOLO: The fully integrated model, termed StripSurface-YOLO, outperformed the baseline YOLOv8n in all metrics. It achieved a 1.4% increase in precision, 3.1% in recall, 4.1% in mAP₅₀, and 3.0% in mAP_50:95, while reducing FLOPs and parameters by 11.6% and 9.7%, respectively. These results underscore StripSurface-YOLO’s superior capacity to detect minute surface defects with high accuracy and efficiency, making it a compelling solution for real-time industrial inspection applications.

4.7. Comparison of Different Methods

Several object-detection models with a computational footprint comparable to YOLOv8n were chosen as baselines, including YOLOv3-tiny, YOLOv5n, YOLOv9t, YOLOv10n, and YOLO11n. Larger mainstream models (YOLOv3, YOLOv9c, and RTDETR-ResNet50) were also included to evaluate the performance advantages of StripSurface-YOLO. Additionally, representative enhanced models from recent research—SDD-YOLO and EcoDetect-YOLOv2—were introduced to further validate model advancement. Under consistent experimental settings, the training results for each model are presented in Table 5.

As shown in Table 5, StripSurface-YOLO, which maintains a lightweight profile (6.0 G FLOPs and 2.32 M parameters), achieves the highest mAP₅₀ (78.1%) and mAP_50:95 (43.8%). Although its precision and recall are not the highest among all evaluated models, it demonstrates a more balanced precision–recall trade-off at an extremely low computational cost.

Specifically, compared to models of similar scale—YOLOv8n, YOLOv9t, YOLO11n, and SDD-YOLO—EcoDetect-YOLOv2 exhibits superior detection performance. Relative to larger-scale models such as YOLOv5s, YOLOv8s, YOLOv9c, and RTDETR-ResNet50, StripSurface-YOLO remains leading in mAP metrics while reducing computational complexity by approximately 60–95%. Collectively, StripSurface-YOLO delivers superior overall detection performance with minimal computational overhead, thereby underscoring its efficiency and robustness.

To evaluate the detection performance of StripSurface-YOLO across different defect types, this study conducted a comparative study against several representative object-detection models. Table 6 summarizes the average precision (AP) values achieved by each model in six defect categories.

The results demonstrate that StripSurface-YOLO achieves the highest AP for four defect types—cracking (Cr), inclusion (In), patch (Pa), and pitted surface (Ps)—surpassing all other models by a considerable margin, especially for cracking and inclusion. For rolling oxide (Rs) and scratch (Sc), StripSurface-YOLO remains among the top-performing lightweight and mid-scale networks, thereby confirming its advantage in detecting both minuscule and structurally complex defects.

Figure 9 provides a visual comparison of per-category AP values and centroid distributions for each model, further highlighting StripSurface-YOLO’s overall superiority in the strip steel surface defect detection task.

In Figure 9, each category’s AP and corresponding centroid visualization are displayed to offer an intuitive comparison of both aggregate and per-category performance. StripSurface-YOLO distinctly leads in overall detection capability and attains the highest AP in four key defect categories—cracking (Cr), inclusion (In), patch (Pa), and pitted surface (Ps)—illustrating its exceptional adaptability and robustness for multi-class surface defect detection.

4.8. Robustness Evaluation

In industrial settings, metal surface images are frequently degraded by various environmental factors—such as overexposure, insufficient illumination, and motion blur— which can significantly impede defect detection. To systematically assess the adaptability and robustness of the proposed StripSurface-YOLO model under such challenging conditions, this study applied five types of typical perturbations to the original dataset: a 50% increase in brightness, a 50% decrease in brightness, a 50% increase in contrast, a 50% decrease in contrast, and the addition of 5% Gaussian noise. A subset of the original images alongside their perturbed counterparts is illustrated in Figure 10.

Table 7 reports the detection performance of the baseline YOLOv8n and our StripSurface-YOLO across each perturbed dataset.

Table 7 reveals the following key insights:

(1): Impact of Brightness Variations: In real industrial production environments, factors such as light occlusion and exposure instability frequently compromise the performance of strip steel surface defect detection systems. As shown in Table 7, under substantial brightness reduction, both the proposed StripSurface-YOLO and the baseline YOLOv8n model suffer considerable performance degradation. Specifically, the mAP₅₀ of YOLOv8n drops sharply from 74.0% to 61.4%, a decrease of 12.6 percentage points, while StripSurface-YOLO exhibits a smaller decline from 78.1% to 67.4%, amounting to a reduction of 10.7 percentage points. When brightness is significantly increased, the performance impact on both models is marginal, with mAP₅₀ decreasing by only 3.5% and 3.0%, respectively. These results suggest that although StripSurface-YOLO is also affected under extreme illumination fluctuations—particularly under low-light conditions—it consistently maintains superior detection accuracy, demonstrating enhanced robustness and practical applicability in environments with variable lighting.
(2): Impact of Contrast Variations: In industrial settings, imaging contrast may deteriorate due to equipment aging, lens contamination, or dust accumulation on conveyor belts, whereas highly reflective stripes or background lighting may cause excessive contrast scenarios. Table 7 indicates that when contrast is reduced by 50%, the mAP₅₀ of StripSurface-YOLO decreases modestly from 78.1% to 76.2%, a reduction of only 1.9 percentage points, whereas the baseline model suffers a larger drop of 2.5 percentage points. A similar trend is observed for the mAP_50:95 metric, where StripSurface-YOLO again exhibits a smaller decrease compared to the baseline (2.0% vs. 2.7%). Likewise, under a 50% increase in contrast, the mAP₅₀ reduction for StripSurface-YOLO is merely 1.4 percentage points, noticeably lower than the 1.8-percentage-point drop seen in YOLOv8n. These findings indicate that StripSurface-YOLO maintains greater stability in preserving discriminative defect features under severe contrast fluctuations caused by lens or lighting inconsistencies, effectively suppressing background texture interference under both low- and high-contrast conditions and ensuring more consistent detection performance.
(3): Impact of Gaussian Noise: In online strip steel inspection, factors such as sensor read/write errors, electromagnetic interference, and high-frequency vibrations can introduce additive Gaussian noise, leading to loss of image detail and texture blurring. After the introduction of 5% Gaussian noise, StripSurface-YOLO’s mAP₅₀ drops only slightly from 78.1% to 77.0%, a reduction of 1.1 percentage points; in contrast, the baseline model suffers a larger decline of 2.2 percentage points. Similarly, in terms of the mAP_50:95 metric, the performance degradation of StripSurface-YOLO is approximately 1.8%, which is also noticeably lower than the 2.1% decrease observed for the baseline. This comparison highlights the superior ability of StripSurface-YOLO to retain fine defect textures and edge features under noisy conditions, ensuring consistently high detection accuracy and exhibiting greater robustness and operational stability.

In summary, across various challenging scenarios involving substantial brightness and contrast variations as well as additive noise interference, StripSurface-YOLO consistently outperforms YOLOv8n in terms of detection precision, recall, mAP₅₀, and mAP_50:95. When aggregating results under all perturbation conditions, YOLOv8n shows average decreases of 4.5% and 4.1% in mAP₅₀ and mAP_50:95, respectively, whereas StripSurface-YOLO exhibits smaller corresponding reductions of approximately 3.6% and 2.4%. Overall, SteelGuard-YOLO not only demonstrates reduced performance fluctuation under noise and illumination disturbances but also achieves superior detection accuracy on perturbed images, even surpassing the baseline YOLOv8n’s performance on original, interference-free images. These results robustly validate the proposed method’s excellent robustness and generalization capability for online strip steel surface defect detection.

5. Conclusions

To address the challenge of poor model robustness in strip steel surface defect detection under complex industrial environments, this study proposes StripSurface-YOLO, a robust and lightweight real-time detection framework built upon the YOLOv8n architecture. The framework integrates a one-shot aggregation strategy and introduces an efficient cross-stage local network, ResGSCSP, constructed with the lightweight convolutional module GSConv, which substantially reduces both parameter count and computational complexity. Furthermore, an Efficient Multi-Scale Attention (EMA) mechanism is incorporated before feature fusion to enhance the model’s sensitivity to subtle and small-scale defects. In the neck network, the traditional nearest-neighbor interpolation is replaced with DySample upsampling, improving the quality of deep feature maps while accelerating inference speed. Additionally, Focal Loss is employed to dynamically adjust the weighting of easy- and difficult-to-classify samples, significantly boosting the model’s capability in detecting complex and minor surface defects. Experimental results demonstrate that, compared to the baseline YOLOv8n model, StripSurface-YOLO achieves reductions of 11.6% in computational load and 7.4% in parameter count, while delivering improvements of 1.4%, 3.1%, 4.1%, and 3.0% in precision, recall, mAP₅₀, and mAP_50:95, respectively. Under typical perturbations such as contrast variation, brightness fluctuation, and Gaussian noise—common in real-world industrial scenarios—the performance degradation of mAP₅₀ and mAP_50:95 remains limited to approximately 3.6% and 2.4%, respectively, validating the proposed method’s superior accuracy, real-time capability, and robustness in practical manufacturing environments.

To further advance the practical deployment of StripSurface-YOLO and close the gap between controlled experiments and real-world industrial applications, future work should pursue the following directions:

(1): Cross-Domain Generalization. Systematic pretraining on diverse, multi-source field datasets—encompassing different steel grades, rolling parameters, and operational environments—followed by targeted fine-tuning, will rigorously evaluate and enhance the model’s adaptability to unseen production conditions.
(2): Incremental and Continual Learning. The integration of online update strategies—such as self-supervised or weakly supervised incremental learning—will enable the detector to accommodate evolving production characteristics and emerging defect types, thereby preserving detection accuracy and reliability over extended operational periods.
(3): Multi-Modal Sensor Fusion. Fusing visual inputs with complementary data streams (e.g., infrared thermography, ultrasonic or acoustic emissions) promises to heighten sensitivity to subsurface anomalies and nascent defect signatures, particularly in scenarios of low illumination, surface occlusion, or complex background noise.

By addressing these research avenues, future efforts can extend the robustness, versatility, and autonomy of StripSurface-YOLO, ultimately facilitating the realization of fully adaptive, end-to-end inspection systems that satisfy the stringent throughput and quality requirements of modern steel manufacturing.

Author Contributions

Conceptualization, H.L. and H.Z.; Data curation, H.Z.; Formal analysis, H.Z.; Funding acquisition, W.Z.; Investigation, H.L. and H.Z.; Methodology, H.L. and H.Z.; Project administration, W.Z.; Resources, W.Z.; Software, H.L.; Supervision, W.Z.; Validation, H.L. and H.Z.; Visualization, H.L.; Writing—original draft, H.L. and H.Z.; Writing—review and editing, H.L. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported in part by the National Science Foundation of China under Grant No. 62373230, the Innovative Development Joint Fund of Natural Science Foundation of Shandong Province under Grant No. ZR2024LZH012, and in part by the Natural Science Foundation of Shandong Province, China, under Grant No. ZR2022MF318 and No. ZR2023MF079.

Data Availability Statement

The data presented in this study are openly available in NEU-DET at https://drive.google.com/file/d/1qrdZlaDi272eA79b0uCwwqPrm2Q_WI3k/view (accessed on 29 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lun, Z.; Pan, Y.; Wang, S.; Abbas, Z.; Islam, M.S.; Yin, S. Skip-YOLO: Domestic Garbage Detection Using Deep Learning Method in Complex Multi-Scenes. Int. J. Comput. Intell. Syst. 2023, 16, 139. [Google Scholar] [CrossRef]
Liang, F.; Zhao, L.; Ren, Y.; Wang, S.; To, S.; Abbas, Z.; Islam, M.S. LAD-Net: A lightweight welding defect surface non-destructive detection algorithm based on the attention mechanism. Comput. Ind. 2024, 161, 104109. [Google Scholar] [CrossRef]
Czimmermann, T.; Ciuti, G.; Milazzo, M.; Chiurazzi, M.; Roccella, S.; Oddo, C.M.; Dario, P. Visual-Based Defect Detection and Classification Approaches for Industrial Applications—A SURVEY. Sensors 2020, 20, 1459. [Google Scholar] [CrossRef]
Fang, X.; Luo, Q.; Zhou, B.; Li, C.; Tian, L. Research Progress of Automated Visual Surface Defect Detection for Industrial Metal Planar Materials. Sensors 2020, 20, 5136. [Google Scholar] [CrossRef]
Xie, L.; Baskaran, P.; Ribeiro, A.L.; Alegria, F.C.; Ramos, H.G. Classification of Corrosion Severity in SPCC Steels Using Eddy Current Testing and Supervised Machine Learning Models. Sensors 2024, 24, 2259. [Google Scholar] [CrossRef]
Zou, Y.; Fan, Y. An Infrared Image Defect Detection Method for Steel Based on Regularized YOLO. Sensors 2024, 24, 1674. [Google Scholar] [CrossRef]
Yousaf, J.; Harseno, R.W.; Kee, S.-H.; Yee, J.-J. Evaluation of the Size of a Defect in Reinforcing Steel Using Magnetic Flux Leakage (MFL) Measurements. Sensors 2023, 23, 5374. [Google Scholar] [CrossRef]
Subramanyam, V.; Kumar, J.; Singh, S.N. Temporal synchronization framework of machine-vision cameras for high-speed steel surface inspection systems. J. Real-Time Image Process. 2022, 19, 445–461. [Google Scholar] [CrossRef]
Kang, Z.; Yuan, C.; Yang, Q. The fabric defect detection technology based on wavelet transform and neural network convergence. In Proceedings of the 2013 IEEE International Conference on Information and Automation (ICIA), Yinchuan, China, 26–28 August 2013; pp. 597–601. [Google Scholar]
Shayeste, H.; Asl, B.M. Automatic seizure detection based on Gray Level Co-occurrence Matrix of STFT imaged-EEG. Biomed. Signal Process. Control. 2023, 79, 104109. [Google Scholar] [CrossRef]
Abouzahir, S.; Sadik, M.; Sabir, E. Bag-of-visual-words-augmented Histogram of Oriented Gradients for efficient weed detection. Biosyst. Eng. 2021, 202, 179–194. [Google Scholar] [CrossRef]
Hu, S.; Li, J.; Fan, H.; Lan, S.; Pan, Z. Scale and pattern adaptive local binary pattern for texture classification. Expert Syst. Appl. 2024, 240, 122403. [Google Scholar] [CrossRef]
Anter, A.M.; Abd Elaziz, M.; Zhang, Z. Real-time epileptic seizure recognition using Bayesian genetic whale optimizer and adaptive machine learning. Future Gener. Comput. Syst. 2022, 127, 426–434. [Google Scholar] [CrossRef]
Ma, J.; Wang, Y.; Shi, C.; Lu, C. Fast Surface Defect Detection Using Improved Gabor Filters. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1508–1512. [Google Scholar]
Malek, A.S.; Drean, J.; Bigue, L.; Osselin, J. Optimization of automated online fabric inspection by fast Fourier transform (FFT) and cross-correlation. Text. Res. J. 2013, 83, 256–268. [Google Scholar] [CrossRef]
Zhou, X.; Wang, Y.; Zhu, Q.; Mao, J.; Xiao, C.; Lu, X.; Zhang, H. A Surface Defect Detection Framework for Glass Bottle Bottom Using Visual Attention Model and Wavelet Transform. IEEE Trans. Ind. Inform. 2020, 16, 2189–2201. [Google Scholar] [CrossRef]
Kulkarni, R.; Banoth, E.; Pal, P. Automated surface feature detection using fringe projection: An autoregressive modeling-based approach. Opt. Lasers Eng. 2019, 121, 506–511. [Google Scholar] [CrossRef]
Hao, M.; Zhou, M.; Jin, J.; Shi, W. An Advanced Superpixel-Based Markov Random Field Model for Unsupervised Change Detection. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1401–1405. [Google Scholar] [CrossRef]
Liu, J.; Cui, G.; Xiao, C. A Real-time and Efficient Surface Defect Detection Method Based on YOLOv4. J. Real-Time Image Process. 2023, 20, 77. [Google Scholar] [CrossRef]
Chen, L.; Wu, X.; Sun, C.; Zou, T.; Meng, K.; Lou, P. An intelligent vision recognition method based on deep learning for pointer meters. Meas. Sci. Technol. 2023, 34, 055410. [Google Scholar] [CrossRef]
Chen, R.; Ye, M.; Li, Z.; Ma, Z.; Yang, D.; Li, S. Empirical assessment of carbon emissions in Guangdong Province within the framework of carbon peaking and carbon neutrality: A lasso-TPE-BP neural network approach. Environ. Sci. Pollut. Res. 2023, 30, 121647–121665. [Google Scholar] [CrossRef]
Luo, Y.; Chen, R.; Li, C.; Yang, D.; Tang, K.; Su, J. An Improved Binary Simulated Annealing Algorithm and TPE-FL-LightGBM for Fast Network Intrusion Detection. Electronics 2025, 14, 231. [Google Scholar] [CrossRef]
Huang, Z.; Hu, H.; Shen, Z.; Zhang, Y.; Zhang, X. Lightweight edge-attention network for surface-defect detection of rubber seal rings. Meas. Sci. Technol. 2022, 33, 085401. [Google Scholar] [CrossRef]
Liu, S.; Chen, R.; Ye, M.; Luo, J.; Yang, D.; Dai, M. EcoDetect-YOLO: A Lightweight, High-Generalization Methodology for Real-Time Detection of Domestic Waste Exposure in Intricate Environmental Landscapes. Sensors 2024, 24, 4666. [Google Scholar] [CrossRef]
Su, J.; Chen, R.; Li, M.; Liu, S.; Xu, G.; Zheng, Z. EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments. Sensors 2025, 25, 3451. [Google Scholar] [CrossRef]
Tang, K.; Su, J.; Chen, R.; Huang, R.; Dai, M.; Li, Y. SkinSwinViT: A Lightweight Transformer-Based Method for Multiclass Skin Lesion Classification with Enhanced Generalization Capabilities. Appl. Sci. 2024, 14, 4005. [Google Scholar] [CrossRef]
Li, H.; Wang, F.; Liu, J.; Song, H.; Hou, Z.; Dai, P. Ensemble model for rail surface defects detection. PLoS ONE 2022, 17, e0268518. [Google Scholar] [CrossRef]
Lin, Z.; Ye, H.; Zhan, B.; Huang, X. An Efficient Network for Surface Defect Detection. Appl. Sci. 2020, 10, 6085. [Google Scholar] [CrossRef]
Zhou, C.; Lu, Z.; Lv, Z.; Meng, M.; Tan, Y.; Xia, K.; Liu, K.; Zuo, H. Metal surface defect detection based on improved YOLOv5. Sci. Rep. 2023, 13, 20803. [Google Scholar] [CrossRef]
Zhang, Y.; Shen, S.; Xu, S. Strip steel surface defect detection based on lightweight YOLOv5. Front. Neuro robot. 2023, 17, 1263739. [Google Scholar] [CrossRef]
Li, Y.; Xu, S.; Zhu, Z.; Wang, P.; Li, K.; He, Q.; Zheng, Q. EFC-YOLO: An Efficient Surface-Defect-Detection Algorithm for Steel Strips. Sensors 2023, 23, 7619. [Google Scholar] [CrossRef]
Zhou, Y.; Zhao, Z. MPA-YOLO: Steel Surface Defect Detection Based on Improved YOLOv8 Framework. Pattern Recognit. 2025, 168, 111897. [Google Scholar] [CrossRef]
Wu, Y.; Chen, R.; Li, Z.; Ye, M.; Dai, M. SDD-YOLO: A Lightweight, High-Generalization Methodology for Real-Time Detection of Strip Surface Defects. Metals 2024, 14, 650. [Google Scholar] [CrossRef]
Ma, S.; Zhang, X.; Jia, C.; Zhao, Z.; Wang, S.; Wang, S. Image and video compression with neural networks: A review. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1683–1698. Available online: https://ieeexplore.ieee.org/document/8693636/ (accessed on 1 April 2025). [CrossRef]
Liu, C.; Gao, H.; Chen, A. A real-time semantic segmentation algorithm based on improved lightweight network. In Proceedings of the 2020 International Symposium on Autonomous Systems (ISAS), Guangzhou, China, 6–8 December 2020; pp. 249–253. [Google Scholar]
Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A fast and lightweight network for single-image super resolution. IEEE Trans. Cybern. 2020, 51, 1443–1453. [Google Scholar] [CrossRef]
LU, X.Y.; QU, M.X. Steel Surface Defect Detection Based on Improved YOLOv8. Eng. Res. Express 2025, 7, 015262. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Zhao, C.; Wang, Q.; Tan, J.; Li, Q.; Zhao, M.; Chen, X.; Hu, X. HSC-YOLO: Steel Surface Defect Detection Model Based on Improved YOLOv10n. Meas. Sci. Technol. 2025, 36, 076008. [Google Scholar] [CrossRef]
DeRose, J.F.; Wang, J.; Berger, M. Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models. IEEE Trans. Vis. Comput. Graph. 2021, 27, 1160–1170. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Lu, F.; Li, K.; Nie, Y.; Tao, Y.; Yu, Y.; Huang, L.; Wang, X. Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s. Sustainability 2023, 15, 14564. [Google Scholar] [CrossRef]
Wang, Z.; Sun, W.; Zhu, Q.; Shi, P. Face Mask-Wearing Detection Model Based on Loss Function and Attention Mechanism. Comput. Intell. Neurosci. 2022, 2022, 2452291. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the Synergistic Effects between Spatial and Channel Attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of YOLOv8.

Figure 2. The complete network architecture of StripSurface-YOLO.

Figure 3. Schematic of the GSConv module.

Figure 4. Illustrations of the GSResBottleneck and ResGSCSP modules.

Figure 5. Structure of the EMA module. The asterisk “*” denotes an element-wise multiplication.

Figure 6. Schematic of the DySample module. The learned offset tensor

O

is generated via a linear mapping of input

X

, rearranged by PixelShuffle into coordinate offsets, and applied to the regular grid

G

to form

δ

. Bilinear interpolation over

δ

yields the high-resolution feature map

\hat{X}

.

Figure 6. Schematic of the DySample module. The learned offset tensor

O

is generated via a linear mapping of input

X

, rearranged by PixelShuffle into coordinate offsets, and applied to the regular grid

G

to form

δ

. Bilinear interpolation over

δ

yields the high-resolution feature map

\hat{X}

.

Figure 7. Exemplar images of the several defect categories from the NEU-DET dataset.

Figure 8. Comparative visualization of detection outputs: Panel A illustrates results from the baseline YOLOv8 model, while Panel B displays those produced by StripSurface-YOLO. Bounding-box hues encode error types—red for missed detections, orange for false positives—and all remaining colors denote correctly detected surface defects across the six categories.

Figure 9. Visual comparison of AP performance between StripSurface-YOLO and leading object-detection models.

Figure 10. Examples of dataset images before and after perturbation.

Table 1. Configuration of hyperparameters for model training.

Parameter Name	Parameter Value
Epochs	300
Initial learning rate	0.01
Image size	640
Batch size	16
Momentum	0.937
Weight decay	0.0005
Optimizer	SGD

Table 2. Performance metrics for YOLOv8n and the proposed StripSurface-YOLO.

Models	P	R	mAP₅₀	mAP_50:95	FLOPs (G)	Parameters (MB)
YOLOv8n	67.6	70.3	74.0	40.8	6.9	2.57
Our StripSurface-YOLO	69.0	73.4	78.1	43.8	6.1	2.38

Table 3. Comparison of attention results.

Attention	P	R	mAP₅₀	mAP_50:95	FLOPs (G)	Parameters (MB)
None	67.6	70.3	74.0	40.8	6.9	2.57
SE [43]	68.5	71.3	74.6	40.5	6.9	2.57
CA [44]	71.2	69.0	74.5	40.9	6.9	2.58
NonlocalBlockND [45]	72.2	68.1	75.2	41.6	6.9	2.61
CBAM [24]	67.6	71.3	73.7	40.8	7.0	2.63
SCSA [46]	71.6	63.0	73.2	39.9	6.9	2.59
EMA	69.5	72.5	76.4	43.2	6.9	2.58

Table 4. Findings from ablation studies.

Num	EMA	ResGSCSP	DySample	Focal Loss	P	R	mAP₅₀	mAP_50:95	FLOPs (G)	Parameters (MB)
0	-	-	-	-	67.6	70.3	74.0	40.8	6.9	2.57
1	√	-	-	-	69.5	72.5	76.4	43.2	6.9	2.58
2	-	√	-	-	67.2	71.1	74.3	40.6	6.0	2.32
3	-	-	√	-	71.3	70.2	76.0	42.9	6.9	2.58
4	-	-	-	√	67.1	71.1	74.7	41.4	6.9	2.57
5	√	√	-	-	69.1	72.9	76.9	42.9	6.0	2.32
6	√	-	√	-	70.6	72.9	77.1	44.0	6.9	2.58
7	-	√	√	-	72.7	70.8	76.3	42.7	6.0	2.32
8	√	√	√	-	69.4	73.1	77.6	43.4	6.0	2.32
9	√	√	√	√	69.0	73.4	78.1	43.8	6.0	2.32

Table 5. Performance comparison of different models.

Model	P	R	mAP₅₀	mAP_50:95	FLOPs (G)	Parameters (MB)
RTDETR-resnet50	67.3	69.6	71.4	40.4	126.4	41.97
YOLOv3	67.7	68.1	72.9	40.0	154.7	58.7
YOLOv3-tiny	59.6	63.5	64.6	31.0	12.9	8.30
YOLOv5n	65.5	68.8	72.4	39.4	5.9	2.09
YOLOv5s	76.5	66.2	74.8	41.7	15.8	6.7
YOLOv6n	66.4	65.3	72.0	39.8	11.5	3.96
YOLOv8n	67.6	70.3	74.0	40.8	6.9	2.56
YOLOv8s	70.9	72.1	76.3	42.6	23.5	9.38
YOLOv9t	72.7	66.9	74.5	41.0	6.7	1.68
YOLOv9c	68.7	76.1	77.1	43.2	84.1	20.37
YOLOv10n	73.2	62.1	71.8	40.8	8.4	2.58
YOLOv10s	67.8	69.2	73.8	41.7	24.7	7.69
YOLO11n	65.4	69.5	73.6	41.0	6.4	2.47
SDD-YOLO [33]	76.5	66.2	74.8	41.7	6.4	3.40
EcoDetect-YOLOv2 [25]	71.1	69.1	75.5	42.4	20.0	7.01
Our StripSurface-YOLO	69.0	73.4	78.1	43.8	6.0	2.32

Table 6. Comparative analysis of AP between StripSurface-YOLO and conventional methods.

Model	Cr (%)	In (%)	Pa (%)	Ps (%)	Rs (%)	Sc (%)	FLOPs (G)	Parameters (MB)
RTDETR-resnet50	33.7	80.1	88.4	77.6	54.6	93.9	126.4	41.97
YOLOv3	41.1	81.9	87.6	78.2	59.9	89	154.7	58.7
YOLOv3-tiny	34.8	77.1	85.5	62	58.4	69.6	12.9	8.30
YOLOv5n	39.4	81.4	88.3	77.7	59.1	88.7	5.9	2.09
YOLOv5s	43.7	81.4	89.1	81.4	62.6	90.5	15.8	6.7
YOLOv6n	40.2	78.7	88.2	79.1	57.4	88.4	11.5	3.96
YOLOv8n	38.8	83.0	89.2	82.8	59.9	91.0	6.9	2.56
YOLOv8s	43.5	82.9	89.9	83.9	66.3	91.5	23.5	9.38
YOLOv9t	40.6	81.6	89.3	80.5	63.2	91.6	6.7	1.68
YOLOv9c	49.7	83.9	89.9	82.4	66.3	90.2	84.1	20.37
YOLOv10n	33.9	79.1	86.6	79.4	60.4	85.1	8.4	2.58
YOLOv10s	43.0	80.6	85.0	80.7	65.3	88.5	24.7	7.69
YOLO11n	41.9	81.4	87.5	80.6	62.1	88.2	6.4	2.47
SDD-YOLO [33]	43.7	81.4	89.1	81.4	62.6	90.5	6.4	3.40
EcoDetect-YOLOv2 [25]	42.9	85.1	88.6	83.8	61.6	91.2	20.0	7.01
Our StripSurface-YOLO	50.7	87.5	90.1	84.6	63.8	92.1	6.0	2.32

Table 7. Comparison of defect detection performance on NEU-DET under image perturbations.

Perturbation	YOLOv8n				StripSurface-YOLO
Perturbation	P	R	mAP₅₀	mAP_50:95	P	R	mAP₅₀	mAP_50:95
Original image	67.6	70.3	74.0	40.8	69.0	73.4	78.1	43.8
−50% brightness	62.6	57.2	61.4	30.8	60.3	61.2	67.4	38.2
+50% brightness	73.4	63.0	70.5	37.6	73.6	66.1	75.1	41.9
−50% contrast	68.0	65.5	71.5	38.1	71.6	66.2	76.2	41.8
+50% contrast	68.4	67.5	72.2	38.5	72.5	70.7	76.7	42.2
+5% Gaussian noise	69.2	68.2	71.8	38.7	68.6	73.4	77.0	42.8
Average (all perturbed)	68.3	64.3	69.5	36.7	69.3	67.6	74.5	41.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Zhang, H.; Zang, W. StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments. Electronics 2025, 14, 2994. https://doi.org/10.3390/electronics14152994

AMA Style

Li H, Zhang H, Zang W. StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments. Electronics. 2025; 14(15):2994. https://doi.org/10.3390/electronics14152994

Chicago/Turabian Style

Li, Haomin, Huanzun Zhang, and Wenke Zang. 2025. "StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments" Electronics 14, no. 15: 2994. https://doi.org/10.3390/electronics14152994

APA Style

Li, H., Zhang, H., & Zang, W. (2025). StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments. Electronics, 14(15), 2994. https://doi.org/10.3390/electronics14152994

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments

Abstract

1. Introduction

2. Related Work

2.1. YOLOv8

2.2. Lightweight Network

2.3. Attention Mechanism

3. Methodology

3.1. StripSurface-YOLO Object-Detection Model

3.2. Design of the ResGS-CSP Architecture Based on GSConv

3.2.1. GSConv

3.2.2. ResGSCSP Structure

3.3. Efficient Multi-Scale Attention (EMA)

3.4. DySample Upsampling

3.5. Focal Loss

4. Experimental Findings and Discussion

4.1. Description of the Dataset

4.2. Evaluation Metrics for Experiments

4.2.1. Precision and Recall

4.2.2. mAP₅₀ and mAP_50:95

4.2.3. Parameter Count, Computational Complexity, and Inference Throughput

4.3. Experimental Setup and Parameter Initialization

4.4. Model Performance Assessment

4.5. Investigation of Attention Mechanism Effects

4.6. Findings from Ablation Studies

4.7. Comparison of Different Methods

4.8. Robustness Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

StripSurface-YOLO: An Enhanced Yolov8n-Based Framework for Detecting Surface Defects on Strip Steel in Industrial Environments

Abstract

1. Introduction

2. Related Work

2.1. YOLOv8

2.2. Lightweight Network

2.3. Attention Mechanism

3. Methodology

3.1. StripSurface-YOLO Object-Detection Model

3.2. Design of the ResGS-CSP Architecture Based on GSConv

3.2.1. GSConv

3.2.2. ResGSCSP Structure

3.3. Efficient Multi-Scale Attention (EMA)

3.4. DySample Upsampling

3.5. Focal Loss

4. Experimental Findings and Discussion

4.1. Description of the Dataset

4.2. Evaluation Metrics for Experiments

4.2.1. Precision and Recall

4.2.2. mAP50 and mAP50:95

4.2.3. Parameter Count, Computational Complexity, and Inference Throughput

4.3. Experimental Setup and Parameter Initialization

4.4. Model Performance Assessment

4.5. Investigation of Attention Mechanism Effects

4.6. Findings from Ablation Studies

4.7. Comparison of Different Methods

4.8. Robustness Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2.2. mAP₅₀ and mAP_50:95