1. Introduction
Galvanized steel sheets represent a significant category of metallic materials, extensively utilized in construction, automotive, and electrical appliance industries due to their excellent corrosion resistance and formability [
1]. In the manufacturing process, coiled steel sheets undergo pickling followed by immersion in molten zinc, forming a protective zinc coating that acts as a physical barrier [
2]. This hot-dip galvanizing process is cost-effective and provides superior rust prevention, effectively shielding the steel from exposure to moisture, oxygen, and other corrosive elements, thereby inhibiting oxidation [
3]. It can extend the service life of steel by 5–8 years in humid environments, establishing itself as a crucial method for enhancing steel longevity [
4]. The integrity of the zinc coating directly dictates the reliability of final products [
5] like automobiles and household appliances, while superior surface appearance is essential for consumer appeal and market competitiveness [
6]. Furthermore, surface imperfections can adversely affect coating adhesion and weldability [
7], compromising subsequent processing quality. Therefore, rigorous surface quality inspection is imperative to ensure that galvanized sheets meet both high-quality standards and economic viability.
In practical hot-dip galvanizing production, factors such as limited equipment precision and process fluctuations can introduce over 50 distinct types of surface defects [
8]. These defects can be broadly categorized into mechanical damage (e.g., scratches) and coating defects (e.g., uneven zinc distribution) [
9]. Left unaddressed, these defects may progressively expand, leading to localized zinc spallation and initiating cascading corrosion issues [
10].
Consequently, high-speed online detection of surface defects on galvanized sheets has become a focal point in visual inspection research. Detection methodologies in this domain can be classified into traditional manual inspection, conventional machine learning-based detection, and deep learning-based detection [
11]. Initially, steel enterprises worldwide primarily relied on manual visual inspection [
12], where workers performed real-time observation on the production line alongside periodic unrolling for sampling to achieve monitoring objectives. However, with advancements in industrial technology and rising demand for high-grade steel, the limitations of this method—susceptibility to subjective bias, lack of quantitative standards, and difficulty in promptly identifying minute defects—became increasingly apparent [
13]. Moreover, challenging factory conditions involving noise, high temperatures, and low illumination contribute to visual fatigue among inspectors, resulting in high missed-detection rates and delayed response times. Consequently, manual inspection is largely obsolete in modern industrial settings [
14].
Subsequently, advancements in hardware capabilities and machine learning techniques fostered the growing application of machine vision in industrial defect detection [
15]. This approach involves capturing images of galvanized sheets using industrial cameras, followed by feature extraction via methods like edge detection and texture analysis [
16], and finally classification using algorithms such as Support Vector Machines (SVM) [
17]. Recognized for its speed, accuracy, and high degree of automation [
18], this technology has garnered significant attention from both industry and research institutions [
19]. Nevertheless, conventional machine learning methods depend heavily on handcrafted feature extraction [
20], a process that is often complex and time-consuming. In contrast, deep learning-based methods autonomously learn features from image samples [
21], demonstrating superior performance in defect localization and recognition. They are characterized by high accuracy, strong adaptability, ease of deployment, efficiency, and enhanced robustness [
22]. Precisely due to these advantages, applying deep learning to surface defect detection in galvanized sheets constitutes a current research hotspot and prevailing trend.
Although YOLOv5 exhibits robust performance in general object detection tasks, its baseline architecture suffers from several critical limitations when directly deployed for defect detection on metallic surfaces (e.g., galvanized sheets). These limitations constitute the core justification for the targeted improvements proposed in this work:
Insufficient Capability for Detecting Small Defects: Certain defects on galvanized sheet surfaces (e.g., non-uniform spangle size, fine scratches) occupy only a minimal number of pixels in high-resolution images acquired by industrial cameras [
23]. The semantic information of such small objects is highly susceptible to loss during feature extraction and down-sampling in the standard YOLOv5 pipeline. The original Feature Pyramid Network (FPN)/PANet in YOLOv5 exhibits limited efficiency in cross-scale feature fusion and inadequate representational capacity for subtle features, resulting in a lower recall rate for small defects.
Limited Modeling Capacity for Defects with Complex Textures and Irregular Shapes: Defects such as “diagonal streaking” on metallic surfaces often exhibit complex textural patterns and irregular geometric characteristics. The backbone network of standard YOLOv5, which is primarily based on CNN convolutional operations, has an inherent limitation in capturing long-range dependencies and global contextual information, stemming from its local receptive field. This hinders the model’s ability to grasp the overall morphology of defects and their complex contextual relationships with the surrounding background, thereby compromising the accurate classification and localization of irregular defects.
Robustness to Reflective Surfaces and Illumination Variations Needs Enhancement: Galvanized sheet surfaces are highly reflective, and uneven illumination in production environments can easily induce highlights and glare in acquired images. These regions exhibit pixel-level similarities to certain defects (e.g., bright spot defects). The standard model lacks an explicit mechanism for the adaptive recalibration of feature channels, rendering it vulnerable to interference from such semantically irrelevant strong noise, which may result in false positives or missed detections.
Feature Utilization Efficiency Can Be Further Improved: In complex industrial scenarios, not all channel-wise information in the rich feature maps extracted by the network is equally critical for the final detection task. Standard YOLOv5 treats all feature channels equally [
24], without adaptively emphasizing key defect-relevant features or suppressing redundant and noisy information. This limits the model’s precision and efficiency to a certain extent.
Thus, direct deployment of the standard YOLOv5 model fails to meet the stringent requirements for high accuracy, strong robustness, and real-time performance in galvanized sheet surface defect detection. The proposed S-C-B-YOLO model in this work is specifically designed to address these limitations in a systematic manner: it achieves this by introducing the Squeeze-and-Excitation (SE) attention mechanism to adaptively recalibrate channel-wise features, thereby enhancing the model’s focus on critical information and its interference resistance; by designing the C3TR module that integrates the global modeling capabilities of Transformer, thereby improving the model’s understanding of complex textures and irregular defects; and by adopting Bi-FPN to optimize the multi-scale feature fusion pipeline, thereby significantly enhancing the model’s sensitivity to small defects. These improvements are synergistic and targeted, with the goal of adapting the model more effectively to the specific challenges associated with metallic surface defect detection.
2. Review
To address the limitations of existing methods in detecting small, irregular, and reflective defects on galvanized steel surfaces, this work proposes an enhanced YOLOv5-based model, named S-C-B-YOLO. The research focuses on three key improvements: integrating a channel attention mechanism to enhance feature selectivity, incorporating transformer-based modules to capture global context, and optimizing the multi-scale feature fusion pathway. The study systematically develops and validates this model using a dedicated dataset of galvanized sheet defects, aiming to achieve a balance between high detection accuracy and real-time processing speed [
25] suitable for industrial deployment.
We collected and constructed a surface spangle defect dataset for galvanized sheets from factory-produced finished products. The dataset primarily includes ‘diagonal streaking’ and ‘non-uniform spangle size’ [
26].
Diagonal streaking refers to streaks that appear at the edges of hot-dip galvanized sheets, forming a certain angle with the rolling direction of the strip steel, as shown in the green box in
Figure 1a. The zinc coating is noticeably thicker in these streaked areas. In extreme cases, the streaks may extend across the entire cross-section of the galvanized sheet, resulting in a penetrating streaking defect.
Non-uniform spangle size is a common defect characterized by a significant difference in spangle dimensions between the head/tail and the middle region of the steel sheet, or a pronounced size variation between these areas, as illustrated in the yellow and orange boxes in
Figure 1b.
The main contributions of this paper are summarized as follows:
Dataset Construction: We compile and annotate a practical galvanized sheet defect dataset, applying advanced augmentation techniques like Mosaic to improve data diversity and model robustness.
Architecture Design: We propose the S-C-B-YOLO model, which systematically integrates three key enhancements into the YOLOv5 framework: the SE attention mechanism for feature recalibration, the C3TR module for global context modeling, and the Bi-FPN for efficient multi-scale fusion.
Comprehensive Validation: Through extensive ablation studies and comparisons with state-of-the-art detectors, we demonstrate the effectiveness of our model, showing significant improvements in both accuracy and speed, and provide an analysis of its performance and failure modes.
The remainder of this paper is organized as follows.
Section 2 gives the review.
Section 3 introduces two methods used in data augmentation.
Section 4 details the proposed methodology and architecture improvements.
Section 5 describes the dataset preparation and experimental setup, including implementation details, and evaluation metrics. In addition, it presents and discusses the results, including ablation studies, comparative analysis, and failure case examination. Finally,
Section 6 concludes the paper and suggests future research directions.
3. Method
To improve model robustness, prevent overfitting, and enhance the detection capability for small-scale defects, systematic data augmentation techniques were applied during the training phase.
3.1. Mosaic Data Augmentation
Mosaic data augmentation is an effective strategy widely used in object detection tasks. Its core principle involves randomly selecting four training images, scaling and cropping them, and stitching them into a new composite image while correspondingly fusing their bounding box annotations. The primary advantages of this method include:
Enriched Context: Forces the model to learn to recognize multiple objects and their spatial relationships within a single, complex scene.
Improved Small-Object Detection: Small defects in the original images may become more prominent in the composite image after stitching and rescaling.
Optimized Batch Normalization: A single composite image contains richer pixel statistics, contributing to more stable training.
Increased Training Efficiency: Effectively exposes the model to more diverse data combinations within the same number of iterations, potentially reducing the total epochs required for convergence.
The specific workflow consists of the following key steps:
Random Image Selection: Four original images are randomly sampled from the training set.
Random Scaling and Cropping: Each image undergoes random resizing and region cropping to introduce variations in scale and composition.
Stitching: The processed images are placed into the four quadrants of a new canvas to form a composite image.
Label Fusion: The bounding box coordinates from each original image are affinely transformed according to their new position and scale within the composite image, generating a unified label file for the synthesized sample.
In this work, Mosaic augmentation was activated by setting the parameter “mosaic = 1” during training. It is important to note that this technique is applied exclusively during the training phase and is disabled for model validation and testing to ensure performance evaluation reflects the model’s capability on real, single images.
3.2. Gaussian Noise
When neural networks attempt to learn recurring but potentially useless high-frequency features, they often face the problem of overfitting. The presence of Gaussian noise allows them to effectively simulate high-frequency features; however, it also affects low-frequency features, thereby making the expected data less accurate. Nonetheless, neural networks can overcome this challenge through learning, ultimately obtaining a more accurate model. Adding an appropriate amount of noise can significantly enhance the learning efficiency and accuracy of neural networks.
Gaussian noise is typically defined by a Gaussian distribution or normal distribution probability density function. It is a type of noise that exists at almost every point, with random noise intensity. Its calculation method can be easily derived using knowledge related to the normal distribution in probability theory, which can be expressed by (1). Here, μ represents the mean (expected value) and σ
2 represents the variance. For each input pixel, the output pixel can be obtained by adding a random number conforming to a Gaussian distribution.
4. Establishment of Galvanized Sheet Inspection Model Based on S-C-B-YOLO
This experiment employs the YOLOv5 model. After comprehensive consideration of the application scenario and resource requirements, the YOLOv5s version was selected. This version features the minimal network depth and narrowest feature map width, resulting in the fastest processing speed and making it more suitable for real-time detection. Its architecture primarily consists of four parts: the Input, the Backbone, the Neck, and the Output (
Figure 2).
4.1. SE Attention Mechanism
The SE (Squeeze-and-Excitation) attention mechanism primarily functions to re-calibrate channel-wise feature responses by performing adaptive weighting of feature map channels. This enables the network to focus more on channels containing crucial information while suppressing less important ones, thereby enhancing the representational capacity of the features. This dynamic adjustment of channel-wise feature response strengths is achieved through a three-step process applied to each channel of the input feature map: Squeeze (global information compression), Excitation (importance weight learning), and Scale (channel re-weighting).
In the present improvement, the SE attention module is incorporated into the deeper layers of the Backbone network (preceding the SPPF module). Placing the SE module here aids the network in better integrating and utilizing information from all preceding layers, leveraging its global receptive field to guide feature selection. It takes the output from the previous C3 module as its input. For this specific implementation, the input feature map has a shape of [B, 1024, Height, Width], where the channel count C is 1024, and the compression ratio is set to 2.
The operational procedure is as follows:
Level 1. Squeeze (Compression):
The objective of this step is to aggregate global spatial information from each channel, discarding the spatial distribution details. This is accomplished by performing Global Average Pooling (GAP) over the spatial dimensions (H, W) of each channel. The result is a channel-wise descriptor vector with a shape of [B, 1024, 1, 1]. Each element
in this vector
Z represents the global average response intensity of the c-th channel in the original feature map U, as shown in Equation (2).
Level 2. Excitation (Adaptive Gating/Weight Learning):
This step aims to capture non-linear dependencies between channels and generate a set of modulation weights.
1. Dimensionality Reduction: The descriptor vector Z from the previous step is passed through a fully connected (FC) layer, reducing its dimensionality. Typically, the dimension is reduced to . Here, with reduction_ratio = 2 and C = 1024, the dimension is compressed to 512. A non-linear activation function (e.g., ReLU) is applied subsequently to capture non-linear interactions and reduce computational complexity.
2. Dimensionality Restoration: The compressed features (512-dimensional) are then fed through another FC layer to restore the original channel dimensionality (1024-dimensional).
3. Activation Function: A Sigmoid activation function is applied to the output of this second FC layer, constraining the weight for each channel to the range [0, 1].
Upon completion of these steps, a channel weighting vector S with a shape of [B, 1024, 1, 1] is generated.
Level 3. Scale (Re-weighting):
The final step involves re-scaling the original input feature map
U using the learned weights. This is done by performing channel-wise multiplication between the original feature map
U and the channel weighting vector
S. Specifically, for each channel
c, all spatial elements (i, j) within that channel are multiplied by the corresponding scalar weight
, as defined in Equation (3).
The result is the recalibrated feature map Û, which possesses the same shape as the input U.
Role of the Incorporated SE Attention Mechanism:
The integration of the SE attention mechanism delivers the following key benefits:
(1) Amplification of Salient Feature Channels: The SE module enables the network to autonomously learn and emphasize the feature channels that contain the most critical information for the specific task (e.g., object detection). For instance, it can identify channels that are particularly sensitive to the textures, shapes, or contextual cues of certain defect categories among the 1024 available channels.
(2) Suppression of Noisy or Irrelevant Channels: Concurrently, the mechanism attenuates the contributions from channels that harbor redundant information, noise, or details less pertinent to the current detection objective.
(3) Enhanced Feature Discriminability: This dynamic, adaptive, channel-wise feature recalibration significantly boosts the representational capacity and discriminative power of the feature maps produced by the network.
4.2. C3TR Module
Prior research indicates that hybrid models often outperform pure Transformer or pure CNN architectures when applied to small-scale datasets. Coincidentally, the volume of defective galvanized steel image data generated in industrial production settings is typically limited. Applying such a hybrid model to the galvanized sheet defect detection task can potentially enhance the network’s capacity for capturing global contextual information. Therefore, this paper integrates a Transformer Block with the final C3 module in the YOLOv5 backbone to form a C3TR module, while retaining the original C3 modules in other layers. This design strikes a balance between overall feature extraction accuracy and computational speed, augments global perceptual capability, and delivers significant performance improvements for complex detection tasks.
The network structure of the Transformer Block is depicted in
Figure 3. For the input, the feature map extracted by the preceding layer of this module is first flattened into a 1D vector sequence, which is then passed to a Patch Embedding layer. Subsequently, each vector undergoes a linear transformation via a fully connected layer, is concatenated with a class token, and then summed with positional encodings that incorporate spatial information about the image. The resulting tokens are fed into the Encoder, a core component comprising mainly a Multi-Head Attention module and a Multilayer Perceptron (MLP).
The output from the Multi-Head Attention layer is combined with its original input via a residual connection, and the result is then passed to the MLP block. In this implementation, the MLP block, constructed with two fully connected layers, enables the Transformer Block to model complex relationships. Another residual connection adds the input of the MLP to its output, facilitating better extraction of image information and yielding the final output of the encoder. Multiple identical Transformer encoder layers are connected and stacked sequentially, forming the complete structure of the Transformer Block module. Replacing the Bottleneck module within the original C3 module with this Transformer Block creates the proposed C3TR module. A structural comparison between the improved C3TR module and the original C3 module is presented in
Figure 4.
4.3. Bi-FPN
Feature pyramids have become a standard component in object detection networks, enhancing the capability to detect objects across various scales. They facilitate the extraction of multi-scale feature information, which is then fused across different hierarchical feature maps to improve model accuracy. Since small objects inherently contain limited pixel information and are prone to being lost during down-sampling, effectively detecting objects with significant size variations is challenging. The traditional Feature Pyramid Network (FPN) addresses this by employing a top-down pathway with lateral connections, fusing high-resolution, shallow features with semantically rich, deep features.
YOLO-v5 utilizes the PANet (Path Aggregation Network). In this paper, we enhance the neck by replacing PANet with the Weighted Bidirectional Feature Pyramid Network (Bi-FPN). Bi-FPN enables simple yet efficient multi-scale feature fusion, and its structure is illustrated in
Figure 5.
Bi-FPN simplifies the architecture by removing nodes that have only one input edge, as their contribution to feature fusion across different networks is minimal. It implements a simplified bidirectional network. If the input and output nodes reside at the same level, an additional edge is incorporated to fuse more features without introducing significant extra cost. Unlike PANet, which features a single top-down and a single bottom-up path, Bi-FPN treats each bidirectional (top-down and bottom-up) pathway as a feature network layer. It assigns learnable weights to each input feature and repeatedly applies these bidirectional layers to achieve higher-level feature fusion.
Furthermore, Bi-FPN employs fast normalized fusion for feature integration, as shown in Equation (4).
denotes the i-th input feature map; is the learnable weight corresponding to the i-th input feature ; refers to any of the learnable weights in the set; ε is any tiny constant added to the denominator.
This method offers a blend of accuracy and speed. The core mechanism involves using learnable weights for each input feature, which are then normalized to generate a weighted sum. For instance, considering the features at level 6 in
Figure 5,
represents the intermediate feature from the top-down pathway at level 6, while
denotes the output feature from the bottom-up pathway at level 6. In summary, Bi-FPN integrates bidirectional cross-scale connections and fast normalized fusion. In the network’s neck, the original Concat operation is replaced with the Bi-FPN_ADD operation, resulting in superior fusion performance compared to its predecessor.
The improved backbone network is tightly integrated with the BiFPN module through its multi-scale feature outputs (P3, P4, P5). In the feature pyramid neck, BiFPN employs a weighted bidirectional connection mechanism to perform cross-scale fusion of feature maps from different depths of the backbone. Specifically, the shallow high-resolution features (P3), mid-level features (P4), and deep semantic-rich features (P5) output by the backbone are simultaneously fed into the multi-layered stacked structure of BiFPN. Within each BiFPN layer, adaptive weighted fusion is applied to the different input features using learnable weights, and both top-down and bottom-up bidirectional information flow are executed. This architecture stacks three BiFPN layers in total (including 2 BiFPN_Add2 layers and 1 BiFPN_Add3 layer). Through this multi-level, iterative bidirectional fusion, the representational capacity of multi-scale features is enhanced and semantic information is effectively integrated, thereby significantly improving the model’s robustness in detecting objects with varying scales.
The overall architecture of the improved algorithm is depicted in
Figure 6.
This study implements systematic architectural improvements to the original YOLOv5s model. In the backbone module, the standard C3 modules are replaced with C3TR modules integrating the Transformer’s self-attention mechanism, and a Squeeze-and-Excitation (SE) channel attention module is inserted after deep feature extraction layers. In the neck module, the traditional unidirectional feature pyramid and simple concatenation operations are entirely replaced with a BiFPN, which achieves adaptive weighted fusion of multi-level features through BiFPN_Add2 and BiFPN_Add3 operations. Finally, while maintaining the original detection head structure in the output module, the feature map input indices are adjusted according to the depth changes in the preceding network. This series of enhancements constructs an improved object detection architecture that integrates local feature extraction, global contextual modeling, and adaptive multi-scale fusion capabilities.
6. Conclusions
This study addressed the critical challenge of automated, high-accuracy defect detection on galvanized steel sheets, a task complicated by the presence of small, irregular defects and reflective surfaces. To overcome the limitations of standard detection models in this specific industrial context, we proposed S-C-B-YOLO, an enhanced architecture based on YOLOv5. The core innovation lies in the synergistic integration of three key modifications: the incorporation of an SE attention mechanism to adaptively emphasize defect-relevant features, the design of a C3TR module to capture global contextual information for irregular defects, and the replacement of PANet with Bi-FPN to optimize multi-scale feature fusion, particularly for small targets.
Experimental results on a dedicated galvanized sheet defect dataset demonstrate the effectiveness of our approach. The proposed model achieved a mean average precision (mAP@0.5) of 92.6% and an inference speed of 62 FPS. Ablation studies confirmed the individual and collective contribution of each proposed component to the final performance. Furthermore, comparative experiments showed that S-C-B-YOLO surpasses several mainstream detectors, including YOLOv3, YOLOv7, and Faster R-CNN, in terms of overall accuracy while maintaining competitive inference speed, showcasing a superior balance suitable for real-time industrial inspection.
In summary, this work provides a robust and efficient deep learning solution for galvanized sheet surface defect detection. The proposed model’s design effectively tackles the specific challenges of the domain, and its performance validates the potential for practical deployment in quality control systems. Future work will focus on expanding the defect dataset to include more rare defect categories, further optimizing the model for edge deployment, and exploring its adaptability to other types of metallic surface inspections.